ToolsyBio is a modular, open-source system that helps researchers navigate the bioinformatics software landscape using retrieval-augmented generation (RAG). It combines metadata from bio.tools with vector search and a local large language model (LLM) served through Ollama, enabling users to ask natural language questions about software tools.
📝 Paper:
(Accepted!) Truong, V.Q., & Ritchie, M.D. (2025). ToolsyBio: A retrieval-augmented generation system for navigating the bioinformatics software landscape. Proceedings of the 3rd Annual Conference of the US Research Software Engineer Association (US-RSE'25), Philadelphia, PA.
ToolsyBio/
│
├── data/ ← Output folder for tool metadata JSON
│ └── biotools_data.json
│
├── logs/ ← Logs generated from queries
│ └── eval_results.json
│
├── 1_fetch_biotools.py ← Fetches tool data from bio.tools API
├── 2_build_vector_store.py ← Builds Chroma vector store from metadata
├── 3_run_streamlit_app.py ← Streamlit frontend for interactive queries
├── rag_chain.py ← Loads RAG chain (retriever + LLM)
├── eval_logger.py ← Saves query/response logs to JSON
├── requirements.txt ← Python dependencies
├── README.md ← Hey, you're here!
├── LICENSE ← MIT License
📘 Developer tips and tuning options (e.g., custom tool fetch count, resetting vector DB) are available in DEV_GUIDE.md.
ToolsyBio has been tested with Python 3.10. Other versions (3.9–3.11) may also work, but 3.10 is recommended for compatibility with LangChain and ChromaDB.
Simply clone this repository, cd into it, and run the provided scripts. Below are some environment configurations.
conda create -n toolsybio python=3.10 -y
conda activate toolsybio
pip install -r requirements.txtpython -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtToolsyBio depends on the following libraries (in requirements.txt):
streamlit
tqdm
requests
langchain
sentence-transformers
chromadb
ToolsyBio also uses a local LLM served via Ollama. Be sure to install the version that works with your local environment.
This script retrieves tens of thousands of tools, formats metadata, and saves to data/biotools_data.json.
In the paper, we test a subset of ~5,800 tools (subject to change as more get added to the bio.tools site).
- Under
params=, we set the field"q": "sequence analysis", which limits the fetch to ~5,800 sequence analysis tools. - If this line is edited or commented out, the fetch won't be constrained to that particular subset.
python 1_fetch_biotools.py
Example output:
...
🔍 Fetching details for: eugene
🔍 Fetching details for: easycluster
🔍 Fetching details for: mmb_api
🔍 Fetching details for: sequence_searcher
🔍 Fetching details for: genocanyon
...
💾 Saved 5750 tools total. (+50 new on this page)
📄 Fetching page 116 with params: {'format': 'json', 'page': 116, 'q': 'sequence analysis', 'sort': 'last_update', 'ord': 'desc'}
...
🔍 Fetching details for: polydot_ws_husar
🔍 Fetching details for: supermatcher_ws_husar
🔍 Fetching details for: wordmatch_ws_husar
🔍 Fetching details for: mismax
💾 Saved 5795 tools total. (+45 new on this page)
✅ No more pages.
✅ Finished. Fetched 5795 tools.
Embeds tool descriptions using all-MiniLM-L6-v2 and stores them in ChromaDB.
python 2_build_vector_store.py
Example output:
Building vector database...
🔍 Creating embeddings for 6084 chunks...
/mnt/c/Users/vtruong/Projects/ToolsyBio/2_build_vector_store.py:54: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEmbeddings``.
embeddings_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL_NAME)
🔢 Embedding Chunks: 100%|███████████████| 6084/6084 [00:00<00:00, 4121813.20it/s]
📦 Vector store created at: chroma_db
✅ Vector database build complete.
Make sure the model you want has been pulled from Ollama and is ready before running the app.
Here are the conditions currently set in rag_chain.py:
OLLAMA_MODEL = "mistral:7b"search_kwargs={"k": 5}temperature=0.0
python -m streamlit run 3_run_streamlit_app.py
Then visit http://localhost:8501 in your browser.
You can ask questions like:
- What FOSS tools support differential gene expression analysis?
- What command-line tools support BAM format?
- What tools help with gene prediction and clustering in viral genomes?
- What tools integrate with Python or R for visualization?
- Can you recommend some lightweight tools for QC of sequencing data?
Results are grounded in retrieved tool metadata and include documentation/homepage links.
If you want to reference ToolsyBio in your work, please cite us!
Truong, V. Q., & Ritchie, M. D. (2025). ToolsyBio: A retrieval-augmented generation system for navigating the bioinformatics software landscape. US Research Software Engineering Conference 2025 (USRSE'25), Philadelphia, PA. https://doi.org/10.5281/zenodo.16891723
You can also use this BibTeX entry:
@inproceedings{truong2025toolsybio,
author = {Truong, Van Q. and Ritchie, Marylyn D.},
title = {ToolsyBio: A retrieval-augmented generation system for navigating the bioinformatics software landscape},
booktitle = {Proceedings of the US Research Software Engineering Conference 2025 (USRSE'25)},
year = {2025},
address = {Philadelphia, PA},
doi = {10.5281/zenodo.16891723},
url = {https://doi.org/10.5281/zenodo.16891723}
}ToolsyBio is open-source software released under the MIT License. You are free to use, modify, and distribute it with proper attribution.
- Tool metadata provided by the bio.tools registry
- Ontological annotations powered by the EDAM ontology
- Local model serving via Ollama