🔬 ToolsyBio

ToolsyBio is a modular, open-source system that helps researchers navigate the bioinformatics software landscape using retrieval-augmented generation (RAG). It combines metadata from bio.tools with vector search and a local large language model (LLM) served through Ollama, enabling users to ask natural language questions about software tools.

📝 Paper:

(Accepted!) Truong, V.Q., & Ritchie, M.D. (2025). ToolsyBio: A retrieval-augmented generation system for navigating the bioinformatics software landscape. Proceedings of the 3rd Annual Conference of the US Research Software Engineer Association (US-RSE'25), Philadelphia, PA.

📁 Project Structure

ToolsyBio/
│
├── data/ ← Output folder for tool metadata JSON
│ └── biotools_data.json
│
├── logs/ ← Logs generated from queries
│ └── eval_results.json
│
├── 1_fetch_biotools.py ← Fetches tool data from bio.tools API
├── 2_build_vector_store.py ← Builds Chroma vector store from metadata
├── 3_run_streamlit_app.py ← Streamlit frontend for interactive queries
├── rag_chain.py ← Loads RAG chain (retriever + LLM)
├── eval_logger.py ← Saves query/response logs to JSON
├── requirements.txt ← Python dependencies
├── README.md ← Hey, you're here!
├── LICENSE ← MIT License

📘 Developer tips and tuning options (e.g., custom tool fetch count, resetting vector DB) are available in DEV_GUIDE.md.

Environment Setup

ToolsyBio has been tested with Python 3.10. Other versions (3.9–3.11) may also work, but 3.10 is recommended for compatibility with LangChain and ChromaDB.

Simply clone this repository, cd into it, and run the provided scripts. Below are some environment configurations.

Option 1: Using Conda (recommended)

conda create -n toolsybio python=3.10 -y
conda activate toolsybio
pip install -r requirements.txt

Option 2: Using venv

python -m venv .venv
source .venv/bin/activate     # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Dependencies

ToolsyBio depends on the following libraries (in requirements.txt):

streamlit
tqdm
requests
langchain
sentence-transformers
chromadb

ToolsyBio also uses a local LLM served via Ollama. Be sure to install the version that works with your local environment.

Step 1: Fetch tool metadata from bio.tools

This script retrieves tens of thousands of tools, formats metadata, and saves to data/biotools_data.json.

In the paper, we test a subset of ~5,800 tools (subject to change as more get added to the bio.tools site).

Under params=, we set the field "q": "sequence analysis", which limits the fetch to ~5,800 sequence analysis tools.
If this line is edited or commented out, the fetch won't be constrained to that particular subset.

python 1_fetch_biotools.py

Example output:

...
🔍 Fetching details for: eugene
🔍 Fetching details for: easycluster
🔍 Fetching details for: mmb_api
🔍 Fetching details for: sequence_searcher
🔍 Fetching details for: genocanyon
...
💾 Saved 5750 tools total. (+50 new on this page)
📄 Fetching page 116 with params: {'format': 'json', 'page': 116, 'q': 'sequence analysis', 'sort': 'last_update', 'ord': 'desc'}
...
🔍 Fetching details for: polydot_ws_husar
🔍 Fetching details for: supermatcher_ws_husar
🔍 Fetching details for: wordmatch_ws_husar
🔍 Fetching details for: mismax
💾 Saved 5795 tools total. (+45 new on this page)
✅ No more pages.
✅ Finished. Fetched 5795 tools.

Step 2: Build the vector store (RAG knowledge base)

Embeds tool descriptions using all-MiniLM-L6-v2 and stores them in ChromaDB.

python 2_build_vector_store.py

Example output:

Building vector database...
🔍 Creating embeddings for 6084 chunks...
/mnt/c/Users/vtruong/Projects/ToolsyBio/2_build_vector_store.py:54: LangChainDeprecationWarning: The class `HuggingFaceEmbeddings` was deprecated in LangChain 0.2.2 and will be removed in 1.0. An updated version of the class exists in the :class:`~langchain-huggingface package and should be used instead. To use it run `pip install -U :class:`~langchain-huggingface` and import as `from :class:`~langchain_huggingface import HuggingFaceEmbeddings``.
  embeddings_model = SentenceTransformerEmbeddings(model_name=EMBEDDING_MODEL_NAME)
🔢 Embedding Chunks: 100%|███████████████| 6084/6084 [00:00<00:00, 4121813.20it/s]
📦 Vector store created at: chroma_db
✅ Vector database build complete.

Step 3: Launch the browser app locally

Make sure the model you want has been pulled from Ollama and is ready before running the app.

Here are the conditions currently set in rag_chain.py:

OLLAMA_MODEL = "mistral:7b"
search_kwargs={"k": 5}
temperature=0.0

python -m streamlit run 3_run_streamlit_app.py

Then visit http://localhost:8501 in your browser.

Example Queries

You can ask questions like:

What FOSS tools support differential gene expression analysis?
What command-line tools support BAM format?
What tools help with gene prediction and clustering in viral genomes?
What tools integrate with Python or R for visualization?
Can you recommend some lightweight tools for QC of sequencing data?

Results are grounded in retrieved tool metadata and include documentation/homepage links.

🧾 Citation

If you want to reference ToolsyBio in your work, please cite us!

Truong, V. Q., & Ritchie, M. D. (2025). ToolsyBio: A retrieval-augmented generation system for navigating the bioinformatics software landscape. US Research Software Engineering Conference 2025 (USRSE'25), Philadelphia, PA. https://doi.org/10.5281/zenodo.16891723

You can also use this BibTeX entry:

@inproceedings{truong2025toolsybio,
  author    = {Truong, Van Q. and Ritchie, Marylyn D.},
  title     = {ToolsyBio: A retrieval-augmented generation system for navigating the bioinformatics software landscape},
  booktitle = {Proceedings of the US Research Software Engineering Conference 2025 (USRSE'25)},
  year      = {2025},
  address   = {Philadelphia, PA},
  doi       = {10.5281/zenodo.16891723},
  url       = {https://doi.org/10.5281/zenodo.16891723}
}

License

ToolsyBio is open-source software released under the MIT License. You are free to use, modify, and distribute it with proper attribution.

Acknowledgments

Tool metadata provided by the bio.tools registry
Ontological annotations powered by the EDAM ontology
Local model serving via Ollama

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔬 ToolsyBio

📁 Project Structure

Environment Setup

Option 1: Using Conda (recommended)

Option 2: Using venv

Dependencies

Step 1: Fetch tool metadata from bio.tools

Step 2: Build the vector store (RAG knowledge base)

Step 3: Launch the browser app locally

Example Queries

🧾 Citation

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
data		data
docs		docs
.gitignore		.gitignore
1_fetch_biotools.py		1_fetch_biotools.py
2_build_vector_store.py		2_build_vector_store.py
3_run_streamlit_app.py		3_run_streamlit_app.py
LICENSE		LICENSE
README.md		README.md
eval_logger.py		eval_logger.py
rag_chain.py		rag_chain.py
requirements.txt		requirements.txt

License

van-truong/ToolsyBio

Folders and files

Latest commit

History

Repository files navigation

🔬 ToolsyBio

📁 Project Structure

Environment Setup

Option 1: Using Conda (recommended)

Option 2: Using venv

Dependencies

Step 1: Fetch tool metadata from bio.tools

Step 2: Build the vector store (RAG knowledge base)

Step 3: Launch the browser app locally

Example Queries

🧾 Citation

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages