BAIO (Bioinformatics AI for Open-set detection) is a cutting-edge metagenomic analysis platform that leverages foundation models for DNA sequence analysis. Unlike traditional reference-based pipelines (e.g., Kraken2, MetaPhlAn), BAIO can detect novel and divergent pathogens through open-set recognition methods, making it crucial for pandemic preparedness and surveillance.
- Taxonomy Profiling: Classifies sequencing reads or contigs into known classes (viral families, bacterial groups, host)
- Open-Set Detection: Flags sequences that don't fit known classes as "novel/unknown"
- Sample-Level Reporting: Aggregates read-level predictions into intuitive reports for surveillance
- Interpretability: Visualizes embeddings, attention patterns, and clusters of novel reads
- End-to-End Pipeline: FASTQ → Evo2 embeddings → classifier + OOD detection → GUI report (JSON + PDF)
- React + Vite: Lightweight single-page UI for uploads, configuration, results, and chat (replaces the legacy Streamlit view)
- PyTorch + Hugging Face Transformers: Core ML framework
- Evo2: Nucleotide model for embeddings
- Custom Heads: MLP for taxonomy classification; OOD scores (Max Softmax, Energy, Mahalanobis)
- Biopython: FASTA/FASTQ parsing
- HDBSCAN: Clustering novel reads
- NumPy/Pandas: Embedding manipulation
- Scikit-learn: Model calibration and evaluation
- GitHub Actions: CI/CD pipeline
- pytest: Testing framework
- Conda/Poetry: Environment management
metaseq-detector/
- api/ (FastAPI backend)
- frontend/ (React + Vite UI)
- metaseq/ (Core library)
- dataio.py FASTA/FASTQ loaders, filters
- evo2_embed.py Evo2 embedding wrapper
- models.py Classifier heads
- ood.py MSP/Energy/Mahalanobis
- agg.py Sample-level aggregation
- cluster.py HDBSCAN for OOD reads
- viz.py Plots: ROC, UMAP, attention
- configs/ YAMLs for experiments
- notebooks/ Exploratory notebooks
- tests/ Pytest unit tests
- runs/ Saved reports/metrics
- weights/ Trained classifier heads
- examples/ Demo FASTQ/FASTA
- environment.yml
- pyproject.toml
- docs/
- weekly_report.md
- design.md System architecture
- dataset_card.md Data sources and splits
- Python 3.8+
- CUDA-compatible GPU (recommended for Evo2 model)
# Clone the repository
git clone https://github.com/your-org/baio.git
cd baio
# Create and activate conda environment
conda env create -f environment.yml
conda activate baioIf you prefer using Python's built-in virtual environment instead of conda:
Windows:
# Navigate to project directory
cd baio
# Create virtual environment
python -m venv baio-env
# Alternative if python3 is your command
python3 -m venv baio-envmacOS/Linux:
# Navigate to project directory
cd baio
# Create virtual environment
python3 -m venv baio-envWindows (Command Prompt):
baio-env\Scripts\activateWindows (PowerShell):
baio-env\Scripts\Activate.ps1macOS/Linux:
source baio-env/bin/activate# Upgrade pip first
pip install --upgrade pip
# Install project dependencies
pip install -r requirements.txt
# Check Python version
python --version
# Check installed packages
pip list
# Test Streamlit installation# Clone and install with Poetry
git clone https://github.com/your-org/baio.git
cd baio
poetry install- Open Command Palette (
Ctrl+Shift+PorCmd+Shift+P) - Select "Python: Select Interpreter"
- Choose the Python executable from your virtual environment:
- Conda: Usually in
~/miniconda3/envs/baio/bin/python - venv:
baio-env/Scripts/python.exe(Windows) orbaio-env/bin/python(macOS/Linux)
- Conda: Usually in
- Go to File → Settings → Project → Python Interpreter
- Click gear icon → Add
- Select "Existing Environment"
- Browse to your environment's Python executable
# For conda environment
conda activate baio
conda install ipykernel
python -m ipykernel install --user --name=baio --display-name="BAIO Project"
# For venv environment
source baio-env/bin/activate # or activate script for Windows
pip install ipykernel
python -m ipykernel install --user --name=baio-env --display-name="BAIO Project"Create a .env file in your project root for API keys and configuration:
# .env file
OPENAI_API_KEY=your_api_key_here
HUGGINGFACE_TOKEN=your_hf_token_here
DEBUG=True
STREAMLIT_SERVER_PORT=8501
CUDA_VISIBLE_DEVICES=0Load environment variables in your Python code:
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')Solution: Use python3 instead of python, or ensure Python is in your system PATH.
Solution: Run PowerShell as administrator and execute:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUserSolution:
# Use python -m pip instead
python -m pip install package-nameSolution:
# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"
# Install CPU-only version if needed
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpuSolution:
- Ensure stable internet connection
- Check Hugging Face token permissions
- Use
huggingface-cli loginfor authentication
-
Activate Environment:
# Conda conda activate baio # venv (Linux/Mac) source baio-env/bin/activate # venv (Windows) baio-env\Scripts\activate
-
Update Dependencies:
# Conda conda env update -f environment.yml # venv pip install -r requirements.txt
-
Run Development Servers (API + React UI):
# FastAPI backend uvicorn api.main:app --reload --port 8080 # React frontend (uses Vite) cd frontend npm install npm run dev -- --host --port 5173
-
Deactivate When Done:
# Both conda and venv deactivate
For Conda:
conda activate baio
conda install new-package
conda env export > environment.ymlFor venv:
source baio-env/bin/activate # Windows: baio-env\Scripts\activate
pip install new-package
pip freeze > requirements.txtFor Poetry:
poetry add new-packageuvicorn api.main:app --reload --port 8080cd frontend
npm install # first run only
# Set VITE_API_BASE if your API is not on http://localhost:8080
npm run dev -- --host --port 5173Then open http://localhost:5173 to upload sequences, tune thresholds, and chat. (Legacy Streamlit UI commands remain in git history if you still need them.)
You can run the full BAIO stack - FastAPI backend and React UI - without installing Python locally.
- Docker Desktop or Docker Engine with Compose v2.
- Git to clone the repository.
docker compose up --buildThis starts:
- API at http://localhost:8080 (health at
/health, classification at/classify, chat at/chat) - Frontend at http://localhost:4173 (set
FRONTEND_PORTif you want a different host port)
API_PORT: host port for FastAPI (default 8080)FRONTEND_PORT: host port for React UI (default 4173)VITE_API_BASE: API base URL baked into the frontend at build time (defaults to http://localhost:8080 so the browser can reach the API)- CUDA vs CPU: Docker defaults to CPU-only PyTorch wheels (
PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cpu). Build a separate GPU image/service if you need CUDA.
- Open the React UI at the frontend port and paste/upload FASTA.
- Configure thresholds (confidence, batch size, novelty sensitivity).
- Run the classification pipeline.
- Review classifications, novelty counts, and chat with the built-in assistant.
- Accuracy, macro-F1
- Per-class Precision/Recall
- AUROC, AUPR-Out
- FPR@95%TPR
- OSCR matrix
- Confusion across {Known-Correct, Known-Wrong, Unknown-Correct, Unknown-Wrong}
- Reads/second processing rate
- Memory footprint
# Activate your environment first
conda activate baio # or source baio-env/bin/activate
# Run tests
pytest tests/# Format code
black metaseq/ app/
# Lint code
flake8 metaseq/ app/
# Type checking
mypy metaseq/ app/We welcome contributions! Please see our contributing guidelines and code of conduct.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Set up your development environment (see Installation section)
- Make your changes and test thoroughly
- Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
BAIO is designed for:
- Pandemic Preparedness: Early detection of novel pathogens
- Metagenomic Surveillance: Monitoring environmental and clinical samples
- Research: Comparative analysis with traditional methods like Kraken2
- Clinical Diagnostics: Supporting pathogen identification in complex samples
If you use BAIO in your research, please cite:
@software{baio2024,
title={BAIO: LLM-Based Taxonomy Profiling & Open-Set Pathogen Detection},
author={Farhan, Tanzim and Hashami, Mustafa and Gujja, Sahana and Burns, Eric and Gaikwad, Manali},
year={2025},
url={https://github.com/your-org/baio}
}This project is licensed under the MIT License - see the LICENSE file for details.
-
Mainuddin - Tech Lead
-
Luis Palmejar - Developer
-
**** - Developer
-
**** - Developer
-
**** - Developer
- Built on the Evo2 foundation model (Science, 2024)
- Inspired by the need for improved metagenomic surveillance capabilities
- Thanks to the open-source bioinformatics community
For questions, issues, or contributions, please:
- Open an issue on GitHub
- Contact the development team
- Check our documentation in the
docs/directory
Note: This is a research prototype. For production use in clinical settings, additional validation and regulatory approval may be required.