BAIO: LLM-Based Taxonomy Profiling & Open-Set Pathogen Detection

Overview

BAIO (Bioinformatics AI for Open-set detection) is a cutting-edge metagenomic analysis platform that leverages foundation models for DNA sequence analysis. Unlike traditional reference-based pipelines (e.g., Kraken2, MetaPhlAn), BAIO can detect novel and divergent pathogens through open-set recognition methods, making it crucial for pandemic preparedness and surveillance.

Key Features

Taxonomy Profiling: Classifies sequencing reads or contigs into known classes (viral families, bacterial groups, host)
Open-Set Detection: Flags sequences that don't fit known classes as "novel/unknown"
Sample-Level Reporting: Aggregates read-level predictions into intuitive reports for surveillance
Interpretability: Visualizes embeddings, attention patterns, and clusters of novel reads
End-to-End Pipeline: FASTQ → Evo2 embeddings → classifier + OOD detection → GUI report (JSON + PDF)

Technology Stack

Frontend

React + Vite: Lightweight single-page UI for uploads, configuration, results, and chat (replaces the legacy Streamlit view)

Model Runtime

PyTorch + Hugging Face Transformers: Core ML framework
Evo2: Nucleotide model for embeddings
Custom Heads: MLP for taxonomy classification; OOD scores (Max Softmax, Energy, Mahalanobis)

Bioinformatics Utilities

Biopython: FASTA/FASTQ parsing
HDBSCAN: Clustering novel reads

Data Processing

NumPy/Pandas: Embedding manipulation
Scikit-learn: Model calibration and evaluation

DevOps

GitHub Actions: CI/CD pipeline
pytest: Testing framework
Conda/Poetry: Environment management

Project Structure

metaseq-detector/
- api/                  (FastAPI backend)
- frontend/             (React + Vite UI)
- metaseq/              (Core library)
  - dataio.py           FASTA/FASTQ loaders, filters
  - evo2_embed.py       Evo2 embedding wrapper
  - models.py           Classifier heads
  - ood.py              MSP/Energy/Mahalanobis
  - agg.py              Sample-level aggregation
  - cluster.py          HDBSCAN for OOD reads
  - viz.py              Plots: ROC, UMAP, attention
- configs/              YAMLs for experiments
- notebooks/            Exploratory notebooks
- tests/                Pytest unit tests
- runs/                 Saved reports/metrics
- weights/              Trained classifier heads
- examples/             Demo FASTQ/FASTA
- environment.yml
- pyproject.toml
- docs/
  - weekly_report.md
  - design.md           System architecture
  - dataset_card.md     Data sources and splits

Installation

Prerequisites

Python 3.8+
CUDA-compatible GPU (recommended for Evo2 model)

Option 1: Conda Environment

# Clone the repository
git clone https://github.com/your-org/baio.git
cd baio

# Create and activate conda environment
conda env create -f environment.yml
conda activate baio

Option 2: Python Virtual Environment

If you prefer using Python's built-in virtual environment instead of conda:

1. Create Virtual Environment

Windows:

# Navigate to project directory
cd baio

# Create virtual environment
python -m venv baio-env

# Alternative if python3 is your command
python3 -m venv baio-env

macOS/Linux:

# Navigate to project directory
cd baio

# Create virtual environment
python3 -m venv baio-env

2. Activate Virtual Environment

Windows (Command Prompt):

baio-env\Scripts\activate

Windows (PowerShell):

baio-env\Scripts\Activate.ps1

macOS/Linux:

source baio-env/bin/activate

3. Install Dependencies

# Upgrade pip first
pip install --upgrade pip

# Install project dependencies
pip install -r requirements.txt

4. Verify Installation

# Check Python version
python --version

# Check installed packages
pip list

# Test Streamlit installation

Option 3: Poetry

# Clone and install with Poetry
git clone https://github.com/your-org/baio.git
cd baio
poetry install

Development Environment Setup

IDE Configuration

Visual Studio Code

Open Command Palette (Ctrl+Shift+P or Cmd+Shift+P)
Select "Python: Select Interpreter"
Choose the Python executable from your virtual environment:
- Conda: Usually in ~/miniconda3/envs/baio/bin/python
- venv: baio-env/Scripts/python.exe (Windows) or baio-env/bin/python (macOS/Linux)

PyCharm

Go to File → Settings → Project → Python Interpreter
Click gear icon → Add
Select "Existing Environment"
Browse to your environment's Python executable

Jupyter Notebook Integration

# For conda environment
conda activate baio
conda install ipykernel
python -m ipykernel install --user --name=baio --display-name="BAIO Project"

# For venv environment
source baio-env/bin/activate  # or activate script for Windows
pip install ipykernel
python -m ipykernel install --user --name=baio-env --display-name="BAIO Project"

Environment Variables

Create a .env file in your project root for API keys and configuration:

# .env file
OPENAI_API_KEY=your_api_key_here
HUGGINGFACE_TOKEN=your_hf_token_here
DEBUG=True
STREAMLIT_SERVER_PORT=8501
CUDA_VISIBLE_DEVICES=0

Load environment variables in your Python code:

from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv('OPENAI_API_KEY')

Common Setup Issues & Solutions

Issue: `python` command not found

Solution: Use python3 instead of python, or ensure Python is in your system PATH.

Issue: Permission denied on Windows PowerShell

Solution: Run PowerShell as administrator and execute:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Issue: `pip` command not found

Solution:

# Use python -m pip instead
python -m pip install package-name

Issue: CUDA/GPU Setup Problems

Solution:

# Check CUDA availability
python -c "import torch; print(torch.cuda.is_available())"

# Install CPU-only version if needed
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

Issue: Evo2 Model Download Fails

Solution:

Ensure stable internet connection
Check Hugging Face token permissions
Use huggingface-cli login for authentication

Daily Development Workflow

Activate Environment:

# Conda
conda activate baio

# venv (Linux/Mac)
source baio-env/bin/activate

# venv (Windows)
baio-env\Scripts\activate

Update Dependencies:

# Conda
conda env update -f environment.yml

# venv
pip install -r requirements.txt

Run Development Servers (API + React UI):

# FastAPI backend
uvicorn api.main:app --reload --port 8080

# React frontend (uses Vite)
cd frontend
npm install
npm run dev -- --host --port 5173

Deactivate When Done:
```
# Both conda and venv
deactivate
```

Dependency Management

Adding New Packages

For Conda:

conda activate baio
conda install new-package
conda env export > environment.yml

For venv:

source baio-env/bin/activate  # Windows: baio-env\Scripts\activate
pip install new-package
pip freeze > requirements.txt

For Poetry:

poetry add new-package

Quick Start

Run FastAPI backend

uvicorn api.main:app --reload --port 8080

Run React + Vite frontend

cd frontend
npm install   # first run only
# Set VITE_API_BASE if your API is not on http://localhost:8080
npm run dev -- --host --port 5173

Then open http://localhost:5173 to upload sequences, tune thresholds, and chat. (Legacy Streamlit UI commands remain in git history if you still need them.)

Running BAIO with Docker & Docker Compose

You can run the full BAIO stack - FastAPI backend and React UI - without installing Python locally.

1. Prerequisites

Docker Desktop or Docker Engine with Compose v2.
Git to clone the repository.

2. Build and start

docker compose up --build

This starts:

API at http://localhost:8080 (health at /health, classification at /classify, chat at /chat)
Frontend at http://localhost:4173 (set FRONTEND_PORT if you want a different host port)

3. Environment overrides (optional)

API_PORT: host port for FastAPI (default 8080)
FRONTEND_PORT: host port for React UI (default 4173)
VITE_API_BASE: API base URL baked into the frontend at build time (defaults to http://localhost:8080 so the browser can reach the API)
CUDA vs CPU: Docker defaults to CPU-only PyTorch wheels (PIP_EXTRA_INDEX_URL=https://download.pytorch.org/whl/cpu). Build a separate GPU image/service if you need CUDA.

Basic Usage

Open the React UI at the frontend port and paste/upload FASTA.
Configure thresholds (confidence, batch size, novelty sensitivity).
Run the classification pipeline.
Review classifications, novelty counts, and chat with the built-in assistant.

Evaluation Metrics

Closed-Set Taxonomy

Accuracy, macro-F1
Per-class Precision/Recall

Open-Set Detection

AUROC, AUPR-Out
FPR@95%TPR

Sample-Level Analysis

OSCR matrix
Confusion across {Known-Correct, Known-Wrong, Unknown-Correct, Unknown-Wrong}

Performance

Reads/second processing rate
Memory footprint

Development

Running Tests

# Activate your environment first
conda activate baio  # or source baio-env/bin/activate

# Run tests
pytest tests/

Code Quality

# Format code
black metaseq/ app/

# Lint code
flake8 metaseq/ app/

# Type checking
mypy metaseq/ app/

Contributing

We welcome contributions! Please see our contributing guidelines and code of conduct.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Set up your development environment (see Installation section)
Make your changes and test thoroughly
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Research Applications

BAIO is designed for:

Pandemic Preparedness: Early detection of novel pathogens
Metagenomic Surveillance: Monitoring environmental and clinical samples
Research: Comparative analysis with traditional methods like Kraken2
Clinical Diagnostics: Supporting pathogen identification in complex samples

Citation

If you use BAIO in your research, please cite:

@software{baio2024,
  title={BAIO: LLM-Based Taxonomy Profiling & Open-Set Pathogen Detection},
  author={Farhan, Tanzim and Hashami, Mustafa and Gujja, Sahana and Burns, Eric and Gaikwad, Manali},
  year={2025},
  url={https://github.com/your-org/baio}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributors

Mainuddin - Tech Lead
Luis Palmejar - Developer
**** - Developer
**** - Developer
**** - Developer

Acknowledgments

Built on the Evo2 foundation model (Science, 2024)
Inspired by the need for improved metagenomic surveillance capabilities
Thanks to the open-source bioinformatics community

Support

For questions, issues, or contributions, please:

Open an issue on GitHub
Contact the development team
Check our documentation in the docs/ directory

Note: This is a research prototype. For production use in clinical settings, additional validation and regulatory approval may be required.

Name		Name	Last commit message	Last commit date
Latest commit History 247 Commits
.github		.github
api		api
binary_classifiers		binary_classifiers
configs		configs
data_processing		data_processing
docs		docs
examples		examples
frontend		frontend
metaseq		metaseq
notebooks		notebooks
prompting		prompting
runs		runs
scripts		scripts
tests		tests
weights		weights
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
docker-compose.yml		docker-compose.yml
environment.yml		environment.yml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

oss-slu/baio

Folders and files

Latest commit

History

Repository files navigation

BAIO: LLM-Based Taxonomy Profiling & Open-Set Pathogen Detection

Overview

Key Features

Technology Stack

Frontend

Model Runtime

Bioinformatics Utilities

Data Processing

DevOps

Project Structure

Installation

Prerequisites

Option 1: Conda Environment

Option 2: Python Virtual Environment

1. Create Virtual Environment

2. Activate Virtual Environment

3. Install Dependencies

4. Verify Installation

Option 3: Poetry

Development Environment Setup

IDE Configuration

Visual Studio Code

PyCharm

Jupyter Notebook Integration

Environment Variables

Common Setup Issues & Solutions

Issue: python command not found

Issue: Permission denied on Windows PowerShell

Issue: pip command not found

Issue: CUDA/GPU Setup Problems

Issue: Evo2 Model Download Fails

Daily Development Workflow

Dependency Management

Adding New Packages

Quick Start

Run FastAPI backend

Run React + Vite frontend

Running BAIO with Docker & Docker Compose

1. Prerequisites

2. Build and start

3. Environment overrides (optional)

Basic Usage

Evaluation Metrics

Closed-Set Taxonomy

Open-Set Detection

Sample-Level Analysis

Performance

Development

Running Tests

Code Quality

Contributing

Research Applications

Citation

License

Contributors

Acknowledgments

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 9

Uh oh!

Languages

Issue: `python` command not found

Issue: `pip` command not found

Packages