A high-performance biomedical question-answering system using Retrieval-Augmented Generation (RAG) with intelligent query parsing, hybrid retrieval, and optimized inference.
This system answers biomedical questions by:
- Parsing queries with SIBILS query parser (generates Elasticsearch queries with concept annotations)
- Retrieving documents using hybrid search (Elasticsearch + Dense FAISS)
- Reranking with cross-encoder semantic matching
- Generating answers with vLLM-optimized LLM (Qwen 2.5-7B)
- Formatting with sentence-level citations (Ragnarok-style)
- Speed: 5-7 seconds per question (36x faster than baseline 177s)
- Accuracy: Comprehensive answers with sentence-level citations
- Scalability: Handles concurrent requests with FastAPI
# API is already running on http://172.30.120.7:9000
# Check health
curl http://172.30.120.7:9000/health
# Ask a question
curl -X POST http://172.30.120.7:9000/qa \
-H "Content-Type: application/json" \
-d '{"question": "What causes malaria?"}'{
"question": "What causes malaria?",
"answer": [
{
"text": "Malaria is primarily caused by protozoan parasites.",
"citation_ids": [3, 9],
"citations": [
{
"document_id": 3,
"document_title": "...",
"pmcid": "PMCPMC11834219"
}
]
}
],
"references": ["[0] PMC...: Title", ...],
"pipeline_time": 6.98
}- Automatically enhances queries with biomedical concepts
- Annotates with MeSH, NCIT, and AGROVOC ontologies
- Generates optimized Elasticsearch queries with concept expansion
- Removes punctuation-only clauses for clean queries
Example:
Input: "What causes malaria?"
Output: Elasticsearch query with:
- Text match: "causes" AND "malaria"
- Concept expansion: mesh:D008288, ncit:C34797, agrovoc:c_34312
- Elasticsearch (SIBILS): Concept-aware search with ontology expansion
- Dense (FAISS): Semantic vector search
- Parallel: Both run simultaneously
- RRF Fusion: Combines results intelligently
- Sentence-level citations
- Structured JSON format
- Clean, readable text
User Question
↓
SIBILS Query Parser (ES query + concept annotations)
↓
Parallel Hybrid Retrieval (Elasticsearch + Dense)
↓
Cross-Encoder Reranking
↓
vLLM Generation (Qwen 2.5-7B)
↓
Citation Parsing
↓
Structured Response
BioMoQA-RAG/
├── src/
│ ├── api_server.py # FastAPI server
│ ├── pipeline_vllm_v3_fast.py # Main RAG pipeline
│ ├── build_dense_index.py # Build FAISS index
│ ├── retrieval/
│ │ ├── query_parser.py # SIBILS query parser
│ │ ├── sibils_retriever.py # BM25 with query parsing
│ │ ├── dense_retriever.py # FAISS dense retrieval
│ │ ├── parallel_hybrid.py # Hybrid orchestration
│ │ └── reranker.py # Cross-encoder reranking
│ ├── generation/
│ │ └── llm_generator.py # vLLM generation
│ └── evaluation/ # Evaluation metrics
├── data/
│ ├── faiss_index.bin # Dense index
│ └── documents.pkl # 2398 documents
├── setup_service.sh # Setup systemd service
├── biomoqa-rag.service # Systemd service file
└── docs/ # Documentation
| Version | Time | Key Innovation |
|---|---|---|
| Baseline | 177s | Standard processing |
| V1 | 7.27s | vLLM optimization (24x faster) |
| V2 | 11.19s | Hybrid retrieval |
| V3 | 6.81s | Parallel + smart strategy |
| V3.1 | 6.98s | Query parser + citations |
| V3.2 | ~7s | Full Elasticsearch queries with concept expansion |
The query parser generates optimized Elasticsearch queries:
- Parse: Call SIBILS query parser API with user question
- Generate ES Query: Creates structured Elasticsearch query with:
- Text matching clauses for keywords
- Concept expansion (MeSH, NCIT, AGROVOC annotations)
- Boolean logic (must/should clauses)
- Clean: Remove punctuation-only clauses from ES query
- Retrieve: POST ES query to SIBILS search API via
jqparameter - Fallback: If ES query fails, fall back to keywords mode
Ask a question:
{
"question": "What causes malaria?",
"retrieval_n": 20, // optional
"final_n": 10, // optional
"debug": false // optional
}Check API status
Interactive API documentation
Edit api_server_v3_fast.py:
config = RAGConfigV3Fast(
retrieval_n=20, # Documents to retrieve
use_smart_retrieval=True, # Adaptive strategy
use_reranking=True, # Cross-encoder
final_n=10, # Final context size
gpu_memory_utilization=0.4 # GPU memory (adjust as needed)
)- MIG GPUs: Use 0.4 (< 40GB)
- Full A100: Use 0.8 (< 64GB)
- Reduce
retrieval_nto 10 - Reduce
final_nto 5 - Disable query parser (less accurate)
- Increase
retrieval_nto 50 - Increase
final_nto 15 - Keep query parser enabled
# Check GPU
nvidia-smi
# Kill old processes
nvidia-smi | grep python | awk '{print $5}' | xargs kill -9
# Check logs
tail -f logs/v3_api.lognvidia-smi | grep python | awk '{print $5}' | xargs kill -9- README.md - This file
- WORK_EXPLANATION.md - Detailed evolution and design decisions
- EVALUATION_REPORT.md - Performance benchmarks
- SIBILS: Query parser and BM25 retrieval API
- vLLM: Fast LLM inference
- Qwen: Open-source LLM from Alibaba
- FAISS: Dense vector search from Meta