βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Data Sources β β Processing β β Storage & β
β β β Pipeline β β Retrieval β
βββββββββββββββββββ€ βββββββββββββββββββ€ βββββββββββββββββββ€
β β’ MedQuAD XML βββββΆβ β’ XML Parser βββββΆβ β’ ChromaDB β
β β’ MedlinePlus β β β’ Web Scraper β β β’ Vector Store β
β Encyclopedia β β β’ Text Cleaner β β β’ Embeddings β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β User Query ββββββ RAG Pipeline ββββββ LLM (Gemini) β
β β β β β β
βββββββββββββββββββ€ βββββββββββββββββββ€ βββββββββββββββββββ€
β β’ Natural β β β’ Query β β β’ Context β
β Language β β Embedding β β Synthesis β
β β’ Medical β β β’ Semantic β β β’ Answer β
β Questions β β Search β β Generation β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
MedQuAD XML Files βββ
ββββΆ Document Processor βββΆ ChromaDB
MedlinePlus Web βββββ
Key Components:
- XML Parser: Extracts Q&A pairs from MedQuAD XML structure
- Web Scraper: Crawls MedlinePlus A-Z index pages for article links
- Content Extractor: Pulls article content from individual pages
- Text Normalizer: Cleans and standardizes text format
Raw Documents βββΆ Embedding Function βββΆ ChromaDB Collection
β
βΌ
Metadata Store
(source, title, URL)
Technical Details:
- Embedding Model: ChromaDB's default embedding function
- Vector Dimensions: 1536 (default)
- Similarity Metric: Cosine similarity
- Persistence: Local disk storage (
chroma_persistent_storage/)
User Query βββΆ Query Embedding βββΆ K-NN Search βββΆ Top-K Documents
Search Parameters:
- K: 5 (configurable)
- Search Strategy: Semantic similarity
- Filtering: None (retrieves from all sources)
Retrieved Context βββΆ Prompt Engineering βββΆ Gemini LLM βββΆ Final Answer
Prompt Structure:
You are a medical assistant. Based on the following medical information,
answer the user's question. If the information provided doesn't contain
the answer, say so clearly.
Medical Information:
{retrieved_context}
User Question: {user_question}
Please provide a clear, accurate, and helpful answer based on the
medical information above.
def extract_qa_from_xml(xml_content):
"""Extract Q&A pairs from XML structure"""
root = ET.fromstring(xml_content)
qa_pairs = []
for qa_pair in root.findall('.//QAPair'):
question = qa_pair.find('Question').text
answer = qa_pair.find('Answer').text
qa_pairs.append({'question': question, 'answer': answer})
return qa_pairsXML Structure Parsed:
<QAPair pid="1">
<Question qid="0000043_1-1">What is Vulvar Cancer?</Question>
<Answer>Key Points - Vulvar cancer is a rare disease...</Answer>
</QAPair>def get_article_links_from_index(index_url):
"""Extract article links from A-Z index pages"""
soup = BeautifulSoup(response.content, 'html.parser')
index_ul = soup.find('ul', {'id': 'index'})
for link in index_ul.find_all('a', href=True):
href = link['href']
if href.startswith('article/') and href.endswith('.htm'):
# Process article linkScraping Strategy:
- Index Discovery: Find A-Z index pages from main encyclopedia page
- Link Extraction: Extract article links from each index page
- Content Scraping: Pull article content from individual pages
- Rate Limiting: 1-second delay between requests (respectful scraping)
{
"id": "unique_document_id",
"text": "Question: {question}\nAnswer: {answer}",
"metadata": {
"question": "Original question text",
"answer": "Original answer text",
"source_file": "MedQuAD file path",
"source": "MedlinePlus Encyclopedia" (for MedlinePlus),
"url": "Article URL" (for MedlinePlus)
}
}collection = chroma_client.get_or_create_collection(
name="medquad_qa_collection",
# Uses default embedding function
# Default similarity metric: cosine
)results = collection.query(
query_texts=[user_question],
n_results=5 # Top-K retrieval
)context = "\n\n".join(results['documents'][0])response = client.models.generate_content(
model="gemini-2.0-flash-exp",
contents=prompt
)- Pros:
- Simple setup (no external dependencies)
- Good performance for small-medium datasets
- Built-in embedding functions
- Persistent storage
- Cons:
- Limited scalability for very large datasets
- No advanced features like hybrid search
- Pros:
- Fast inference (good for real-time queries)
- Good medical knowledge
- Cost-effective
- Cons:
- Less powerful than larger models
- Limited context window
- MedQuAD: Structured Q&A format, good for specific questions
- MedlinePlus: Comprehensive articles, good for detailed explanations
- Combination: Covers both specific and general medical queries
- Scraping Choice: MedlinePlus doesn't provide a public API
- Rate Limiting: 1-second delays to be respectful
- Robustness: Error handling for network issues
- MedQuAD: ~1000-2000 Q&A pairs per minute
- MedlinePlus: ~60 articles per minute (due to rate limiting)
- Total Processing Time: ~2-3 hours for full dataset
- Retrieval Time: <100ms (ChromaDB)
- LLM Generation: 2-5 seconds (Gemini)
- Total Response Time: 2-6 seconds
- Vector Storage: ~500MB for full dataset
- Metadata: ~50MB
- Total: ~550MB
- Single-threaded processing
- No distributed storage
- Limited context window (Gemini)
- Parallel Processing: Multi-threaded scraping
- Distributed Storage: Redis/PostgreSQL for metadata
- Advanced Retrieval: Hybrid search (keyword + semantic)
- Caching: Redis for frequent queries
- Model Upgrades: Larger context windows
- No user data stored
- All medical data is public domain
- No PII in the system
- Educational Use Only: Not for medical diagnosis
- Source Attribution: All answers cite sources
- Accuracy: Based on authoritative sources (NIH/NLM)
- Respectful web scraping (1-second delays)
- No aggressive crawling
- Follows robots.txt
- XML parsing validation
- Content length filtering (>100 characters)
- Duplicate detection
- Manual evaluation of top-K results
- Relevance scoring
- Source diversity
- Fact-checking against sources
- Consistency validation
- Medical accuracy review
This technical architecture provides a robust foundation for medical question answering while maintaining transparency, accuracy, and ethical use of medical information.