A Graph RAG pipeline that processes nutrition research papers into a knowledge graph, clusters them into communities, and uses graph-based retrieval to answer nutrition questions.
Built on a corpus of 30 nutrition research papers.
PDFs → Entity/Relationship Extraction → Knowledge Graph → Community Detection → Vector Storage → Query
- Entity Extraction — LlamaIndex + Llama3-70B (via Groq) extracts entities and relationships from 30 nutrition research papers
- Graph Construction — Entities become nodes, relationships become edges in a NetworkX graph. Semantic normalization merges duplicate entities (e.g., "Vitamin D" and "vitamin d3"), reducing graph fragmentation by ~60%
- Community Detection — Hierarchical Leiden algorithm clusters related nutrition concepts
- Summarization — LLM generates a summary for each community
- Vector Storage — Community summaries and entity embeddings indexed in Pinecone (FastEmbed BGE-small-en-v1.5)
- Query Processing — Three retrieval strategies:
entity— LLM extracts entities from query, searches graph neighborsdirect— Embeds the raw query, does semantic searchhybrid— Both, merged and deduplicated (best results)
Query: "How does vitamin D help with calcium absorption?"
→ Entities extracted: ["vitamin D", "calcium", "absorption"]
→ Graph neighbors found: 12 related entities
→ Communities hit: 2 (bone health cluster, micronutrient interactions cluster)
→ Context assembled: community summaries + entity relationships
→ Answer generated with full graph context
- Graph: NetworkX
- Community Detection: python-louvain (Leiden)
- Vector DB: Pinecone + FastEmbed (BGE-small-en-v1.5)
- LLM: Llama3-70B via Groq
- Orchestration: LlamaIndex
pipeline/
├── 01_entity_relationship_extraction/ # PDF → entities + relationships
├── 02_graph_construction/ # Entities → NetworkX graph
├── 03_community_detection/ # Leiden clustering
├── 04_community_summarization/ # LLM community summaries
├── 05_graph_vector_storage/ # Pinecone indexing
└── 06_query_processing/ # Retrieval + answer generation
Each module runs independently and outputs artifacts consumed by the next step.
git clone https://github.com/raipalorange/graph-rag-nutrition.git
cd graph-rag-nutrition
pip install -r requirements.txt
cp .env.example .env
# Add your GROQ_API_KEY, PINECONE_API_KEY, PINECONE_INDEX_NAME to .envMust be run sequentially — each step depends on the previous output:
python pipeline/01_entity_relationship_extraction/main.py
python pipeline/02_graph_construction/main.py
python pipeline/03_community_detection/main.py
python pipeline/04_community_summarization/main.py
python pipeline/05_graph_vector_storage/main.py
python pipeline/06_query_processing/main.py # starts interactive sessionfrom query_processor import NutritionQueryProcessor
processor = NutritionQueryProcessor()
result = processor.process_query(
query="What nutrients support bone health?",
top_k=15,
max_communities=3,
method="hybrid"
)
print(result["answer"])
print(result["entities_found"])
print(result["communities_used"])- English only
- No incremental updates — adding new papers requires re-running the full pipeline
- No evaluation — no systematic comparison against vanilla RAG or direct LLM answers
- Implement incremental graph updates without full rebuild
- Add a vanilla RAG baseline for comparison
- Add retrieval evaluation (recall@k, answer quality scoring)
- Explore cross-domain transfer (e.g., applying the same pipeline to pharmacology or exercise science papers)