Skip to content

raipalorange/graph-rag-nutrition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Graph RAG Nutrition Chatbot

A Graph RAG pipeline that processes nutrition research papers into a knowledge graph, clusters them into communities, and uses graph-based retrieval to answer nutrition questions.

Built on a corpus of 30 nutrition research papers.

How It Works

PDFs → Entity/Relationship Extraction → Knowledge Graph → Community Detection → Vector Storage → Query
  1. Entity Extraction — LlamaIndex + Llama3-70B (via Groq) extracts entities and relationships from 30 nutrition research papers
  2. Graph Construction — Entities become nodes, relationships become edges in a NetworkX graph. Semantic normalization merges duplicate entities (e.g., "Vitamin D" and "vitamin d3"), reducing graph fragmentation by ~60%
  3. Community Detection — Hierarchical Leiden algorithm clusters related nutrition concepts
  4. Summarization — LLM generates a summary for each community
  5. Vector Storage — Community summaries and entity embeddings indexed in Pinecone (FastEmbed BGE-small-en-v1.5)
  6. Query Processing — Three retrieval strategies:
    • entity — LLM extracts entities from query, searches graph neighbors
    • direct — Embeds the raw query, does semantic search
    • hybrid — Both, merged and deduplicated (best results)

Query Trace Example

Query: "How does vitamin D help with calcium absorption?"

→ Entities extracted: ["vitamin D", "calcium", "absorption"]
→ Graph neighbors found: 12 related entities
→ Communities hit: 2 (bone health cluster, micronutrient interactions cluster)
→ Context assembled: community summaries + entity relationships
→ Answer generated with full graph context

Stack

  • Graph: NetworkX
  • Community Detection: python-louvain (Leiden)
  • Vector DB: Pinecone + FastEmbed (BGE-small-en-v1.5)
  • LLM: Llama3-70B via Groq
  • Orchestration: LlamaIndex

Project Structure

pipeline/
├── 01_entity_relationship_extraction/   # PDF → entities + relationships
├── 02_graph_construction/               # Entities → NetworkX graph
├── 03_community_detection/              # Leiden clustering
├── 04_community_summarization/          # LLM community summaries
├── 05_graph_vector_storage/             # Pinecone indexing
└── 06_query_processing/                 # Retrieval + answer generation

Each module runs independently and outputs artifacts consumed by the next step.

Setup

git clone https://github.com/raipalorange/graph-rag-nutrition.git
cd graph-rag-nutrition
pip install -r requirements.txt
cp .env.example .env
# Add your GROQ_API_KEY, PINECONE_API_KEY, PINECONE_INDEX_NAME to .env

Running the Pipeline

Must be run sequentially — each step depends on the previous output:

python pipeline/01_entity_relationship_extraction/main.py
python pipeline/02_graph_construction/main.py
python pipeline/03_community_detection/main.py
python pipeline/04_community_summarization/main.py
python pipeline/05_graph_vector_storage/main.py
python pipeline/06_query_processing/main.py  # starts interactive session

Using the Query Processor

from query_processor import NutritionQueryProcessor

processor = NutritionQueryProcessor()

result = processor.process_query(
    query="What nutrients support bone health?",
    top_k=15,
    max_communities=3,
    method="hybrid"
)

print(result["answer"])
print(result["entities_found"])
print(result["communities_used"])

Limitations

  • English only
  • No incremental updates — adding new papers requires re-running the full pipeline
  • No evaluation — no systematic comparison against vanilla RAG or direct LLM answers

What I'd Improve

  • Implement incremental graph updates without full rebuild
  • Add a vanilla RAG baseline for comparison
  • Add retrieval evaluation (recall@k, answer quality scoring)
  • Explore cross-domain transfer (e.g., applying the same pipeline to pharmacology or exercise science papers)

About

Graph RAG pipeline for nutrition Q&A — extracts knowledge graphs from 30 research papers, clusters with Leiden community detection, and retrieves answers via hybrid graph + semantic search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages