Graph RAG Nutrition Chatbot

A Graph RAG pipeline that processes nutrition research papers into a knowledge graph, clusters them into communities, and uses graph-based retrieval to answer nutrition questions.

Built on a corpus of 30 nutrition research papers.

How It Works

PDFs → Entity/Relationship Extraction → Knowledge Graph → Community Detection → Vector Storage → Query

Entity Extraction — LlamaIndex + Llama3-70B (via Groq) extracts entities and relationships from 30 nutrition research papers
Graph Construction — Entities become nodes, relationships become edges in a NetworkX graph. Semantic normalization merges duplicate entities (e.g., "Vitamin D" and "vitamin d3"), reducing graph fragmentation by ~60%
Community Detection — Hierarchical Leiden algorithm clusters related nutrition concepts
Summarization — LLM generates a summary for each community
Vector Storage — Community summaries and entity embeddings indexed in Pinecone (FastEmbed BGE-small-en-v1.5)
Query Processing — Three retrieval strategies:
- entity — LLM extracts entities from query, searches graph neighbors
- direct — Embeds the raw query, does semantic search
- hybrid — Both, merged and deduplicated (best results)

Query Trace Example

Query: "How does vitamin D help with calcium absorption?"

→ Entities extracted: ["vitamin D", "calcium", "absorption"]
→ Graph neighbors found: 12 related entities
→ Communities hit: 2 (bone health cluster, micronutrient interactions cluster)
→ Context assembled: community summaries + entity relationships
→ Answer generated with full graph context

Stack

Graph: NetworkX
Community Detection: python-louvain (Leiden)
Vector DB: Pinecone + FastEmbed (BGE-small-en-v1.5)
LLM: Llama3-70B via Groq
Orchestration: LlamaIndex

Project Structure

pipeline/
├── 01_entity_relationship_extraction/   # PDF → entities + relationships
├── 02_graph_construction/               # Entities → NetworkX graph
├── 03_community_detection/              # Leiden clustering
├── 04_community_summarization/          # LLM community summaries
├── 05_graph_vector_storage/             # Pinecone indexing
└── 06_query_processing/                 # Retrieval + answer generation

Each module runs independently and outputs artifacts consumed by the next step.

Setup

git clone https://github.com/raipalorange/graph-rag-nutrition.git
cd graph-rag-nutrition
pip install -r requirements.txt
cp .env.example .env
# Add your GROQ_API_KEY, PINECONE_API_KEY, PINECONE_INDEX_NAME to .env

Running the Pipeline

Must be run sequentially — each step depends on the previous output:

python pipeline/01_entity_relationship_extraction/main.py
python pipeline/02_graph_construction/main.py
python pipeline/03_community_detection/main.py
python pipeline/04_community_summarization/main.py
python pipeline/05_graph_vector_storage/main.py
python pipeline/06_query_processing/main.py  # starts interactive session

Using the Query Processor

from query_processor import NutritionQueryProcessor

processor = NutritionQueryProcessor()

result = processor.process_query(
    query="What nutrients support bone health?",
    top_k=15,
    max_communities=3,
    method="hybrid"
)

print(result["answer"])
print(result["entities_found"])
print(result["communities_used"])

Limitations

English only
No incremental updates — adding new papers requires re-running the full pipeline
No evaluation — no systematic comparison against vanilla RAG or direct LLM answers

What I'd Improve

Implement incremental graph updates without full rebuild
Add a vanilla RAG baseline for comparison
Add retrieval evaluation (recall@k, answer quality scoring)
Explore cross-domain transfer (e.g., applying the same pipeline to pharmacology or exercise science papers)

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
pipeline		pipeline
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph RAG Nutrition Chatbot

How It Works

Query Trace Example

Stack

Project Structure

Setup

Running the Pipeline

Using the Query Processor

Limitations

What I'd Improve

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graph RAG Nutrition Chatbot

How It Works

Query Trace Example

Stack

Project Structure

Setup

Running the Pipeline

Using the Query Processor

Limitations

What I'd Improve

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages