Graph RAG POC

A Proof-of-Concept for Graph Retrieval-Augmented Generation (Graph RAG) applied to the LME (London Metal Exchange) aluminum supply chain domain. This project demonstrates how to extract a knowledge graph from raw news articles, persist it in Neo4j, and query it using natural language — all powered by LangGraph, Ollama, and Groq.

Overview

Traditional RAG systems retrieve document chunks using vector similarity. Graph RAG instead builds a knowledge graph from the source corpus — capturing entities and the relationships between them — and uses the graph structure as the retrieval backbone.

This POC applies that idea to LME aluminum market news:

Raw news articles are chunked into small text fragments.
A local LLM (Ollama / qwen2.5:3b) extracts structured entities and relationships from each chunk, producing a knowledge graph in JSONL format.
The graph is ingested into Neo4j AuraDB (cloud-hosted).
At query time, a LangGraph pipeline converts a natural-language question into a Cypher query, runs it against Neo4j to find relevant chunk IDs, retrieves those chunks, and passes them to Groq (llama-3.3-70b-versatile) for a final grounded answer.

Architecture

Raw News Articles (JSONL)
        │
        ▼
[ extract-chunk-corpus.py ]
  RecursiveCharacterTextSplitter
        │
        ▼
   chunks.jsonl  (1,428 chunks)
        │
        ▼
[ extract-kg.py ]
  Ollama (qwen2.5:3b) — local LLM
  Structured KG extraction (nodes + relationships)
        │
        ▼
  extracted-kg.jsonl  (1,428 KG records)
        │
        ▼
[ create-graph-db.py ]
  Neo4j AuraDB ingestion
        │
        ▼
    Neo4j Graph DB
        │
        ▼
[ GraphRAGPOC — LangGraph workflow ]
  1. Natural language query
  2. Cypher query generation (Groq)
  3. Neo4j execution → chunk IDs
  4. Chunk retrieval from chunks.jsonl
  5. Final answer generation (Groq)
        │
        ▼
     Answer

Project Structure

graph-rag-poc/
│
├── core/                          # Reusable Python package
│   ├── __init__.py
│   ├── graph.py                   # Main GraphRAGPOC class + LangGraph workflow
│   ├── logging.py                 # Centralized LoggerFactory
│   ├── prompt.py                  # All LLM prompt templates
│   └── utils.py                   # DocumentProcess & Neo4jOps helpers
│
├── dataset/                       # Data files (tracked in git)
│   ├── news-articles-raw.jsonl    # 65 raw LME aluminum news articles
│   ├── chunks.jsonl               # 1,428 text chunks (derived from raw articles)
│   └── extracted-kg.jsonl         # 1,428 KG records (nodes + relationships per chunk)
│
├── development/                   # One-off pipeline scripts
│   ├── extract-chunk-corpus.py    # Stage 1: Chunk raw articles
│   ├── extract-kg.py              # Stage 2: Extract KG from chunks via Ollama
│   ├── create-graph-db.py         # Stage 3: Ingest KG into Neo4j
│   └── viz-graph.py               # Stage 4: Generate interactive HTML graph
│
├── interactive-graph.html         # Pre-generated PyVis graph visualization
├── main.ipynb                     # End-to-end Jupyter notebook demo
├── .env.poc.example               # Environment variable template
├── .gitignore
├── pyproject.toml                 # Package metadata (PEP 517)
└── requirements.txt               # Pinned dependencies

How It Works

Stage 1 — Chunking the Corpus

Script: development/extract-chunk-corpus.py

Reads dataset/news-articles-raw.jsonl (65 raw articles) and splits each article into smaller text chunks using LangChain's RecursiveCharacterTextSplitter.

Chunk size and overlap are controlled via .env.poc (CHUNK_SIZE, CHUNK_OVERLAP).
Default configuration: CHUNK_SIZE=200, CHUNK_OVERLAP=0.
Output: dataset/chunks.jsonl — 1,428 chunks, each stored as a LangChain Document (with page_content and metadata including article title, publish date, and source URL).

Stage 2 — Knowledge Graph Extraction

Script: development/extract-kg.py

Iterates over all 1,428 chunks and uses a locally running Ollama model (qwen2.5:3b by default) with format="json" to extract a structured knowledge graph from each chunk.

The LLM is guided by two prompt templates (defined in core/prompt.py):

SystemPrompt — Instructs the model to extract aluminum supply chain KGs with strict vocabulary and normalisation rules (e.g., "lme" → "london metal exchange", "aluminium" → "aluminum").
UserPrompt — Provides the chunk text and a JSON schema to fill in with nodes and relationships.

Each extracted record is validated against Pydantic models (Node, Relationship, GraphResponse) and appended to dataset/extracted-kg.jsonl.

KG Schema:

{
  "chunk_id": 0,
  "nodes": [
    {"id": "aluminum_price", "type": "financial_metric", "properties": {"value": 2985.5, "unit": "usd/t"}}
  ],
  "relationships": [
    {"source": "london_metal_exchange", "target": "aluminum_price", "type": "reports", "properties": {}}
  ]
}

Entity types used: exchange, financial_metric, commodity, location, event, policy, etc.
Relationship verbs used: reports, trades_on, affects, produces, exports_to, disrupts, etc.

A 5-second delay (time.sleep(5)) is inserted between chunks to avoid overloading the local Ollama server.

Stage 3 — Populating Neo4j

Script: development/create-graph-db.py

Reads dataset/extracted-kg.jsonl and pushes all nodes and relationships into a Neo4j AuraDB instance using the Neo4jOps helper class.

Key implementation details:

All entities are stored under a single node label: Entity.
Nested property dictionaries are flattened (e.g., {"value": 100, "unit": "usd/t"} becomes value_value=100, value_unit="usd/t") for Neo4j compatibility.
The chunk_id is attached to every node and relationship as a property, enabling the RAG pipeline to trace graph hits back to source text chunks.
Nodes are created with MERGE to avoid duplicates.

Stage 4 — Graph Visualization

Script: development/viz-graph.py

Queries all edges from Neo4j (MATCH (a)-[r]->(b) RETURN a.id, b.id) and builds an interactive HTML graph using NetworkX and PyVis.

Physics simulation is enabled (ForceAtlas2 layout).
Full interactivity: drag nodes, zoom, hover, adjust physics/nodes/edges via control panel.
Output: interactive-graph.html (self-contained, opens in browser automatically).

Stage 5 — Graph RAG Query Pipeline

Class: core/graph.py → GraphRAGPOC

The main inference class. Instantiated with a Groq API key, Neo4j driver, Groq model name, and path to the chunks file. Internally builds and compiles a LangGraph state machine.

LangGraph Workflow

The query pipeline is a four-node directed acyclic graph (DAG) with no conditional edges:

[get_cypher_query] → [get_valid_chunks] → [get_context] → [get_final_response]

State object (GraphRAGState):

Field	Type	Description
`natural_query`	`str`	The original user question
`cypher_query`	`Optional[str]`	Generated Cypher query
`valid_chunks`	`Optional[List[int]]`	Chunk IDs returned from Neo4j
`context`	`Optional[List[str]]`	Text of relevant chunks
`final_response`	`Optional[str]`	Final LLM answer

Node descriptions:

get_cypher_query
Introspects the live Neo4j schema (node labels, relationship types, sample properties) and feeds it — along with the user's question — to the CypherPrompt. Groq (llama-3.3-70b-versatile) generates a single valid Cypher query. The prompt enforces strict output rules: no Markdown, no explanation, output must start with MATCH.

get_valid_chunks
Executes the Cypher query against the Neo4j session. Parses the result DataFrame to find all chunk_id columns, coerces them to integers, and deduplicates them. This produces the list of source chunk IDs that are semantically relevant to the query.

get_context
Maps each chunk ID back to its Document object loaded from chunks.jsonl, assembling the context passages to be used by the final LLM call.

get_final_response
Passes the original question and retrieved context chunks to the FinalLLMPrompt. Groq generates a concise, grounded analytical answer. The system persona is an LME aluminum industry analyst; the model is instructed to respond only from the provided context and say "Insufficient information" if the context is insufficient.

Core Module Reference

`core/graph.py`

Class / Method	Description
`GraphRAGState`	Pydantic state model for the LangGraph workflow
`GraphRAGPOC.__init__()`	Initialises LLM, Neo4j session, loads chunks, builds workflow
`GraphRAGPOC.get_response(query)`	Public entry point — runs the full workflow and returns the answer
`_workflow_get_cypher_query()`	LangGraph node 1: natural language → Cypher
`_workflow_get_valid_chunks()`	LangGraph node 2: Cypher → chunk IDs
`_workflow_get_context()`	LangGraph node 3: chunk IDs → text passages
`_workflow_get_final_response()`	LangGraph node 4: passages → final answer

`core/utils.py`

Class / Method	Description
`DocumentProcess.save_docs_jsonl()`	Serialise a list of LangChain `Document` objects to a JSONL file
`DocumentProcess.load_docs_jsonl()`	Deserialise a JSONL file back into a list of `Document` objects
`Neo4jOps.create_node()`	`MERGE` an entity node into Neo4j
`Neo4jOps.create_relationship()`	`MERGE` a typed relationship between two nodes
`Neo4jOps.extract_node_schema_info()`	Fetch node labels and property key counts from Neo4j
`Neo4jOps.extract_relationship_info()`	Fetch relationship types, source/target labels, and usage counts
`Neo4jOps.extract_node_property_info()`	Fetch sample property keys for each node label

`core/prompt.py`

Class	Purpose
`SystemPrompt`	Static system prompt for KG extraction — vocabulary and normalisation rules
`UserPrompt`	Per-chunk user prompt for KG extraction — provides the text and JSON schema
`CypherPrompt`	Multi-step Cypher generation prompt — includes schema, normalisation, expansion, and filtering instructions
`FinalLLMPrompt`	Final answer prompt — enforces grounded, context-only responses as an LME analyst

`core/logging.py`

Class	Description
`LoggerFactory`	Centralised logger factory. Produces loggers with a consistent format (`timestamp \| level \| name \| message`), console handler, and an optional rotating file handler (`logs/app.log`, 10 MB max, 3 backups).

Dataset

File	Records	Description
`dataset/news-articles-raw.jsonl`	65 articles	Raw LME aluminum market news scraped from AL Circle (Jan 2026). Each record contains `page_content`, `article_title`, `article_publish_date`, and `article_link`.
`dataset/chunks.jsonl`	1,428 chunks	Articles split into 200-token chunks with no overlap.
`dataset/extracted-kg.jsonl`	1,428 KG records	One KG (nodes + relationships) extracted per chunk by Ollama.

The dataset covers topics including LME cash and futures prices, Chinese aluminum production and caps, CBAM (EU Carbon Border Adjustment Mechanism), SHFE (Shanghai Futures Exchange) movements, Indonesia supply growth, recycling investments, and tariff impacts.

Prerequisites

Python 3.11 (the project requires >=3.11, <3.12)
Ollama installed and running locally with qwen2.5:3b pulled — required only for Stage 2 (KG extraction)
A Neo4j AuraDB free instance (or any Neo4j 5.x instance) with credentials
A Groq API key — used for Cypher generation and final answer generation at query time

Installation

# 1. Clone the repository
git clone https://github.com/Tksrivastava/graph-rag-poc.git
cd graph-rag-poc

# 2. Create a virtual environment
python3.11 -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

# 4. (Optional) Install as a local package for development
pip install -e .

Configuration

Copy the example environment file and fill in your credentials:

cp .env.poc.example .env.poc

Edit .env.poc:

# Chunking
CHUNK_SIZE = 200
CHUNK_OVERLAP = 0

# Ollama (local LLM for KG extraction)
OLLAMA_MODEL_NAME = "qwen2.5:3b"

# Neo4j AuraDB
NEO4J_URI = "neo4j+s://<your-instance>.databases.neo4j.io"
NEO4J_USERNAME = "neo4j"
NEO4J_PASSWORD = "<your-password>"
NEO4J_DATABASE = "neo4j"
AURA_INSTANCEID = "<your-instance-id>"
AURA_INSTANCENAME = "<your-instance-name>"

# Groq (cloud LLM for querying)
GROQ_LLM_MODEL = "llama-3.3-70b-versatile"
GROQ_API_KEY = "<your-groq-api-key>"

Note: .env.poc is listed in .gitignore and will never be committed. Never commit real credentials.

Running the Pipeline

The four development stages are meant to be run in order the first time. The derived dataset files (chunks.jsonl, extracted-kg.jsonl) are already committed to the repository, so you can skip to Stage 3 or Stage 5 if you don't need to regenerate them.

Stage 1 — Chunk the corpus

python development/extract-chunk-corpus.py

Reads dataset/news-articles-raw.jsonl → writes dataset/chunks.jsonl.

Stage 2 — Extract the knowledge graph

# Ensure Ollama is running and qwen2.5:3b is available
ollama pull qwen2.5:3b
ollama serve

python development/extract-kg.py

Reads dataset/chunks.jsonl → writes dataset/extracted-kg.jsonl.
This step is slow (one API call per chunk with a 5-second delay). Resume support is built in — edit the chunk_id >= 136 threshold in the script to restart from a specific chunk.

Stage 3 — Populate Neo4j

python development/create-graph-db.py

Reads dataset/extracted-kg.jsonl → pushes all nodes and relationships to Neo4j AuraDB.

Stage 4 — Visualise the graph

python development/viz-graph.py

Queries Neo4j → renders interactive-graph.html and opens it in your default browser.

Stage 5 — Query the graph (Jupyter)

Open main.ipynb in Jupyter and run the cells. The notebook demonstrates end-to-end querying using GraphRAGPOC:

from neo4j import GraphDatabase
from core.graph import GraphRAGPOC

driver = GraphDatabase.driver(NEO4J_URI, auth=(NEO4J_USERNAME, NEO4J_PASSWORD))

rag = GraphRAGPOC(
    groq_model=GROQ_LLM_MODEL,
    groq_api=GROQ_API_KEY,
    neo4j_driver=driver,
    chunks_path="dataset/chunks.jsonl",
)

answer = rag.get_response("What is happening with Chinese aluminum production?")
print(answer)

Interactive Graph

The file interactive-graph.html is a self-contained interactive visualisation of the full knowledge graph stored in Neo4j. Open it directly in any modern browser:

open interactive-graph.html       # macOS
xdg-open interactive-graph.html   # Linux
start interactive-graph.html      # Windows

Features:

Drag and reposition nodes
Zoom in/out
Hover to inspect node IDs
Adjust physics simulation, node styling, and edge styling via the built-in control panel
Directed edges showing relationship direction

Dependencies

Package	Version	Role
`langchain`	`>=0.2.0,<0.3.0`	Core RAG framework, document utilities, text splitter
`langgraph`	`>=0.2.0,<0.3.0`	Graph-based LLM workflow orchestration
`langchain-community`	`>=0.2.0,<0.3.0`	Community integrations
`langchain-ollama`	`0.1.3`	Ollama LLM integration (KG extraction)
`langchain-groq`	`0.1.9`	Groq LLM integration (query pipeline)
`neo4j`	`6.1.0`	Neo4j Python driver
`networkx`	`3.6.1`	Graph data structure for visualisation
`pyvis`	`0.3.2`	Interactive HTML graph rendering
`pypdf`	`4.2.0`	PDF parsing (if raw data includes PDFs)
`pydantic`	(transitive)	Data validation for KG models
`python-dotenv`	`1.0.1`	`.env` file loading
`pandas`	`2.2.2`	DataFrame operations on Neo4j query results
`pandas-toon`	`0.1.0`	DataFrame-to-string conversion for LLM prompts
`numpy`	`1.26.4`	Numerical utilities
`tqdm`	`4.67.3`	Progress bars for batch processing
`urllib3`	`1.26.18`	HTTP utilities

Author

Tanul Kumar Srivastava

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
core		core
dataset		dataset
development		development
.env.poc.example		.env.poc.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
interactive-graph.html		interactive-graph.html
main.ipynb		main.ipynb
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Graph RAG POC

Table of Contents

Overview

Architecture

Project Structure

How It Works

Stage 1 — Chunking the Corpus

Stage 2 — Knowledge Graph Extraction

Stage 3 — Populating Neo4j

Stage 4 — Graph Visualization

Stage 5 — Graph RAG Query Pipeline

LangGraph Workflow

Core Module Reference

core/graph.py

core/utils.py

core/prompt.py

core/logging.py

Dataset

Prerequisites

Installation

Configuration

Running the Pipeline

Stage 1 — Chunk the corpus

Stage 2 — Extract the knowledge graph

Stage 3 — Populate Neo4j

Stage 4 — Visualise the graph

Stage 5 — Query the graph (Jupyter)

Interactive Graph

Dependencies

Author

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`core/graph.py`

`core/utils.py`

`core/prompt.py`

`core/logging.py`

Packages