GraphRAG Knowledge Graph Pipeline

A Python ETL pipeline that scrapes Jama Software's "The Essential Guide to Requirements Management and Traceability" and loads it into a Neo4j knowledge graph using the neo4j_graphrag library for GraphRAG retrieval.

Features

Dynamic guide discovery from live TOC — no static configuration required
Async scraping with httpx (default) or Playwright (browser mode for JS-rendered content)
Neo4j GraphRAG integration using neo4j_graphrag.SimpleKGPipeline
Schema-constrained entity extraction with 12 node types and 14 relationship types
Gleaning — 2-pass LLM extraction catches 20-30% additional entities
Hierarchical HTML chunking with LangChain HTMLHeaderTextSplitter; optional Chonkie semantic chunking
Voyage AI voyage-4 embeddings (1024d, asymmetric); auto-detected from VOYAGE_API_KEY with OpenAI fallback
11-step 3-phase entity post-processing — backfill, source grounding, normalization, deduplication, cross-label dedup, cleanup, industry taxonomy (100+ → 18), LLM summaries, community detection, community summaries, community embeddings
Leiden community detection with LLM-generated community summaries and vector embeddings
Cross-label entity deduplication — merges same-name entities with different type labels
Glossary-to-concept linking with fuzzy matching via rapidfuzz
Supplementary graph structure — Chapter, Resource (Image/Video/Webinar), and Glossary nodes
Validation framework with comprehensive quality checks, repair operations, and auto-archived reports
Preflight validation — checks Neo4j connectivity, APOC availability, and vector index dimensions
Cost estimation via --dry-run flag

Architecture

Requirements

Python 3.13+
Neo4j 5.x with APOC plugin
OpenAI API key for entity extraction, gleaning, and summaries
Voyage AI API key (optional, preferred) for embeddings — falls back to OpenAI
UV (recommended) or pip for package management

Installation

Using UV (Recommended)

# Clone or copy the project
cd graphrag-kg-pipeline

# Install dependencies
uv sync

# Copy environment template and configure
cp .env.example .env
# Edit .env with your Neo4j and OpenAI credentials

Using pip

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -e .

Configuration

Create a .env file (see .env.example for full template):

# OpenAI (required) — LLM for extraction, gleaning, summaries
OPENAI_API_KEY=sk-your-api-key

# Voyage AI (optional — preferred for embeddings)
# When set, voyage-4 (1024d, asymmetric) is auto-detected.
# When absent, falls back to OpenAI text-embedding-3-small.
VOYAGE_API_KEY=pa-your-api-key

# Neo4j connection
NEO4J_URI=neo4j+s://xxx.databases.neo4j.io
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your-password
NEO4J_DATABASE=neo4j

Usage

Quick Start

# Run the full pipeline: scrape → extract → load to Neo4j
graphrag-kg

# Explicit scrape subcommand (same as above)
graphrag-kg scrape

# With validation report after pipeline completes
graphrag-kg scrape --validate

# Skip resource node creation (images, videos, webinars)
graphrag-kg scrape --skip-resources

# Skip all supplementary structure (chapters, resources, glossary)
graphrag-kg scrape --skip-supplementary

# Scrape only - skip Neo4j processing (saves JSON/JSONL only)
graphrag-kg scrape --scrape-only

# Estimate costs without running (dry run)
graphrag-kg scrape --dry-run

# Use headless browser for JavaScript-rendered content
graphrag-kg scrape --browser

# Full pipeline: scrape + extract + normalize + validate + fix
graphrag-kg scrape --full

# Preview full pipeline costs
graphrag-kg scrape --full --dry-run

# Validate the graph and generate report
graphrag-kg validate

# Save validation report to file
graphrag-kg validate -o validation_report.md

# Preview what fixes would do without applying
graphrag-kg validate --fix --dry-run

# Apply all fixes (chunk_ids + entity cleanup)
graphrag-kg validate --fix

# Apply only specific fix types
graphrag-kg validate --fix-chunk-ids   # Safe, additive operation
graphrag-kg validate --fix-entities    # Merges plurals, deletes generic entities

Pre-Ingestion Validation

# Validate TOC discovery against live site (~2 seconds, no API keys needed)
uv run python examples/validate_toc_discovery.py

Programmatic Usage

import asyncio
from graphrag_kg_pipeline import run_scraper

# Run the full Neo4j GraphRAG pipeline
async def main():
    guide = await run_scraper(
        output_dir="./output",      # Intermediate files directory
        run_validation=True,        # Run validation after loading
        skip_resources=False,       # Include images, videos, webinars
        skip_supplementary=False,   # Include chapters, glossary structure
    )
    print(f"Loaded {guide.total_articles} articles to Neo4j")

asyncio.run(main())

# Scrape only (no Neo4j processing)
async def scrape_only():
    guide = await run_scraper(scrape_only=True)
    print(f"Scraped {guide.total_articles} articles to JSON/JSONL")

asyncio.run(scrape_only())

Running Validation Only

import asyncio
from neo4j import AsyncGraphDatabase
from graphrag_kg_pipeline.validation import generate_validation_report

async def validate_graph():
    driver = AsyncGraphDatabase.driver(
        "neo4j+s://xxx.databases.neo4j.io",
        auth=("neo4j", "password")
    )
    try:
        report = await generate_validation_report(driver, "neo4j")
        print(report.to_markdown())
        print(f"Validation passed: {report.validation_passed}")
    finally:
        await driver.close()

asyncio.run(validate_graph())

Knowledge Graph Schema

Node Types

Label	Description	Key Properties
`Article`	Source document	`article_id`, `title`, `url`
`Chapter`	Document grouping	`chapter_number`, `title`
`Chunk`	Text fragment for RAG	`text`, `embedding`, `index`
`Community`	Leiden community cluster	`communityId`, `summary`, `summary_embedding`
`Concept`	Domain concept	`name`, `definition`, `aliases`
`Industry`	Business sector	`name`
`Standard`	Compliance standard	`name`
`Tool`	Software/methodology	`name`
`Challenge`	Problem domain	`name`
`Bestpractice`	Recommended approach	`name`
`Processstage`	Lifecycle phase	`name`
`Role`	Job function	`name`
`Methodology`	Process framework	`name`
`Artifact`	Work product	`name`
`Organization`	Standards body / agency	`name`, `organization_type`
`Outcome`	Positive result / goal	`name`, `outcome_type`
`Image`	Visual resource	`src`, `alt_text`
`Video`	Video resource	`url`, `title`
`Webinar`	Webinar resource	`url`, `title`, `thumbnail_url`
`Definition`	Glossary entry	`term`, `definition`, `acronym`

Relationship Types

Semantic (LLM-extracted):

ADDRESSES - Concept/BestPractice/Tool → Challenge
REQUIRES - Dependency relationships
COMPONENT_OF - Part-whole relationships
RELATED_TO - General semantic associations
ALTERNATIVE_TO - Competing approaches
USED_BY - Tool/Artifact/Concept → Role/Industry
APPLIES_TO - Standard → Industry
PRODUCES - Process → Artifact
DEFINES - Standard → Concept
PREREQUISITE_FOR - Sequential dependencies
PUBLISHES - Organization → Standard
REGULATES - Organization → Industry
DEVELOPS - Organization → Tool
ACHIEVES - Practice/Concept → Outcome

Structural (pipeline-created):

FROM_ARTICLE - Chunk → Article
MENTIONED_IN - Entity → Chunk
IN_COMMUNITY - Entity → Community
HAS_ARTICLE - Chapter → Article
CONTAINS_DEFINITION - Glossary → Definition
DEFINES_CONCEPT - Definition → Concept

Project Structure

graphrag-kg-pipeline/
├── src/graphrag_kg_pipeline/
│   ├── __init__.py           # Package exports
│   ├── cli.py                # Command-line interface
│   ├── config.py             # URL configs, rate limiting settings
│   ├── fetcher.py            # Protocol-based fetcher (httpx/Playwright)
│   ├── scraper.py            # Async web scraper + pipeline orchestration
│   ├── parser.py             # HTML → Markdown parser + TOC discovery
│   ├── preflight.py          # Pre-ingestion validation
│   ├── exceptions.py         # Custom exception hierarchy
│   ├── models/               # Pydantic data models
│   │   ├── content.py        # Article, Chapter, Glossary
│   │   └── resource.py       # Image, Video, Webinar
│   ├── chunking/             # LangChain text splitting
│   │   ├── config.py         # HierarchicalChunkingConfig
│   │   ├── hierarchical_chunker.py
│   │   └── adapter.py        # LangChain → neo4j_graphrag adapter
│   ├── embeddings/           # Custom embedding providers
│   │   └── voyage.py         # VoyageAIEmbeddings (Embedder interface)
│   ├── extraction/           # Entity extraction
│   │   ├── schema.py         # NODE_TYPES, RELATIONSHIP_TYPES, PATTERNS
│   │   ├── prompts.py        # LLM extraction prompts
│   │   ├── pipeline.py       # SimpleKGPipeline factory
│   │   └── gleaning.py       # Multi-pass extraction refinement
│   ├── loaders/              # Data loading
│   │   ├── html_loader.py    # GuideHTMLLoader (DataLoader interface)
│   │   └── index_builder.py  # Article index utilities
│   ├── utils/                # Shared utilities
│   │   └── retry.py          # Retry with exponential backoff
│   ├── postprocessing/       # Entity normalization (7 modules)
│   │   ├── normalizer.py     # Name normalization + deduplication
│   │   ├── entity_cleanup.py # Plural/generic entity handling
│   │   ├── industry_taxonomy.py  # 100+ → 18 canonical industries
│   │   ├── mentioned_in_backfill.py  # MENTIONED_IN + APPLIES_TO
│   │   ├── entity_summarizer.py      # LLM entity descriptions
│   │   ├── langextract_augmenter.py  # Source grounding (text provenance)
│   │   └── glossary_linker.py        # Glossary → Concept linking
│   ├── graph/                # Graph algorithms + structure (5 modules)
│   │   ├── community_detection.py   # Leiden clustering (leidenalg + igraph)
│   │   ├── community_summarizer.py  # LLM community summaries
│   │   ├── community_embedder.py    # Voyage AI community embeddings
│   │   ├── supplementary.py  # Chapter/Resource/Glossary nodes
│   │   └── constraints.py    # Indexes, constraints, vector indexes
│   └── validation/           # Quality checks
│       ├── queries.py        # Validation Cypher queries
│       ├── fixes.py          # Data repair utilities
│       └── reporter.py       # Report generation + auto-archive
├── tests/                    # Comprehensive test suite (256 tests)
│   ├── conftest.py           # Pytest fixtures
│   ├── test_models.py
│   ├── test_chunking.py
│   ├── test_extraction.py
│   ├── test_loaders.py
│   ├── test_postprocessing.py
│   ├── test_community.py
│   ├── test_langextract.py
│   ├── test_embeddings.py
│   ├── test_preflight.py
│   ├── test_scraper.py
│   ├── test_smoke.py
│   └── test_validation.py
├── examples/                 # Usage demonstrations
│   ├── query_knowledge_graph.py      # 4 query approaches demo
│   ├── validate_toc_discovery.py     # Pre-ingestion TOC validation
│   ├── backfill_entity_labels.py     # Entity label repair utility
│   └── diagnose_concept_anomaly.py   # Debugging: concept count analysis
├── pyproject.toml            # Project configuration
├── .env.example              # Environment template
└── CLAUDE.md                 # AI assistant guidance

Development

Running Tests

# Run all tests
uv run pytest

# Run with verbose output
uv run pytest -v

# Run specific test file
uv run pytest tests/test_extraction.py

Linting & Formatting

uv run ruff check .          # Lint check
uv run ruff check . --fix    # Auto-fix issues
uv run ruff format .         # Format code

Type Checking

uv run ty check src/

Validation Queries

After loading, validate the graph quality:

// Check for orphan chunks (should be 0)
MATCH (c:Chunk) WHERE NOT (c)-[:FROM_ARTICLE]->() RETURN count(c)

// Check industry count (should be ≤18)
MATCH (i:Industry) RETURN count(i)

// Check for duplicate entities
MATCH (n) WHERE n:Concept OR n:Industry
WITH n.name AS name, labels(n)[0] AS label, count(n) AS cnt
WHERE cnt > 1 RETURN label, name, cnt

// Check embedding coverage
MATCH (c:Chunk) WHERE c.embedding IS NULL RETURN count(c)

// Check community summary embedding coverage
MATCH (c:Community) WHERE c.summary_embedding IS NULL RETURN count(c)

Querying the Knowledge Graph

Example Script (Recommended)

A ready-to-use example script is included for evaluating the knowledge graph:

# Run with default query
uv run python examples/query_knowledge_graph.py

# Run with custom query
uv run python examples/query_knowledge_graph.py "What is impact analysis?"

# Custom entity search term
uv run python examples/query_knowledge_graph.py "What is impact analysis?" --search "impact"

The script demonstrates four query approaches:

Vector similarity search - Semantic matching on chunk embeddings
Chunk-to-entity traversal - Find entities mentioned in retrieved chunks
Direct entity search - Search entities by name pattern
Relationship exploration - Show connections for a specific entity

Example output:

╭────────────────────── Guide Knowledge Graph Test ────────────────────────────╮
│ Query: What can you tell me about Requirements Tracing?                      │
╰──────────────────────────────────────────────────────────────────────────────╯

1. Vector Similarity Search (semantic match)
------------------------------------------------------------
Result 1 (score: 0.851)
In simple terms, requirements traceability is the process of creating
and maintaining connections between different development artifacts...

3. Direct Entity Search (name contains 'trac')
------------------------------------------------------------
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Type     ┃ Name                     ┃ Connections ┃ Definition               ┃
┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ Concept  │ Traceability             │         370 │ the only way to know...  │
│ Concept  │ Requirements Traceability│         147 │ the practice of linking..│
│ Concept  │ Live Traceability        │          84 │ The ability for any...   │
└──────────┴──────────────────────────┴─────────────┴──────────────────────────┘

Semantic Search (RAG)

// Find relevant chunks using vector similarity
CALL db.index.vector.queryNodes('chunk_embedding', 5, $query_embedding)
YIELD node, score
MATCH (node)-[:FROM_ARTICLE]->(a:Article)
RETURN a.title, node.text, score
ORDER BY score DESC

Community Search

// Find relevant communities using vector similarity
CALL db.index.vector.queryNodes('community_summary_embeddings', 3, $query_embedding)
YIELD node, score
RETURN node.communityId, node.summary, score
ORDER BY score DESC

Knowledge Graph Traversal

// Find concepts related to a specific industry
MATCH (i:Industry {name: "automotive"})<-[:APPLIES_TO]-(s:Standard)
MATCH (s)-[:DEFINES]->(c:Concept)
RETURN DISTINCT c.name, s.name

Article Context

// Get all entities mentioned in an article
MATCH (a:Article {article_id: "ch1-art1"})<-[:FROM_ARTICLE]-(c:Chunk)
MATCH (c)<-[:MENTIONED_IN]-(e)
RETURN labels(e)[0] AS type, e.name, count(*) AS mentions
ORDER BY mentions DESC

Legal Notice

The scraped content remains the intellectual property of Jama Software. Please respect their terms of service when using this pipeline. This project demonstrates a production-grade GraphRAG pipeline architecture and is not affiliated with Jama Software.

License

MIT License - See LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.claude		.claude
.github		.github
.vscode		.vscode
docs		docs
examples		examples
scripts		scripts
src/graphrag_kg_pipeline		src/graphrag_kg_pipeline
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
run.py		run.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

GraphRAG Knowledge Graph Pipeline

Features

Architecture

Requirements

Installation

Using UV (Recommended)

Using pip

Configuration

Usage

Quick Start

Pre-Ingestion Validation

Programmatic Usage

Running Validation Only

Knowledge Graph Schema

Node Types

Relationship Types

Project Structure

Development

Running Tests

Linting & Formatting

Type Checking

Validation Queries

Querying the Knowledge Graph

Example Script (Recommended)

Semantic Search (RAG)

Community Search

Knowledge Graph Traversal

Article Context

Legal Notice

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages