kg-foundry

kg-foundry is a modern, end-to-end pipeline for transforming unstructured textual documents (PDF, Markdown, Text) into a high-quality, validated, and visualizable Knowledge Graph (KG). It leverages Large Language Models (LLMs) for extraction and Retrieval-Augmented Generation (RAG) for interactive querying.

🚀 Features

Ingestion: Support for .txt, .md, and .pdf documents.
Extraction: LLM-based extraction of Entities and Relations using Instructor (structured output).
- Ontology-Constrained Extraction: Optionally constrain entity/relation types to match a specific ontology schema.
Entity Disambiguation: Automatic linking of extracted entities to knowledge bases (ontologies) using hybrid approach:
- Multi-ontology support (RDF/OWL TTL files)
- Adaptive threshold filtering (exact, type-based, and partial matching)
- LLM-based disambiguation for ambiguous cases (MHEL-LLaMo inspired)
- NIL detection for entities not in ontology
Validation: Graph quality assurance using SHACL (Shapes Constraint Language) and pyshacl.
Storage: Persistent graph storage using Neo4j with disambiguation metadata.
Visualization: Interactive graph exploration with Cytoscape.js.
RAG (Chat): "Ask the Graph" feature to answer questions based on the knowledge graph context.
Pipeline Logs: Real-time monitoring of the processing pipeline with detailed logs and configuration view.
Containerization: Full deployment support with Podman / Docker.

🛠️ Technology Stack

Backend: Python 3.11, FastAPI, LangChain, Instructor, Pydantic, rdflib.
Frontend: Next.js 14, TypeScript, Tailwind CSS, Cytoscape.js.
Database: Neo4j (Graph DB).
AI/LLM: OpenAI GPT-4o (configurable).
Ontologies: RDF/OWL support with D3FEND (MITRE) cybersecurity ontology.

📂 Project Structure

kg-foundry/
├── src/
│   ├── api/            # FastAPI routes and entry point
│   │   ├── main.py             # FastAPI application
│   │   ├── routes.py           # API endpoints
│   │   └── logs.py             # Pipeline logging system
│   ├── extraction/     # LLM extraction logic (Instructor)
│   │   ├── extractor.py        # Main extraction engine
│   │   ├── schema.py           # Pydantic models (Entity, Relation)
│   │   ├── config.py           # Extraction configuration
│   │   └── ontology_constraints.py  # Ontology-based type constraints
│   ├── disambiguation/ # Entity disambiguation module
│   │   ├── disambiguator.py    # Main disambiguation engine
│   │   ├── ontology_loader.py  # Multi-ontology loader (RDF/OWL)
│   │   └── config.py           # Configuration management
│   ├── graph/          # Neo4j client wrapper
│   ├── ingestion/      # Document loaders
│   ├── rag/            # Retrieval-Augmented Generation logic
│   └── validation/     # SHACL validation logic
├── data/
│   ├── shapes/         # SHACL shape definitions (.ttl)
│   └── ontologies/     # Knowledge base ontologies (.ttl)
│       ├── d3fend.ttl          # D3FEND cybersecurity ontology (MITRE)
│       └── powergrids_cps.ttl  # Power grid cyber-physical security ontology
├── frontend/           # Next.js web application
│   └── app/
│       ├── page.tsx            # Main graph visualization page
│       └── logs/page.tsx       # Pipeline logs & config page
├── tests/              # Unit and integration tests
├── docker-compose.yml  # Podman/Docker orchestration
├── .env.example        # Environment variables template
└── requirements.txt    # Python dependencies

⚡ Getting Started

Prerequisites

OpenAI API Key: Required for extraction and RAG.
Neo4j: Required for storage (provided via Docker).
Python 3.11+ & Node.js 18+ (for local dev).
Podman or Docker (for containerized run).

Option A: Run with Podman (Recommended)

Create a .env file at the root (see .env.example for template):

# OpenAI Configuration (REQUIRED)
OPENAI_API_KEY=sk-your-key-here

# Neo4j Configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=password

# Extraction Configuration (optional)
EXTRACTION_MODEL_NAME=gpt-4o
EXTRACTION_ONTOLOGIES_DIR=./data/ontologies/
# Constrain extraction to specific ontology types (comma-separated)
# EXTRACTION_ONTOLOGIES_FILES=powergrids_cps.ttl,d3fend.ttl

# Entity Disambiguation (optional)
DISAMBIGUATION_ENABLED=true
DISAMBIGUATION_THRESHOLD=0.5
DISAMBIGUATION_ONTOLOGIES_DIR=./data/ontologies/
MAX_PROCESSING_TIME=300
MAX_PROCESSING_TOKENS=100000

Build and start the services:
```
podman-compose up --build
```
Access the application:
- Frontend: http://localhost:3000
- Pipeline Logs: http://localhost:3000/logs
- API Docs: http://localhost:8000/docs
- Neo4j Browser: http://localhost:7474 (User: neo4j, Pass: password)

Option B: Run Locally

1. Backend

# Create virtual env
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set env vars
export OPENAI_API_KEY=sk-...
export NEO4J_URI=bolt://localhost:7687
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password

# Run server
uvicorn src.api.main:app --reload

2. Frontend

cd frontend
npm install
npm run dev

📖 Usage

Open the web interface at http://localhost:3000.
Click "Choose File" and select a text or PDF document.
Click "Process". The system will:
- Extract entities and relations using LLM.
- Disambiguate entities by linking them to ontology concepts (if enabled).
- Validate them against the SHACL schema.
- Store in Neo4j with disambiguation metadata.
- Display the resulting Knowledge Graph interactively.
Use the "Ask the Graph" chat box below the visualization to ask questions about the ingested content.

Pipeline Flow

Document Upload → Ingestion → Extraction (LLM) → Disambiguation → Validation → Neo4j → Visualization
                   (0-10%)      (10-90%)          (90-94%)        (95%)      (96-100%)

🎯 Ontology-Constrained Extraction

KG-Foundry can constrain the LLM to use only entity and relation types defined in your ontologies. This ensures extracted data conforms to your domain schema.

How It Works

When EXTRACTION_ONTOLOGIES_FILES is configured:

Ontology Parsing: Loads specified .ttl files and extracts:
- owl:Class definitions → allowed entity types
- owl:ObjectProperty definitions → allowed relation types (with domain/range constraints)

Prompt Injection: The LLM system prompt includes:

IMPORTANT CONSTRAINT - ENTITY TYPES:
You MUST use ONLY the following entity types:
  - Asset
  - CyberAsset
  - Attack
  - Vulnerability
  ...

IMPORTANT CONSTRAINT - RELATIONSHIP TYPES:
You MUST use ONLY the following relationship types:
  - targets (from: Attack → to: Asset)
  - exploits (from: Threat → to: Vulnerability)
  ...

Constrained Extraction: The LLM extracts entities and relations using only the allowed types.

Configuration

# Enable ontology-constrained extraction
EXTRACTION_ONTOLOGIES_FILES=powergrids_cps.ttl,d3fend.ttl
EXTRACTION_ONTOLOGIES_DIR=./data/ontologies/
EXTRACTION_MODEL_NAME=gpt-4o

Example: Power Grid Ontology

Using powergrids_cps.ttl:

78 entity types: Asset, CyberAsset, PhysicalAsset, PLC, RTU, SCADA server, Attack, Vulnerability...
34 relation types: targets, exploits, contains, hostsDevice, usesProtocol...

Verify Configuration

Check that ontologies are loaded via the logs page or API:

# Via API
curl http://localhost:8000/config | jq .extraction

# Response
{
  "model_name": "gpt-4o",
  "ontologies_files": ["powergrids_cps.ttl", "d3fend.ttl"],
  "constraints_active": true,
  "ontologies_status": [
    {"file": "powergrids_cps.ttl", "exists": true, "size_kb": 16.8},
    {"file": "d3fend.ttl", "exists": true, "size_kb": 3351.9}
  ]
}

🔗 Entity Disambiguation

KG-Foundry features an advanced entity disambiguation system inspired by the MHEL-LLaMo approach, which links extracted entities to knowledge base concepts.

How It Works

Multi-Ontology Loading: Automatically loads all .ttl (Turtle/RDF) ontology files from ./data/ontologies/
- Currently includes D3FEND (MITRE cybersecurity ontology with 5,014 entities)
- Supports standard RDF/OWL predicates (rdfs:label, d3f:definition, etc.)
3-Tier Candidate Search:
- Exact matching: Label normalization and exact string matching (score: 1.0)
- Type-based matching: Matches by entity type (ATTACK, TECHNIQUE, etc.) (score: 0.6)
- Partial matching: Substring and fuzzy matching (score: 0.4+)
Adaptive Threshold Filtering:
- High-confidence matches (≥ threshold) → automatic linking
- Low-confidence matches (< threshold) → LLM disambiguation
LLM Prompt Chaining (for hard cases):
- Step 1: NIL prediction - Does the entity exist in the ontology?
- Step 2: Candidate selection - Which ontology concept is it?
Resource Management:
- Maximum processing time limit (default: 300s)
- Token budget control for API costs (default: 100,000 tokens)
- Graceful fallback to NIL on timeout/errors

Disambiguation Metadata

Each disambiguated entity includes:

canonical_id: URI from ontology (e.g., d3f:Phishing)
canonical_label: Standard name from ontology
disambiguation_confidence: Score 0-100
disambiguation_method: How it was disambiguated (auto_exact, llm_selection, etc.)
candidates: Alternative entities considered

Configuration

Control disambiguation behavior via environment variables:

DISAMBIGUATION_ENABLED=true              # Enable/disable feature
DISAMBIGUATION_THRESHOLD=0.5             # Confidence threshold (0-1)
DISAMBIGUATION_ONTOLOGIES_DIR=./data/ontologies/  # Ontology directory
MAX_PROCESSING_TIME=300                  # Timeout in seconds
MAX_PROCESSING_TOKENS=100000             # Token budget limit

Threshold tuning:

Lower (0.3-0.5): More recall, more LLM calls, higher cost
Higher (0.7-0.9): More precision, fewer LLM calls, lower cost

Adding Custom Ontologies

Place your .ttl (Turtle) ontology files in ./data/ontologies/
Ensure they follow standard RDF/OWL conventions:
- Use rdfs:label or skos:prefLabel for entity names
- Use rdfs:comment or d3f:definition for descriptions
- Define types with rdf:type and rdfs:subClassOf
Restart the application - ontologies are loaded automatically

Querying Disambiguation Results

View disambiguation data in Neo4j:

// Get all disambiguated entities
MATCH (e:Entity)
WHERE e.canonical_id IS NOT NULL
RETURN e.name, e.canonical_label, e.disambiguation_confidence, e.disambiguation_method
ORDER BY e.disambiguation_confidence DESC

// Get NIL entities (not found in ontology)
MATCH (e:Entity)
WHERE e.canonical_id = "NIL"
RETURN e.name, e.type, e.description

// Statistics by disambiguation method
MATCH (e:Entity)
RETURN e.disambiguation_method as method,
       COUNT(e) as count,
       AVG(e.disambiguation_confidence) as avg_confidence

🛡️ Validation

The graph structure is validated against SHACL shapes defined in data/shapes/schema.ttl. This ensures that every Entity has a name and type, and relations are properly formed.

🧪 Testing

Test Disambiguation Module

# Activate virtual environment
source venv/bin/activate

# Test ontology loader
python3 -c "
from src.disambiguation.ontology_loader import OntologyLoader
loader = OntologyLoader('./data/ontologies/')
index = loader.build_index()
print(f'Loaded {len(index)} entities from ontologies')
"

# Test disambiguator
python3 -c "
from src.extraction.schema import Entity
from src.disambiguation.disambiguator import EntityDisambiguator
from dotenv import load_dotenv
load_dotenv()

disambiguator = EntityDisambiguator(threshold=0.5)
entity = Entity(name='Phishing', type='ATTACK', description='Email attack')
candidates = disambiguator._find_candidates(entity, 'phishing attack context', top_k=3)
for i, c in enumerate(candidates):
    print(f'{i+1}. {c[\"label\"]} (score: {c[\"score\"]:.2f})')
"

Test Ontology Constraints

# Test ontology constraints loader
python3 -c "
from src.extraction.ontology_constraints import OntologyConstraintsLoader
loader = OntologyConstraintsLoader(['powergrids_cps.ttl'], './data/ontologies/')
print(f'Entity types: {len(loader.entity_types)}')
print(f'Relation types: {len(loader.relation_types)}')
print(loader.format_for_prompt()[:500])
"

Run Unit Tests

# First time: install package in development mode
pip install -e .

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_ontology_constraints.py -v

# With coverage
pytest --cov=src tests/

See TESTING.md for comprehensive testing guide.

📊 Pipeline Logs & Monitoring

KG-Foundry includes a real-time logging system to monitor the processing pipeline.

Logs Page

Access the logs page at http://localhost:3000/logs or click "Show Logs" button in the main interface.

The logs page displays:

Current Configuration: Extraction settings, disambiguation settings, Neo4j connection
Ontology Files Status: Which files are loaded, their sizes, and validation
Pipeline Logs: Real-time logs from each processing stage:
- load - Document loading
- extraction - Entity/relation extraction
- disambiguation - Entity linking to ontology
- validation - SHACL validation
- storage - Neo4j persistence

API Endpoints

# Get pipeline logs (last 100 entries)
curl http://localhost:8000/logs

# Get logs filtered by stage
curl "http://localhost:8000/logs?stage=extraction&limit=50"

# Get current configuration
curl http://localhost:8000/config

# Clear logs
curl -X DELETE http://localhost:8000/logs

Log Levels

Level	Color	Description
`debug`	Gray	Detailed debug information
`info`	Blue	General information
`warning`	Yellow	Non-critical issues
`error`	Red	Errors that were handled
`success`	Green	Successful operations

Example Log Output

{
  "timestamp": "2024-01-16T10:30:24.123456",
  "level": "success",
  "stage": "extraction",
  "message": "Ontology constraints loaded",
  "details": {
    "entity_types_count": 78,
    "relation_types_count": 34,
    "files": ["powergrids_cps.ttl"]
  }
}

Container Logs

# View backend logs (Docker/Podman)
docker-compose logs -f backend

# Or with podman
podman-compose logs -f backend

🎯 Performance Tips

Reduce LLM Costs:
- Increase DISAMBIGUATION_THRESHOLD to 0.7-0.8
- Most exact matches will be automatic (no API calls)
Improve Recall:
- Decrease DISAMBIGUATION_THRESHOLD to 0.3-0.4
- More entities will be sent to LLM for disambiguation
Handle Large Documents:
- Increase MAX_PROCESSING_TIME for longer processing
- Increase MAX_PROCESSING_TOKENS for more API budget
Optimize Ontology Loading:
- Keep ontology files organized in ./data/ontologies/
- Remove unused ontologies to speed up startup

📚 Documentation

Implementation Specifications: See ./instructions/entities_desambiguation_pipeline.md
MVP Summary: See ./MVP_IMPLEMENTATION_SUMMARY.md
API Documentation: Available at http://localhost:8000/docs when running
MHEL-LLaMo Reference: See ./instructions/MHEL-LLAMO/ for original research implementation

🤝 Contributing

Contributions are welcome! Areas for improvement:

Semantic search with embeddings (Phase 2)
~~Ontology-constrained extraction~~ (Implemented)
~~Pipeline logs and monitoring page~~ (Implemented)
Frontend visualization of disambiguation metadata
Additional ontology integrations (MITRE ATT&CK, etc.)
Disambiguation result caching
Batch processing optimization
Multi-language ontology support
Export graph to various formats (RDF, JSON-LD, etc.)

📄 License

[Add your license information here]

🙏 Acknowledgments

MHEL-LLaMo: Entity disambiguation approach inspired by "It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models"
SPR-RAG: Semantic parsing and retrieval concepts
D3FEND: MITRE cybersecurity ontology
LangChain: Document processing framework
Instructor: Structured LLM outputs

📞 Support

For issues, questions, or feature requests, please open an issue on GitHub.

Built with ❤️ using Claude Code

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github/workflows		.github/workflows
data		data
frontend		frontend
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.whitesource		.whitesource
CHANGELOG.md		CHANGELOG.md
Dockerfile.backend		Dockerfile.backend
Dockerfile.frontend		Dockerfile.frontend
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TESTING.md		TESTING.md
demo.py		demo.py
docker-compose.yml		docker-compose.yml
podman-compose.yml		podman-compose.yml
references.bib		references.bib
requirements.txt		requirements.txt
setup.py		setup.py

License

aklira/kg-foundry

Folders and files

Latest commit

History

Repository files navigation

kg-foundry

🚀 Features

🛠️ Technology Stack

📂 Project Structure

⚡ Getting Started

Prerequisites

Option A: Run with Podman (Recommended)

Option B: Run Locally

1. Backend

2. Frontend

📖 Usage

Pipeline Flow

🎯 Ontology-Constrained Extraction

How It Works

Configuration

Example: Power Grid Ontology

Verify Configuration

🔗 Entity Disambiguation

How It Works

Disambiguation Metadata

Configuration

Adding Custom Ontologies

Querying Disambiguation Results

🛡️ Validation

🧪 Testing

Test Disambiguation Module

Test Ontology Constraints

Run Unit Tests

📊 Pipeline Logs & Monitoring

Logs Page

API Endpoints

Log Levels

Example Log Output

Container Logs

🎯 Performance Tips

📚 Documentation

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages