kg-foundry is a modern, end-to-end pipeline for transforming unstructured textual documents (PDF, Markdown, Text) into a high-quality, validated, and visualizable Knowledge Graph (KG). It leverages Large Language Models (LLMs) for extraction and Retrieval-Augmented Generation (RAG) for interactive querying.
- Ingestion: Support for
.txt,.md, and.pdfdocuments. - Extraction: LLM-based extraction of Entities and Relations using Instructor (structured output).
- Ontology-Constrained Extraction: Optionally constrain entity/relation types to match a specific ontology schema.
- Entity Disambiguation: Automatic linking of extracted entities to knowledge bases (ontologies) using hybrid approach:
- Multi-ontology support (RDF/OWL TTL files)
- Adaptive threshold filtering (exact, type-based, and partial matching)
- LLM-based disambiguation for ambiguous cases (MHEL-LLaMo inspired)
- NIL detection for entities not in ontology
- Validation: Graph quality assurance using SHACL (Shapes Constraint Language) and
pyshacl. - Storage: Persistent graph storage using Neo4j with disambiguation metadata.
- Visualization: Interactive graph exploration with Cytoscape.js.
- RAG (Chat): "Ask the Graph" feature to answer questions based on the knowledge graph context.
- Pipeline Logs: Real-time monitoring of the processing pipeline with detailed logs and configuration view.
- Containerization: Full deployment support with Podman / Docker.
- Backend: Python 3.11, FastAPI, LangChain, Instructor, Pydantic, rdflib.
- Frontend: Next.js 14, TypeScript, Tailwind CSS, Cytoscape.js.
- Database: Neo4j (Graph DB).
- AI/LLM: OpenAI GPT-4o (configurable).
- Ontologies: RDF/OWL support with D3FEND (MITRE) cybersecurity ontology.
kg-foundry/
βββ src/
β βββ api/ # FastAPI routes and entry point
β β βββ main.py # FastAPI application
β β βββ routes.py # API endpoints
β β βββ logs.py # Pipeline logging system
β βββ extraction/ # LLM extraction logic (Instructor)
β β βββ extractor.py # Main extraction engine
β β βββ schema.py # Pydantic models (Entity, Relation)
β β βββ config.py # Extraction configuration
β β βββ ontology_constraints.py # Ontology-based type constraints
β βββ disambiguation/ # Entity disambiguation module
β β βββ disambiguator.py # Main disambiguation engine
β β βββ ontology_loader.py # Multi-ontology loader (RDF/OWL)
β β βββ config.py # Configuration management
β βββ graph/ # Neo4j client wrapper
β βββ ingestion/ # Document loaders
β βββ rag/ # Retrieval-Augmented Generation logic
β βββ validation/ # SHACL validation logic
βββ data/
β βββ shapes/ # SHACL shape definitions (.ttl)
β βββ ontologies/ # Knowledge base ontologies (.ttl)
β βββ d3fend.ttl # D3FEND cybersecurity ontology (MITRE)
β βββ powergrids_cps.ttl # Power grid cyber-physical security ontology
βββ frontend/ # Next.js web application
β βββ app/
β βββ page.tsx # Main graph visualization page
β βββ logs/page.tsx # Pipeline logs & config page
βββ tests/ # Unit and integration tests
βββ docker-compose.yml # Podman/Docker orchestration
βββ .env.example # Environment variables template
βββ requirements.txt # Python dependencies
- OpenAI API Key: Required for extraction and RAG.
- Neo4j: Required for storage (provided via Docker).
- Python 3.11+ & Node.js 18+ (for local dev).
- Podman or Docker (for containerized run).
- Create a
.envfile at the root (see.env.examplefor template):# OpenAI Configuration (REQUIRED) OPENAI_API_KEY=sk-your-key-here # Neo4j Configuration NEO4J_URI=bolt://localhost:7687 NEO4J_USER=neo4j NEO4J_PASSWORD=password # Extraction Configuration (optional) EXTRACTION_MODEL_NAME=gpt-4o EXTRACTION_ONTOLOGIES_DIR=./data/ontologies/ # Constrain extraction to specific ontology types (comma-separated) # EXTRACTION_ONTOLOGIES_FILES=powergrids_cps.ttl,d3fend.ttl # Entity Disambiguation (optional) DISAMBIGUATION_ENABLED=true DISAMBIGUATION_THRESHOLD=0.5 DISAMBIGUATION_ONTOLOGIES_DIR=./data/ontologies/ MAX_PROCESSING_TIME=300 MAX_PROCESSING_TOKENS=100000
- Build and start the services:
podman-compose up --build
- Access the application:
- Frontend: http://localhost:3000
- Pipeline Logs: http://localhost:3000/logs
- API Docs: http://localhost:8000/docs
- Neo4j Browser: http://localhost:7474 (User:
neo4j, Pass:password)
# Create virtual env
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Set env vars
export OPENAI_API_KEY=sk-...
export NEO4J_URI=bolt://localhost:7687
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password
# Run server
uvicorn src.api.main:app --reloadcd frontend
npm install
npm run dev- Open the web interface at
http://localhost:3000. - Click "Choose File" and select a text or PDF document.
- Click "Process". The system will:
- Extract entities and relations using LLM.
- Disambiguate entities by linking them to ontology concepts (if enabled).
- Validate them against the SHACL schema.
- Store in Neo4j with disambiguation metadata.
- Display the resulting Knowledge Graph interactively.
- Use the "Ask the Graph" chat box below the visualization to ask questions about the ingested content.
Document Upload β Ingestion β Extraction (LLM) β Disambiguation β Validation β Neo4j β Visualization
(0-10%) (10-90%) (90-94%) (95%) (96-100%)
KG-Foundry can constrain the LLM to use only entity and relation types defined in your ontologies. This ensures extracted data conforms to your domain schema.
When EXTRACTION_ONTOLOGIES_FILES is configured:
-
Ontology Parsing: Loads specified
.ttlfiles and extracts:owl:Classdefinitions β allowed entity typesowl:ObjectPropertydefinitions β allowed relation types (with domain/range constraints)
-
Prompt Injection: The LLM system prompt includes:
IMPORTANT CONSTRAINT - ENTITY TYPES: You MUST use ONLY the following entity types: - Asset - CyberAsset - Attack - Vulnerability ... IMPORTANT CONSTRAINT - RELATIONSHIP TYPES: You MUST use ONLY the following relationship types: - targets (from: Attack β to: Asset) - exploits (from: Threat β to: Vulnerability) ... -
Constrained Extraction: The LLM extracts entities and relations using only the allowed types.
# Enable ontology-constrained extraction
EXTRACTION_ONTOLOGIES_FILES=powergrids_cps.ttl,d3fend.ttl
EXTRACTION_ONTOLOGIES_DIR=./data/ontologies/
EXTRACTION_MODEL_NAME=gpt-4oUsing powergrids_cps.ttl:
- 78 entity types: Asset, CyberAsset, PhysicalAsset, PLC, RTU, SCADA server, Attack, Vulnerability...
- 34 relation types: targets, exploits, contains, hostsDevice, usesProtocol...
Check that ontologies are loaded via the logs page or API:
# Via API
curl http://localhost:8000/config | jq .extraction
# Response
{
"model_name": "gpt-4o",
"ontologies_files": ["powergrids_cps.ttl", "d3fend.ttl"],
"constraints_active": true,
"ontologies_status": [
{"file": "powergrids_cps.ttl", "exists": true, "size_kb": 16.8},
{"file": "d3fend.ttl", "exists": true, "size_kb": 3351.9}
]
}KG-Foundry features an advanced entity disambiguation system inspired by the MHEL-LLaMo approach, which links extracted entities to knowledge base concepts.
-
Multi-Ontology Loading: Automatically loads all
.ttl(Turtle/RDF) ontology files from./data/ontologies/- Currently includes D3FEND (MITRE cybersecurity ontology with 5,014 entities)
- Supports standard RDF/OWL predicates (rdfs:label, d3f:definition, etc.)
-
3-Tier Candidate Search:
- Exact matching: Label normalization and exact string matching (score: 1.0)
- Type-based matching: Matches by entity type (ATTACK, TECHNIQUE, etc.) (score: 0.6)
- Partial matching: Substring and fuzzy matching (score: 0.4+)
-
Adaptive Threshold Filtering:
- High-confidence matches (β₯ threshold) β automatic linking
- Low-confidence matches (< threshold) β LLM disambiguation
-
LLM Prompt Chaining (for hard cases):
- Step 1: NIL prediction - Does the entity exist in the ontology?
- Step 2: Candidate selection - Which ontology concept is it?
-
Resource Management:
- Maximum processing time limit (default: 300s)
- Token budget control for API costs (default: 100,000 tokens)
- Graceful fallback to NIL on timeout/errors
Each disambiguated entity includes:
canonical_id: URI from ontology (e.g.,d3f:Phishing)canonical_label: Standard name from ontologydisambiguation_confidence: Score 0-100disambiguation_method: How it was disambiguated (auto_exact, llm_selection, etc.)candidates: Alternative entities considered
Control disambiguation behavior via environment variables:
DISAMBIGUATION_ENABLED=true # Enable/disable feature
DISAMBIGUATION_THRESHOLD=0.5 # Confidence threshold (0-1)
DISAMBIGUATION_ONTOLOGIES_DIR=./data/ontologies/ # Ontology directory
MAX_PROCESSING_TIME=300 # Timeout in seconds
MAX_PROCESSING_TOKENS=100000 # Token budget limitThreshold tuning:
- Lower (0.3-0.5): More recall, more LLM calls, higher cost
- Higher (0.7-0.9): More precision, fewer LLM calls, lower cost
- Place your
.ttl(Turtle) ontology files in./data/ontologies/ - Ensure they follow standard RDF/OWL conventions:
- Use
rdfs:labelorskos:prefLabelfor entity names - Use
rdfs:commentord3f:definitionfor descriptions - Define types with
rdf:typeandrdfs:subClassOf
- Use
- Restart the application - ontologies are loaded automatically
View disambiguation data in Neo4j:
// Get all disambiguated entities
MATCH (e:Entity)
WHERE e.canonical_id IS NOT NULL
RETURN e.name, e.canonical_label, e.disambiguation_confidence, e.disambiguation_method
ORDER BY e.disambiguation_confidence DESC
// Get NIL entities (not found in ontology)
MATCH (e:Entity)
WHERE e.canonical_id = "NIL"
RETURN e.name, e.type, e.description
// Statistics by disambiguation method
MATCH (e:Entity)
RETURN e.disambiguation_method as method,
COUNT(e) as count,
AVG(e.disambiguation_confidence) as avg_confidenceThe graph structure is validated against SHACL shapes defined in data/shapes/schema.ttl. This ensures that every Entity has a name and type, and relations are properly formed.
# Activate virtual environment
source venv/bin/activate
# Test ontology loader
python3 -c "
from src.disambiguation.ontology_loader import OntologyLoader
loader = OntologyLoader('./data/ontologies/')
index = loader.build_index()
print(f'Loaded {len(index)} entities from ontologies')
"
# Test disambiguator
python3 -c "
from src.extraction.schema import Entity
from src.disambiguation.disambiguator import EntityDisambiguator
from dotenv import load_dotenv
load_dotenv()
disambiguator = EntityDisambiguator(threshold=0.5)
entity = Entity(name='Phishing', type='ATTACK', description='Email attack')
candidates = disambiguator._find_candidates(entity, 'phishing attack context', top_k=3)
for i, c in enumerate(candidates):
print(f'{i+1}. {c[\"label\"]} (score: {c[\"score\"]:.2f})')
"# Test ontology constraints loader
python3 -c "
from src.extraction.ontology_constraints import OntologyConstraintsLoader
loader = OntologyConstraintsLoader(['powergrids_cps.ttl'], './data/ontologies/')
print(f'Entity types: {len(loader.entity_types)}')
print(f'Relation types: {len(loader.relation_types)}')
print(loader.format_for_prompt()[:500])
"# First time: install package in development mode
pip install -e .
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_ontology_constraints.py -v
# With coverage
pytest --cov=src tests/See TESTING.md for comprehensive testing guide.
KG-Foundry includes a real-time logging system to monitor the processing pipeline.
Access the logs page at http://localhost:3000/logs or click "Show Logs" button in the main interface.
The logs page displays:
- Current Configuration: Extraction settings, disambiguation settings, Neo4j connection
- Ontology Files Status: Which files are loaded, their sizes, and validation
- Pipeline Logs: Real-time logs from each processing stage:
load- Document loadingextraction- Entity/relation extractiondisambiguation- Entity linking to ontologyvalidation- SHACL validationstorage- Neo4j persistence
# Get pipeline logs (last 100 entries)
curl http://localhost:8000/logs
# Get logs filtered by stage
curl "http://localhost:8000/logs?stage=extraction&limit=50"
# Get current configuration
curl http://localhost:8000/config
# Clear logs
curl -X DELETE http://localhost:8000/logs| Level | Color | Description |
|---|---|---|
debug |
Gray | Detailed debug information |
info |
Blue | General information |
warning |
Yellow | Non-critical issues |
error |
Red | Errors that were handled |
success |
Green | Successful operations |
{
"timestamp": "2024-01-16T10:30:24.123456",
"level": "success",
"stage": "extraction",
"message": "Ontology constraints loaded",
"details": {
"entity_types_count": 78,
"relation_types_count": 34,
"files": ["powergrids_cps.ttl"]
}
}# View backend logs (Docker/Podman)
docker-compose logs -f backend
# Or with podman
podman-compose logs -f backend-
Reduce LLM Costs:
- Increase
DISAMBIGUATION_THRESHOLDto 0.7-0.8 - Most exact matches will be automatic (no API calls)
- Increase
-
Improve Recall:
- Decrease
DISAMBIGUATION_THRESHOLDto 0.3-0.4 - More entities will be sent to LLM for disambiguation
- Decrease
-
Handle Large Documents:
- Increase
MAX_PROCESSING_TIMEfor longer processing - Increase
MAX_PROCESSING_TOKENSfor more API budget
- Increase
-
Optimize Ontology Loading:
- Keep ontology files organized in
./data/ontologies/ - Remove unused ontologies to speed up startup
- Keep ontology files organized in
- Implementation Specifications: See
./instructions/entities_desambiguation_pipeline.md - MVP Summary: See
./MVP_IMPLEMENTATION_SUMMARY.md - API Documentation: Available at
http://localhost:8000/docswhen running - MHEL-LLaMo Reference: See
./instructions/MHEL-LLAMO/for original research implementation
Contributions are welcome! Areas for improvement:
- Semantic search with embeddings (Phase 2)
-
Ontology-constrained extraction(Implemented) -
Pipeline logs and monitoring page(Implemented) - Frontend visualization of disambiguation metadata
- Additional ontology integrations (MITRE ATT&CK, etc.)
- Disambiguation result caching
- Batch processing optimization
- Multi-language ontology support
- Export graph to various formats (RDF, JSON-LD, etc.)
[Add your license information here]
- MHEL-LLaMo: Entity disambiguation approach inspired by "It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models"
- SPR-RAG: Semantic parsing and retrieval concepts
- D3FEND: MITRE cybersecurity ontology
- LangChain: Document processing framework
- Instructor: Structured LLM outputs
For issues, questions, or feature requests, please open an issue on GitHub.
Built with β€οΈ using Claude Code