Skip to content

kg-foundry is a modern end-to-end pipeline for transforming unstructured textual documents (articles, newsletters, reports) into a high-quality, validated and visualizable Knowledge Graph.

License

Notifications You must be signed in to change notification settings

aklira/kg-foundry

Repository files navigation

kg-foundry

kg-foundry is a modern, end-to-end pipeline for transforming unstructured textual documents (PDF, Markdown, Text) into a high-quality, validated, and visualizable Knowledge Graph (KG). It leverages Large Language Models (LLMs) for extraction and Retrieval-Augmented Generation (RAG) for interactive querying.

πŸš€ Features

  • Ingestion: Support for .txt, .md, and .pdf documents.
  • Extraction: LLM-based extraction of Entities and Relations using Instructor (structured output).
    • Ontology-Constrained Extraction: Optionally constrain entity/relation types to match a specific ontology schema.
  • Entity Disambiguation: Automatic linking of extracted entities to knowledge bases (ontologies) using hybrid approach:
    • Multi-ontology support (RDF/OWL TTL files)
    • Adaptive threshold filtering (exact, type-based, and partial matching)
    • LLM-based disambiguation for ambiguous cases (MHEL-LLaMo inspired)
    • NIL detection for entities not in ontology
  • Validation: Graph quality assurance using SHACL (Shapes Constraint Language) and pyshacl.
  • Storage: Persistent graph storage using Neo4j with disambiguation metadata.
  • Visualization: Interactive graph exploration with Cytoscape.js.
  • RAG (Chat): "Ask the Graph" feature to answer questions based on the knowledge graph context.
  • Pipeline Logs: Real-time monitoring of the processing pipeline with detailed logs and configuration view.
  • Containerization: Full deployment support with Podman / Docker.

πŸ› οΈ Technology Stack

  • Backend: Python 3.11, FastAPI, LangChain, Instructor, Pydantic, rdflib.
  • Frontend: Next.js 14, TypeScript, Tailwind CSS, Cytoscape.js.
  • Database: Neo4j (Graph DB).
  • AI/LLM: OpenAI GPT-4o (configurable).
  • Ontologies: RDF/OWL support with D3FEND (MITRE) cybersecurity ontology.

πŸ“‚ Project Structure

kg-foundry/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ api/            # FastAPI routes and entry point
β”‚   β”‚   β”œβ”€β”€ main.py             # FastAPI application
β”‚   β”‚   β”œβ”€β”€ routes.py           # API endpoints
β”‚   β”‚   └── logs.py             # Pipeline logging system
β”‚   β”œβ”€β”€ extraction/     # LLM extraction logic (Instructor)
β”‚   β”‚   β”œβ”€β”€ extractor.py        # Main extraction engine
β”‚   β”‚   β”œβ”€β”€ schema.py           # Pydantic models (Entity, Relation)
β”‚   β”‚   β”œβ”€β”€ config.py           # Extraction configuration
β”‚   β”‚   └── ontology_constraints.py  # Ontology-based type constraints
β”‚   β”œβ”€β”€ disambiguation/ # Entity disambiguation module
β”‚   β”‚   β”œβ”€β”€ disambiguator.py    # Main disambiguation engine
β”‚   β”‚   β”œβ”€β”€ ontology_loader.py  # Multi-ontology loader (RDF/OWL)
β”‚   β”‚   └── config.py           # Configuration management
β”‚   β”œβ”€β”€ graph/          # Neo4j client wrapper
β”‚   β”œβ”€β”€ ingestion/      # Document loaders
β”‚   β”œβ”€β”€ rag/            # Retrieval-Augmented Generation logic
β”‚   └── validation/     # SHACL validation logic
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ shapes/         # SHACL shape definitions (.ttl)
β”‚   └── ontologies/     # Knowledge base ontologies (.ttl)
β”‚       β”œβ”€β”€ d3fend.ttl          # D3FEND cybersecurity ontology (MITRE)
β”‚       └── powergrids_cps.ttl  # Power grid cyber-physical security ontology
β”œβ”€β”€ frontend/           # Next.js web application
β”‚   └── app/
β”‚       β”œβ”€β”€ page.tsx            # Main graph visualization page
β”‚       └── logs/page.tsx       # Pipeline logs & config page
β”œβ”€β”€ tests/              # Unit and integration tests
β”œβ”€β”€ docker-compose.yml  # Podman/Docker orchestration
β”œβ”€β”€ .env.example        # Environment variables template
└── requirements.txt    # Python dependencies

⚑ Getting Started

Prerequisites

  • OpenAI API Key: Required for extraction and RAG.
  • Neo4j: Required for storage (provided via Docker).
  • Python 3.11+ & Node.js 18+ (for local dev).
  • Podman or Docker (for containerized run).

Option A: Run with Podman (Recommended)

  1. Create a .env file at the root (see .env.example for template):
    # OpenAI Configuration (REQUIRED)
    OPENAI_API_KEY=sk-your-key-here
    
    # Neo4j Configuration
    NEO4J_URI=bolt://localhost:7687
    NEO4J_USER=neo4j
    NEO4J_PASSWORD=password
    
    # Extraction Configuration (optional)
    EXTRACTION_MODEL_NAME=gpt-4o
    EXTRACTION_ONTOLOGIES_DIR=./data/ontologies/
    # Constrain extraction to specific ontology types (comma-separated)
    # EXTRACTION_ONTOLOGIES_FILES=powergrids_cps.ttl,d3fend.ttl
    
    # Entity Disambiguation (optional)
    DISAMBIGUATION_ENABLED=true
    DISAMBIGUATION_THRESHOLD=0.5
    DISAMBIGUATION_ONTOLOGIES_DIR=./data/ontologies/
    MAX_PROCESSING_TIME=300
    MAX_PROCESSING_TOKENS=100000
  2. Build and start the services:
    podman-compose up --build
  3. Access the application:

Option B: Run Locally

1. Backend

# Create virtual env
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Set env vars
export OPENAI_API_KEY=sk-...
export NEO4J_URI=bolt://localhost:7687
export NEO4J_USER=neo4j
export NEO4J_PASSWORD=password

# Run server
uvicorn src.api.main:app --reload

2. Frontend

cd frontend
npm install
npm run dev

πŸ“– Usage

  1. Open the web interface at http://localhost:3000.
  2. Click "Choose File" and select a text or PDF document.
  3. Click "Process". The system will:
    • Extract entities and relations using LLM.
    • Disambiguate entities by linking them to ontology concepts (if enabled).
    • Validate them against the SHACL schema.
    • Store in Neo4j with disambiguation metadata.
    • Display the resulting Knowledge Graph interactively.
  4. Use the "Ask the Graph" chat box below the visualization to ask questions about the ingested content.

Pipeline Flow

Document Upload β†’ Ingestion β†’ Extraction (LLM) β†’ Disambiguation β†’ Validation β†’ Neo4j β†’ Visualization
                   (0-10%)      (10-90%)          (90-94%)        (95%)      (96-100%)

🎯 Ontology-Constrained Extraction

KG-Foundry can constrain the LLM to use only entity and relation types defined in your ontologies. This ensures extracted data conforms to your domain schema.

How It Works

When EXTRACTION_ONTOLOGIES_FILES is configured:

  1. Ontology Parsing: Loads specified .ttl files and extracts:

    • owl:Class definitions β†’ allowed entity types
    • owl:ObjectProperty definitions β†’ allowed relation types (with domain/range constraints)
  2. Prompt Injection: The LLM system prompt includes:

    IMPORTANT CONSTRAINT - ENTITY TYPES:
    You MUST use ONLY the following entity types:
      - Asset
      - CyberAsset
      - Attack
      - Vulnerability
      ...
    
    IMPORTANT CONSTRAINT - RELATIONSHIP TYPES:
    You MUST use ONLY the following relationship types:
      - targets (from: Attack β†’ to: Asset)
      - exploits (from: Threat β†’ to: Vulnerability)
      ...
    
  3. Constrained Extraction: The LLM extracts entities and relations using only the allowed types.

Configuration

# Enable ontology-constrained extraction
EXTRACTION_ONTOLOGIES_FILES=powergrids_cps.ttl,d3fend.ttl
EXTRACTION_ONTOLOGIES_DIR=./data/ontologies/
EXTRACTION_MODEL_NAME=gpt-4o

Example: Power Grid Ontology

Using powergrids_cps.ttl:

  • 78 entity types: Asset, CyberAsset, PhysicalAsset, PLC, RTU, SCADA server, Attack, Vulnerability...
  • 34 relation types: targets, exploits, contains, hostsDevice, usesProtocol...

Verify Configuration

Check that ontologies are loaded via the logs page or API:

# Via API
curl http://localhost:8000/config | jq .extraction

# Response
{
  "model_name": "gpt-4o",
  "ontologies_files": ["powergrids_cps.ttl", "d3fend.ttl"],
  "constraints_active": true,
  "ontologies_status": [
    {"file": "powergrids_cps.ttl", "exists": true, "size_kb": 16.8},
    {"file": "d3fend.ttl", "exists": true, "size_kb": 3351.9}
  ]
}

πŸ”— Entity Disambiguation

KG-Foundry features an advanced entity disambiguation system inspired by the MHEL-LLaMo approach, which links extracted entities to knowledge base concepts.

How It Works

  1. Multi-Ontology Loading: Automatically loads all .ttl (Turtle/RDF) ontology files from ./data/ontologies/

    • Currently includes D3FEND (MITRE cybersecurity ontology with 5,014 entities)
    • Supports standard RDF/OWL predicates (rdfs:label, d3f:definition, etc.)
  2. 3-Tier Candidate Search:

    • Exact matching: Label normalization and exact string matching (score: 1.0)
    • Type-based matching: Matches by entity type (ATTACK, TECHNIQUE, etc.) (score: 0.6)
    • Partial matching: Substring and fuzzy matching (score: 0.4+)
  3. Adaptive Threshold Filtering:

    • High-confidence matches (β‰₯ threshold) β†’ automatic linking
    • Low-confidence matches (< threshold) β†’ LLM disambiguation
  4. LLM Prompt Chaining (for hard cases):

    • Step 1: NIL prediction - Does the entity exist in the ontology?
    • Step 2: Candidate selection - Which ontology concept is it?
  5. Resource Management:

    • Maximum processing time limit (default: 300s)
    • Token budget control for API costs (default: 100,000 tokens)
    • Graceful fallback to NIL on timeout/errors

Disambiguation Metadata

Each disambiguated entity includes:

  • canonical_id: URI from ontology (e.g., d3f:Phishing)
  • canonical_label: Standard name from ontology
  • disambiguation_confidence: Score 0-100
  • disambiguation_method: How it was disambiguated (auto_exact, llm_selection, etc.)
  • candidates: Alternative entities considered

Configuration

Control disambiguation behavior via environment variables:

DISAMBIGUATION_ENABLED=true              # Enable/disable feature
DISAMBIGUATION_THRESHOLD=0.5             # Confidence threshold (0-1)
DISAMBIGUATION_ONTOLOGIES_DIR=./data/ontologies/  # Ontology directory
MAX_PROCESSING_TIME=300                  # Timeout in seconds
MAX_PROCESSING_TOKENS=100000             # Token budget limit

Threshold tuning:

  • Lower (0.3-0.5): More recall, more LLM calls, higher cost
  • Higher (0.7-0.9): More precision, fewer LLM calls, lower cost

Adding Custom Ontologies

  1. Place your .ttl (Turtle) ontology files in ./data/ontologies/
  2. Ensure they follow standard RDF/OWL conventions:
    • Use rdfs:label or skos:prefLabel for entity names
    • Use rdfs:comment or d3f:definition for descriptions
    • Define types with rdf:type and rdfs:subClassOf
  3. Restart the application - ontologies are loaded automatically

Querying Disambiguation Results

View disambiguation data in Neo4j:

// Get all disambiguated entities
MATCH (e:Entity)
WHERE e.canonical_id IS NOT NULL
RETURN e.name, e.canonical_label, e.disambiguation_confidence, e.disambiguation_method
ORDER BY e.disambiguation_confidence DESC

// Get NIL entities (not found in ontology)
MATCH (e:Entity)
WHERE e.canonical_id = "NIL"
RETURN e.name, e.type, e.description

// Statistics by disambiguation method
MATCH (e:Entity)
RETURN e.disambiguation_method as method,
       COUNT(e) as count,
       AVG(e.disambiguation_confidence) as avg_confidence

πŸ›‘οΈ Validation

The graph structure is validated against SHACL shapes defined in data/shapes/schema.ttl. This ensures that every Entity has a name and type, and relations are properly formed.

πŸ§ͺ Testing

Test Disambiguation Module

# Activate virtual environment
source venv/bin/activate

# Test ontology loader
python3 -c "
from src.disambiguation.ontology_loader import OntologyLoader
loader = OntologyLoader('./data/ontologies/')
index = loader.build_index()
print(f'Loaded {len(index)} entities from ontologies')
"

# Test disambiguator
python3 -c "
from src.extraction.schema import Entity
from src.disambiguation.disambiguator import EntityDisambiguator
from dotenv import load_dotenv
load_dotenv()

disambiguator = EntityDisambiguator(threshold=0.5)
entity = Entity(name='Phishing', type='ATTACK', description='Email attack')
candidates = disambiguator._find_candidates(entity, 'phishing attack context', top_k=3)
for i, c in enumerate(candidates):
    print(f'{i+1}. {c[\"label\"]} (score: {c[\"score\"]:.2f})')
"

Test Ontology Constraints

# Test ontology constraints loader
python3 -c "
from src.extraction.ontology_constraints import OntologyConstraintsLoader
loader = OntologyConstraintsLoader(['powergrids_cps.ttl'], './data/ontologies/')
print(f'Entity types: {len(loader.entity_types)}')
print(f'Relation types: {len(loader.relation_types)}')
print(loader.format_for_prompt()[:500])
"

Run Unit Tests

# First time: install package in development mode
pip install -e .

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_ontology_constraints.py -v

# With coverage
pytest --cov=src tests/

See TESTING.md for comprehensive testing guide.

πŸ“Š Pipeline Logs & Monitoring

KG-Foundry includes a real-time logging system to monitor the processing pipeline.

Logs Page

Access the logs page at http://localhost:3000/logs or click "Show Logs" button in the main interface.

The logs page displays:

  • Current Configuration: Extraction settings, disambiguation settings, Neo4j connection
  • Ontology Files Status: Which files are loaded, their sizes, and validation
  • Pipeline Logs: Real-time logs from each processing stage:
    • load - Document loading
    • extraction - Entity/relation extraction
    • disambiguation - Entity linking to ontology
    • validation - SHACL validation
    • storage - Neo4j persistence

API Endpoints

# Get pipeline logs (last 100 entries)
curl http://localhost:8000/logs

# Get logs filtered by stage
curl "http://localhost:8000/logs?stage=extraction&limit=50"

# Get current configuration
curl http://localhost:8000/config

# Clear logs
curl -X DELETE http://localhost:8000/logs

Log Levels

Level Color Description
debug Gray Detailed debug information
info Blue General information
warning Yellow Non-critical issues
error Red Errors that were handled
success Green Successful operations

Example Log Output

{
  "timestamp": "2024-01-16T10:30:24.123456",
  "level": "success",
  "stage": "extraction",
  "message": "Ontology constraints loaded",
  "details": {
    "entity_types_count": 78,
    "relation_types_count": 34,
    "files": ["powergrids_cps.ttl"]
  }
}

Container Logs

# View backend logs (Docker/Podman)
docker-compose logs -f backend

# Or with podman
podman-compose logs -f backend

🎯 Performance Tips

  1. Reduce LLM Costs:

    • Increase DISAMBIGUATION_THRESHOLD to 0.7-0.8
    • Most exact matches will be automatic (no API calls)
  2. Improve Recall:

    • Decrease DISAMBIGUATION_THRESHOLD to 0.3-0.4
    • More entities will be sent to LLM for disambiguation
  3. Handle Large Documents:

    • Increase MAX_PROCESSING_TIME for longer processing
    • Increase MAX_PROCESSING_TOKENS for more API budget
  4. Optimize Ontology Loading:

    • Keep ontology files organized in ./data/ontologies/
    • Remove unused ontologies to speed up startup

πŸ“š Documentation

  • Implementation Specifications: See ./instructions/entities_desambiguation_pipeline.md
  • MVP Summary: See ./MVP_IMPLEMENTATION_SUMMARY.md
  • API Documentation: Available at http://localhost:8000/docs when running
  • MHEL-LLaMo Reference: See ./instructions/MHEL-LLAMO/ for original research implementation

🀝 Contributing

Contributions are welcome! Areas for improvement:

  • Semantic search with embeddings (Phase 2)
  • Ontology-constrained extraction (Implemented)
  • Pipeline logs and monitoring page (Implemented)
  • Frontend visualization of disambiguation metadata
  • Additional ontology integrations (MITRE ATT&CK, etc.)
  • Disambiguation result caching
  • Batch processing optimization
  • Multi-language ontology support
  • Export graph to various formats (RDF, JSON-LD, etc.)

πŸ“„ License

[Add your license information here]

πŸ™ Acknowledgments

  • MHEL-LLaMo: Entity disambiguation approach inspired by "It's All About the Confidence: An Unsupervised Approach for Multilingual Historical Entity Linking using Large Language Models"
  • SPR-RAG: Semantic parsing and retrieval concepts
  • D3FEND: MITRE cybersecurity ontology
  • LangChain: Document processing framework
  • Instructor: Structured LLM outputs

πŸ“ž Support

For issues, questions, or feature requests, please open an issue on GitHub.


Built with ❀️ using Claude Code

About

kg-foundry is a modern end-to-end pipeline for transforming unstructured textual documents (articles, newsletters, reports) into a high-quality, validated and visualizable Knowledge Graph.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages