Skip to content

matthiasautrata/semantipolis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantipolis

Semantic Layer for FAIR Data on Lakehouses


What This Is

A semantic control plane that implements FAIR Data principles on data lakehouses. The system uses LinkML as its core modeling language and connects business concepts to physical data through a chain of mappings:

concept → ontology → logical → physical

We don't access data. We help users find, understand, and govern it.


The Mapping Chain

User: "Show org chart for engineering"
                    ↓
        EXTRACT: [org chart, engineering]
                    ↓
        RESOLVE: org chart → ✗ unbound
                    ↓
        EXPAND:  LLM: "org chart needs employee, manager, reports_to"
                    ↓
        RESOLVE: employee → Worker ✓, manager → Manager ✓
                    ↓
        BIND:    Worker → unity://hr/employees
                    ↓
        COVER:   self-join via manager_id
                    ↓
        GENERATE: WITH RECURSIVE hierarchy AS (...)

Architecture

Skeleton + Flesh

Layer Role Technology
Skeleton What CAN exist (authoritative) LinkML, SKOS
Flesh What DOES activate (fuzzy) HNSW vectors (usearch)

Neither is subordinate. Complementary:

  • Flesh without skeleton: no auditability, drifts
  • Skeleton without flesh: brittle, no fuzzy matching

The Four Planes

Plane Question Standards
Inventory What exists? DCAT, dprod
Semantics What does it mean? SKOS, LinkML
Evidence Is it true? PROV, DQV
Governance Can I use it? ODRL, DCON

The Triangle

           Data Products
          (dprod / DCAT)
              /    \
             /      \
    Data Contracts    Usage Policies
       (DCON)           (ODRL)

Quick Start

# Run tests
uv run pytest tests/ -v

# Demo (needs OPENAI_API_KEY)
export OPENAI_API_KEY=sk-...
uv run python scripts/orchestrator_demo.py "show org chart for engineering"

Documentation

Document Purpose
LLM.md Shared technical briefing (Claude/Gemini/Codex)
AGENTS.md Governance, personas, decision authority
CLAUDE.md Claude wrapper (loads LLM.md)
GEMINI.md Gemini wrapper (loads LLM.md)
CODEX.md Codex wrapper (loads LLM.md)
docs/vision.md Vision, direction, justification, roadmap
docs/architecture.md Components, mapping chain, service architecture
docs/dcon-evolution.md History of experiments and key learnings

Technology Stack

Component Technology
Schemas LinkML (YAML)
Vocabulary SKOS concepts
Vector Index usearch (HNSW)
Embeddings sentence-transformers
Graph NetworkX
Mappings LinkML-Map, SSSOM
LLM OpenAI / Anthropic

Principles

  1. Ask first, don't guess — Disambiguation over assumption
  2. LinkML is canonical — Other formats are derived or imported
  3. Standards over invention — DCAT, DCON, ODRL, SKOS, PROV
  4. Components over monoliths — Interface contracts, not prescribed implementations
  5. Skeleton is authoritative — Edit to fix errors immediately
  6. HNSW is cheap, LLM is expensive — Use vectors first

Current Status

Phase: Resetting direction. Prior NL→SQL pipeline (88% exit accuracy, 56 tests) archived as proof of concept.

New focus: Semantic layer for FAIR data on lakehouses using LinkML mapping chain.

See docs/vision.md for the roadmap.

About

Latest with HNSW & Ontology

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages