This guide documents the model-based relation extraction pipeline that has been successfully implemented and tested.
The pipeline runs purely model-based at runtime:
- Uses spaCy NER to extract entities
- Generates candidate entity pairs (+ noun chunks)
- Scores relations with the ONNX model
- Applies per-predicate thresholds to emit final edges
crawl → data/corpus/*.txt
pack_corpus → out/pack.ner.jsonl → out/pack.candidates.jsonl → out/pack.scored.jsonl
auto_train + retrain → models/*
edges_from_pack → out/edges.jsonl → graph.ttl
quality → quality indicators
config/predicates.yaml
- Defines 13 predicate types:
type,covered_by,headquartered_in,operates_in,part_of,subsidiary_of,parent_of,member_of,uses,provides,requires,prohibits,none - Each predicate includes a description and template for human-readable representation
training/ruler_patterns.jsonl
- EntityRuler patterns for domain-specific entities
- One JSON per line:
{"label":"ROLE","pattern":"CEO"} - Patterns boost entities before statistical NER runs
models/thresholds.json
- Per-predicate confidence thresholds
- Currently set to 0.2 for development (adjust to 0.8+ for production)
models/classes.json
- List of class labels from training
- Auto-generated during training
groundkg/ner_tag.py
- Runs spaCy's pre-trained NER on input text
- Pipeline setup:
- Adds
sentencizerfirst for sentence boundary detection - Adds
entity_rulerbefore NER to boost domain-specific entities - Loads patterns from
training/ruler_patterns.jsonl(e.g., "OpenAI", "CEO", "AI Act") - Then runs statistical NER model
- Adds
- Extracts entities: ORG, PERSON, GPE, LAW, PRODUCT, ROLE, EVENT, etc.
- Outputs sentence-level entity annotations with character offsets
groundkg/candidates.py
- Pairs entities within sentences
- Adds noun phrase chunks (regex-based) to catch missed entities
- Filters by distance (max 150 chars between entities)
- Outputs:
(subject, object)candidate pairs
training/train_re_transformers.py (default)
- Uses sentence transformer embeddings (
all-MiniLM-L6-v2) + Logistic Regression - Better semantic understanding than TF-IDF
- Exports LogisticRegression to ONNX format for inference
- Auto-tunes thresholds using precision-recall curves
- Outputs:
promoter_v1.onnx,thresholds.json,classes.json
training/train_re_sklearn.py (deprecated)
- TF-IDF + Logistic Regression model (kept for backward compatibility)
- Use
training/train_re_transformers.pyinstead
groundkg/re_score.py
- Loads sentence transformer model (
all-MiniLM-L6-v2) for embeddings - Loads ONNX model (LogisticRegression) for classification
- Converts text to embeddings, then runs ONNX model on embeddings
- Outputs per-pair predictions with probabilities
tools/promote_from_scored.py
- Applies per-class thresholds and dedupes to produce final edges.
# Install dependencies
pip install -r requirement.txt
# Download spaCy model
python -m spacy download en_core_web_smWhen starting with a new corpus and no existing model, you can automatically generate initial training seeds:
# 1) Extract entities and candidates (no model needed)
make -f Makefile.gk ner cand
# 2) Auto-generate seeds from clear connector patterns
make -f Makefile.gk bootstrap_seed
# or use one-shot helper:
make -f Makefile.gk bootstrap
# 3) Train first model from auto-generated seeds
make -f Makefile.gk coldstart
# 4) Run full pipeline
make -f Makefile.gk crawl manifest ner cand score infer edges ttl reportHow Bootstrap Works:
tools/bootstrap_seed_from_candidates.py scans out/pack.candidates.jsonl for entity pairs and looks for high-precision connector phrases:
- "is headquartered in" →
headquartered_in - "subsidiary of" →
subsidiary_of - "uses" →
uses - "provides" →
provides - And 7 more patterns
The bootstrap is type-gated (only accepts pairs with compatible entity types) and conservative (avoids overly broad matches). It automatically adds negative examples (labeled "none") from candidate pairs that don't match any pattern, ensuring at least 2 classes for training (required by logistic regression). Outputs training/seed.jsonl in training format.
Future Bootstrap Methods (Placeholders):
- Dependency-based:
tools/bootstrap_seed_dep.py --enable(spaCy Matcher patterns) - KB-backed:
tools/bootstrap_seed_kb.py --kb config/kb.json(distant supervision) - LLM-assisted:
tools/bootstrap_seed_llm.py --enable(with gating + audits)
make -f Makefile.gk all
make pack_stats
make lint{"subject": "Tesla", "predicate": "headquartered_in", "object": "Austin", "evidence": {...}}
{"subject": "Tesla", "predicate": "headquartered_in", "object": "Texas", "evidence": {...}}ex:node/Tesla ex:headquartered_in ex:node/Austin .
ex:node/Tesla ex:headquartered_in ex:node/Texas .Input: training/re_train.jsonl
{"text": "[E1]Tesla[/E1] is headquartered in [E2]Austin, Texas[/E2].", "label": "headquartered_in"}
{"text": "[E1]ACME[/E1] is [E2]a Digital Service Provider[/E2].", "label": "type"}
{"text": "[E1]System[/E1] consists of [E2]three components[/E2].", "label": "consists_of"}
{"text": "According to [E1]the vendor[/E1], [E2]the model[/E2] may achieve 92% accuracy.", "label": "none"}See the Quality indicators printed at the end of make pipeline.
✅ No Hard-Coded Patterns: Model learns from examples ✅ Generalizes Better: Works on unseen sentence structures ✅ Extensible: Add new predicates by adding training examples ✅ Probabilistic: Confidence scores for filtering ✅ Noun Chunk Support: Catches multi-word entities NER misses
Edit models/thresholds.json:
{
"type": 0.85,
"headquartered_in": 0.90,
"consists_of": 0.85,
"performs_task": 0.85,
"none": 1.00
}- Add to
config/predicates.yaml - Add training examples to
training/re_train.jsonl - Retrain:
python training/train_re_transformers.py(ormake -f Makefile.gk coldstart)
- Lower thresholds in
models/thresholds.json - Add more positive training examples
- Increase
MAX_CHAR_DISTincandidates.py
config/
predicates.yaml # Predicate taxonomy
models/
promoter_v1.onnx # Trained ONNX model (LogisticRegression)
thresholds.json # Confidence thresholds
classes.json # Class labels
training/
train_re_transformers.py # Training script (sentence transformers, default)
train_re_sklearn.py # Training script (TF-IDF, deprecated)
re_train.jsonl # Training examples
re_dev.jsonl # Development examples
ruler_patterns.jsonl # EntityRuler patterns (domain-specific entities)
tools/
bootstrap_seed_from_candidates.py # Auto-seed generation
groundkg/
ner_tag.py # NER extraction
candidates.py # Candidate generation
re_score.py # Model inference (sentence transformers + ONNX)
out/
ner.jsonl # NER output
candidates.jsonl # Candidate pairs
edges.jsonl # Final edges
graph.ttl # RDF graph
- Collect More Training Data: Run on larger corpora to bootstrap more examples
- Refine Thresholds: Tune per-predicate thresholds on dev set
- Experiment with Models: Try transformers (BERT, RoBERTa) for better accuracy
- Active Learning: Manually label uncertain predictions to improve model
- Domain Adaptation: Fine-tune on domain-specific text
make demo
make hash
# Output hashes:
# 6e25b5c4cae5835cb6057d74043370246cd133a8 out/edges.jsonl
# 776d1b593cd9641b4f3214a9beb04591f454e398 out/statements.scored.jsonl
# 23b02cd22b7cb37b4064eaa74deb56826f3f1a24 out/attributes.jsonl
# be5768f91095360d98b9bd34169a766c231a1d3d out/graph.ttlThe model-based pipeline is fully operational and tested. The system automatically switches between rule-based and ML-based extraction, providing a smooth migration path from deterministic patterns to learned models.