This project uses a model-based relation extraction pipeline end-to-end (no rule fallback at runtime).
crawl (tools/crawl.py) → data/corpus/*.txt
↓
manifest (tools/build_manifest.py) → docs.yaml
↓
pack_corpus:
[ner_tag.py] → out/pack.ner.jsonl (entities per sentence)
[candidates.py] → out/pack.candidates.jsonl (entity pairs)
[re_score.py + promoter_v1.onnx] → out/pack.scored.jsonl (ML scores)
↓
auto_train (tools/select_training_from_scored.py) → training/re_train.jsonl, training/re_dev.jsonl → retrain (training/train_re_sklearn.py)
↓
edges_from_pack (tools/promote_from_scored.py + thresholds) → out/edges.jsonl
↓
[export_ttl.py] → out/graph.ttl
↓
quality (tools/quality_report.py) → summary metrics
config/predicates.yaml
- Defines the predicate taxonomy (e.g., type, covered_by, headquartered_in, operates_in, subsidiary_of, member_of, part_of, provides, requires, prohibits, uses, none)
- Used by the ML model to classify relationships
models/thresholds.json
- Confidence thresholds per predicate
- Used to filter low-confidence predictions
groundkg/ner_tag.py
- Runs spaCy's NER on input text
- Pipeline components:
sentencizer: Rule-based sentence segmentation (runs first)entity_ruler: Pattern-based entity boost (loads fromtraining/ruler_patterns.jsonl)ner: Statistical NER model (spaCy transformer-based)
- Outputs:
{"doc_id", "sent_idx", "text", "entities": [...]} - Entities include: ORG, PRODUCT, PERSON, GPE, LAW, ROLE, EVENT, etc.
- Supports custom entity labels via EntityRuler patterns
groundkg/candidates.py
- Pairs entities within the same sentence
- Filters by max character distance (150 chars)
- Prioritizes certain entity types for subjects vs objects
- Outputs: candidate (subject, object) pairs per sentence
groundkg/re_score.py
- Loads ONNX model (
models/promoter_v1.onnx) and emits per-pair predictions with probabilities (no thresholding).
tools/promote_from_scored.py
- Converts scored predictions to final edges using per-class thresholds and deduplication.
tools/select_training_from_scored.py
- Selects high-confidence positives and robust negatives from
out/pack.scored.jsonl. - Supports minimum-per-class backfill and integration with mined patterns.
training/train_re_sklearn.py
- Trains TF-IDF + Logistic Regression, exports ONNX, writes thresholds and classes.
End-to-end: crawl → manifest → pack → auto-train → repack → stats → edges → quality
Checksums output files for reproducibility
Validates out/edges.jsonl JSONL
make pipeline
make pack_stats
make lintThe next step is to create training/train_re_sklearn.py:
- Load
training/re_train.jsonlandtraining/re_dev.jsonl - Featurize text (TF-IDF or embeddings)
- Train a classifier (Logistic Regression, SVM, or small neural network)
- Export to ONNX using
skl2onnx - Save as
models/promoter_v1.onnx - Tune
models/thresholds.jsonbased on dev set precision/recall
✅ Backward Compatible: Existing rule-based pipeline still works ✅ Deterministic Bootstrap: Rules provide high-precision labels for training ✅ Incremental Migration: Can switch to ML without rewriting everything ✅ Fallback Safety: If model fails or is unavailable, rules still work ✅ Reproducible: Pinned dependencies, checksum validation ✅ Extensible: Easy to add new predicates to config and retrain
Core dependencies in requirement.txt include:
onnxruntime- Model inferencescikit-learn- Model trainingskl2onnx- Model export- spaCy + spacy-transformers - NER
Install with:
pip install -r requirement.txt