Model-Based Relation Extraction Pipeline - Complete Guide

✅ Implementation Complete

This guide documents the model-based relation extraction pipeline that has been successfully implemented and tested.

Architecture Overview

Model-Only Operation

The pipeline runs purely model-based at runtime:

Uses spaCy NER to extract entities
Generates candidate entity pairs (+ noun chunks)
Scores relations with the ONNX model
Applies per-predicate thresholds to emit final edges

Pipeline Flow

crawl → data/corpus/*.txt
pack_corpus → out/pack.ner.jsonl → out/pack.candidates.jsonl → out/pack.scored.jsonl
auto_train + retrain → models/*
edges_from_pack → out/edges.jsonl → graph.ttl
quality → quality indicators

Key Components

1. Configuration Files

config/predicates.yaml

Defines 13 predicate types: type, covered_by, headquartered_in, operates_in, part_of, subsidiary_of, parent_of, member_of, uses, provides, requires, prohibits, none
Each predicate includes a description and template for human-readable representation

training/ruler_patterns.jsonl

EntityRuler patterns for domain-specific entities
One JSON per line: {"label":"ROLE","pattern":"CEO"}
Patterns boost entities before statistical NER runs

models/thresholds.json

Per-predicate confidence thresholds
Currently set to 0.2 for development (adjust to 0.8+ for production)

models/classes.json

List of class labels from training
Auto-generated during training

2. NER & Candidate Generation

groundkg/ner_tag.py

Runs spaCy's pre-trained NER on input text
Pipeline setup:
- Adds sentencizer first for sentence boundary detection
- Adds entity_ruler before NER to boost domain-specific entities
- Loads patterns from training/ruler_patterns.jsonl (e.g., "OpenAI", "CEO", "AI Act")
- Then runs statistical NER model
Extracts entities: ORG, PERSON, GPE, LAW, PRODUCT, ROLE, EVENT, etc.
Outputs sentence-level entity annotations with character offsets

groundkg/candidates.py

Pairs entities within sentences
Adds noun phrase chunks (regex-based) to catch missed entities
Filters by distance (max 150 chars between entities)
Outputs: (subject, object) candidate pairs

3. Model Training

training/train_re_transformers.py (default)

Uses sentence transformer embeddings (all-MiniLM-L6-v2) + Logistic Regression
Better semantic understanding than TF-IDF
Exports LogisticRegression to ONNX format for inference
Auto-tunes thresholds using precision-recall curves
Outputs: promoter_v1.onnx, thresholds.json, classes.json

training/train_re_sklearn.py (deprecated)

TF-IDF + Logistic Regression model (kept for backward compatibility)
Use training/train_re_transformers.py instead

4. Model Inference

groundkg/re_score.py

Loads sentence transformer model (all-MiniLM-L6-v2) for embeddings
Loads ONNX model (LogisticRegression) for classification
Converts text to embeddings, then runs ONNX model on embeddings
Outputs per-pair predictions with probabilities

tools/promote_from_scored.py

Applies per-class thresholds and dedupes to produce final edges.

How to Use

Initial Setup (One-Time)

# Install dependencies
pip install -r requirement.txt

# Download spaCy model
python -m spacy download en_core_web_sm

Automatic Seed Bootstrap

When starting with a new corpus and no existing model, you can automatically generate initial training seeds:

# 1) Extract entities and candidates (no model needed)
make -f Makefile.gk ner cand

# 2) Auto-generate seeds from clear connector patterns
make -f Makefile.gk bootstrap_seed
# or use one-shot helper:
make -f Makefile.gk bootstrap

# 3) Train first model from auto-generated seeds
make -f Makefile.gk coldstart

# 4) Run full pipeline
make -f Makefile.gk crawl manifest ner cand score infer edges ttl report

How Bootstrap Works:

tools/bootstrap_seed_from_candidates.py scans out/pack.candidates.jsonl for entity pairs and looks for high-precision connector phrases:

"is headquartered in" → headquartered_in
"subsidiary of" → subsidiary_of
"uses" → uses
"provides" → provides
And 7 more patterns

The bootstrap is type-gated (only accepts pairs with compatible entity types) and conservative (avoids overly broad matches). It automatically adds negative examples (labeled "none") from candidate pairs that don't match any pattern, ensuring at least 2 classes for training (required by logistic regression). Outputs training/seed.jsonl in training format.

Future Bootstrap Methods (Placeholders):

Dependency-based: tools/bootstrap_seed_dep.py --enable (spaCy Matcher patterns)
KB-backed: tools/bootstrap_seed_kb.py --kb config/kb.json (distant supervision)
LLM-assisted: tools/bootstrap_seed_llm.py --enable (with gating + audits)

Typical Run

make -f Makefile.gk all
make pack_stats
make lint

Example Output

Model-Based Edges (`out/edges.jsonl`)

{"subject": "Tesla", "predicate": "headquartered_in", "object": "Austin", "evidence": {...}}
{"subject": "Tesla", "predicate": "headquartered_in", "object": "Texas", "evidence": {...}}

Knowledge Graph (`out/graph.ttl`)

ex:node/Tesla ex:headquartered_in ex:node/Austin .
ex:node/Tesla ex:headquartered_in ex:node/Texas .

Training Data Format

Input: training/re_train.jsonl

{"text": "[E1]Tesla[/E1] is headquartered in [E2]Austin, Texas[/E2].", "label": "headquartered_in"}
{"text": "[E1]ACME[/E1] is [E2]a Digital Service Provider[/E2].", "label": "type"}
{"text": "[E1]System[/E1] consists of [E2]three components[/E2].", "label": "consists_of"}
{"text": "According to [E1]the vendor[/E1], [E2]the model[/E2] may achieve 92% accuracy.", "label": "none"}

Current Performance

See the Quality indicators printed at the end of make pipeline.

Advantages Over Rule-Based

✅ No Hard-Coded Patterns: Model learns from examples ✅ Generalizes Better: Works on unseen sentence structures ✅ Extensible: Add new predicates by adding training examples ✅ Probabilistic: Confidence scores for filtering ✅ Noun Chunk Support: Catches multi-word entities NER misses

Configuration Tuning

Increasing Precision (Fewer False Positives)

Edit models/thresholds.json:

{
  "type": 0.85,
  "headquartered_in": 0.90,
  "consists_of": 0.85,
  "performs_task": 0.85,
  "none": 1.00
}

Adding New Predicates

Add to config/predicates.yaml
Add training examples to training/re_train.jsonl
Retrain: python training/train_re_transformers.py (or make -f Makefile.gk coldstart)

Improving Recall (More Extractions)

Lower thresholds in models/thresholds.json
Add more positive training examples
Increase MAX_CHAR_DIST in candidates.py

Files Created

config/
  predicates.yaml          # Predicate taxonomy
models/
  promoter_v1.onnx         # Trained ONNX model (LogisticRegression)
  thresholds.json          # Confidence thresholds
  classes.json             # Class labels
training/
  train_re_transformers.py # Training script (sentence transformers, default)
  train_re_sklearn.py      # Training script (TF-IDF, deprecated)
  re_train.jsonl           # Training examples
  re_dev.jsonl             # Development examples
  ruler_patterns.jsonl     # EntityRuler patterns (domain-specific entities)
tools/
  bootstrap_seed_from_candidates.py  # Auto-seed generation
groundkg/
  ner_tag.py               # NER extraction
  candidates.py            # Candidate generation
  re_score.py              # Model inference (sentence transformers + ONNX)
out/
  ner.jsonl                # NER output
  candidates.jsonl         # Candidate pairs
  edges.jsonl              # Final edges
  graph.ttl                # RDF graph

Next Steps

Collect More Training Data: Run on larger corpora to bootstrap more examples
Refine Thresholds: Tune per-predicate thresholds on dev set
Experiment with Models: Try transformers (BERT, RoBERTa) for better accuracy
Active Learning: Manually label uncertain predictions to improve model
Domain Adaptation: Fine-tune on domain-specific text

Reproducibility

make demo
make hash
# Output hashes:
# 6e25b5c4cae5835cb6057d74043370246cd133a8  out/edges.jsonl
# 776d1b593cd9641b4f3214a9beb04591f454e398  out/statements.scored.jsonl
# 23b02cd22b7cb37b4064eaa74deb56826f3f1a24  out/attributes.jsonl
# be5768f91095360d98b9bd34169a766c231a1d3d  out/graph.ttl

Success! 🎉

The model-based pipeline is fully operational and tested. The system automatically switches between rule-based and ML-based extraction, providing a smooth migration path from deterministic patterns to learned models.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model-Based Relation Extraction Pipeline - Complete Guide

✅ Implementation Complete

Architecture Overview

Model-Only Operation

Pipeline Flow

Key Components

1. Configuration Files

2. NER & Candidate Generation

3. Model Training

4. Model Inference

How to Use

Initial Setup (One-Time)

Automatic Seed Bootstrap

Typical Run

Example Output

Model-Based Edges (`out/edges.jsonl`)

Knowledge Graph (`out/graph.ttl`)

Training Data Format

Current Performance

Advantages Over Rule-Based

Configuration Tuning

Increasing Precision (Fewer False Positives)

Adding New Predicates

Improving Recall (More Extractions)

Files Created

Next Steps

Reproducibility

Success! 🎉

FilesExpand file tree

MODEL_PIPELINE_GUIDE.md

Latest commit

History

MODEL_PIPELINE_GUIDE.md

File metadata and controls

Model-Based Relation Extraction Pipeline - Complete Guide

✅ Implementation Complete

Architecture Overview

Model-Only Operation

Pipeline Flow

Key Components

1. Configuration Files

2. NER & Candidate Generation

3. Model Training

4. Model Inference

How to Use

Initial Setup (One-Time)

Automatic Seed Bootstrap

Typical Run

Example Output

Model-Based Edges (out/edges.jsonl)

Knowledge Graph (out/graph.ttl)

Training Data Format

Current Performance

Advantages Over Rule-Based

Configuration Tuning

Increasing Precision (Fewer False Positives)

Adding New Predicates

Improving Recall (More Extractions)

Files Created

Next Steps

Reproducibility

Success! 🎉

Model-Based Edges (`out/edges.jsonl`)

Knowledge Graph (`out/graph.ttl`)