Skip to content

Latest commit

 

History

History
257 lines (197 loc) · 8.7 KB

File metadata and controls

257 lines (197 loc) · 8.7 KB

Model-Based Relation Extraction Pipeline - Complete Guide

✅ Implementation Complete

This guide documents the model-based relation extraction pipeline that has been successfully implemented and tested.

Architecture Overview

Model-Only Operation

The pipeline runs purely model-based at runtime:

  • Uses spaCy NER to extract entities
  • Generates candidate entity pairs (+ noun chunks)
  • Scores relations with the ONNX model
  • Applies per-predicate thresholds to emit final edges

Pipeline Flow

crawl → data/corpus/*.txt
pack_corpus → out/pack.ner.jsonl → out/pack.candidates.jsonl → out/pack.scored.jsonl
auto_train + retrain → models/*
edges_from_pack → out/edges.jsonl → graph.ttl
quality → quality indicators

Key Components

1. Configuration Files

config/predicates.yaml

  • Defines 13 predicate types: type, covered_by, headquartered_in, operates_in, part_of, subsidiary_of, parent_of, member_of, uses, provides, requires, prohibits, none
  • Each predicate includes a description and template for human-readable representation

training/ruler_patterns.jsonl

  • EntityRuler patterns for domain-specific entities
  • One JSON per line: {"label":"ROLE","pattern":"CEO"}
  • Patterns boost entities before statistical NER runs

models/thresholds.json

  • Per-predicate confidence thresholds
  • Currently set to 0.2 for development (adjust to 0.8+ for production)

models/classes.json

  • List of class labels from training
  • Auto-generated during training

2. NER & Candidate Generation

groundkg/ner_tag.py

  • Runs spaCy's pre-trained NER on input text
  • Pipeline setup:
    • Adds sentencizer first for sentence boundary detection
    • Adds entity_ruler before NER to boost domain-specific entities
    • Loads patterns from training/ruler_patterns.jsonl (e.g., "OpenAI", "CEO", "AI Act")
    • Then runs statistical NER model
  • Extracts entities: ORG, PERSON, GPE, LAW, PRODUCT, ROLE, EVENT, etc.
  • Outputs sentence-level entity annotations with character offsets

groundkg/candidates.py

  • Pairs entities within sentences
  • Adds noun phrase chunks (regex-based) to catch missed entities
  • Filters by distance (max 150 chars between entities)
  • Outputs: (subject, object) candidate pairs

3. Model Training

training/train_re_transformers.py (default)

  • Uses sentence transformer embeddings (all-MiniLM-L6-v2) + Logistic Regression
  • Better semantic understanding than TF-IDF
  • Exports LogisticRegression to ONNX format for inference
  • Auto-tunes thresholds using precision-recall curves
  • Outputs: promoter_v1.onnx, thresholds.json, classes.json

training/train_re_sklearn.py (deprecated)

  • TF-IDF + Logistic Regression model (kept for backward compatibility)
  • Use training/train_re_transformers.py instead

4. Model Inference

groundkg/re_score.py

  • Loads sentence transformer model (all-MiniLM-L6-v2) for embeddings
  • Loads ONNX model (LogisticRegression) for classification
  • Converts text to embeddings, then runs ONNX model on embeddings
  • Outputs per-pair predictions with probabilities

tools/promote_from_scored.py

  • Applies per-class thresholds and dedupes to produce final edges.

How to Use

Initial Setup (One-Time)

# Install dependencies
pip install -r requirement.txt

# Download spaCy model
python -m spacy download en_core_web_sm

Automatic Seed Bootstrap

When starting with a new corpus and no existing model, you can automatically generate initial training seeds:

# 1) Extract entities and candidates (no model needed)
make -f Makefile.gk ner cand

# 2) Auto-generate seeds from clear connector patterns
make -f Makefile.gk bootstrap_seed
# or use one-shot helper:
make -f Makefile.gk bootstrap

# 3) Train first model from auto-generated seeds
make -f Makefile.gk coldstart

# 4) Run full pipeline
make -f Makefile.gk crawl manifest ner cand score infer edges ttl report

How Bootstrap Works:

tools/bootstrap_seed_from_candidates.py scans out/pack.candidates.jsonl for entity pairs and looks for high-precision connector phrases:

  • "is headquartered in" → headquartered_in
  • "subsidiary of" → subsidiary_of
  • "uses" → uses
  • "provides" → provides
  • And 7 more patterns

The bootstrap is type-gated (only accepts pairs with compatible entity types) and conservative (avoids overly broad matches). It automatically adds negative examples (labeled "none") from candidate pairs that don't match any pattern, ensuring at least 2 classes for training (required by logistic regression). Outputs training/seed.jsonl in training format.

Future Bootstrap Methods (Placeholders):

  • Dependency-based: tools/bootstrap_seed_dep.py --enable (spaCy Matcher patterns)
  • KB-backed: tools/bootstrap_seed_kb.py --kb config/kb.json (distant supervision)
  • LLM-assisted: tools/bootstrap_seed_llm.py --enable (with gating + audits)

Typical Run

make -f Makefile.gk all
make pack_stats
make lint

Example Output

Model-Based Edges (out/edges.jsonl)

{"subject": "Tesla", "predicate": "headquartered_in", "object": "Austin", "evidence": {...}}
{"subject": "Tesla", "predicate": "headquartered_in", "object": "Texas", "evidence": {...}}

Knowledge Graph (out/graph.ttl)

ex:node/Tesla ex:headquartered_in ex:node/Austin .
ex:node/Tesla ex:headquartered_in ex:node/Texas .

Training Data Format

Input: training/re_train.jsonl

{"text": "[E1]Tesla[/E1] is headquartered in [E2]Austin, Texas[/E2].", "label": "headquartered_in"}
{"text": "[E1]ACME[/E1] is [E2]a Digital Service Provider[/E2].", "label": "type"}
{"text": "[E1]System[/E1] consists of [E2]three components[/E2].", "label": "consists_of"}
{"text": "According to [E1]the vendor[/E1], [E2]the model[/E2] may achieve 92% accuracy.", "label": "none"}

Current Performance

See the Quality indicators printed at the end of make pipeline.

Advantages Over Rule-Based

No Hard-Coded Patterns: Model learns from examples ✅ Generalizes Better: Works on unseen sentence structures ✅ Extensible: Add new predicates by adding training examples ✅ Probabilistic: Confidence scores for filtering ✅ Noun Chunk Support: Catches multi-word entities NER misses

Configuration Tuning

Increasing Precision (Fewer False Positives)

Edit models/thresholds.json:

{
  "type": 0.85,
  "headquartered_in": 0.90,
  "consists_of": 0.85,
  "performs_task": 0.85,
  "none": 1.00
}

Adding New Predicates

  1. Add to config/predicates.yaml
  2. Add training examples to training/re_train.jsonl
  3. Retrain: python training/train_re_transformers.py (or make -f Makefile.gk coldstart)

Improving Recall (More Extractions)

  • Lower thresholds in models/thresholds.json
  • Add more positive training examples
  • Increase MAX_CHAR_DIST in candidates.py

Files Created

config/
  predicates.yaml          # Predicate taxonomy
models/
  promoter_v1.onnx         # Trained ONNX model (LogisticRegression)
  thresholds.json          # Confidence thresholds
  classes.json             # Class labels
training/
  train_re_transformers.py # Training script (sentence transformers, default)
  train_re_sklearn.py      # Training script (TF-IDF, deprecated)
  re_train.jsonl           # Training examples
  re_dev.jsonl             # Development examples
  ruler_patterns.jsonl     # EntityRuler patterns (domain-specific entities)
tools/
  bootstrap_seed_from_candidates.py  # Auto-seed generation
groundkg/
  ner_tag.py               # NER extraction
  candidates.py            # Candidate generation
  re_score.py              # Model inference (sentence transformers + ONNX)
out/
  ner.jsonl                # NER output
  candidates.jsonl         # Candidate pairs
  edges.jsonl              # Final edges
  graph.ttl                # RDF graph

Next Steps

  1. Collect More Training Data: Run on larger corpora to bootstrap more examples
  2. Refine Thresholds: Tune per-predicate thresholds on dev set
  3. Experiment with Models: Try transformers (BERT, RoBERTa) for better accuracy
  4. Active Learning: Manually label uncertain predictions to improve model
  5. Domain Adaptation: Fine-tune on domain-specific text

Reproducibility

make demo
make hash
# Output hashes:
# 6e25b5c4cae5835cb6057d74043370246cd133a8  out/edges.jsonl
# 776d1b593cd9641b4f3214a9beb04591f454e398  out/statements.scored.jsonl
# 23b02cd22b7cb37b4064eaa74deb56826f3f1a24  out/attributes.jsonl
# be5768f91095360d98b9bd34169a766c231a1d3d  out/graph.ttl

Success! 🎉

The model-based pipeline is fully operational and tested. The system automatically switches between rule-based and ML-based extraction, providing a smooth migration path from deterministic patterns to learned models.