This pipeline enables automatic corpus collection, annotation, and self-training for the GroundKG relation extraction system. It crawls documents, extracts text, runs ML inference, and uses high-confidence predictions to retrain the model.
Seed URLs (seed.csv)
↓
[crawl.py] → Download & extract text
↓ (data/corpus/*.txt + data/meta.jsonl)
[pack_corpus] → NER → Candidates → RE Scoring
↓ (out/pack.scored.jsonl)
[auto_train] → Select high-conf examples → Retrain
↓ (models/promoter_v1.onnx updated)
[edges_from_pack] → Promote scored to edges + graph
data/
seed.csv # Input: URLs to crawl
raw/ # Downloaded HTML/PDF files
corpus/ # Extracted text files
meta.jsonl # Metadata for each document
tools/
crawl.py # Web crawler with PDF/HTML support
build_manifest.py # Generate docs.yaml manifest
select_training_from_scored.py # Self-training data selection
groundkg/
re_score.py # Score candidates with probabilities
models/
promoter_v1.onnx # Trained model
thresholds.json # Confidence thresholds
classes.json # Class labels
out/
pack.ner.jsonl # NER results for corpus
pack.candidates.jsonl # Entity pairs
pack.scored.jsonl # Scored predictions
pip install -r requirement.txtNew dependencies:
requests- HTTP clienttrafilatura- HTML text extractionpdfminer.six- PDF text extractionbeautifulsoup4- HTML parsingpython-slugify- Safe filenames
Edit data/seed.csv:
doc_id,url,license,lang
eur_nis2,https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32022L2555,CC-BY-4.0,en
eur_ai_act,https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206,CC-BY-4.0,en
wikipedia_ai,https://en.wikipedia.org/wiki/Artificial_intelligence,CC-BY-SA-4.0,en# Download documents and extract text
make crawl
# Generate manifest (docs.yaml)
make manifest
# Verify
ls -lh data/corpus/
head data/meta.jsonl# Run NER, candidate generation, and RE scoring on entire corpus
make pack_corpus
# View statistics
make pack_stats
# Output:
# Pred counts in pack:
# 1523 none
# 142 type
# 87 headquartered_in
# 45 performs_task
# 23 consists_of
# 12 covered_by# Select high-confidence predictions and retrain model
make auto_train
# With custom thresholds:
GK_POS_THR=0.97 GK_NEG_THR=0.98 GK_MAX_PER_CLASS=1000 make auto_train# Promote scored predictions to edges and export TTL
make edges_from_pack
make verify
make hashFeatures:
- ✅ Respects
robots.txt - ✅ HTML & PDF support
- ✅ Rate limiting (0.2s delay)
- ✅ Size limits (25 MB max)
- ✅ SHA-256 checksums
- ✅ Metadata tracking
Text Extraction:
- HTML:
trafilatura(high quality) + BeautifulSoup fallback - PDF:
pdfminer.six
Output:
- Raw files:
data/raw/<doc_id>.(html|pdf) - Text files:
data/corpus/<doc_id>.txt - Metadata:
data/meta.jsonl
Generates docs.yaml with provenance:
- doc_id: eur_nis2
title: "Directive (EU) 2022/2555 on cybersecurity"
url: https://example.com/nis2.pdf
license: CC-BY-4.0
sha256: abc123...Pipeline Components:
- Sentencizer: Splits text into sentences (runs first)
- EntityRuler: Pattern-based entity recognition (loads from
training/ruler_patterns.jsonl)- Domain-specific entities: "OpenAI", "Tesla", "AI Act"
- Roles: "CEO", "CTO", "founder"
- Custom labels: ROLE, EVENT, LAW, PRODUCT
- NER Model: Statistical entity recognition (spaCy transformer model)
Pattern File Format (training/ruler_patterns.jsonl):
{"label":"ROLE","pattern":"CEO"}
{"label":"ORG","pattern":"OpenAI"}
{"label":"PRODUCT","pattern":"Model 3"}Usage:
- Add patterns to
training/ruler_patterns.jsonl(one JSON per line) - Patterns are exact matches (case-sensitive unless using regex)
- EntityRuler runs before statistical NER for guaranteed coverage
Runs ONNX model inference and outputs probabilities:
{
"doc_id": "eur_nis2",
"sent_start": 1234,
"text": "The directive applies to essential entities.",
"subject": {"text": "directive", "start": 4, "end": 13, "label": "LAW"},
"object": {"text": "essential entities", "start": 25, "end": 43, "label": "ORG"},
"pred": "covered_by",
"prob": 0.967
}Self-training logic:
- Selects examples where
prob >= POS_THRfor positive classes - Selects examples where
pred == "none"andprob >= NEG_THR - Caps per class to avoid imbalance
- Randomizes and splits 80/20 train/dev
Environment variables:
GK_POS_THR: Positive threshold (default: 0.95)GK_NEG_THR: Negative threshold (default: 0.95)GK_MAX_PER_CLASS: Max examples per class (default: 500)
Option A: Automatic seed generation (recommended)
# 1) Extract entities and candidates (no model needed)
make -f Makefile.gk ner cand
# 2) Auto-generate seeds from corpus
make -f Makefile.gk bootstrap_seed
# Note: Bootstrap automatically adds negative examples ("none" class)
# to ensure at least 2 classes for training
# 3) Train initial model
make -f Makefile.gk coldstart
# 4) Crawl corpus
make -f Makefile.gk crawl
make -f Makefile.gk manifest
# 5) Process corpus with initial model
make -f Makefile.gk ner cand score infer edges ttl report
# 6) Self-train (retrain from high-confidence predictions)
make -f Makefile.gk autoselect
# 7) Verify improved model
make -f Makefile.gk score infer edges ttl reportOption B: Manual seed creation (if you have labeled examples)
# 1) Create training/seed.jsonl manually
cat > training/seed.jsonl <<EOF
{"text":"[E1]Tesla[/E1] is headquartered in [E2]Austin[/E2].","label":"headquartered_in"}
{"text":"[E1]ACME[/E1] is a [E2]Digital Service Provider[/E2].","label":"type"}
{"text":"[E1]Neural networks[/E1] are used for [E2]pattern recognition[/E2].","label":"uses"}
EOF
# 2) Train initial model
make -f Makefile.gk coldstart
# 3) Crawl corpus
make -f Makefile.gk crawl
make -f Makefile.gk manifest
# 4) Process corpus with initial model
make -f Makefile.gk ner cand score infer edges ttl report
# 5) Self-train (retrain from high-confidence predictions)
make -f Makefile.gk autoselect
# 6) Verify improved model
make -f Makefile.gk score infer edges ttl report# Add more URLs to data/seed.csv
# Then:
make crawl # Download new docs
make manifest # Update docs.yaml
make pack_corpus # Process with current model
make pack_stats # Check prediction distribution
# Retrain with stricter thresholds
GK_POS_THR=0.98 GK_NEG_THR=0.99 GK_MAX_PER_CLASS=2000 make auto_train
# Finalize edges
make edges_from_packTIMEOUT = 20 # HTTP timeout (seconds)
MAX_BYTES = 25_000_000 # Max file size (25 MB)
UA = "GroundKGFetcher/1.0 (+https://groundkg.com)"MAX_CHAR_DIST = 150 # Max distance between entities
MAX_PAIRS_PER_SENT = 10 # Max pairs per sentence# Conservative (high precision)
GK_POS_THR=0.98 GK_NEG_THR=0.99 GK_MAX_PER_CLASS=300
# Balanced
GK_POS_THR=0.95 GK_NEG_THR=0.95 GK_MAX_PER_CLASS=500
# Aggressive (higher recall)
GK_POS_THR=0.90 GK_NEG_THR=0.90 GK_MAX_PER_CLASS=1000# Document count
ls data/corpus/*.txt | wc -l
# Total tokens (approx)
wc -w data/corpus/*.txt | tail -1
# Average document length
for f in data/corpus/*.txt; do wc -w < $f; done | awk '{s+=$1} END {print s/NR}'# Predicate distribution
jq -r '.pred' out/pack.scored.jsonl | sort | uniq -c | sort -nr
# High-confidence predictions
jq 'select(.prob > 0.95)' out/pack.scored.jsonl | jq -s 'group_by(.pred) | map({pred: .[0].pred, count: length})'
# Low-confidence (may need review)
jq 'select(.prob < 0.60)' out/pack.scored.jsonl | head -10
# Sample by predicate
jq 'select(.pred == "headquartered_in" and .prob > 0.9)' out/pack.scored.jsonl | head -5# Check balance
jq -r '.label' training/re_train.jsonl | sort | uniq -c
# View samples per class
for cls in type covered_by headquartered_in consists_of performs_task none; do
echo "=== $cls ===";
jq "select(.label == \"$cls\")" training/re_train.jsonl | head -2;
donegraph LR
A[Seed URLs] --> B[Crawl]
B --> C[Corpus]
C --> D[NER + Candidates]
D --> E[RE Score]
E --> F[High-conf Filter]
F --> G[Retrain Model]
G --> D
G --> H[Improved Predictions]
Workflow:
- Add more URLs →
make crawl - Process corpus →
make pack_corpus - Check quality →
make pack_stats - Retrain →
make auto_train - Evaluate →
make demo - Repeat!
- ✅ Always respect
robots.txt - ✅ Use reasonable delays (0.2s minimum)
- ✅ Set appropriate User-Agent
- ✅ Check licenses before using
- ✅ Store raw files for auditing
- 🎯 Start with high thresholds (0.95+)
- 📊 Monitor class balance
- 🔍 Manually review samples before retraining
- 📈 Track model performance over iterations
- 🚫 Don't blindly trust model predictions
- Inspect
data/meta.jsonlfor failed downloads - Check for very short documents (< 200 chars)
- Review low-confidence predictions manually
- Maintain a held-out test set
Track these over iterations:
# After each auto_train:
make pack_corpus
make pack_stats
# Track:
# 1. Total high-conf predictions (prob > 0.95)
# 2. Predicate distribution
# 3. Training set size
# 4. Model accuracy on manual test setCrawler fails:
- Check internet connection
- Verify URL accessibility
- Check disk space
- Review error messages in stderr
Empty pack.scored.jsonl:
- Ensure corpus files exist:
ls data/corpus/ - Verify model exists:
ls models/promoter_v1.onnx - Check NER output:
head out/pack.ner.jsonl
Poor self-training results:
- Lower thresholds:
GK_POS_THR=0.90 - Increase examples:
GK_MAX_PER_CLASS=1000 - Add more seed URLs
- Manually review training data
doc_id,url,license,lang
eur_nis2,https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32022L2555,CC-BY-4.0,en
eur_gdpr,https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32016R0679,CC-BY-4.0,en
eur_ai_act,https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206,CC-BY-4.0,en
wiki_kg,https://en.wikipedia.org/wiki/Knowledge_graph,CC-BY-SA-4.0,en
wiki_nlp,https://en.wikipedia.org/wiki/Natural_language_processing,CC-BY-SA-4.0,enReady to scale! This pipeline enables continuous learning from web documents, making your knowledge graph extraction system progressively better over time. 🚀