Production-grade knowledge extraction system with 25 locked predicates, strict JSON Schema validation, deduplication, Information Value Score (IVS) ranking, and retrieval re-ranking.
FactBuilder extracts structured, high-value facts from video content (narrative + transcript) using LLMs with constrained outputs, then deduplicates, scores, and ranks them for optimal retrieval quality.
- β 25 locked predicates across 6 fact classes (definition, constraint, mechanism, procedure, measurement, exception)
- β Strict JSON Schema validation (OpenAI Structured Outputs + Claude tool-use)
- β Salience gate - Filters pricing, promos, platform meta, greetings (keeps causality, constraints, procedures, measurements)
- β Triple-signature deduplication (SimHash + MinHash + embeddings)
- β Canonicalization - Normalizes units (Β°C, seconds, kg, m), quantizes angles/times/ratios
- β
Information Value Score (IVS) -
rarity Γ specificity Γ support Γ trust - β Creator depth metrics - UFR, RW_UFR, depth score (daily aggregates)
- β
Retrieval re-ranker -
0.60ΓSemSim + 0.25ΓIVS + 0.10ΓDepth - 0.05ΓDupPenalty - β
Per-pool plugins - Domain-specific qualifiers (e.g.,
grip_width_cmfor fitness,donenessfor cooking)
- Only extract useful knowledge: Causality, constraints, procedural steps, definitions, measurements that matter for execution
- Stable ontology: Fixed predicate set across pools; per-pool qualifiers for domain specifics
- Deterministic structure: Strict schema + constrained decoding
- Early dedupe: Collapse duplicates at ingestion; store support counts, not copies
- Retrieval-time quality bias: Re-rank by rarity, specificity, support, and creator depth
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β VIDEO CONTENT (Narrative + Transcript) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. SALIENCE GATE β
β - Drop: pricing, promos, platform meta, greetings β
β - Keep: causal markers, constraints, procedures, quantities β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. LLM EXTRACTION (OpenAI Structured Outputs / Claude tool-use) β
β - 25 predicates Γ 6 classes β
β - Schema-validated JSON output β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. CANONICALIZATION β
β - Normalize units (Β°FβΒ°C, minβseconds, lbsβkg) β
β - Quantize: anglesβ5Β°, times<120sβ5s, else 15s β
β - Create canonical_str for deduplication β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. DEDUPLICATION (SimHash + MinHash + Embeddings) β
β - same_fact: hammingβ€3 OR (cosineβ₯0.92 AND jaccardβ₯0.65) β
β - near_variant: 0.85β€cosine<0.92 β
β - Action: increment support_count, link variants β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. IVS SCORING β
β IVS = rarity Γ specificity Γ support_w Γ trust β
β rarity = ln(1 + N_creators / (1 + df_fact)) β
β specificity = 1 + bonuses (numeric, causal, step, condition) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. CREATOR METRICS (daily) β
β UFR, RW_UFR, Depth = f(qualifiers, steps, causal rate, IVS) β
ββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 7. RETRIEVAL RE-RANKER β
β Score = 0.60ΓSemSim + 0.25ΓIVS + 0.10ΓDepth - 0.05ΓDupPenalty β
β Diversity: penalize >2 facts/creator in top-N β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python 3.9+
- PostgreSQL 12+ with
pgvectorextension - OpenAI API key or Anthropic API key
# Clone the repository
git clone https://github.com/yourusername/factbuilder.git
cd factbuilder
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install pgvector extension in PostgreSQL
psql -U postgres -d factbuilder -c "CREATE EXTENSION IF NOT EXISTS pgvector;"
# Create database schema
psql -U postgres -d factbuilder -f database/schema.sql
# Configure environment
cp .env.example .env
# Edit .env with your credentials and API keysfrom workers.extractor import OpenAIExtractor
import json
# Load pool configuration
with open("plugins/fitness.json") as f:
fitness_config = json.load(f)
# Initialize extractor
extractor = OpenAIExtractor(
api_key="your-openai-api-key",
pool_config=fitness_config
)
# Extract facts from content
results = extractor.extract_from_source(
source_id="video_12345",
creator_id="trainer_abc",
pool_id=1, # fitness
narrative="The trainer demonstrates proper squat form...",
transcript="Today we're going to focus on squat technique. "
"You must keep your knees aligned over your toes at 90 degrees...",
lang="en",
duration_s=300
)
# Access results
print(f"Extracted {results['stats']['facts_unique']} unique facts")
print(f"Dropped {results['stats']['spans_dropped']} boring spans")
print(f"Average IVS: {results['stats']['avg_ivs']:.3f}")
# Iterate over facts
for fact in results['facts']:
print(f" - {fact['predicate']}: {fact['subject_text']}")from workers.extractor import ClaudeExtractor
extractor = ClaudeExtractor(
api_key="your-anthropic-api-key",
pool_config=fitness_config
)
# Rest is the same...A) Definition/Taxonomy
is_a- Taxonomy relationships ("Deadlift is an exercise")aka- Aliases ("Paneer is also called cottage cheese")part_of- Component-whole ("Quadriceps is part of the leg")requires- Prerequisites ("Pistol squat requires ankle mobility")
B) Constraint/Rule
must- Hard requirements ("You must keep your back straight")should- Recommendations ("You should warm up first")avoid- Things to avoid ("Avoid rounding your spine")contraindicated_for- Safety ("Contraindicated for pregnancy")
C) Mechanism/Causality
causes- Causal relationships ("Hip hinge causes glute activation")prevents- Prevention ("Proper form prevents injury")increases- Increases a metric ("Heavy load increases strength")reduces- Reduces a metric ("Rest reduces fatigue")works_by- Mechanism ("Works by engaging the posterior chain")
D) Procedure/Steps
has_step- Sequential steps ("Step 1: Position feet shoulder-width")precondition- Pre-conditions ("Ensure bar is racked securely")postcondition- Post-conditions ("Bar should be at hip level")
E) Measurement/Quantification
has_value- Specific values ("Temperature is 180Β°C")has_range- Ranges ("Between 50-100 reps")angle- Angular measurements ("90 degrees at the knee")time- Time durations ("Hold for 30 seconds")temperature- Temperatures ("Bake at 180Β°C")reps- Exercise reps ("3 sets of 10 reps")ratio- Ratios ("Coffee to water ratio of 1:15")
F) Exceptions/Warnings
warning- Safety warnings ("Risk of shoulder impingement")failure_mode- Common failures ("Knees cave inward")
Global types include: concept, category, pose, exercise, technique, procedure, step, movement, body_part, joint, muscle, ingredient, food, mixture, tool, equipment, device, software, dataset, model, method, metric, parameter, measurement, condition, symptom, risk, benefit, population, product, brand, organization, person, location, environment, material, chemical, recipe, event, other.
Per-pool plugins can add domain-specific qualifiers (see plugins/README.md).
Three fingerprinting methods:
- SimHash (64-bit): Fast hamming distance check (threshold: 3 bits)
- MinHash (100 permutations): Jaccard similarity estimation (threshold: 0.65)
- Embeddings (1024-d): Cosine similarity (same: β₯0.92, variant: 0.85-0.92)
Decision:
- same_fact:
hammingβ€3OR (cosineβ₯0.92ANDjaccardβ₯0.65) β incrementsupport_count - near_variant:
0.85β€cosine<0.92β insert + link withrelation="near_variant" - distinct: Insert as new fact
IVS = rarity Γ specificity Γ support_w Γ trust
Where:
- rarity =
ln(1 + N_creators / (1 + df_fact))- Higher for rarer facts - specificity =
1 + bonuses(0.20Γhas_numeric, 0.20Γhas_condition, 0.25Γis_causal/constraint, 0.15Γhas_step, 0.10Γnamed_entity) - support_w =
min(1.0 + 0.1Γ(support_count-1), 1.2)- Boost for multiple sightings - trust =
clip(asr_confidence Γ narration_quality, 0.6, 1.0)- Confidence in extraction
Computed daily per pool:
Depth = 0.35Γ(avg_qualifiers/3) + 0.25Γ(avg_steps/5) + 0.25Γcausal_constraint_rate + 0.15Γ(median_top50_ivs/2)
Also tracks:
- UFR: Unique Fact Rate (% of facts with df_fact=1)
- RW_UFR: Rarity-Weighted UFR (sum of rarity scores / n_facts)
Score = 0.60ΓSemSim + 0.25ΓIVS_norm + 0.10ΓDepth_norm - 0.05ΓDupPenalty
Diversity: Penalize >2 facts from same creator in top-N results.
Pool-specific configurations live in plugins/:
fitness.json- Fitness, exercise, mobilitycooking.json- Cooking, recipes, culinarydiy.json- DIY, home improvement, maker projects
Each plugin defines:
- domain_qualifiers: Pool-specific qualifiers (e.g.,
grip_width_cm,doneness,grit) - salience_patterns: Positive/negative regex patterns
- entity_whitelist: Expected entity types
- price_allowed: Whether pricing is allowed in
has_value
See plugins/README.md for details on creating new pool configs.
Key settings in .env:
# LLM API keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
# Deduplication thresholds
SIMHASH_THRESHOLD=3
COSINE_SAME_THRESHOLD=0.92
JACCARD_THRESHOLD=0.65
# Re-ranking weights
RERANK_SEM_SIM_WEIGHT=0.60
RERANK_IVS_WEIGHT=0.25
RERANK_DEPTH_WEIGHT=0.10
MAX_FACTS_PER_CREATOR=2Key tables:
creators- Content creators (handle, platform, lang)sources- Videos/content (narrative, transcript, metadata)facts- Extracted facts with 25 predicatesfact_fingerprints- SimHash, MinHash, embeddingsfact_scores- IVS components (rarity, specificity, support_w, trust)creator_metrics_daily- Daily UFR, RW_UFR, depth scoresfact_links- Relationships (near_variant, contradicts, supports)pool_configs- Per-pool settings
See database/schema.sql for complete schema and indexes.
# Run all tests
pytest
# Run with coverage
pytest --cov=core --cov=workers --cov=schema --cov-report=html
# Run specific test file
pytest tests/test_salience_gate.py -v
# Run specific test
pytest tests/test_canonicalize.py::TestCanonicalizer::test_quantize_angle -vTests cover:
- β Salience gate pattern matching
- β Canonicalization (unit normalization, quantization)
- β Fingerprinting (SimHash, MinHash)
- β Schema validation
- β IVS scoring
- β Retrieval re-ranking
- Unique facts/1k tokens: β by 30-50% vs. baseline
- Dup rate: β (dedupe collisions increase)
- Causal/constraint share in top-10: β by 2Γ
- Answer quality (human eval 1-5): β₯4.0 with evidence citations
- Boring facts dropped: β₯95% (pricing, CTA, logistics)
- SimHash: O(1) hamming distance check
- MinHash: O(k) jaccard estimation (k=100 permutations)
- Embeddings: pgvector HNSW index for ANN search
- Batch processing: Process 500 sources in parallel
- Nightly jobs: Recompute df_fact, IVS, creator metrics
# Format code
black .
# Lint
flake8 core workers schema
# Type checking
mypy core workers schemaFactBuilder/
βββ core/ # Core data structures
β βββ predicates.py # 25 predicate inventory
β βββ entity_types.py # Entity type system
β βββ fingerprints.py # SimHash, MinHash, embeddings
βββ workers/ # Processing workers
β βββ salience_gate.py # Content filtering
β βββ canonicalize.py # Unit normalization
β βββ dedupe.py # Deduplication logic
β βββ score.py # IVS & creator metrics
β βββ extractor.py # LLM extraction orchestrator
β βββ reranker.py # Retrieval re-ranking
βββ schema/ # JSON Schema & validation
β βββ factset.schema.json # Strict schema
β βββ validation.py # Python validation
βββ database/ # Database schema & models
β βββ schema.sql # PostgreSQL schema
β βββ models.py # SQLAlchemy models
βββ prompts/ # LLM prompts
β βββ extractor.system.txt
β βββ extractor.user.txt
β βββ claude_tool.json
βββ plugins/ # Pool configurations
β βββ fitness.json
β βββ cooking.json
β βββ diy.json
β βββ README.md
βββ tests/ # Unit tests
βββ examples/ # Usage examples
MIT License - see LICENSE file for details.
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
Please ensure:
- Tests pass (
pytest) - Code is formatted (
black .) - New predicates are NOT added (25 is locked)
- Pool plugins follow the schema in
plugins/README.md
- OpenAI Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
- Claude Tool Use: https://docs.anthropic.com/claude/docs/tool-use
- pgvector: https://github.com/pgvector/pgvector
- JSON Schema: https://json-schema.org/draft/2020-12/json-schema-core.html
- Issues: https://github.com/yourusername/factbuilder/issues
- Discussions: https://github.com/yourusername/factbuilder/discussions
- Email: team@example.com
- Core predicate inventory (25 predicates)
- Strict JSON Schema with OpenAI + Claude support
- Salience gate (rule-based)
- Triple-signature deduplication
- IVS scoring & creator metrics
- Retrieval re-ranker
- Per-pool plugins (fitness, cooking, DIY)
- ML-based salience classifier
- Contradiction detection
- Multi-language support
- FAISS external index
- Real-time streaming ingestion
- Web UI for fact browsing
Built with β€οΈ for production knowledge extraction