FactBuilder

Production-grade knowledge extraction system with 25 locked predicates, strict JSON Schema validation, deduplication, Information Value Score (IVS) ranking, and retrieval re-ranking.

FactBuilder extracts structured, high-value facts from video content (narrative + transcript) using LLMs with constrained outputs, then deduplicates, scores, and ranks them for optimal retrieval quality.

🎯 Features

Core Capabilities

✅ 25 locked predicates across 6 fact classes (definition, constraint, mechanism, procedure, measurement, exception)
✅ Strict JSON Schema validation (OpenAI Structured Outputs + Claude tool-use)
✅ Salience gate - Filters pricing, promos, platform meta, greetings (keeps causality, constraints, procedures, measurements)
✅ Triple-signature deduplication (SimHash + MinHash + embeddings)
✅ Canonicalization - Normalizes units (°C, seconds, kg, m), quantizes angles/times/ratios
✅ Information Value Score (IVS) - rarity × specificity × support × trust
✅ Creator depth metrics - UFR, RW_UFR, depth score (daily aggregates)
✅ Retrieval re-ranker - 0.60×SemSim + 0.25×IVS + 0.10×Depth - 0.05×DupPenalty
✅ Per-pool plugins - Domain-specific qualifiers (e.g., grip_width_cm for fitness, doneness for cooking)

Design Goals

Only extract useful knowledge: Causality, constraints, procedural steps, definitions, measurements that matter for execution
Stable ontology: Fixed predicate set across pools; per-pool qualifiers for domain specifics
Deterministic structure: Strict schema + constrained decoding
Early dedupe: Collapse duplicates at ingestion; store support counts, not copies
Retrieval-time quality bias: Re-rank by rarity, specificity, support, and creator depth

📊 Architecture

┌─────────────────────────────────────────────────────────────────────┐
│                     VIDEO CONTENT (Narrative + Transcript)          │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  1. SALIENCE GATE                                                   │
│     - Drop: pricing, promos, platform meta, greetings               │
│     - Keep: causal markers, constraints, procedures, quantities     │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  2. LLM EXTRACTION (OpenAI Structured Outputs / Claude tool-use)    │
│     - 25 predicates × 6 classes                                     │
│     - Schema-validated JSON output                                  │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  3. CANONICALIZATION                                                │
│     - Normalize units (°F→°C, min→seconds, lbs→kg)                  │
│     - Quantize: angles→5°, times<120s→5s, else 15s                  │
│     - Create canonical_str for deduplication                        │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  4. DEDUPLICATION (SimHash + MinHash + Embeddings)                  │
│     - same_fact: hamming≤3 OR (cosine≥0.92 AND jaccard≥0.65)       │
│     - near_variant: 0.85≤cosine<0.92                                │
│     - Action: increment support_count, link variants                │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  5. IVS SCORING                                                     │
│     IVS = rarity × specificity × support_w × trust                  │
│     rarity = ln(1 + N_creators / (1 + df_fact))                    │
│     specificity = 1 + bonuses (numeric, causal, step, condition)    │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  6. CREATOR METRICS (daily)                                         │
│     UFR, RW_UFR, Depth = f(qualifiers, steps, causal rate, IVS)    │
└────────────────────────────┬────────────────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────┐
│  7. RETRIEVAL RE-RANKER                                             │
│     Score = 0.60×SemSim + 0.25×IVS + 0.10×Depth - 0.05×DupPenalty  │
│     Diversity: penalize >2 facts/creator in top-N                   │
└─────────────────────────────────────────────────────────────────────┘

🚀 Quick Start

Prerequisites

Python 3.9+
PostgreSQL 12+ with pgvector extension
OpenAI API key or Anthropic API key

Installation

# Clone the repository
git clone https://github.com/yourusername/factbuilder.git
cd factbuilder

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install pgvector extension in PostgreSQL
psql -U postgres -d factbuilder -c "CREATE EXTENSION IF NOT EXISTS pgvector;"

# Create database schema
psql -U postgres -d factbuilder -f database/schema.sql

# Configure environment
cp .env.example .env
# Edit .env with your credentials and API keys

Basic Usage

from workers.extractor import OpenAIExtractor
import json

# Load pool configuration
with open("plugins/fitness.json") as f:
    fitness_config = json.load(f)

# Initialize extractor
extractor = OpenAIExtractor(
    api_key="your-openai-api-key",
    pool_config=fitness_config
)

# Extract facts from content
results = extractor.extract_from_source(
    source_id="video_12345",
    creator_id="trainer_abc",
    pool_id=1,  # fitness
    narrative="The trainer demonstrates proper squat form...",
    transcript="Today we're going to focus on squat technique. "
               "You must keep your knees aligned over your toes at 90 degrees...",
    lang="en",
    duration_s=300
)

# Access results
print(f"Extracted {results['stats']['facts_unique']} unique facts")
print(f"Dropped {results['stats']['spans_dropped']} boring spans")
print(f"Average IVS: {results['stats']['avg_ivs']:.3f}")

# Iterate over facts
for fact in results['facts']:
    print(f"  - {fact['predicate']}: {fact['subject_text']}")

Using Claude Instead

from workers.extractor import ClaudeExtractor

extractor = ClaudeExtractor(
    api_key="your-anthropic-api-key",
    pool_config=fitness_config
)

# Rest is the same...

📚 Core Concepts

1. Predicate Inventory (25 predicates, 6 classes)

A) Definition/Taxonomy

is_a - Taxonomy relationships ("Deadlift is an exercise")
aka - Aliases ("Paneer is also called cottage cheese")
part_of - Component-whole ("Quadriceps is part of the leg")
requires - Prerequisites ("Pistol squat requires ankle mobility")

B) Constraint/Rule

must - Hard requirements ("You must keep your back straight")
should - Recommendations ("You should warm up first")
avoid - Things to avoid ("Avoid rounding your spine")
contraindicated_for - Safety ("Contraindicated for pregnancy")

C) Mechanism/Causality

causes - Causal relationships ("Hip hinge causes glute activation")
prevents - Prevention ("Proper form prevents injury")
increases - Increases a metric ("Heavy load increases strength")
reduces - Reduces a metric ("Rest reduces fatigue")
works_by - Mechanism ("Works by engaging the posterior chain")

D) Procedure/Steps

has_step - Sequential steps ("Step 1: Position feet shoulder-width")
precondition - Pre-conditions ("Ensure bar is racked securely")
postcondition - Post-conditions ("Bar should be at hip level")

E) Measurement/Quantification

has_value - Specific values ("Temperature is 180°C")
has_range - Ranges ("Between 50-100 reps")
angle - Angular measurements ("90 degrees at the knee")
time - Time durations ("Hold for 30 seconds")
temperature - Temperatures ("Bake at 180°C")
reps - Exercise reps ("3 sets of 10 reps")
ratio - Ratios ("Coffee to water ratio of 1:15")

F) Exceptions/Warnings

warning - Safety warnings ("Risk of shoulder impingement")
failure_mode - Common failures ("Knees cave inward")

2. Entity Types

Global types include: concept, category, pose, exercise, technique, procedure, step, movement, body_part, joint, muscle, ingredient, food, mixture, tool, equipment, device, software, dataset, model, method, metric, parameter, measurement, condition, symptom, risk, benefit, population, product, brand, organization, person, location, environment, material, chemical, recipe, event, other.

Per-pool plugins can add domain-specific qualifiers (see plugins/README.md).

3. Deduplication Logic

Three fingerprinting methods:

SimHash (64-bit): Fast hamming distance check (threshold: 3 bits)
MinHash (100 permutations): Jaccard similarity estimation (threshold: 0.65)
Embeddings (1024-d): Cosine similarity (same: ≥0.92, variant: 0.85-0.92)

Decision:

same_fact: hamming≤3 OR (cosine≥0.92 AND jaccard≥0.65) → increment support_count
near_variant: 0.85≤cosine<0.92 → insert + link with relation="near_variant"
distinct: Insert as new fact

4. Information Value Score (IVS)

IVS = rarity × specificity × support_w × trust

Where:

rarity = ln(1 + N_creators / (1 + df_fact)) - Higher for rarer facts
specificity = 1 + bonuses (0.20×has_numeric, 0.20×has_condition, 0.25×is_causal/constraint, 0.15×has_step, 0.10×named_entity)
support_w = min(1.0 + 0.1×(support_count-1), 1.2) - Boost for multiple sightings
trust = clip(asr_confidence × narration_quality, 0.6, 1.0) - Confidence in extraction

5. Creator Depth Metrics

Computed daily per pool:

Depth = 0.35×(avg_qualifiers/3) + 0.25×(avg_steps/5) + 0.25×causal_constraint_rate + 0.15×(median_top50_ivs/2)

Also tracks:

UFR: Unique Fact Rate (% of facts with df_fact=1)
RW_UFR: Rarity-Weighted UFR (sum of rarity scores / n_facts)

6. Retrieval Re-ranking

Score = 0.60×SemSim + 0.25×IVS_norm + 0.10×Depth_norm - 0.05×DupPenalty

Diversity: Penalize >2 facts from same creator in top-N results.

🔧 Configuration

Pool Plugins

Pool-specific configurations live in plugins/:

fitness.json - Fitness, exercise, mobility
cooking.json - Cooking, recipes, culinary
diy.json - DIY, home improvement, maker projects

Each plugin defines:

domain_qualifiers: Pool-specific qualifiers (e.g., grip_width_cm, doneness, grit)
salience_patterns: Positive/negative regex patterns
entity_whitelist: Expected entity types
price_allowed: Whether pricing is allowed in has_value

See plugins/README.md for details on creating new pool configs.

Environment Variables

Key settings in .env:

# LLM API keys
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

# Deduplication thresholds
SIMHASH_THRESHOLD=3
COSINE_SAME_THRESHOLD=0.92
JACCARD_THRESHOLD=0.65

# Re-ranking weights
RERANK_SEM_SIM_WEIGHT=0.60
RERANK_IVS_WEIGHT=0.25
RERANK_DEPTH_WEIGHT=0.10
MAX_FACTS_PER_CREATOR=2

📖 Database Schema

Key tables:

creators - Content creators (handle, platform, lang)
sources - Videos/content (narrative, transcript, metadata)
facts - Extracted facts with 25 predicates
fact_fingerprints - SimHash, MinHash, embeddings
fact_scores - IVS components (rarity, specificity, support_w, trust)
creator_metrics_daily - Daily UFR, RW_UFR, depth scores
fact_links - Relationships (near_variant, contradicts, supports)
pool_configs - Per-pool settings

See database/schema.sql for complete schema and indexes.

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=core --cov=workers --cov=schema --cov-report=html

# Run specific test file
pytest tests/test_salience_gate.py -v

# Run specific test
pytest tests/test_canonicalize.py::TestCanonicalizer::test_quantize_angle -v

Tests cover:

✅ Salience gate pattern matching
✅ Canonicalization (unit normalization, quantization)
✅ Fingerprinting (SimHash, MinHash)
✅ Schema validation
✅ IVS scoring
✅ Retrieval re-ranking

📈 Performance & Scalability

Expected Metrics (Spec §13)

Unique facts/1k tokens: ↑ by 30-50% vs. baseline
Dup rate: ↓ (dedupe collisions increase)
Causal/constraint share in top-10: ↑ by 2×
Answer quality (human eval 1-5): ≥4.0 with evidence citations
Boring facts dropped: ≥95% (pricing, CTA, logistics)

Optimizations

SimHash: O(1) hamming distance check
MinHash: O(k) jaccard estimation (k=100 permutations)
Embeddings: pgvector HNSW index for ANN search
Batch processing: Process 500 sources in parallel
Nightly jobs: Recompute df_fact, IVS, creator metrics

🛠 Development

Code Style

# Format code
black .

# Lint
flake8 core workers schema

# Type checking
mypy core workers schema

Project Structure

FactBuilder/
├── core/                  # Core data structures
│   ├── predicates.py      # 25 predicate inventory
│   ├── entity_types.py    # Entity type system
│   └── fingerprints.py    # SimHash, MinHash, embeddings
├── workers/               # Processing workers
│   ├── salience_gate.py   # Content filtering
│   ├── canonicalize.py    # Unit normalization
│   ├── dedupe.py          # Deduplication logic
│   ├── score.py           # IVS & creator metrics
│   ├── extractor.py       # LLM extraction orchestrator
│   └── reranker.py        # Retrieval re-ranking
├── schema/                # JSON Schema & validation
│   ├── factset.schema.json # Strict schema
│   └── validation.py       # Python validation
├── database/              # Database schema & models
│   ├── schema.sql          # PostgreSQL schema
│   └── models.py           # SQLAlchemy models
├── prompts/               # LLM prompts
│   ├── extractor.system.txt
│   ├── extractor.user.txt
│   └── claude_tool.json
├── plugins/               # Pool configurations
│   ├── fitness.json
│   ├── cooking.json
│   ├── diy.json
│   └── README.md
├── tests/                 # Unit tests
└── examples/              # Usage examples

📝 License

MIT License - see LICENSE file for details.

🤝 Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

Please ensure:

Tests pass (pytest)
Code is formatted (black .)
New predicates are NOT added (25 is locked)
Pool plugins follow the schema in plugins/README.md

📚 References

OpenAI Structured Outputs: https://platform.openai.com/docs/guides/structured-outputs
Claude Tool Use: https://docs.anthropic.com/claude/docs/tool-use
pgvector: https://github.com/pgvector/pgvector
JSON Schema: https://json-schema.org/draft/2020-12/json-schema-core.html

💬 Support

Issues: https://github.com/yourusername/factbuilder/issues
Discussions: https://github.com/yourusername/factbuilder/discussions
Email: team@example.com

🎯 Roadmap

Built with ❤️ for production knowledge extraction

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
core		core
database		database
examples		examples
plugins		plugins
prompts		prompts
schema		schema
tests		tests
workers		workers
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

RoboBaby/FactBuilder

Folders and files

Latest commit

History

Repository files navigation