KOI Processor v2

🚀 Production-Ready Knowledge Organization Infrastructure Pipeline

✅ Provenance System Enhanced (Sept 27, 2025): Complete parent-child document relationships with full URL preservation through provenance chain.

✅ Text Extraction Fixed (Sept 14, 2025): Pipeline now processing 100% clean, uncorrupted text after fixing sensor extraction issues.

A comprehensive sensor-to-agent pipeline that processes real-time content from KOI sensors, generates embeddings, handles deduplication and versioning, and provides immediate semantic search capabilities for AI agents.

Overview

The KOI Processor is the central processing hub of the Knowledge Organization Infrastructure (KOI) ecosystem. It receives events from distributed sensors, processes content into searchable embeddings, and makes knowledge immediately available to AI agents through semantic search.

What's New in v2

✅ RID-based Deduplication: Prevents duplicate content ingestion
✅ Version Control: Tracks content updates with full audit trail
✅ CAT Receipts: Complete provenance tracking for all transformations
✅ Isolated Tables: Separates sensor data from scraped content
✅ Parent-Child Documents: Hierarchical relationships for forum topics and posts
✅ Provenance API: Navigate document lineage via /api/koi/graph/provenance/{rid}
✅ Enhanced Curators: Daily/Weekly content curation with proper URL attribution
✅ Production Embeddings: Model-agnostic embedding server (currently BGE-large-en-v1.5)
✅ MCP Integration: Semantic search via Model Context Protocol
✅ Content Operations Dashboard: Web-based monitoring for Daily Bot and Weekly Digest
✅ Daily Content Curator: LLM-enhanced daily X posts with comprehensive ledger integration
✅ Weekly Aggregator: AI-powered weekly digest with on-chain activity tracking
✅ Code Graph Service: Automatic code entity extraction into Apache AGE graph
✅ Knowledge Graph Quality Improvement (Dec 2025): Modular post-processing pipeline with regression suite and 99.7% quality (up from 62%)

🎯 Knowledge Graph Quality (Dec 2025)

Status: ✅ PRODUCTION DEPLOYMENT v1.1 | Quality: 99.7% | Tests: KG regression suite passing | Dedup: 70.10%

Comprehensive three-phase quality improvement project completed successfully:

Phase 1-2: Quality Filters & Pipeline ✅

62% → 99.7% quality improvement
Modular Pipeline Framework: 6 operational modules (ConfidenceFilter, DocumentLevelDeduplicator, CanonicalResolver, OntologyNormalizer, ListSplitter, EntityQualityFilter)
Regression Suite: Targeted tests for KG stability
Production Deployment: Zero errors, < 1% performance overhead

Phase 3: Cross-Document Deduplication ✅ COMPLETE

Entity Deduplication System with three-tier waterfall:
- Tier 1 (Exact): B-Tree index match (~microseconds) - 58.0% hit rate
- Tier 2 (Semantic): pgvector HNSW + OpenAI embeddings (~milliseconds) - 10.6% hit rate
- Tier 3 (New): Insert new entities - 31.4% new entities
Production Stats (as of 2025-12-11, pre-Stage 6 re-extraction):
- 12,985 unique entities from 43,430 raw entity mentions
- 70.10% deduplication rate (target: 65-75%)
- 64,925 RDF triples (Fuseki knowledge graph deployed)
- Zero type collisions (all type mismatches resolved)
- Zero placeholder entities (all "Unknown"/"Anonymous" removed)
- Zero errors in production
Quality Improvements:
- JIRA IDs: 509 → 0 (100% eliminated)
- Template text: 444 → 0 (100% eliminated)
- Chunk repetition: 95% reduction
- Type consolidation: 678 entities merged
Code Quality: A+ grade (expert reviewed)

Stage 6 Re-Extraction + Code Bridge (Dec 2025)

Stage 6 rebuilds the semantic KG from a docs-only corpus (Notion + Discourse + Website + GitHub/GitLab markdown), using Gemini extraction and PostgreSQL as the authoritative store. Fuseki is rebuilt from PostgreSQL after completion.

Code Bridge enables joinable semantics and code structure:

koi_code_artifacts: canonical code entities (exported from code graph provenance)
koi_doc_code_links: doc → code links (MENTIONS edges preserved)
entity_registry.metadata.code_uri: entity-level code links (post-Stage 6)
AGE stub sync for single-query access to semantic anchors

Key scripts:

scripts/reextraction/stage6_canary_gemini.py
scripts/reextraction/stage6_full_reextract_gemini.py
scripts/code_bridge/export_code_artifacts.py
scripts/code_bridge/link_docs_to_code.py
scripts/code_bridge/link_entities_to_code.py
scripts/code_bridge/sync_stubs_to_age.py

Quick Start:

from knowledge_graph.graph_integration import KnowledgeGraphIntegrator

# With pipeline and deduplication
kg = KnowledgeGraphIntegrator(
    use_pipeline=True,
    use_entity_resolver=True  # Enable deduplication
)
valid_entities = kg.process_entities_batch(entities)

Documentation:

PRODUCTION_DEPLOYMENT_SUMMARY.md - Production deployment status & rollback procedures
prompts/ALL_PROMPTS_SUMMARY.md - Complete project workflow
CLAUDE.md - Current project context

Project Structure

koi-processor/
├── src/                    # Source code
│   ├── core/              # Core KOI processing
│   │   ├── koi_event_bridge_v2.py
│   │   ├── koi_types.py
│   │   ├── koi_permissions_api.py
│   │   └── bge_server.py
│   ├── content/           # Content generation & monitoring
│   │   ├── content_dashboard.py
│   │   ├── daily_curator_llm.py
│   │   ├── weekly_curator_llm.py
│   │   └── quality_control.py
│   ├── audio/             # Podcast generation
│   │   ├── audio_pipeline_enhanced.py
│   │   └── podcast_integration.py
│   ├── knowledge_graph/   # Knowledge graph & entity processing
│   │   ├── postprocessing/
│   │   │   ├── pipeline.py          # Pipeline framework
│   │   │   └── modules/             # Processing modules
│   │   ├── entity_resolver.py       # Entity deduplication (3-tier)
│   │   ├── uri_generator.py         # Deterministic URI generation
│   │   └── graph_integration.py     # Fuseki integration
│   ├── services/          # External service integrations
│   │   ├── regen_ledger.py
│   │   └── regen_ledger_comprehensive.py
│   └── utils/             # Utilities & helpers
├── scripts/               # Operational scripts
│   ├── setup.sh          # One-command setup
│   ├── run_migrations.sh # Database migrations
│   ├── run_daily_curator.py
│   └── code_bridge/      # Code/semantic bridge scripts
├── migrations/            # Database migrations
├── docs/                  # Documentation
├── tests/                 # Test suite (use targeted KG regression subset)
├── prompts/               # Project planning & documentation
│   ├── PROMPT_1-23_*.md  # Active prompts
│   └── archive/          # Superseded prompts
├── config/                # Configuration files
├── static/                # Dashboard static files
├── templates/             # Dashboard templates
└── requirements.txt       # Python dependencies

Architecture

DATA INGESTION PIPELINE:
┌─────────────┐     ┌──────────────┐     ┌──────────────┐
│ KOI Sensors │────▶│ Coordinator  │────▶│ Event Bridge │
│  (Various)  │     │  (Port 8005) │     │  (Port 8100) │
└─────────────┘     └──────────────┘     └──────────────┘
                                                   │
                                    ┌──────────────┴──────────────┐
                                    │                             │
                                    ▼                             ▼
                            ┌──────────────┐            ┌──────────────────┐
                            │  Embedding   │            │ Entity Extractor │
                            │   Server     │            │  (JSON-LD/RDF)   │
                            │  (Port 8090) │            │    [PLANNED]     │
                            └──────────────┘            └──────────────────┘
                                    │                              │
                                    ▼                              ▼
                            ┌──────────────┐            ┌──────────────────┐
                            │ PostgreSQL   │            │  Apache Jena     │
                            │  (pgvector)  │            │    Fuseki        │
                            │ • Embeddings │            │  (Port 3030)     │
                            └──────────────┘            │ • RDF Triples    │
                                                        │ • Ontologies     │
                                                        └──────────────────┘

QUERY/ACCESS LAYER:
                            ┌──────────────────────────────────────┐
                            │      Hybrid RAG API (Port 8301)      │
                            │  • Reciprocal Rank Fusion (RRF)      │
                            │  • BGE Semantic Search               │
                            │  • Adaptive Extraction Triggers      │
                            └──────────────────────────────────────┘
                                              │
                            ┌─────────────────┴──────────────────┐
                            │                                     │
                            ▼                                     ▼
                   ┌──────────────────┐             ┌──────────────────┐
                   │  MCP Knowledge   │             │   GAIA React     │
                   │  Server          │             │   Frontend       │
                   │  (Port 8200)     │             │ (Port 3000 web)  │
                   └──────────────────┘             └──────────────────┘

QUERY/ACCESS LAYER:
                            ┌───────────────────────────────────┐
                            │         PostgreSQL                │
                            │  • koi_memories (KOI knowledge)   │
                            │  • koi_embeddings (pgvector)      │
                            │  • memories (agent state)         │
                            │  • conversations (agent history)  │
                            └───────────────────────────────────┘
                                               │
                                               ▼
                            ┌─────────────────────────────────┐
                            │         Eliza Agents            │
                            │        (5 AI Agents)            │
                            │                                 │
                            │ • Direct SQL for state          │
                            │ • MCP tools for external data   │
                            └─────────────────────────────────┘
                                     ▲             ▲
                                     │             │
                        ┌────────────┴───┐    ┌────┴──────────┐
                        │ Knowledge MCP  │    │  Regen MCP    │
                        │     Server     │    │    Server     │
                        │                │    │               │
                        │ Routes to:     │    │ Connects to:  │
                        │ • PostgreSQL   │    │ • Regen       │
                        │   (pgvector)   │    │   Ledger      │
                        │ • Apache Jena  │    │ • Blockchain  │
                        │   Fuseki       │    │   data        │
                        └────────────────┘    └───────────────┘
                                ▲                     ▲
                                │                     │
                        ┌───────┴────────┐            │
                        │                │            │
                  PostgreSQL      Apache Jena    Regen Ledger
                  (pgvector)        Fuseki       (Blockchain)

Component Description

Data Ingestion Pipeline:

KOI Sensors: Monitor websites, documents, and other sources
KOI Coordinator (port 8200): Routes events to processing pipeline
KOI Event Bridge v2 (port 8100): Distributes content to processors
- Handles deduplication, versioning, chunking
- Routes to both embedding and entity extraction paths
Embedding Server (port 8090): Generates semantic embeddings
- Currently using BAAI/bge-large-en-v1.5 (1024 dimensions)
- Model-agnostic API allows swapping to other models
- Stores embeddings in PostgreSQL pgvector
Entity Extractor (PLANNED): Extracts structured data
- Processes content into JSON-LD/RDF format
- Extracts entities, relationships, and ontological information
- Uses unified metabolic ontology (36 classes)
- Loads RDF triples directly into Apache Jena

Storage Layer:

PostgreSQL: Dual-purpose database
- Stores KOI knowledge (koi_memories, koi_embeddings with pgvector)
- Stores agent state (memories, conversations, relationships)
Apache Jena Fuseki (port 3030): SPARQL triplestore
- Stores RDF triples and OWL ontologies
- Populated by Entity Extractor (when implemented)
- Handles complex ontological/semantic reasoning queries

Query/Access Layer:

Knowledge MCP Server: KOI knowledge query API for agents
- Routes semantic searches to PostgreSQL pgvector
- Routes ontological queries to Apache Jena Fuseki
- Provides unified knowledge interface via stdio transport
Regen MCP Server: Blockchain data API for agents
- Connects to Regen Ledger blockchain
- Provides access to on-chain data (carbon credits, ecological state, etc.)
- Handles blockchain queries and transactions
- Separate from knowledge infrastructure
Eliza Agents: Three connection patterns
- Direct PostgreSQL: For agent state, conversations, memories
- Via Knowledge MCP: For KOI knowledge queries (embeddings and ontologies)
- Via Regen MCP: For blockchain/ledger queries

Key Features

🔄 Deduplication & Versioning

RID-based tracking: Each document has a unique Resource Identifier
Version control: UPDATE events create new versions, preserving history
Audit trail: Complete provenance tracking with CAT receipts

🧬 Smart Processing

Intelligent chunking: 1000 chars with 200 char overlap
Multi-format support: Handles JSON, HTML, plain text
Event types: NEW, UPDATE, FORGET with appropriate handling

🔍 Semantic Search

Production embeddings: State-of-the-art semantic vectors
MCP integration: Standard protocol for agent tool use
Permission filtering: Agent-specific content access control

📊 Isolated Storage

Dual-table pattern: koi_memories for source documents, memories for chunked content
No contamination: Clean separation of data sources
Migration support: Gradual transition from legacy systems
Full documentation: See STORAGE_ARCHITECTURE.md for details

🗓️ RDF Date Enrichment (publishedAt)

Export publication dates from DB to JSON mapping:
- node scripts/export_published_map.js (uses POSTGRES_URL, writes to src/core/published_map.json by default)
Refine RDF graph with regx:publishedAt and load into Jena:
- bash scripts/refine_with_published.sh (uses CONSOLIDATION_PATH, PUBLISHED_MAP_PATH, JENA_DATA_ENDPOINT)
Enables SPARQL date gating in MCP when present.

Quick Start

Prerequisites

Python 3.8+
PostgreSQL with pgvector extension (or Docker)
4GB+ RAM recommended

Installation

Setup

Quick Start (No Google Drive)

./scripts/setup.sh

With Google Drive Integration

Run setup:
```
./scripts/setup.sh
```

Configure OAuth (see config/README.md):

cp /path/to/client_secret.json config/client_secret.json

Service account setup (for automated ingestion):
- Share Google Drive folders with: rag-ingestion-bot@koi-sensor.iam.gserviceaccount.com
- Grant "Viewer" access

OAuth Endpoints

Once configured, users authenticate at:

Auth URL: https://your-domain.com/api/koi/auth/initiate
Callback: https://your-domain.com/api/koi/auth/callback
Status: https://your-domain.com/api/koi/auth/status

Clone and setup:

cd /opt/projects/koi-processor
git pull origin regen-prod
bash scripts/setup.sh

Configure environment:

cp .env.example .env
# Edit .env and add your OPENAI_API_KEY (required for podcast generation)

Run migrations:

bash scripts/run_migrations_with_backup.sh

Start services:

# Start core KOI services
bash scripts/start_all_services.sh

# Start Hybrid RAG API (required for semantic search)
cd /opt/projects/koi-processor && bun koi-query-api.ts &

# Start MCP Knowledge Server (for agent access)
source venv/bin/activate && python3 src/core/koi_knowledge_mcp_server.py &

Service Ports:

Port 8005: KOI Coordinator
Port 8090: BGE Embedding Server
Port 8100: Event Bridge
Port 8200: MCP Knowledge Server
Port 8301: Hybrid RAG API (semantic search)
Port 8400: Content Dashboard

Access dashboard: Open http://localhost:8400 in your browser

Web Access Setup (HTTPS)

To set up HTTPS access at https://regen.gaiaai.xyz/digests:

# Run the nginx setup script
sudo bash /opt/projects/koi-processor/setup_nginx_digests.sh

This will:

Install nginx (if needed)
Configure SSL with Let's Encrypt
Set up proxy from https://regen.gaiaai.xyz/digests to localhost:8400
Enable WebSocket support for real-time updates

After setup, the dashboard will be accessible at: https://regen.gaiaai.xyz/digests

# Clone the repository
git clone https://github.com/yourusername/koi-processor.git
cd koi-processor

# Run the setup script
chmod +x scripts/setup.sh
./scripts/setup.sh

The setup script will:

✅ Check Python version
✅ Create virtual environment
✅ Install all dependencies
✅ Set up PostgreSQL (optionally with Docker)
✅ Run database migrations
✅ Create configuration files
✅ Set up the monitoring dashboard

Running the System

1. Start the Monitoring Dashboard

source venv/bin/activate
python src/content/content_dashboard.py
# Open http://localhost:8400 in your browser

2. Run Daily Content Curator

source venv/bin/activate
python scripts/run_daily_curator.py

3. Check System Status

source venv/bin/activate
python scripts/run_daily_curator.py status

Advanced Setup

Manual Installation

If you prefer to set up components manually:

Database

# Create database
createdb -U postgres eliza

# Enable pgvector extension  
psql -U postgres -d eliza -c "CREATE EXTENSION IF NOT EXISTS vector;"

# Run migrations individually
psql -U postgres -d eliza < migrations/001_create_transformation_receipts.sql
psql -U postgres -d eliza < migrations/002_create_agent_knowledge_permissions.sql
psql -U postgres -d eliza < migrations/003_create_isolated_koi_tables.sql
psql -U postgres -d eliza < migrations/004_add_publication_dates.sql
psql -U postgres -d eliza < migrations/005_create_dashboard_tables.sql

Embedding Server (Optional)

# For testing/development
python bge_server.py

# For production (requires GPU)
# See bge_server_real.py for Hugging Face implementation

Apache Jena Fuseki (Optional)

# Download and extract Fuseki
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.10.0.tar.gz
tar -xzf apache-jena-fuseki-4.10.0.tar.gz
cd apache-jena-fuseki-4.10.0

# Start Fuseki server
./fuseki-server --loc=/path/to/data --port=3030 /koi

# Or use Docker
docker run -p 3030:3030 -e ADMIN_PASSWORD=admin stain/jena-fuseki

Step 6: Knowledge MCP Server Setup

cd bge-mcp-ts
bun install
bun run bge-server.ts

Step 7: Regen MCP Server Setup (Optional)

# See separate Regen MCP repository for blockchain integration
# https://github.com/yourusername/regen-mcp-server

Configuration

Environment Variables

Create a .env file in the project root:

# Database
POSTGRES_URL=postgresql://postgres:postgres@localhost:5433/eliza

# Embedding Server
BGE_API_URL=http://localhost:8090/encode

# Event Bridge Configuration
USE_ISOLATED_TABLES=true  # Use new deduplication tables
KOI_COORDINATOR_URL=http://localhost:8200

# MCP Server (optional)
MCP_SERVER_PORT=3000

# Logging
LOG_LEVEL=INFO

Service Ports

8090: Embedding Server
8100: KOI Event Bridge
8200: KOI Coordinator
3000: MCP Server (stdio transport)
3030: Apache Jena Fuseki SPARQL endpoint

Usage

Starting Services

1. Start Embedding Server

python bge_server.py  # Currently using BGE model
# Server will run on http://localhost:8090

2. Start Event Bridge v2

USE_ISOLATED_TABLES=true python koi_event_bridge_v2.py
# Server will run on http://localhost:8100

3. Start Apache Jena Fuseki

./fuseki-server --loc=/path/to/data --port=3030 /koi
# SPARQL endpoint will be at http://localhost:3030/koi

4. Start Knowledge MCP Server

cd bge-mcp-ts
bun run bge-server.ts
# Knowledge MCP server handles query routing to PostgreSQL and Apache Jena

5. Start Regen MCP Server (if needed)

# See Regen MCP repository for setup
# Provides blockchain data access to agents

Sending Events

NEW Event (First time content)

curl -X POST http://localhost:8100/process-koi-event \
  -H "Content-Type: application/json" \
  -d '{
    "event_type": "NEW",
    "source_sensor": "website_monitor",
    "timestamp": "2025-09-09T12:00:00Z",
    "bundle": {
      "rid": "sensor.website.example.com.page1",
      "cid": "bafyreiabc123...",
      "content": {
        "text": "This is the content to be processed..."
      },
      "metadata": {
        "title": "Example Page",
        "url": "https://example.com/page1"
      },
      "manifest": {
        "version": "1.0.0"
      }
    }
  }'

UPDATE Event (Content changed)

curl -X POST http://localhost:8100/process-koi-event \
  -H "Content-Type: application/json" \
  -d '{
    "event_type": "UPDATE",
    "source_sensor": "website_monitor",
    "timestamp": "2025-09-09T13:00:00Z",
    "bundle": {
      "rid": "sensor.website.example.com.page1",
      "cid": "bafyreiabc456...",
      "content": {
        "text": "This is the UPDATED content..."
      },
      "metadata": {
        "title": "Example Page (Updated)"
      },
      "manifest": {
        "version": "1.0.0"
      }
    }
  }'

Checking Status

# Event Bridge health
curl http://localhost:8100/

# Pipeline statistics
curl http://localhost:8100/stats

# Embedding server test
curl -X POST http://localhost:8090/encode \
  -H "Content-Type: application/json" \
  -d '{"text": "test embedding"}'

# Apache Jena SPARQL test
curl http://localhost:3030/koi/sparql \
  -H "Content-Type: application/sparql-query" \
  -d "SELECT * WHERE { ?s ?p ?o } LIMIT 10"

Agent Query Flow

The dual MCP Server architecture provides specialized query interfaces:

Knowledge MCP Server:

Semantic Search (via PostgreSQL pgvector):
- Agent sends: {"tool": "bge_search", "query": "regenerative agriculture"}
- Routes to PostgreSQL for embedding similarity search
- Returns relevant documents with similarity scores
Ontological Query (via Apache Jena):
- Agent sends: {"tool": "sparql_query", "query": "SELECT ?entity WHERE..."}
- Routes to Apache Jena Fuseki
- Returns RDF triples and relationships
Hybrid Query:
- Combines results from both systems
- Semantic context from embeddings + ontological relationships

Regen MCP Server:

Blockchain Query:
- Agent sends: {"tool": "ledger_query", "query": "carbon_credits"}
- Connects to Regen Ledger
- Returns on-chain data (credits, attestations, ecological state)

API Documentation

Event Bridge API

`GET /` - Health Check

Returns service status and configuration.

Response:

{
  "service": "KOI Event Bridge v2",
  "status": "operational",
  "version": "2.0.0",
  "features": [...],
  "isolated_tables": true
}

`POST /process-koi-event` - Process Event

Processes a KOI event with deduplication and versioning.

Request Body:

{
  "event_type": "NEW|UPDATE|FORGET",
  "source_sensor": "string",
  "timestamp": "ISO 8601",
  "bundle": {
    "rid": "unique resource identifier",
    "cid": "content identifier",
    "content": {},
    "metadata": {},
    "manifest": {}
  }
}

Response:

{
  "success": true,
  "rid": "string",
  "cid": "string",
  "chunks_created": 1,
  "embeddings_created": 1,
  "version": 1,
  "previous_version_id": null,
  "error": null
}

`GET /stats` - Pipeline Statistics

Returns current pipeline metrics.

Embedding Server API

`POST /encode` - Generate Embedding

Generates semantic embedding for text (currently using BGE model).

Request:

{
  "text": "content to embed"
}

Response:

{
  "embedding": [0.123, -0.456, ...] // 1024 dimensions
}

Database Schema

Isolated KOI Tables

`koi_memories`

CREATE TABLE koi_memories (
    id UUID PRIMARY KEY,
    rid VARCHAR(500) NOT NULL,
    cid VARCHAR(500),
    version INTEGER DEFAULT 1,
    previous_version_id UUID,
    event_type VARCHAR(20),
    source_sensor VARCHAR(200),
    content JSONB,
    metadata JSONB,
    superseded_at TIMESTAMP,
    created_at TIMESTAMP,
    UNIQUE(rid, version)
);

`koi_embeddings`

CREATE TABLE koi_embeddings (
    id SERIAL PRIMARY KEY,
    memory_id UUID REFERENCES koi_memories(id),
    dim_768 vector(768),   -- For alternative models
    dim_1024 vector(1024), -- For BGE
    dim_1536 vector(1536), -- For OpenAI
    created_at TIMESTAMP,
    UNIQUE(memory_id)
);

Useful Queries

-- Get latest version of all documents
SELECT * FROM current_koi_memories;

-- Get version history for a RID
SELECT * FROM get_koi_memory_history('sensor.website.example.com.page1');

-- Pipeline statistics
SELECT * FROM koi_pipeline_stats;

-- Check for duplicates
SELECT rid, COUNT(*) 
FROM koi_memories 
WHERE superseded_at IS NULL 
GROUP BY rid 
HAVING COUNT(*) > 1;

Milestone B Components

✅ Podcast Publishing System (Session 14 - COMPLETE)

Status: Fully Implemented and Tested

The podcast publishing system generates weekly audio digests from aggregated content using automated audio generation (Podcastfy) or manual export (NotebookLM).

Key Components:

podcast_publisher.py - RSS 2.0 feed generation with iTunes extensions
podcastfy_generator.py - Automated audio generation (no manual steps)
podcast_integration.py - Complete pipeline orchestration
PODCAST_HOSTING_GUIDE.md - Comprehensive documentation

Features:

Automated audio generation from weekly digests
RSS feed with full podcast metadata
Google Drive backup integration (optional)
Episode management and versioning
Configurable voices and conversation styles

✅ Weekly Aggregator (Session 8 - COMPLETE)

Status: Fully Implemented

Aggregates content from past 7 days and generates comprehensive weekly digests.

✅ Daily Content Curator (Sessions 7, 9-13 - COMPLETE)

Status: Fully Implemented

The Daily Content Curator will be a specialized processor component that aggregates and curates content for daily X posts and weekly digests.

Architecture Decision:

Component Type: Processor/Aggregator (NOT a KOI node)
Location: /koi-processor/daily_curator.py
Integration: Queries KOI infrastructure rather than acting as a sensor

Key Features:

Query PostgreSQL for recent koi_memories (24-48 hours)
Embedding similarity search for trending topic identification
Stats aggregation from ledger sensor data
Thread generation (3-5 posts with headline, stat, links, CTA)
Style guide compliance checking
JSON output for X bot consumption

Data Flow:

KOI Sensors → Event Bridge → PostgreSQL
                                ↓
                        Daily Content Curator
                                ↓
                        X Bot / Weekly Digest

Testing

Unit Tests

python -m pytest tests/

Integration Test

# Start all services
./scripts/start_services.sh

# Run integration tests
python tests/test_integration.py

Manual Testing

# Send test event
python scripts/send_test_event.py

# Check if processed
psql -U postgres -d eliza -c "SELECT * FROM koi_pipeline_stats;"

Deployment

Production Configuration

Use environment variables for all configuration
Enable SSL for PostgreSQL connections
Use real embedding model instead of mock server
Set up monitoring (Prometheus metrics available at /metrics)
Configure log rotation for production logs

Docker Deployment

# Build image
docker build -t koi-processor .

# Run with environment file
docker run --env-file .env.production koi-processor

Systemd Service

[Unit]
Description=KOI Event Bridge v2
After=network.target postgresql.service

[Service]
Type=simple
User=koi
WorkingDirectory=/opt/koi-processor
Environment="USE_ISOLATED_TABLES=true"
ExecStart=/usr/bin/python3 koi_event_bridge_v2.py
Restart=always

[Install]
WantedBy=multi-user.target

Troubleshooting

Common Issues

"Embedding server not responding"

Check if embedding server is running: curl http://localhost:8090/encode -d '{"text":"test"}'
Verify BGE_API_URL environment variable
Check firewall rules for port 8090

"Duplicate key violation"

This means deduplication is working!
Use UPDATE event type for changed content
Check RID uniqueness before sending NEW events

"No embeddings created"

Verify pgvector extension: \dx in psql
Check embedding dimension matches model output
Review Event Bridge logs for errors

"Memory/CPU usage high"

Adjust chunk size and overlap in configuration
Implement rate limiting for sensor events
Consider horizontal scaling with multiple Event Bridge instances

Debug Mode

# Enable debug logging
LOG_LEVEL=DEBUG python koi_event_bridge_v2.py

# Check specific component
python -c "from koi_event_bridge_v2 import test_connection; test_connection()"

Contributing

Fork the repository
Create feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open Pull Request

Related Repositories

koi-sensors - Sensor implementations
koi-research - Research and documentation
GAIA - Eliza AI agent framework

License

MIT License - see LICENSE file for details

Built with 💚 for the regenerative future

Name		Name	Last commit message	Last commit date
Latest commit History 304 Commits
.local-backup		.local-backup
agent-prompts		agent-prompts
api		api
archive		archive
bge-mcp-ts		bge-mcp-ts
config		config
data		data
docs		docs
migrations		migrations
monitoring		monitoring
ontologies		ontologies
patches		patches
podcast/feed		podcast/feed
prompts		prompts
reports		reports
scripts		scripts
services		services
src		src
static		static
systemd		systemd
templates		templates
tests		tests
utils		utils
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
INTERACTIVE_REVIEW_GUIDE.md		INTERACTIVE_REVIEW_GUIDE.md
INTERACTIVE_REVIEW_QUICK_START.md		INTERACTIVE_REVIEW_QUICK_START.md
PRODUCTION_DEPLOYMENT_SUMMARY.md		PRODUCTION_DEPLOYMENT_SUMMARY.md
README.md		README.md
adaptive_extraction_api.py		adaptive_extraction_api.py
check_embedding_progress.sh		check_embedding_progress.sh
count_by_source.sql		count_by_source.sql
deploy_code_graph_provenance.sh		deploy_code_graph_provenance.sh
ecosystem.hybrid.config.js		ecosystem.hybrid.config.js
koi-ontology.ttl		koi-ontology.ttl
koi-query-api.ts		koi-query-api.ts
nginx_podcast_addition.conf		nginx_podcast_addition.conf
pipeline-metadata.ttl		pipeline-metadata.ttl
prompt24_validation_report.json		prompt24_validation_report.json
requirements.txt		requirements.txt
run-koi-mcp-stdio.sh		run-koi-mcp-stdio.sh
test_documents.json		test_documents.json

gaiaaiagent/koi-processor

Folders and files

Latest commit

History

Repository files navigation

KOI Processor v2

📋 Table of Contents

Overview

What's New in v2

🎯 Knowledge Graph Quality (Dec 2025)

Stage 6 Re-Extraction + Code Bridge (Dec 2025)

Project Structure

Architecture

Component Description

Data Ingestion Pipeline:

Storage Layer:

Query/Access Layer:

Key Features

🔄 Deduplication & Versioning

🧬 Smart Processing

🔍 Semantic Search

📊 Isolated Storage

🗓️ RDF Date Enrichment (publishedAt)

Quick Start

Prerequisites

Installation

Setup

Quick Start (No Google Drive)

With Google Drive Integration

OAuth Endpoints

Web Access Setup (HTTPS)

Running the System

1. Start the Monitoring Dashboard

2. Run Daily Content Curator

3. Check System Status

Advanced Setup

Manual Installation

Database

Embedding Server (Optional)

Apache Jena Fuseki (Optional)

Step 6: Knowledge MCP Server Setup

Step 7: Regen MCP Server Setup (Optional)

Configuration

Environment Variables

Service Ports

Usage

Starting Services

1. Start Embedding Server

2. Start Event Bridge v2

3. Start Apache Jena Fuseki

4. Start Knowledge MCP Server

5. Start Regen MCP Server (if needed)

Sending Events

NEW Event (First time content)

UPDATE Event (Content changed)

Checking Status

Agent Query Flow

Knowledge MCP Server:

Regen MCP Server:

API Documentation

Event Bridge API

GET / - Health Check

POST /process-koi-event - Process Event

GET /stats - Pipeline Statistics

Embedding Server API

POST /encode - Generate Embedding

Database Schema

Isolated KOI Tables

koi_memories

koi_embeddings

Useful Queries

Milestone B Components

✅ Podcast Publishing System (Session 14 - COMPLETE)

✅ Weekly Aggregator (Session 8 - COMPLETE)

✅ Daily Content Curator (Sessions 7, 9-13 - COMPLETE)

Testing

Unit Tests

Integration Test

Manual Testing

Deployment

`GET /` - Health Check

`POST /process-koi-event` - Process Event

`GET /stats` - Pipeline Statistics

`POST /encode` - Generate Embedding

`koi_memories`

`koi_embeddings`

Packages