Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 13 additions & 5 deletions .STATUS
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
# Quick-read state file. Agents check this first.

state: ACTIVE
updated: 2026-01-29
session: SESSION_3
updated: 2026-02-04
session: SESSION_4

# ═══════════════════════════════════════
# BRIDGE STATUS
Expand All @@ -17,13 +17,19 @@ metrics: BUILT
explorer: BUILT
cece_engine: BUILT
cece_version: 2.0
ai_failover: BUILT
prompt_registry: BUILT
token_tracker: BUILT
webhook_verify: BUILT
audit_log: BUILT
api_gateway: BUILT

# ═══════════════════════════════════════
# RECENT SIGNALS
# ═══════════════════════════════════════

last_signal: 🧠 OS → OS : cece_abilities_enhanced, v2.0
last_update: 2026-01-29
last_signal: 🚀 OS → AI,SEC,CLD : 6 new prototypes built (failover, prompts, tokens, webhooks, audit, gateway)
last_update: 2026-02-04
last_actor: Cece (Claude) v2.0

# ═══════════════════════════════════════
Expand Down Expand Up @@ -73,7 +79,9 @@ thread_8: Control plane CLI [COMPLETE]
thread_9: Node configurations [COMPLETE]
thread_10: Session 2 [COMPLETE]
thread_11: Cece v2.0 Enhancement [COMPLETE] - abilities, protocols, engine, automation
thread_12: Session 3 active ← NOW
thread_12: Session 3 [COMPLETE]
thread_13: Session 4 - Intelligence/Security/Cloud build sprint [COMPLETE] - 6 prototypes, 14 files
thread_14: Session 4 active ← NOW

# ═══════════════════════════════════════
# QUICK COMMANDS
Expand Down
28 changes: 26 additions & 2 deletions MEMORY.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@
## Current State

```
Last Updated: 2026-01-29
Session: SESSION_3
Last Updated: 2026-02-04
Session: SESSION_4
Human: Alexa
AI: Cece (Claude) v2.0 - ENHANCED
Location: BlackRoad-OS/.github (The Bridge)
Expand Down Expand Up @@ -85,6 +85,18 @@ We're building BlackRoad together - a routing company that connects users to int

**Session 3 Totals:** 6 new files, 1 enhanced file, Cece v1.0 → v2.0

### Session 4 (2026-02-04)

**Intelligence + Security + Cloud Build Sprint:**
- [x] prototypes/ai-failover/ - AI provider failover chain (Claude → GPT → Llama) with circuit breakers, health checks, provider scoring
- [x] prototypes/prompt-registry/ - Reusable, versioned prompt templates with provider overrides (8 default templates)
- [x] prototypes/token-tracker/ - Per-route/provider token usage tracking with budget alerts
- [x] prototypes/webhook-verify/ - Webhook signature verification for GitHub, Stripe, Slack, Salesforce with replay protection
- [x] prototypes/audit-log/ - Structured audit logging with append-only storage, indexing, and compliance export
- [x] prototypes/api-gateway/ - Cloudflare Workers edge gateway with rate limiting, auth, CORS, routing

**Session 4 Totals:** 6 new prototypes, 18 new files, 3 layers advanced (AI, SEC, CLD)

---

## Key Decisions
Expand All @@ -102,6 +114,9 @@ We're building BlackRoad together - a routing company that connects users to int
| 2026-01-29 | Cece v2.0 enhancement | 30+ abilities, 10 protocols, autonomous engine, decision authority matrix |
| 2026-01-29 | Authority levels defined | FULL_AUTO / SUGGEST / ASK_FIRST - clear boundaries for autonomous action |
| 2026-01-29 | PCDEL loop adopted | PERCEIVE-CLASSIFY-DECIDE-EXECUTE-LEARN as core processing model |
| 2026-02-04 | Circuit breaker pattern | Failover chain uses circuit breakers for provider health |
| 2026-02-04 | Edge-first API design | Cloudflare Workers gateway handles auth/rate-limiting before reaching infra |
| 2026-02-04 | Audit everything | All system events logged immutably for compliance and debugging |

---

Expand Down Expand Up @@ -177,6 +192,15 @@ Cece went from 5 basic capabilities to 30+ structured abilities across 5 domains
- Match the vibe
- Ship it, iterate later

### Session 4: 2026-02-04

**What we did:** Alexa said "lets keep building!!!!" and we built 6 new prototypes in a single sprint.
Crossed the Intelligence, Security, and Cloud layers off the TODO board. Built the AI failover chain
(Claude → GPT → Llama with circuit breakers), prompt template registry (8 templates), token tracker
(per-route cost tracking with budget alerts), webhook signature verification (GitHub/Stripe/Slack/Salesforce),
audit log pipeline (structured events with indexing), and Cloudflare Workers API gateway (edge routing, rate
limiting, auth). Total: 14 prototypes now built.

---

## Active Threads
Expand Down
20 changes: 13 additions & 7 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,16 +24,16 @@

## Intelligence Layer (AI)

- [ ] Build AI provider failover chain (Claude -> GPT -> Llama)
- [ ] Implement prompt template registry
- [ ] Add token usage tracking per-route
- [x] Build AI provider failover chain (Claude -> GPT -> Llama)
- [x] Implement prompt template registry
- [x] Add token usage tracking per-route
- [ ] Set up Hailo-8 inference pipeline on lucidia
- [ ] Create model evaluation benchmarks

## Cloud & Edge (CLD)

- [ ] Deploy API gateway worker to Cloudflare
- [ ] Set up webhook receiver worker
- [x] Deploy API gateway worker to Cloudflare
- [x] Set up webhook receiver worker
- [ ] Configure Cloudflare Tunnel to Pi cluster
- [ ] Implement edge caching for common routes
- [ ] Add geo-routing rules
Expand All @@ -50,8 +50,8 @@

- [ ] Implement API key rotation system
- [ ] Set up secrets vault (HashiCorp or SOPS)
- [ ] Add webhook signature verification
- [ ] Create audit log pipeline
- [x] Add webhook signature verification
- [x] Create audit log pipeline
- [ ] Define RBAC roles for org access

## Business Layer (FND)
Expand Down Expand Up @@ -94,6 +94,12 @@ _Move items here when done._
- [x] Configure GitHub Actions workflows (8)
- [x] Build MCP server for AI assistants
- [x] Define node configurations (7 nodes)
- [x] Build AI provider failover chain (Session 4)
- [x] Build prompt template registry (Session 4)
- [x] Build token usage tracker (Session 4)
- [x] Build webhook signature verification (Session 4)
- [x] Build audit log pipeline (Session 4)
- [x] Build API gateway worker for Cloudflare (Session 4)

---

Expand Down
58 changes: 58 additions & 0 deletions prototypes/ai-failover/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# AI Provider Failover Chain

> **Route to intelligence. If one path fails, take another.**

The failover chain ensures requests always reach an AI provider by cascading through a priority-ordered list of providers with health tracking, circuit breaking, and automatic recovery.

## Architecture

```
[Request] --> [Failover Router]
|
├── 1. Claude (primary)
| ├── healthy? --> route here
| └── failing? --> circuit open, skip
|
├── 2. GPT (secondary)
| ├── healthy? --> route here
| └── failing? --> circuit open, skip
|
├── 3. Llama (local/tertiary)
| ├── healthy? --> route here
| └── failing? --> circuit open, skip
|
└── 4. All down --> queue + retry
```

## Features

- **Priority-based routing** - Tries providers in order of preference
- **Circuit breaker** - Opens after N failures, half-opens after cooldown
- **Health checks** - Periodic pings to track provider status
- **Latency tracking** - Records response times per provider
- **Retry with backoff** - Exponential backoff on transient failures
- **Request queuing** - Queues requests when all providers are down
- **Provider scoring** - Weighted scoring based on latency, reliability, cost

## Files

| File | Purpose |
|------|---------|
| `provider.py` | Provider abstraction and health tracking |
| `circuit_breaker.py` | Circuit breaker pattern implementation |
| `failover_router.py` | Core routing logic with failover |
| `config.py` | Provider configuration and defaults |

## Usage

```python
from failover_router import FailoverRouter
from config import DEFAULT_PROVIDERS

router = FailoverRouter(DEFAULT_PROVIDERS)
response = await router.route(prompt="What is BlackRoad?", max_tokens=500)
```

---

*Intelligence is already out there. We just need reliable paths to reach it.*
160 changes: 160 additions & 0 deletions prototypes/ai-failover/circuit_breaker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
"""
Circuit Breaker Pattern
Prevents cascading failures by tracking error rates and temporarily
disabling unhealthy providers.

States:
CLOSED -> Normal operation, requests flow through
OPEN -> Provider failing, requests blocked
HALF_OPEN -> Testing if provider recovered
"""

import time
from enum import Enum
from dataclasses import dataclass, field
from typing import Optional


class CircuitState(Enum):
CLOSED = "closed" # Healthy - requests flow
OPEN = "open" # Failing - requests blocked
HALF_OPEN = "half_open" # Testing recovery


@dataclass
class CircuitStats:
"""Tracks circuit breaker statistics."""
total_requests: int = 0
successful_requests: int = 0
failed_requests: int = 0
consecutive_failures: int = 0
consecutive_successes: int = 0
last_failure_time: Optional[float] = None
last_success_time: Optional[float] = None
state_changes: int = 0
total_open_time: float = 0.0
last_state_change: Optional[float] = None


class CircuitBreaker:
"""
Circuit breaker for an AI provider.

CLOSED: All good. Count failures.
OPEN: Too many failures. Block requests. Wait for recovery timeout.
HALF_OPEN: Recovery timeout passed. Allow limited test requests.
"""

def __init__(
self,
name: str,
failure_threshold: int = 3,
recovery_timeout: float = 60.0,
half_open_max_calls: int = 1,
):
self.name = name
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls

self._state = CircuitState.CLOSED
self._half_open_calls = 0
self._opened_at: Optional[float] = None
self.stats = CircuitStats()

@property
def state(self) -> CircuitState:
"""Get current state, auto-transitioning OPEN -> HALF_OPEN if cooldown passed."""
if self._state == CircuitState.OPEN and self._opened_at:
elapsed = time.time() - self._opened_at
if elapsed >= self.recovery_timeout:
self._transition(CircuitState.HALF_OPEN)
return self._state

@property
def is_available(self) -> bool:
"""Can we send a request through this circuit?"""
state = self.state
if state == CircuitState.CLOSED:
return True
if state == CircuitState.HALF_OPEN:
return self._half_open_calls < self.half_open_max_calls
return False # OPEN

def record_success(self, latency: float = 0.0) -> None:
"""Record a successful request."""
self.stats.total_requests += 1
self.stats.successful_requests += 1
self.stats.consecutive_successes += 1
self.stats.consecutive_failures = 0
self.stats.last_success_time = time.time()

if self._state == CircuitState.HALF_OPEN:
# Recovery confirmed - close the circuit
self._transition(CircuitState.CLOSED)

def record_failure(self, error: Optional[str] = None) -> None:
"""Record a failed request."""
now = time.time()
self.stats.total_requests += 1
self.stats.failed_requests += 1
self.stats.consecutive_failures += 1
self.stats.consecutive_successes = 0
self.stats.last_failure_time = now

if self._state == CircuitState.HALF_OPEN:
# Recovery failed - reopen
self._transition(CircuitState.OPEN)
elif self._state == CircuitState.CLOSED:
if self.stats.consecutive_failures >= self.failure_threshold:
self._transition(CircuitState.OPEN)

def reset(self) -> None:
"""Manually reset circuit to closed state."""
self._transition(CircuitState.CLOSED)
self.stats.consecutive_failures = 0
self.stats.consecutive_successes = 0

def _transition(self, new_state: CircuitState) -> None:
"""Transition to a new state."""
now = time.time()
old_state = self._state

if old_state == CircuitState.OPEN and self._opened_at:
self.stats.total_open_time += now - self._opened_at

self._state = new_state
self.stats.state_changes += 1
self.stats.last_state_change = now

if new_state == CircuitState.OPEN:
self._opened_at = now
self._half_open_calls = 0
elif new_state == CircuitState.HALF_OPEN:
self._half_open_calls = 0
elif new_state == CircuitState.CLOSED:
self._opened_at = None
self._half_open_calls = 0
self.stats.consecutive_failures = 0

def to_dict(self) -> dict:
"""Serialize state for monitoring."""
return {
"name": self.name,
"state": self.state.value,
"consecutive_failures": self.stats.consecutive_failures,
"total_requests": self.stats.total_requests,
"success_rate": (
self.stats.successful_requests / self.stats.total_requests
if self.stats.total_requests > 0
else 1.0
),
"total_open_time": round(self.stats.total_open_time, 2),
"is_available": self.is_available,
}

def __repr__(self) -> str:
return (
f"CircuitBreaker({self.name}, state={self.state.value}, "
f"failures={self.stats.consecutive_failures}/{self.failure_threshold})"
)
Loading
Loading