LLM Output Verification Middleware for Claim-Level Reliability Auditing in Long-Form Generated Text.
Production-style runtime service for auditing GenAI outputs and assigning epistemic risk signals prior to downstream system usage.
graph TD
User[Client / Downstream App] -->|POST /audit| API[FastAPI Backend]
API -->|Async Pipeline| Engine[Audit Engine]
subgraph "Verification Pipeline"
Engine --> Extract[Claim Extraction]
Extract --> Link[Entity Linking]
Link --> Retrieve[Evidence Retrieval]
Retrieve --> Verify[Claim Verification]
Verify --> Agg[Risk Aggregation]
end
subgraph "Data & Analytics"
Engine -.->|Log| JSONL[audit_runs.jsonl]
JSONL -.->|Offline Eval| Dashboard[Research Dashboard]
end
Engine -->|Structured Risk Signal| User
The frontend provides a research-grade interface for manual inspection of model outputs:
- Input: Paste generated text (up to 20k chars).
- Process: Visualize the claim extraction and verification process in real-time.
- Output: Granular, claim-level verdicts (Supported, Refuted, Uncertain) with linked evidence.
Ideal for: Red-teaming, Policy tuning, and Qualitative analysis of model failure modes.
- Runtime: FastAPI (Python 3.11+)
- Entrypoint:
POST /audit - Observability:
GET /healthfor readiness probes. - Concurrency: Async-first pipeline design for high-throughput auditing.
# Example Health Check
curl http://localhost:8000/healthThe engine produces a strictly typed JSON response designed for programmatic consumption:
overall_risk: High-level traffic light signal (LOW, MEDIUM, HIGH) for gating.hallucination_score: Normalized [0-1] score for threshold-based filtering.claims: Array of atomic claims with individual verdicts and evidence context.
Use this payload to:
- Block high-risk responses.
- Flag uncertain claims for human review.
- Inject citations back into the generation.
Every request to the inference endpoints is automatically logged to an append-only JSONL event stream:
- Traceability: Full input/output capture with timestamp and configuration metadata.
- Dataset Generation: Logs can be directly consumed by the evaluation harness to build fine-tuning datasets or regression benchmarks.
- Reproducibility: Contains all necessary state to replay audits ensuring deterministic behavior.
File path: paper/data/audit_runs.jsonl
Reproducible research capabilities are built-in as first-class citizens:
- Seed Control: Deterministic execution for reliable regression testing.
- Prompt Perturbation: Evaluate massive batches of synthetic outputs.
- Artifact Generation: Automatically produces PDF/PNG analysis figures for calibration reports.
Run the harness:
EPI_SYNTH_MODE=demo EPI_SYNTH_RUNS=500 bash scripts/run_research.sh- Liveness: Simple HTTP 200 OK.
- Readiness: Checks pipeline initialization and model loading status.
- Uptime: Tracks service stability.
- Backend: Python / FastAPI / Uvicorn
- Frontend: Next.js (React) / Tailwind
- Orchestration: Dockerizable services, ready for Kubernetes or ECS.
- Backend:
uvicorn app:app --host 0.0.0.0 --port 8000 - Frontend:
npm start(port 3000)
- Backend:
cd backend && PYTHONPATH=$PWD .venv/bin/python -m uvicorn app:app --host 127.0.0.1 --port 8000 - Frontend:
cd frontend && npm run dev - Run both:
npm run devfrom the repo root - Frontend env: copy
frontend/.env.exampletofrontend/.env.localand setBACKEND_URL=http://127.0.0.1:8000 - Expected ports: frontend on
http://127.0.0.1:3000, backend onhttp://127.0.0.1:8000
The frontend health proxy calls the backend GET /health endpoint and fails fast with a clear 503 JSON payload when the backend is unavailable.
The root backend launcher prefers ./.venv/bin/python, then backend/.venv/bin/python, then falls back to python3 or python if one of those interpreters already has the backend requirements installed.
Integrate Epistemic Audit Engine as a middleware layer in your RAG or Copilot architecture:
- Internal Copilots: Prevent hallucinated policy advice in HR/Legal bots.
- Document QA: Verify answers against retrieval context before showing to users.
- Compliance Pipelines: Audit generated marketing copy for factual claim reliability.
- Moderation: Automate the detection of unsubstantiated claims in user-generated content.
Flow:
LLM Generation -> Epistemic Audit -> (Low Risk) -> User
-> (High Risk) -> Fallback / Warning





