Skip to content

liuboxuan20010613/trust-cost-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TrustCost Bench

A quality inspection pipeline for AI outputs. Makes AI check itself before responding — catching lazy answers, missing sources, and filler before they reach the user.

TL;DR — What is this and what does it do?

Every AI gives you answers. But how do you know if you should trust them?

TrustCost Bench installs a pre-flight checklist into your AI. Before every response, the AI automatically audits itself:

You ask: "What new products did Luckin launch this week?"

WITHOUT TrustCost:
  "Luckin Coffee continues to maintain strong innovation momentum..."
  → Filler. Zero information. Sounds confident.

WITH TrustCost (internal pre-flight):
  [Draft] "My knowledge cutoff is..."
  [Gate 0 ❌] Wait — user asked for products, I'm talking about my limitations. Goal mismatch.
  [Gate 1 ❌] I have web search. Why am I not using it?
  → Redo.
  [Searching...]
  [Final] "This week Luckin launched 3 items: Longjing Tea Latte, Bitter Melon Veggie Tea...
           (Source: Sina Finance, 2026-03-19)"

Three skills, install and go:

Skill What it does How the user uses it
/trustcost-preflight AI self-checks before every response Install once, forget about it — AI gets better automatically
/trustcost-eval Score any AI output (0-100) Paste any AI response → get claim-by-claim breakdown
/trustcost-rewrite Rewrite AI output to be more verifiable Paste a bad response → get optimized version with sources

What it is NOT:

  • ❌ Not making AI "smarter" — doesn't increase knowledge or reasoning
  • ❌ Not checking accuracy — doesn't tell you if the answer is correct
  • ✅ Making AI outputs verifiable — so you can quickly decide whether to trust them
  • ✅ Making AI not lazy — use your tools, cite your sources, admit what you don't know

The bigger picture:

Now:    Users install skills → their AI outputs improve (individual effect)
Next:   Evaluation dataset → model leaderboard (which model is most "trustworthy"?)
Goal:   Model vendors adopt TCI as training signal → all AI ships with verifiability built-in

Like code linters: today you install them yourself. Tomorrow they're built into the compiler.


The Theory Behind It

Verification cost is the largest hidden tax on AI adoption. It can't be eliminated — but it can be compressed from O(N×M) to O(K).

The Hidden Tax

Every time a human receives AI output, they face a silent question: "Do I trust this?"

The time spent answering that question — searching for sources, cross-checking facts, re-doing the work "just to be sure" — is verification cost. It's invisible in every AI benchmark, yet it dominates real-world adoption decisions.

Current AI evaluation asks: "Is the answer correct?" (accuracy)

We ask a different question:

"How long does it take a human to decide whether to trust the answer?"

These are fundamentally different. An answer can be correct but unverifiable. An answer can be wrong but look trustworthy. No existing benchmark measures this.

Three Laws of Verification Cost

Law 1: Conservation — Verification cost cannot be eliminated, only transferred

Want zero verification cost → must fully trust AI → but full trust = no verification → impossible
Want perfect trust metrics  → need massive human labeling → labeling IS verification cost → contradiction

Trust is not free. Like energy, verification cost is conserved — it can be moved between parties and stages, but never destroyed.

Law 2: The Seller's Paradox — AI cannot verify itself

"You cannot trust a trustworthiness score generated by the system you're trying to trust."

If an AI says "I'm 90% confident," that claim itself needs verification — creating infinite regress. This is why confidence scores, self-evaluation, and AI-generated trust labels are fundamentally broken.

Law 3: Compression — Verification cost can be amortized across users

This is where the opportunity lives:

Status quo:
  Every user × every output × verifies independently
  Total cost = N users × M outputs × T time per verification

TrustCost approach:
  K annotators × verify once → structural rules → all users benefit
  Total cost = K × once + N users × M outputs × t (drastically reduced)

The same logic as drug approval: FDA spends enormous cost on clinical trials (centralized verification) so that ordinary people don't have to run their own experiments (distributed verification cost → near zero).

TrustCost is a compression algorithm for trust.

TrustCost Index v2 (TCI)

A composite score from 0-100. Higher = lower verification cost = better.

Gate Metrics (pass/fail — must pass or TCI = 0)

Discovered through a real failure: an AI with web search scored 72.9 on TCI v1 by eloquently saying "I don't know." v2 adds gates to catch this.

Gate What it catches Example failure
Goal Alignment AI answers a different question than asked User asks for data; AI explains methodology instead
Capability Utilization AI has tools but doesn't use them AI says "my knowledge cutoff..." when it has web search

If either gate fails, TCI = 0. A beautifully structured non-answer is still a non-answer.

Score Metrics (quantitative, when gates pass)

Metric What it measures How
Anchor Density % of claims that are independently verifiable Verifiable facts / total claims
Source Traceability How many clicks to verify a claim? 0=direct link, 1=searchable, 2=needs expertise, 3=unverifiable
Uncertainty Honesty Does AI hedge when it should? % of unsourced claims properly marked as uncertain
Signal Ratio How much content directly answers the question? On-topic tokens / total tokens
if gates_failed:
    TCI = 0
else:
    TCI = 25 × anchor_density
        + 25 × (1 - avg_verify_cost / 3)
        + 25 × uncertainty_honesty
        + 25 × signal_ratio

Important caveat: TCI measures whether verification paths exist, not whether the content at the end of those paths is correct. A model could fabricate a citation with perfect structure and score high. This is a known limitation — TCI is the first compression layer, not the complete solution. See Three-Layer Architecture for the full design.

Quick Example

Same question: "Long-term effects of coffee on cardiovascular health?"

Output A (typical AI):

"Moderate coffee consumption is generally beneficial for cardiovascular health. Studies show that 3-4 cups per day can reduce cardiovascular disease risk..."

Output B (trust-optimized):

"A meta-analysis of 1.2M subjects (Ding et al., 2014, Circulation) found 3-5 cups/day associated with 15% lower cardiovascular mortality. Note: observational studies only — cannot prove causation. The 400mg/day upper limit comes from EFSA 2015 assessment."

Output A Output B
Sources provided 0 3 (with author, year, journal)
Uncertainty acknowledged No Yes
Filler content ~30% ~5%
TCI Score 38.5 90.4

Both may be equally accurate. But Output B is far more verifiable.

Cross-Domain Results

Domain                Typical TCI  Optimized TCI  Improvement
--------------------  ------------ -------------- -----------
Competitive Intel          29.6         60.9       +31.3
Finance                    16.3         83.7       +67.4
Medical                    38.5         90.4       +51.9
Coding                     43.3         97.2       +53.9
--------------------  ------------ -------------- -----------
AVERAGE                    31.9         83.0       +51.1

Key finding: coding achieves the highest optimized TCI because runnable code is the ultimate verification — verification cost is literally zero when you can execute and see the result yourself.

Three-Layer Architecture

TCI alone is not enough. The full system has three layers:

Layer Who pays the cost How often Purpose
L1: Structural Metrics (TCI) Framework designers (us) Once Free, instant, catches obvious problems
L2: Human Verification Data Annotators (centralized) Ongoing but concentrated Trains better metrics, detects gaming
L3: End-User Verification Every user Every time, but compressed Residual cost, reduced from minutes to seconds
  • L1 is open-sourced — everyone can use it immediately
  • L2 is the moat — real human behavior data on "what did the user do after seeing AI output?"
  • L3 never reaches zero — but L1 and L2 compress it dramatically

This mirrors how the internet evolved: early PageRank (structural signal, gameable) → click behavior data (human signal, hard to fake) → modern search ranking (hybrid).

Roadmap

  • Phase 1 (now): Core metrics definition + cross-domain comparison examples
  • Phase 2: Evaluation dataset — AI outputs annotated with actual human verification time
  • Phase 3: Leaderboard — benchmark major models on TCI
  • Phase 4: Training signal — TCI as RLHF reward, so models learn to produce verifiable outputs

Quick Start: Skills (for Claude Code users)

Two ready-to-use skills:

Command What it does
/trustcost-preflight Pre-flight check — embed in system prompt so AI self-audits every response
/trustcost-eval Evaluate any AI output — claim-by-claim TCI breakdown
/trustcost-rewrite Rewrite any AI output to minimize verification cost
# Install
mkdir -p .claude/skills
cp skill/trustcost-preflight.md .claude/skills/  # Auto self-check on every response
cp skill/trustcost-eval.md .claude/skills/        # On-demand evaluation
cp skill/trustcost-rewrite.md .claude/skills/     # On-demand rewrite

Then in Claude Code: /trustcost-eval [paste any AI output]

These skill prompts also work as system prompts for any LLM (GPT, Gemini, etc.) — the evaluation logic is model-agnostic.

Python API

from metrics import TrustCostScore, AnchorPoint, VerifyCost

score = TrustCostScore(
    total_claims=4,
    verifiable_claims=3,
    anchor_points=[
        AnchorPoint("claim text", source_provided=True, verify_cost=VerifyCost.LOW),
    ],
    unsourced_claims=1,
    hedged_unsourced_claims=1,
    total_tokens=100,
    on_topic_tokens=85,
)

print(score.trust_cost_index)  # 0-100, higher is better
print(score.summary())

Why This Matters

Every AI lab is racing to make models smarter. Almost nobody is working on making outputs more verifiable.

But in enterprise adoption, trust is the bottleneck — not capability. A slightly less accurate answer that a human can verify in 5 seconds is more valuable than a perfect answer that takes 10 minutes to validate.

Verification cost is the hidden tax on every AI interaction. TrustCost doesn't eliminate it — that's impossible. But it compresses it, the same way JPEG compresses images: lossy, imperfect, but transformatively useful.

Contributing

This project is in early stage. Contributions welcome:

  • Real-world AI output examples for the evaluation dataset
  • Domain-specific verification cost patterns (medical, legal, financial, etc.)
  • Human verification time studies
  • Integration with existing eval frameworks

License

MIT

Releases

No releases published

Packages

 
 
 

Contributors

Languages