Verify your AI agent's answers. Confidence scoring, independent verification, and adversarial stress-testing in one pipeline.
Agent's answer → Confidence check → Verification → Adversarial attack → Verdict
If it fails at any step, the rest don't run. No wasted LLM calls.
curl -fsSL https://raw.githubusercontent.com/sharifli4/agent-verdict/main/install.sh | shOptions: sh -s anthropic, sh -s openai, sh -s deepseek, sh -s kimi, sh -s all, sh -s mcp.
Works with any LLM. Built-in support for:
| Provider | API key env var | Default model | Install extra |
|---|---|---|---|
| Anthropic | ANTHROPIC_API_KEY |
claude-sonnet-4-6 |
agent-verdict[anthropic] |
| OpenAI | OPENAI_API_KEY |
gpt-4o |
agent-verdict[openai] |
| DeepSeek | DEEPSEEK_API_KEY |
deepseek-chat |
agent-verdict[deepseek] |
| Kimi (Moonshot) | MOONSHOT_API_KEY |
kimi-k2.5 |
agent-verdict[kimi] |
Set your API key:
export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...
# or
export DEEPSEEK_API_KEY=sk-...
# or
export MOONSHOT_API_KEY=sk-...Any OpenAI-compatible API works too — just pass --base-url and --api-key-env:
# use any OpenAI-compatible provider
agent-verdict -p openai --base-url https://api.example.com/v1 --api-key-env MY_API_KEY \
eval "result" -c "task"# full pipeline (auto-detects provider from API key)
agent-verdict evaluate "SQL injection on line 14" -c "Find security bugs"
# pick a specific provider
agent-verdict -p deepseek eval "SQL injection on line 14" -c "Find security bugs"
# pick a specific model
agent-verdict -p openai -m gpt-4o-mini eval "result" -c "task"
# pipe from another tool
my-agent analyze code.py | agent-verdict eval -c "Find security bugs"
# quick confidence check (1 LLM call)
agent-verdict check "server crashed due to OOM" -c "Diagnose outage"
# adversarial only
agent-verdict attack "race condition in pool.get()" -c "Find concurrency bugs"
# JSON output
agent-verdict --json eval "result" -c "task" > verdict.jsonExit code 0 = passed, 1 = dropped.
from agent_verdict import verdict
from agent_verdict.llm.anthropic import AnthropicProvider
llm = AnthropicProvider()
@verdict(llm=llm, task_context="Find security bugs in Python code")
def analyze(code: str) -> str:
return "Found SQL injection in the login handler"
result = analyze(user_code)
result.confidence # 0.87
result.defended # True
result.dropped # FalseUse any provider:
from agent_verdict.llm.openai import OpenAIProvider
from agent_verdict.llm.deepseek import DeepSeekProvider
from agent_verdict.llm.kimi import KimiProvider
llm = DeepSeekProvider() # uses DEEPSEEK_API_KEY
llm = KimiProvider(model="kimi-k2.5") # uses MOONSHOT_API_KEY
# any OpenAI-compatible API
llm = OpenAIProvider(
model="my-model",
base_url="https://api.example.com/v1",
api_key_env="MY_API_KEY",
)Or use the pipeline directly:
from agent_verdict import VerdictPipeline, VerdictConfig
pipeline = VerdictPipeline(llm=llm, config=VerdictConfig(confidence_threshold=0.7))
result = await pipeline.evaluate("race condition in pool", task_context="Find concurrency bugs")curl -fsSL https://raw.githubusercontent.com/sharifli4/agent-verdict/main/install.sh | sh -s mcp
claude mcp add agent-verdict -- /path/to/.venv/bin/agent-verdict-mcpConfigure via env vars: VERDICT_PROVIDER, VERDICT_MODEL, VERDICT_BASE_URL, VERDICT_API_KEY_ENV.
Tools: evaluate (customizable via stages param), check_confidence, adversarial_check, self_consistency_check, semantic_similarity_check, entailment_check, logprob_check, cross_verification.
result.confidence # 0.0-1.0
result.context_relevance # 0.0-1.0
result.justification # why the answer makes sense
result.counter_argument # best attack against the answer
result.defense # response to that attack
result.defended # did the defense hold?
result.dropped # was the answer rejected?
result.drop_reason # why it was rejected
result.deliberation # list[JurorPosition] — cross-verification jury
result.usage # list[StageUsage] — per-stage token/cost breakdown
result.total_tokens # sum of all tokens across stages
result.total_cost # sum of all costs in USD| Algorithm | Technique | Catches |
|---|---|---|
| Confidence Scoring | LLM rates confidence and relevance 0.0-1.0 | Low-quality or vague answers |
| Independent Verification | LLM re-derives answer without seeing the original | Answers that don't hold up independently |
| Adversarial Dialectic | Generate counter-argument, then defend against it | Plausible but flawed answers |
| Self-Consistency | Wang et al. 2022 — sample N answers, measure agreement | Unstable/unreliable answers |
| Cosine Similarity | Sentence embeddings (MiniLM) | Off-topic answers |
| NLI Entailment | DeBERTa-v3 classification | Hallucinated/contradicting answers |
| Logprob Calibration | Token log-probabilities via exp(mean_logprob) |
Internally uncertain answers |
| Cross-Verification | Multi-model jury deliberation with position, counter, and rebuttal | Answers that fool one model but not others |
Default pipeline uses the first 3 (4 LLM calls). Stages 5-7 use different models, breaking the "same brain grading itself" problem.
| Stage | LLM calls | Install |
|---|---|---|
ConfidenceStage |
1 | included |
VerificationStage |
1 | included |
AdversarialStage |
2 | included |
SelfConsistencyStage(n) |
2n | included |
SemanticSimilarityStage |
0 | pip install agent-verdict[embeddings] |
EntailmentStage |
0 | pip install agent-verdict[nli] |
LogprobStage |
1 | OpenAI only |
CrossVerificationStage |
2 per juror | needs 2+ providers |
Custom pipeline:
from agent_verdict import VerdictPipeline, ConfidenceStage, EntailmentStage, AdversarialStage
pipeline = VerdictPipeline(llm=llm, stages=[
ConfidenceStage(),
EntailmentStage(),
AdversarialStage(),
])Different LLMs deliberate on your agent's answer like a jury. Each juror:
- States position — support or challenge, with argument
- Steel-mans the other side — strongest counter to their own position
- Deliberates — sees all positions, writes rebuttal, casts final vote
All jurors run in parallel. Wall-clock time ≈ one slow LLM call per phase.
from agent_verdict import VerdictPipeline, CrossVerificationStage
from agent_verdict.llm.anthropic import AnthropicProvider
from agent_verdict.llm.openai import OpenAIProvider
from agent_verdict.llm.deepseek import DeepSeekProvider
pipeline = VerdictPipeline(
llm=AnthropicProvider(),
stages=[
CrossVerificationStage(challengers=[
OpenAIProvider(),
DeepSeekProvider(),
]),
],
)
result = await pipeline.evaluate("the agent's answer", task_context="what it should do")
# inspect the jury deliberation
for juror in result.deliberation:
print(f"{juror.juror}: {juror.vote} → {juror.final_vote} ({juror.confidence:.2f})")
print(f" argument: {juror.argument}")
print(f" steel-man: {juror.counter_to_self}")
print(f" rebuttal: {juror.rebuttal}")Majority vote decides. If more jurors challenge than support, the result is dropped.
Every evaluation tracks token usage and estimated cost per stage — no surprises.
result = await pipeline.evaluate("answer", task_context="task")
for stage in result.usage:
print(f"{stage.stage}: {stage.llm_calls} calls, {stage.total_tokens:,} tokens, ${stage.cost:.4f}")
print(f"Total: {result.total_tokens:,} tokens, ${result.total_cost:.4f}")CLI output includes cost breakdown automatically:
PASSED
confidence: 0.87
relevance: 0.82
cost:
ConfidenceStage 1 calls 423 tokens $0.0016
VerificationStage 1 calls 512 tokens $0.0021
AdversarialStage 2 calls 891 tokens $0.0045
total: $0.0082 (1,826 tokens)
JSON output (--json) includes usage array with per-stage breakdown.
Built-in pricing for GPT-4o, Claude Sonnet/Opus/Haiku, DeepSeek, Kimi K2.5, and more. Custom models show token counts (cost = $0 if model not in pricing table).
VerdictConfig(
confidence_threshold=0.5, # below → dropped
relevance_threshold=0.4, # below → dropped
require_defense=True, # can't defend → dropped
)CLI: --confidence-threshold, --relevance-threshold, --no-require-defense.
# Custom LLM provider — implement one method
class MyProvider(LLMProvider):
async def complete(self, messages):
return LLMResponse(content=await my_llm(messages[0].content))
# Or subclass OpenAIProvider for any OpenAI-compatible API
class MyProvider(OpenAIProvider):
def __init__(self):
super().__init__(
model="my-model",
base_url="https://api.example.com/v1",
api_key_env="MY_API_KEY",
)
# Custom stage
class MyStage(Stage):
async def run(self, verdict, llm, task_context, config):
return verdict.model_copy(update={"confidence": 0.99})git clone https://github.com/sharifli4/agent-verdict.git
cd agent-verdict
pip install -e ".[dev]"
pytest tests/ -v # no API keys needed, uses mock provider