agent-verdict

Verify your AI agent's answers. Confidence scoring, independent verification, and adversarial stress-testing in one pipeline.

Agent's answer → Confidence check → Verification → Adversarial attack → Verdict

If it fails at any step, the rest don't run. No wasted LLM calls.

Install

curl -fsSL https://raw.githubusercontent.com/sharifli4/agent-verdict/main/install.sh | sh

Options: sh -s anthropic, sh -s openai, sh -s deepseek, sh -s kimi, sh -s all, sh -s mcp.

Providers

Works with any LLM. Built-in support for:

Provider	API key env var	Default model	Install extra
Anthropic	`ANTHROPIC_API_KEY`	`claude-sonnet-4-6`	`agent-verdict[anthropic]`
OpenAI	`OPENAI_API_KEY`	`gpt-4o`	`agent-verdict[openai]`
DeepSeek	`DEEPSEEK_API_KEY`	`deepseek-chat`	`agent-verdict[deepseek]`
Kimi (Moonshot)	`MOONSHOT_API_KEY`	`kimi-k2.5`	`agent-verdict[kimi]`

Set your API key:

export ANTHROPIC_API_KEY=sk-ant-...
# or
export OPENAI_API_KEY=sk-...
# or
export DEEPSEEK_API_KEY=sk-...
# or
export MOONSHOT_API_KEY=sk-...

Any OpenAI-compatible API works too — just pass --base-url and --api-key-env:

# use any OpenAI-compatible provider
agent-verdict -p openai --base-url https://api.example.com/v1 --api-key-env MY_API_KEY \
    eval "result" -c "task"

Usage

CLI

# full pipeline (auto-detects provider from API key)
agent-verdict evaluate "SQL injection on line 14" -c "Find security bugs"

# pick a specific provider
agent-verdict -p deepseek eval "SQL injection on line 14" -c "Find security bugs"

# pick a specific model
agent-verdict -p openai -m gpt-4o-mini eval "result" -c "task"

# pipe from another tool
my-agent analyze code.py | agent-verdict eval -c "Find security bugs"

# quick confidence check (1 LLM call)
agent-verdict check "server crashed due to OOM" -c "Diagnose outage"

# adversarial only
agent-verdict attack "race condition in pool.get()" -c "Find concurrency bugs"

# JSON output
agent-verdict --json eval "result" -c "task" > verdict.json

Exit code 0 = passed, 1 = dropped.

Python

from agent_verdict import verdict
from agent_verdict.llm.anthropic import AnthropicProvider

llm = AnthropicProvider()

@verdict(llm=llm, task_context="Find security bugs in Python code")
def analyze(code: str) -> str:
    return "Found SQL injection in the login handler"

result = analyze(user_code)
result.confidence        # 0.87
result.defended          # True
result.dropped           # False

Use any provider:

from agent_verdict.llm.openai import OpenAIProvider
from agent_verdict.llm.deepseek import DeepSeekProvider
from agent_verdict.llm.kimi import KimiProvider

llm = DeepSeekProvider()                     # uses DEEPSEEK_API_KEY
llm = KimiProvider(model="kimi-k2.5")       # uses MOONSHOT_API_KEY

# any OpenAI-compatible API
llm = OpenAIProvider(
    model="my-model",
    base_url="https://api.example.com/v1",
    api_key_env="MY_API_KEY",
)

Or use the pipeline directly:

from agent_verdict import VerdictPipeline, VerdictConfig

pipeline = VerdictPipeline(llm=llm, config=VerdictConfig(confidence_threshold=0.7))
result = await pipeline.evaluate("race condition in pool", task_context="Find concurrency bugs")

MCP (Claude Code / Cursor)

curl -fsSL https://raw.githubusercontent.com/sharifli4/agent-verdict/main/install.sh | sh -s mcp
claude mcp add agent-verdict -- /path/to/.venv/bin/agent-verdict-mcp

Configure via env vars: VERDICT_PROVIDER, VERDICT_MODEL, VERDICT_BASE_URL, VERDICT_API_KEY_ENV.

Tools: evaluate (customizable via stages param), check_confidence, adversarial_check, self_consistency_check, semantic_similarity_check, entailment_check, logprob_check, cross_verification.

Verdict object

result.confidence          # 0.0-1.0
result.context_relevance   # 0.0-1.0
result.justification       # why the answer makes sense
result.counter_argument    # best attack against the answer
result.defense             # response to that attack
result.defended            # did the defense hold?
result.dropped             # was the answer rejected?
result.drop_reason         # why it was rejected
result.deliberation        # list[JurorPosition] — cross-verification jury
result.usage               # list[StageUsage] — per-stage token/cost breakdown
result.total_tokens        # sum of all tokens across stages
result.total_cost          # sum of all costs in USD

Evaluation algorithms

Algorithm	Technique	Catches
Confidence Scoring	LLM rates confidence and relevance 0.0-1.0	Low-quality or vague answers
Independent Verification	LLM re-derives answer without seeing the original	Answers that don't hold up independently
Adversarial Dialectic	Generate counter-argument, then defend against it	Plausible but flawed answers
Self-Consistency	Wang et al. 2022 — sample N answers, measure agreement	Unstable/unreliable answers
Cosine Similarity	Sentence embeddings (MiniLM)	Off-topic answers
NLI Entailment	DeBERTa-v3 classification	Hallucinated/contradicting answers
Logprob Calibration	Token log-probabilities via `exp(mean_logprob)`	Internally uncertain answers
Cross-Verification	Multi-model jury deliberation with position, counter, and rebuttal	Answers that fool one model but not others

Default pipeline uses the first 3 (4 LLM calls). Stages 5-7 use different models, breaking the "same brain grading itself" problem.

Stages

Stage	LLM calls	Install
`ConfidenceStage`	1	included
`VerificationStage`	1	included
`AdversarialStage`	2	included
`SelfConsistencyStage(n)`	2n	included
`SemanticSimilarityStage`	0	`pip install agent-verdict[embeddings]`
`EntailmentStage`	0	`pip install agent-verdict[nli]`
`LogprobStage`	1	OpenAI only
`CrossVerificationStage`	2 per juror	needs 2+ providers

Custom pipeline:

from agent_verdict import VerdictPipeline, ConfidenceStage, EntailmentStage, AdversarialStage

pipeline = VerdictPipeline(llm=llm, stages=[
    ConfidenceStage(),
    EntailmentStage(),
    AdversarialStage(),
])

Cross-verification (multi-model jury)

Different LLMs deliberate on your agent's answer like a jury. Each juror:

States position — support or challenge, with argument
Steel-mans the other side — strongest counter to their own position
Deliberates — sees all positions, writes rebuttal, casts final vote

All jurors run in parallel. Wall-clock time ≈ one slow LLM call per phase.

from agent_verdict import VerdictPipeline, CrossVerificationStage
from agent_verdict.llm.anthropic import AnthropicProvider
from agent_verdict.llm.openai import OpenAIProvider
from agent_verdict.llm.deepseek import DeepSeekProvider

pipeline = VerdictPipeline(
    llm=AnthropicProvider(),
    stages=[
        CrossVerificationStage(challengers=[
            OpenAIProvider(),
            DeepSeekProvider(),
        ]),
    ],
)

result = await pipeline.evaluate("the agent's answer", task_context="what it should do")

# inspect the jury deliberation
for juror in result.deliberation:
    print(f"{juror.juror}: {juror.vote} → {juror.final_vote} ({juror.confidence:.2f})")
    print(f"  argument: {juror.argument}")
    print(f"  steel-man: {juror.counter_to_self}")
    print(f"  rebuttal: {juror.rebuttal}")

Majority vote decides. If more jurors challenge than support, the result is dropped.

Cost tracking

Every evaluation tracks token usage and estimated cost per stage — no surprises.

result = await pipeline.evaluate("answer", task_context="task")

for stage in result.usage:
    print(f"{stage.stage}: {stage.llm_calls} calls, {stage.total_tokens:,} tokens, ${stage.cost:.4f}")

print(f"Total: {result.total_tokens:,} tokens, ${result.total_cost:.4f}")

CLI output includes cost breakdown automatically:

  PASSED

  confidence:   0.87
  relevance:    0.82
  cost:
    ConfidenceStage              1 calls       423 tokens  $0.0016
    VerificationStage            1 calls       512 tokens  $0.0021
    AdversarialStage             2 calls       891 tokens  $0.0045
                                 total: $0.0082  (1,826 tokens)

JSON output (--json) includes usage array with per-stage breakdown.

Built-in pricing for GPT-4o, Claude Sonnet/Opus/Haiku, DeepSeek, Kimi K2.5, and more. Custom models show token counts (cost = $0 if model not in pricing table).

Configuration

VerdictConfig(
    confidence_threshold=0.5,   # below → dropped
    relevance_threshold=0.4,    # below → dropped
    require_defense=True,       # can't defend → dropped
)

CLI: --confidence-threshold, --relevance-threshold, --no-require-defense.

Extend

# Custom LLM provider — implement one method
class MyProvider(LLMProvider):
    async def complete(self, messages):
        return LLMResponse(content=await my_llm(messages[0].content))

# Or subclass OpenAIProvider for any OpenAI-compatible API
class MyProvider(OpenAIProvider):
    def __init__(self):
        super().__init__(
            model="my-model",
            base_url="https://api.example.com/v1",
            api_key_env="MY_API_KEY",
        )

# Custom stage
class MyStage(Stage):
    async def run(self, verdict, llm, task_context, config):
        return verdict.model_copy(update={"confidence": 0.99})

Development

git clone https://github.com/sharifli4/agent-verdict.git
cd agent-verdict
pip install -e ".[dev]"
pytest tests/ -v    # no API keys needed, uses mock provider

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
demo		demo
src/agent_verdict		src/agent_verdict
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

agent-verdict

Install

Providers

Usage

CLI

Python

MCP (Claude Code / Cursor)

Verdict object

Evaluation algorithms

Stages

Cross-verification (multi-model jury)

Cost tracking

Configuration

Extend

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

agent-verdict

Install

Providers

Usage

CLI

Python

MCP (Claude Code / Cursor)

Verdict object

Evaluation algorithms

Stages

Cross-verification (multi-model jury)

Cost tracking

Configuration

Extend

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages