Add per-input trial_count support to Eval() #1342

dflynn15 · 2026-02-05T19:30:19Z

Why?

Braintrust's Eval() function supports a trialCount parameter that runs each input multiple times to measure variance in non-deterministic LLM outputs. However, this setting applies globally to all inputs, which creates some (but minimal) friction in some evaluation workflows. For example:

Targeted Debugging is Expensive: When investigating a single flaky test case, you want to run it 10-20 times to understand the variance pattern. With global trialCount, this means running your entire suite 10-20 times, multiplying costs and wait time unnecessarily.
Mixed Determinism is Common: Real evaluation suites contain a mix of deterministic scenarios (math problems, factual lookups) and non-deterministic ones (creative writing, open-ended reasoning). Forcing the same trial count on both wastes resources.
Cost Scales Linearly: Every additional trial means another LLM API call. A global trialCount: 5 on a 100-item dataset means 500 API calls, even if only 10 items actually need variance analysis.

In order to address this, we've created a custom solution that I want to propose as a contribution. Specifically it, allows each data item to specify its own trialCount, overriding the global default. This gives users fine-grained control over where to invest their evaluation budget.

What?

Eval(
    "My Project",
    data=[
        {"input": "stable query", "expected": "..."},                     # Uses global (3)
        {"input": "flaky query", "expected": "...", "trial_count": 10},   # Override to 10
        EvalCase(input="deterministic", expected="...", trial_count=1),   # Override to 1
    ],
    task=my_task,
    scores=[Factuality],
    trial_count=3,  # Global default
)

There is a corollary JS PR up to match it here: #1341

Allow data items to specify their own trial_count, overriding the global evaluator setting. This enables targeted debugging of flaky test cases and mixed determinism scenarios without multiplying the entire suite. - Add optional `trial_count` field to `EvalCase` dataclass and TypedDict - Per-item trial_count takes precedence over global trial_count - Items without trial_count use the global value (or 1 if unset) - Works with both EvalCase objects and dict data

dflynn15 mentioned this pull request Feb 5, 2026

Add per-input trialCount support to Eval() #1341

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add per-input trial_count support to Eval() #1342

Add per-input trial_count support to Eval() #1342

dflynn15 commented Feb 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add per-input trial_count support to Eval() #1342

Are you sure you want to change the base?

Add per-input trial_count support to Eval() #1342

Conversation

dflynn15 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why?

What?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dflynn15 commented Feb 5, 2026 •

edited

Loading