Add per-input trial_count support to Eval() #1342
Open
+135
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why?
Braintrust's
Eval()function supports atrialCountparameter that runs each input multiple times to measure variance in non-deterministic LLM outputs. However, this setting applies globally to all inputs, which creates some (but minimal) friction in some evaluation workflows. For example:Targeted Debugging is Expensive: When investigating a single flaky test case, you want to run it 10-20 times to understand the variance pattern. With global
trialCount, this means running your entire suite 10-20 times, multiplying costs and wait time unnecessarily.Mixed Determinism is Common: Real evaluation suites contain a mix of deterministic scenarios (math problems, factual lookups) and non-deterministic ones (creative writing, open-ended reasoning). Forcing the same trial count on both wastes resources.
Cost Scales Linearly: Every additional trial means another LLM API call. A global
trialCount: 5on a 100-item dataset means 500 API calls, even if only 10 items actually need variance analysis.In order to address this, we've created a custom solution that I want to propose as a contribution. Specifically it, allows each data item to specify its own
trialCount, overriding the global default. This gives users fine-grained control over where to invest their evaluation budget.What?
There is a corollary JS PR up to match it here: #1341