Quick braindump, I just worked on evaluation strategy for my Meedan/Rockefeller thing.

Basically we crafted 20 test cases and then an editor reviewed them and gave gold standard / human ground truth for all 20.

At first I just used a spreadsheet and did it manually (a professional editor made the classification judgments.)

(Screenshot shows: the test of a classifier which responds yes or no, and is sometimes incorrect, with interesting caveats about the surprising behavior of the LLM. The classification is "is this article an example of solutions journalism" which is itself a specific evaluation standard.)

Then I wanted to automate the evaluation. I was able set up promptable and run an executable process that could be part of a build pipeline. Though it is still "just a spreadsheet" and all the evaluation is still human. I was basically editing JSON instead of Sheets.

Takeaway: We should definitely figure this out and do semi-automated evaluation, but just doing any regular, subjective check for overall quality, tone, etc. seems way more important at this step than having it be completely automated or even having a very large test suite. The interesting stuff is how the model might evade the question, or give too much detail, etc. It helps enumerate unexpected behaviors.

Defending a single overall metric does not realistically make the model better or prevent regressions unless you have specifically crafted a reliable test for that case.

I do think it is possible to design test cases that are more reliable, but it's going to be a flaky type of test that can't really block the build.

I think ideally we would have 100+ tests that would run in CI and have reasonably deterministic output (less than 5% flakiness in the test). Each test would give a specific input context and then ask a question and the natural language response would need to contain certain character string (regex) or some correctly-ranked search result, or some reliable indicator that it got the query correct (?) If we are trying to test nondeterministic natural language, I hope we don't have to resort to using another LLM call to interpret the result, because then we need a test for THAT model. Slippery stuff.

Evaluation strategy #15

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions