I'm sort of imagining a qualititative process here rather than some sort of automated harness connected to one of the more generic NLP eval metics. Those are great for LLM-level broad comparisons, but seem pretty useless for specialized use cases like ours.