Can we provide a skill.md/guideline for automatic eval dataset generation?
Let me propose one idea:
- Let LLM generate inputs for the task
- Let LLM generate rubric for the task
- Run the agent with Opus 4.6 (or the best model whatsoever) to obtain outputs
- iterate once on the inputs and rubric based on the outputs from Opus 4.6 (e.g., the inputs might be too simple etc)