Context
PinchBench currently has a set of tasks for evaluating coding agents. We should expand coverage to better represent real-world coding scenarios.
Questions
- What types of tasks are missing from current benchmarks?
- Are there specific failure modes we want to test for?
- Should we include multi-file refactoring, debugging, or documentation tasks?
- Any language/framework gaps?
Input Wanted
Looking for ideas from anyone running benchmarks or building coding agents. What would be most useful to measure?