Skip to content

Discussion: What tasks should we add next? #52

@ScuttleBot

Description

@ScuttleBot

Context

PinchBench currently has a set of tasks for evaluating coding agents. We should expand coverage to better represent real-world coding scenarios.

Questions

  • What types of tasks are missing from current benchmarks?
  • Are there specific failure modes we want to test for?
  • Should we include multi-file refactoring, debugging, or documentation tasks?
  • Any language/framework gaps?

Input Wanted

Looking for ideas from anyone running benchmarks or building coding agents. What would be most useful to measure?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions