Skip to content

Create Custom Handoff Benchmark Suite #13

@wonderwomancode

Description

@wonderwomancode

Overview

Create a custom benchmark suite using AstaBench/InspectAI to measure agent handoff quality and coordination.

Benchmark Definitions

1. Handoff Latency Benchmark

Measure time from task delegation to completion:

  • Senku assigns task to Yusuke
  • Clock starts at NATS message send
  • Clock stops at completion confirmation
  • Track P50, P95, P99 latencies

2. Context Preservation Benchmark

Test if context survives handoffs:

  • Give Senku a complex requirement
  • Senku delegates to Atlas
  • Atlas implements
  • Score: How many requirements were preserved?

3. Parallel Efficiency Benchmark

Measure speedup from parallelization:

  • Give same 3-task workload to:
    • Single agent (sequential)
    • 3 agents (parallel)
  • Calculate parallelization efficiency ratio

4. Error Recovery Benchmark

Test graceful degradation:

  • Inject failure into one agent
  • Measure: Does work get reassigned?
  • Measure: Is context lost?

5. Cost Efficiency Benchmark

Track tokens vs complexity:

  • Define task complexity score (1-10)
  • Track tokens used per task
  • Calculate tokens/complexity ratio

Acceptance Criteria

  • Define JSON schema for each benchmark
  • Create test fixtures (sample tasks)
  • Write evaluation scripts
  • Create baseline measurements
  • Set up automated benchmark runs
  • Build reporting dashboard

Assigned Agent

@quinn - Build the test suite
@senku - Review benchmark design

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions