-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Overview
Create a custom benchmark suite using AstaBench/InspectAI to measure agent handoff quality and coordination.
Benchmark Definitions
1. Handoff Latency Benchmark
Measure time from task delegation to completion:
- Senku assigns task to Yusuke
- Clock starts at NATS message send
- Clock stops at completion confirmation
- Track P50, P95, P99 latencies
2. Context Preservation Benchmark
Test if context survives handoffs:
- Give Senku a complex requirement
- Senku delegates to Atlas
- Atlas implements
- Score: How many requirements were preserved?
3. Parallel Efficiency Benchmark
Measure speedup from parallelization:
- Give same 3-task workload to:
- Single agent (sequential)
- 3 agents (parallel)
- Calculate parallelization efficiency ratio
4. Error Recovery Benchmark
Test graceful degradation:
- Inject failure into one agent
- Measure: Does work get reassigned?
- Measure: Is context lost?
5. Cost Efficiency Benchmark
Track tokens vs complexity:
- Define task complexity score (1-10)
- Track tokens used per task
- Calculate tokens/complexity ratio
Acceptance Criteria
- Define JSON schema for each benchmark
- Create test fixtures (sample tasks)
- Write evaluation scripts
- Create baseline measurements
- Set up automated benchmark runs
- Build reporting dashboard
Assigned Agent
@quinn - Build the test suite
@senku - Review benchmark design
Reactions are currently unavailable