Hi! We're building vla-evaluation-harness at Allen AI — a unified evaluation framework for VLA models across multiple simulation benchmarks.
The core idea: model servers and benchmarks communicate via WebSocket, each in their own isolated environment. Integrate a model once, and it works across all benchmarks automatically. Currently supports 11+ benchmarks (LIBERO, CALVIN, SimplerEnv, RoboTwin, ManiSkill2, etc.) and growing.
We've been using Dexbotic checkpoints, and DB-CogACT is actually our most thoroughly reproduced model so far:
| Benchmark |
Our result |
Dexbotic reported |
Verdict |
| LIBERO (avg) |
95.2% |
94.9% |
✅ Reproduced |
| CALVIN (avg len) |
4.05 |
4.06 |
✅ Reproduced |
| SimplerEnv WidowX (avg) |
72.2% |
69.5% |
✅ Reproduced |
That said, we have plenty of gaps. Many model × benchmark combinations are still untested, and we've learned the hard way that subtle differences in action space conventions, observation preprocessing, and seed handling can make or break reproduction. It's a work in progress.
Where we think there's mutual benefit:
- For Dexbotic users: Our framework provides dependency-isolated, parallelized evaluation — a full LIBERO run (2000 episodes) takes ~30 min on H100 with sharding. No need to install benchmark dependencies alongside model dependencies.
- For us: Dexbotic has a rich model zoo (DB-OFT, DB-π0, DB-MemVLA, DB-GR00TN1) that we haven't integrated yet. Adding these would significantly expand our cross-benchmark coverage.
We'd love to collaborate — whether that's cross-checking reproduction results, integrating more Dexbotic models into our framework, or anything else that makes sense. If anyone from the team (or community) is interested in helping verify reproductions or adding model server integrations, we'd be thrilled.
Hi! We're building vla-evaluation-harness at Allen AI — a unified evaluation framework for VLA models across multiple simulation benchmarks.
The core idea: model servers and benchmarks communicate via WebSocket, each in their own isolated environment. Integrate a model once, and it works across all benchmarks automatically. Currently supports 11+ benchmarks (LIBERO, CALVIN, SimplerEnv, RoboTwin, ManiSkill2, etc.) and growing.
We've been using Dexbotic checkpoints, and DB-CogACT is actually our most thoroughly reproduced model so far:
That said, we have plenty of gaps. Many model × benchmark combinations are still untested, and we've learned the hard way that subtle differences in action space conventions, observation preprocessing, and seed handling can make or break reproduction. It's a work in progress.
Where we think there's mutual benefit:
We'd love to collaborate — whether that's cross-checking reproduction results, integrating more Dexbotic models into our framework, or anything else that makes sense. If anyone from the team (or community) is interested in helping verify reproductions or adding model server integrations, we'd be thrilled.