vla-evaluation-harness — cross-benchmark evaluation framework (DB-CogACT reproduced across 3 benchmarks)

Hi! We're building [vla-evaluation-harness](https://github.com/allenai/vla-evaluation-harness) at Allen AI — a unified evaluation framework for VLA models across multiple simulation benchmarks.

The core idea: model servers and benchmarks communicate via WebSocket, each in their own isolated environment. Integrate a model once, and it works across all benchmarks automatically. Currently supports 11+ benchmarks (LIBERO, CALVIN, SimplerEnv, RoboTwin, ManiSkill2, etc.) and growing.

**We've been using Dexbotic checkpoints**, and DB-CogACT is actually our most thoroughly reproduced model so far:

| Benchmark | Our result | Dexbotic reported | Verdict |
|-----------|-----------|-------------------|---------|
| LIBERO (avg) | 95.2% | 94.9% | ✅ Reproduced |
| CALVIN (avg len) | 4.05 | 4.06 | ✅ Reproduced |
| SimplerEnv WidowX (avg) | 72.2% | 69.5% | ✅ Reproduced |

**That said, we have plenty of gaps.** Many model × benchmark combinations are still untested, and we've learned the hard way that subtle differences in action space conventions, observation preprocessing, and seed handling can make or break reproduction. It's a work in progress.

**Where we think there's mutual benefit:**

- **For Dexbotic users**: Our framework provides dependency-isolated, parallelized evaluation — a full LIBERO run (2000 episodes) takes ~30 min on H100 with sharding. No need to install benchmark dependencies alongside model dependencies.
- **For us**: Dexbotic has a rich model zoo (DB-OFT, DB-π0, DB-MemVLA, DB-GR00TN1) that we haven't integrated yet. Adding these would significantly expand our cross-benchmark coverage.

We'd love to collaborate — whether that's cross-checking reproduction results, integrating more Dexbotic models into our framework, or anything else that makes sense. If anyone from the team (or community) is interested in helping verify reproductions or adding model server integrations, we'd be thrilled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vla-evaluation-harness — cross-benchmark evaluation framework (DB-CogACT reproduced across 3 benchmarks) #74

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Benchmark	Our result	Dexbotic reported	Verdict
LIBERO (avg)	95.2%	94.9%	✅ Reproduced
CALVIN (avg len)	4.05	4.06	✅ Reproduced
SimplerEnv WidowX (avg)	72.2%	69.5%	✅ Reproduced

vla-evaluation-harness — cross-benchmark evaluation framework (DB-CogACT reproduced across 3 benchmarks) #74

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions