Skip to content

vla-evaluation-harness — cross-benchmark evaluation framework (DB-CogACT reproduced across 3 benchmarks) #74

@MilkClouds

Description

@MilkClouds

Hi! We're building vla-evaluation-harness at Allen AI — a unified evaluation framework for VLA models across multiple simulation benchmarks.

The core idea: model servers and benchmarks communicate via WebSocket, each in their own isolated environment. Integrate a model once, and it works across all benchmarks automatically. Currently supports 11+ benchmarks (LIBERO, CALVIN, SimplerEnv, RoboTwin, ManiSkill2, etc.) and growing.

We've been using Dexbotic checkpoints, and DB-CogACT is actually our most thoroughly reproduced model so far:

Benchmark Our result Dexbotic reported Verdict
LIBERO (avg) 95.2% 94.9% ✅ Reproduced
CALVIN (avg len) 4.05 4.06 ✅ Reproduced
SimplerEnv WidowX (avg) 72.2% 69.5% ✅ Reproduced

That said, we have plenty of gaps. Many model × benchmark combinations are still untested, and we've learned the hard way that subtle differences in action space conventions, observation preprocessing, and seed handling can make or break reproduction. It's a work in progress.

Where we think there's mutual benefit:

  • For Dexbotic users: Our framework provides dependency-isolated, parallelized evaluation — a full LIBERO run (2000 episodes) takes ~30 min on H100 with sharding. No need to install benchmark dependencies alongside model dependencies.
  • For us: Dexbotic has a rich model zoo (DB-OFT, DB-π0, DB-MemVLA, DB-GR00TN1) that we haven't integrated yet. Adding these would significantly expand our cross-benchmark coverage.

We'd love to collaborate — whether that's cross-checking reproduction results, integrating more Dexbotic models into our framework, or anything else that makes sense. If anyone from the team (or community) is interested in helping verify reproductions or adding model server integrations, we'd be thrilled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions