Skip to content

Commit 7494cb3

Browse files
authored
Merge pull request #196 from Video-Reason/dev
Evaluation Kit is only for evaluation.
2 parents 62cce49 + 752dc52 commit 7494cb3

337 files changed

Lines changed: 3709 additions & 39874 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.dockerignore

100644100755
File mode changed.

.gitattributes

100644100755
File mode changed.

.gitignore

100644100755
Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ data/evaluations/
4242
# Installer logs
4343
pip-log.txt
4444
pip-delete-this-directory.txt
45+
install_output.txt
4546

4647
# Unit test / coverage reports
4748
htmlcov/
@@ -219,7 +220,7 @@ deprecated/
219220

220221
# VMEvalKit data folder - ignore large datasets but keep scripts and logging
221222
data/questions/*
222-
!data/questions/tests_task/
223+
!data/questions/test_task/
223224
data/outputs/
224225
# Keep: data/s3_sync.py, data/data_logging/
225226
.vscode/

.gitmodules

100644100755
File mode changed.

CONTRIBUTING.md

Lines changed: 0 additions & 6 deletions
This file was deleted.

Dockerfile

100644100755
File mode changed.

LICENSE

100644100755
File mode changed.

README.md

100644100755
Lines changed: 69 additions & 246 deletions
Original file line numberDiff line numberDiff line change
@@ -1,286 +1,109 @@
11
# VMEvalKit 🎥🧠
22

3+
**Unified inference and evaluation framework for 29+ video generation models.**
34

4-
<div align="center">
5+
## Features
56

7+
- **🚀 29+ Models**: Unified interface for commercial APIs (Luma, Veo, Sora, Runway) + open-source (LTX-Video, HunyuanVideo, DynamiCrafter, SVD, etc.)
8+
- **⚖️ Evaluation Pipeline**: Human scoring (Gradio) + automated scoring (GPT-4O, InternVL)
9+
- **☁️ Cloud Integration**: S3 + HuggingFace Hub support
610

7-
[![results](https://img.shields.io/badge/Result-A42C2?style=for-the-badge&logo=googledisplayandvideo360&logoColor=white)](https://grow-ai-like-a-child.com/video-reason/)
8-
[![Paper](https://img.shields.io/badge/Paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](paper/video-models-start-to-solve/Video_Model_Start_to_Solve.pdf)
9-
[![Hugging Face](https://img.shields.io/badge/hf-fcd022?style=for-the-badge&logo=huggingface&logoColor=white)](https://huggingface.co/VideoReason)
10-
[![WeChat](https://img.shields.io/badge/WeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https://github.com/hokindeng/VMEvalKit/issues/132)
11-
[![Homepage](https://img.shields.io/badge/Homepage-07D100?style=for-the-badge&logo=googlehome&logoColor=white)](https://video-reason.github.io/)
11+
## Data Format
1212

13+
Organize your questions outside VMEvalKit with the following structure:
1314

14-
</div>
15-
16-
17-
A framework for evaluating the reasoning abilities of video generation models at scale. It also serves as a data engine that produces reasoning datasets of varying tasks and difficulty levels, with easy customization. We **make it very convenient** to [**add models**](docs/ADDING_MODELS.md), [**add tasks**](docs/ADDING_TASKS.md), [**run inferences**](docs/INFERENCE.md), [**run scoring**](docs/SCORING.md), [**manage datasets**](docs/DATA_MANAGEMENT.md) and [**display results**](https://grow-ai-like-a-child.com/video-reason/). It's **permissively open-source**, and we welcome everyone to [**join**](https://join.slack.com/t/growingailikeachild/shared_invite/zt-309yqd0sl-W8xzOkdBPha1Jh5rnee78A) us and **build in public together**! 🚀
18-
19-
20-
<p align="center">
21-
22-
</p>
23-
24-
![VMEvalKit Framework](paper/video-models-start-to-solve/assets/draft_1.jpg)
25-
26-
27-
## 🎬 Supported Models
28-
29-
VMEvalKit provides unified access to **40 video generation models** across **11 provider families**:
30-
31-
For commercial APIs, we support Luma Dream Machine, Google Veo, Google Veo 3.1, WaveSpeed WAN 2.1, WaveSpeed WAN 2.2, Runway ML, OpenAI Sora. For open-source models, we support HunyuanVideo, VideoCrafter, DynamiCrafter, Stable Video Diffusion, Morphic, LTX-Video, and so on. See [here](docs/models/README.md) for details.
32-
33-
34-
## 📊 Supported Datasets
35-
36-
VMEvalKit provides access to **9 local task generation engines(quickly increasing)** and other external benchmark datasets (HuggingFace) [here](docs/tasks/README.md).
37-
38-
### Local Task Generation Engines
39-
40-
Tasks supported by VMEvalKit:
41-
42-
Chess, Maze, Raven, Rotation, Sudoku, Object Subtraction, Clock, mirror clock. For more details, see [**Task Docs**](docs/tasks/README.md).
43-
44-
### Basic Idea
45-
46-
VMEvalKit aims to provide an infrastructure for reasoning research in video models at scale:
47-
48-
- 🎯 [**Task Creation at Scale**](docs/ADDING_TASKS.md): Create question dataset of many different cognitive tasks programmatically at scale and our framework makes sure the dataset to be well-organized.
49-
- 🚀 [**Model Inference at Scale**](docs/INFERENCE.md): Easy one-click inference of the entire question dataset across many video models (commercial APIs + open-source) with automatic resume, error handling, and structured output management, and automatically sync the inference results into the dataset.
50-
- ⚖️ [**Scoring Pipeline**](docs/SCORING.md): Human scoring via web interface and AI scoring via automated MLLM scoring, also automatically sync the scoring results into the dataset.
51-
- ☁️ [**Dataset Management**](docs/DATA_MANAGEMENT.md): Manage question datasets from task creation, inference results from video models, and scoring results from humans or MLLM pipelines. Provides AWS S3 integration with version tracking and built-in logging for reproducibility.
52-
53-
We have completed running a question dataset of [**chess**](/vmevalkit/tasks/chess_task/CHESS.md), [**maze**](/vmevalkit/tasks/maze_task/MAZE.md), [**Sudoku**](/vmevalkit/tasks/sudoku_task/SUDOKU.md), [**mental rotation**](/vmevalkit/tasks/rotation_task/ROTATION.md), and [**Raven's Matrices**](/vmevalkit/tasks/raven_task/RAVEN.md) on [**latest video models**](https://grow-ai-like-a-child.com/video-reason/). Checkout our raw results videos on this [**website**](https://grow-ai-like-a-child.com/video-reason/). Here are a few examples.
54-
55-
## Installation & Setup
56-
57-
1. **Clone the repository**
58-
```bash
59-
git clone https://github.com/hokindeng/VMEvalKit.git
60-
cd VMEvalKit
6115
```
62-
63-
2. **Initialize submodules** - good for optional open-source models and datasets
64-
```bash
65-
git submodule update --init --recursive
16+
questions/
17+
└── {domain}_task/ # task folder (e.g., chess_task, matching_object_task)
18+
├── {domain}_0000/ # individual question folder
19+
│ ├── first_frame.png # required: input image for video generation
20+
│ ├── prompt.txt # required: text prompt describing the video
21+
│ ├── final_frame.png # optional: expected final frame for evaluation
22+
│ └── ground_truth.mp4 # optional: reference video for evaluation
23+
├── {domain}_0001/
24+
│ └── ...
25+
└── {domain}_0002/
26+
└── ...
6627
```
6728

68-
3. **Configure environment** - Copy the example environment file and add your API keys
69-
```bash
70-
cp env.template .env
29+
**Example** with domain `chess`:
7130
```
72-
73-
4. **Set up Python environment** – Recommended: use a fresh virtual environment
74-
75-
```bash
76-
python -m venv venv
77-
source venv/bin/activate
31+
questions/
32+
└── chess_task/
33+
├── chess_0000/
34+
│ ├── first_frame.png
35+
│ ├── prompt.txt
36+
│ ├── final_frame.png
37+
│ └── ground_truth.mp4
38+
├── chess_0001/
39+
│ └── ...
40+
└── chess_0002/
41+
└── ...
7842
```
7943

80-
Alternatively, you can use other tools like [`uv`](https://github.com/astral-sh/uv) for faster install (`uv venv`), or [`conda`](https://docs.conda.io/) if your usecase has cross-language dependencies.
44+
**Naming Convention:**
45+
- **Task folder**: `{domain}_task` (e.g., `chess_task`, `matching_object_task`)
46+
- **Question folders**: `{domain}_{i:04d}` where `i` is zero-padded (e.g., `chess_0000`, `chess_0064`). Padding automatically expands beyond 4 digits when needed—no dataset size limit.
8147

82-
5. **Install dependencies:**
48+
## Quick Start
8349

8450
```bash
85-
pip install -r requirements.txt
86-
pip install -e .
87-
```
88-
89-
For open-source video generation and evaluator models, please refer to [**Open Source Models**](./examples/opensource/open_source.md) for detailed installation instructions.
90-
91-
**Model Weights:** All model weights are stored in a centralized `weights/` directory. See [**Weights Structure**](docs/WEIGHTS_STRUCTURE.md) for details on weight management and migration.
51+
# 1. Install
52+
git clone https://github.com/Video-Reason/VMEvalKit.git
53+
cd VMEvalKit
9254

93-
## 🚀 Quick Start - End-to-End Example
55+
python -m venv venv
56+
source venv/bin/activate
9457

95-
Here's a complete workflow from creating questions to scoring results:
58+
pip install -e .
9659

97-
### 1️⃣ Create Questions
98-
```bash
99-
# Generate 5 chess and maze questions each
100-
python examples/create_questions.py --task chess maze --pairs-per-domain 5
60+
# 2. Setup models
61+
bash setup/install_model.sh --model svd --validate
10162

102-
# Output: Creates data/questions/ with chess_task/ and maze_task/ folders
103-
```
63+
# 3. Organize your questions data (see format above)
64+
mkdir -p ~/my_research/questions
10465

105-
### 2️⃣ Generate Videos
106-
```bash
107-
# Run on specific model (e.g., stable video diffusion)
108-
python examples/generate_videos.py --model svd --task chess maze
66+
# 4. Run inference
67+
python examples/generate_videos.py --questions-dir ~/my_research/questions --output-dir ~/my_research/outputs --model svd
10968

110-
# Output: Creates data/outputs/pilot_experiment/ with generated videos
111-
# for close source model, need to set key in .env file
69+
# 5. Run evaluation
70+
# Create eval_config.json first:
71+
echo '{"method": "human", "inference_dir": "~/my_research/outputs", "eval_output_dir": "~/my_research/evaluations"}' > eval_config.json
72+
python examples/score_videos.py --eval-config eval_config.json
11273
```
11374

114-
### 3️⃣ Score Results
115-
```bash
116-
# open source VLM Automated scoring
117-
bash script/lmdeploy_server.sh
118-
119-
# Human scoring via web interface
120-
python examples/score_videos.py human
75+
## API Keys
12176

122-
# Automated GPT-4O scoring
123-
python examples/score_videos.py gpt4o
124-
```
125-
126-
### 4️⃣ View Results
77+
Set in `.env` file:
12778
```bash
128-
# Launch web dashboard to explore results
129-
cd web && ./start.sh
130-
# Open http://localhost:5000 in your browser
79+
cp env.template .env
80+
# Edit .env with your API keys:
81+
# LUMA_API_KEY=...
82+
# OPENAI_API_KEY=...
83+
# GEMINI_API_KEY=...
13184
```
13285

133-
That's it! You now have:
134-
- ✅ Custom reasoning questions in `data/questions/`
135-
- ✅ Generated videos in `data/outputs/`
136-
- ✅ Scoring results in `data/scorings/`
137-
- ✅ Interactive dashboard
138-
139-
140-
## Tasks
141-
142-
Every VMEvalKit dataset consists of **Task Pairs** - the basic unit for video reasoning scoring:
143-
144-
We have two types of tasks:
145-
146-
### Final image
147-
148-
Each Task Pair consists of three core components:
149-
- 📸 **Initial state image** (`first_frame.png`): shows the starting point or problem to be solved
150-
- 🎯 **Final state image** (`final_frame.png`): illustrates the goal state or solution
151-
- 📝 **Text prompt** (`prompt.txt`): provides natural language instructions for the video model
152-
153-
There is also an accompanying `question_metadata.json` file with rich metadata. Each task pair is organized in its own folder (`data/questions/{domain}_task/{question_id}/`) containing all four files.
154-
155-
![Task Pair Structure](paper/video-models-start-to-solve/assets/question_set.jpg)
156-
157-
### Final text answer
158-
159-
Each Task Pair consists of three core components:
160-
- 📸 **Initial state image** (`first_frame.png`): shows the starting point or problem to be solved
161-
- 📝 **Text answer** (`goal.txt`): provides the text answer to the question
162-
- 📝 **Text prompt** (`prompt.txt`): provides natural language instructions for the video model
163-
164-
With our VMEvalKit, you can easily create tasks with final text answer by simply adding a `goal.txt` file to the task folder, so you could adapt your VQA datasets to video reasoning tasks.
165-
166-
For more details, see [**Task Docs**](docs/tasks/README.md).
167-
168-
## Inference Architecture
169-
170-
See **[Inference Guide](docs/INFERENCE.md)** for details.
171-
172-
## Scoring Pipeline
173-
174-
See **[Scoring Guide](docs/SCORING.md)** for details.
175-
176-
## Dataset Management
177-
178-
See **[Data Management](docs/DATA_MANAGEMENT.md)** for details.
179-
180-
## Display Results
181-
182-
See **[Web Dashboard](docs/WEB_DASHBOARD.md)** for details.
183-
184-
## Add Models or Tasks
185-
186-
You can add new video generation models and reasoning tasks with minimal effort:
187-
188-
**Adding New Models**
189-
190-
Add any video generation model (API-based or open-source) with just a few steps:
86+
## Adding Models
19187

19288
```python
193-
# Example: Adding a new model wrapper
194-
from vmevalkit.models.base import BaseVideoModel
89+
# Inherit from ModelWrapper
90+
from vmevalkit.models.base import ModelWrapper
19591

196-
class MyModelWrapper(BaseVideoModel):
197-
def generate_video(self, image_path, text_prompt, **kwargs):
198-
# Your model's video generation logic
199-
return video_path
92+
class MyModelWrapper(ModelWrapper):
93+
def generate(self, image_path, text_prompt, **kwargs):
94+
# Your inference logic
95+
return {"success": True, "video_path": "...", ...}
20096
```
20197

202-
Then register it in `MODEL_CATALOG.py`:
98+
Register in `vmevalkit/runner/MODEL_CATALOG.py`:
20399
```python
204100
"my-model": {
205-
"provider": "mycompany",
206-
"wrapper_path": "vmevalkit.models.my_model.MyModelWrapper",
207-
...
101+
"wrapper_module": "vmevalkit.models.my_model_inference",
102+
"wrapper_class": "MyModelWrapper",
103+
"family": "MyCompany"
208104
}
209105
```
210106

211-
See **[Adding Models Guide](docs/ADDING_MODELS.md)** for details.
212-
213-
**Adding New Tasks**
214-
215-
Create new reasoning tasks programmatically at scale:
216-
217-
```python
218-
from vmevalkit.tasks.base_task import BaseTask
219-
220-
class MyTask(BaseTask):
221-
def generate_task_pair(self, ...):
222-
# Generate initial and final states
223-
initial_state = self.create_initial_state()
224-
final_state = self.create_final_state()
225-
prompt = self.create_prompt()
226-
227-
return {
228-
"first_frame": initial_state,
229-
"final_frame": final_state,
230-
"prompt": prompt,
231-
"metadata": {...}
232-
}
233-
```
234-
235-
See **[Adding Tasks Guide](docs/ADDING_TASKS.md)** for details.
236-
237-
## Invitation to Collaborate 🤝
238-
239-
VMEvalKit is meant to be a permissively open-source **shared playground** for everyone. If you’re interested in machine cognition, video models, evaluation, or anything anything 🦄✨, we’d love to build with you:
240-
241-
* 🧪 Add new reasoning tasks (planning, causality, social, physical, etc.)
242-
* 🎥 Plug in new video models (APIs or open-source)
243-
* 📊 Experiment with better evaluation metrics and protocols
244-
* 🧱 Improve infrastructure, logging, and the web dashboard
245-
* 📚 Use VMEvalKit in your own research and share back configs/scripts
246-
* 🌟🎉 Or Anything anything 🦄✨
247-
248-
💬 **Join us on Slack** to ask questions, propose ideas, or start a collab:
249-
[Slack Invite](https://join.slack.com/t/growingailikeachild/shared_invite/zt-309yqd0sl-W8xzOkdBPha1Jh5rnee78A) 🚀
250-
251-
## Documentation
252-
253-
📚 **Core Documentation:**
254-
- **[Inference Guide](docs/INFERENCE.md)** - Complete guide to running inference, supported models, and architecture
255-
- **[Scoring Guide](docs/SCORING.md)** - Human and automated scoring methods
256-
- **[Data Management](docs/DATA_MANAGEMENT.md)** - Dataset organization, S3 sync, and version tracking
257-
- **[Adding Models](docs/ADDING_MODELS.md)** - How to add new video generation models
258-
- **[Adding Tasks](docs/ADDING_TASKS.md)** - How to create new reasoning tasks
259-
- **[Web Dashboard](docs/WEB_DASHBOARD.md)** - Interactive results visualization
260-
- **[Weights Structure](docs/WEIGHTS_STRUCTURE.md)** - Model weights management and centralized storage
261-
262-
## Research
263-
264-
Here we keep track of papers spinned off from this code infrastructure and some works in progress.
265-
266-
- [**"Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven's Matrices"**](paper/video-models-start-to-solve/Video_Model_Start_to_Solve.pdf)
267-
268-
This paper implements our experimental framework and demonstrates that leading video generation models (Sora-2 etc) can perform visual reasoning tasks with >60% success rates. See [**results**](https://grow-ai-like-a-child.com/video-reason/).
269-
270107
## License
271108

272-
Apache 2.0
273-
274-
275-
## Citation
276-
277-
If you find VMEvalKit useful in your research, please cite:
278-
279-
```bibtex
280-
@misc{VMEvalKit,
281-
author = {VMEvalKit Team},
282-
title = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models},
283-
year = {2025},
284-
howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}}
285-
}
286-
```
109+
Apache 2.0

0 commit comments

Comments
 (0)