|
1 | 1 | # VMEvalKit 🎥🧠 |
2 | 2 |
|
| 3 | +**Unified inference and evaluation framework for 29+ video generation models.** |
3 | 4 |
|
4 | | -<div align="center"> |
| 5 | +## Features |
5 | 6 |
|
| 7 | +- **🚀 29+ Models**: Unified interface for commercial APIs (Luma, Veo, Sora, Runway) + open-source (LTX-Video, HunyuanVideo, DynamiCrafter, SVD, etc.) |
| 8 | +- **⚖️ Evaluation Pipeline**: Human scoring (Gradio) + automated scoring (GPT-4O, InternVL) |
| 9 | +- **☁️ Cloud Integration**: S3 + HuggingFace Hub support |
6 | 10 |
|
7 | | -[](https://grow-ai-like-a-child.com/video-reason/) |
8 | | -[](paper/video-models-start-to-solve/Video_Model_Start_to_Solve.pdf) |
9 | | -[](https://huggingface.co/VideoReason) |
10 | | -[](https://github.com/hokindeng/VMEvalKit/issues/132) |
11 | | -[](https://video-reason.github.io/) |
| 11 | +## Data Format |
12 | 12 |
|
| 13 | +Organize your questions outside VMEvalKit with the following structure: |
13 | 14 |
|
14 | | -</div> |
15 | | - |
16 | | - |
17 | | -A framework for evaluating the reasoning abilities of video generation models at scale. It also serves as a data engine that produces reasoning datasets of varying tasks and difficulty levels, with easy customization. We **make it very convenient** to [**add models**](docs/ADDING_MODELS.md), [**add tasks**](docs/ADDING_TASKS.md), [**run inferences**](docs/INFERENCE.md), [**run scoring**](docs/SCORING.md), [**manage datasets**](docs/DATA_MANAGEMENT.md) and [**display results**](https://grow-ai-like-a-child.com/video-reason/). It's **permissively open-source**, and we welcome everyone to [**join**](https://join.slack.com/t/growingailikeachild/shared_invite/zt-309yqd0sl-W8xzOkdBPha1Jh5rnee78A) us and **build in public together**! 🚀 |
18 | | - |
19 | | - |
20 | | -<p align="center"> |
21 | | - |
22 | | -</p> |
23 | | - |
24 | | - |
25 | | - |
26 | | - |
27 | | -## 🎬 Supported Models |
28 | | - |
29 | | -VMEvalKit provides unified access to **40 video generation models** across **11 provider families**: |
30 | | - |
31 | | -For commercial APIs, we support Luma Dream Machine, Google Veo, Google Veo 3.1, WaveSpeed WAN 2.1, WaveSpeed WAN 2.2, Runway ML, OpenAI Sora. For open-source models, we support HunyuanVideo, VideoCrafter, DynamiCrafter, Stable Video Diffusion, Morphic, LTX-Video, and so on. See [here](docs/models/README.md) for details. |
32 | | - |
33 | | - |
34 | | -## 📊 Supported Datasets |
35 | | - |
36 | | -VMEvalKit provides access to **9 local task generation engines(quickly increasing)** and other external benchmark datasets (HuggingFace) [here](docs/tasks/README.md). |
37 | | - |
38 | | -### Local Task Generation Engines |
39 | | - |
40 | | -Tasks supported by VMEvalKit: |
41 | | - |
42 | | -Chess, Maze, Raven, Rotation, Sudoku, Object Subtraction, Clock, mirror clock. For more details, see [**Task Docs**](docs/tasks/README.md). |
43 | | - |
44 | | -### Basic Idea |
45 | | - |
46 | | -VMEvalKit aims to provide an infrastructure for reasoning research in video models at scale: |
47 | | - |
48 | | -- 🎯 [**Task Creation at Scale**](docs/ADDING_TASKS.md): Create question dataset of many different cognitive tasks programmatically at scale and our framework makes sure the dataset to be well-organized. |
49 | | -- 🚀 [**Model Inference at Scale**](docs/INFERENCE.md): Easy one-click inference of the entire question dataset across many video models (commercial APIs + open-source) with automatic resume, error handling, and structured output management, and automatically sync the inference results into the dataset. |
50 | | -- ⚖️ [**Scoring Pipeline**](docs/SCORING.md): Human scoring via web interface and AI scoring via automated MLLM scoring, also automatically sync the scoring results into the dataset. |
51 | | -- ☁️ [**Dataset Management**](docs/DATA_MANAGEMENT.md): Manage question datasets from task creation, inference results from video models, and scoring results from humans or MLLM pipelines. Provides AWS S3 integration with version tracking and built-in logging for reproducibility. |
52 | | - |
53 | | -We have completed running a question dataset of [**chess**](/vmevalkit/tasks/chess_task/CHESS.md), [**maze**](/vmevalkit/tasks/maze_task/MAZE.md), [**Sudoku**](/vmevalkit/tasks/sudoku_task/SUDOKU.md), [**mental rotation**](/vmevalkit/tasks/rotation_task/ROTATION.md), and [**Raven's Matrices**](/vmevalkit/tasks/raven_task/RAVEN.md) on [**latest video models**](https://grow-ai-like-a-child.com/video-reason/). Checkout our raw results videos on this [**website**](https://grow-ai-like-a-child.com/video-reason/). Here are a few examples. |
54 | | - |
55 | | -## Installation & Setup |
56 | | - |
57 | | -1. **Clone the repository** |
58 | | -```bash |
59 | | -git clone https://github.com/hokindeng/VMEvalKit.git |
60 | | -cd VMEvalKit |
61 | 15 | ``` |
62 | | - |
63 | | -2. **Initialize submodules** - good for optional open-source models and datasets |
64 | | -```bash |
65 | | -git submodule update --init --recursive |
| 16 | +questions/ |
| 17 | +└── {domain}_task/ # task folder (e.g., chess_task, matching_object_task) |
| 18 | + ├── {domain}_0000/ # individual question folder |
| 19 | + │ ├── first_frame.png # required: input image for video generation |
| 20 | + │ ├── prompt.txt # required: text prompt describing the video |
| 21 | + │ ├── final_frame.png # optional: expected final frame for evaluation |
| 22 | + │ └── ground_truth.mp4 # optional: reference video for evaluation |
| 23 | + ├── {domain}_0001/ |
| 24 | + │ └── ... |
| 25 | + └── {domain}_0002/ |
| 26 | + └── ... |
66 | 27 | ``` |
67 | 28 |
|
68 | | -3. **Configure environment** - Copy the example environment file and add your API keys |
69 | | -```bash |
70 | | -cp env.template .env |
| 29 | +**Example** with domain `chess`: |
71 | 30 | ``` |
72 | | - |
73 | | -4. **Set up Python environment** – Recommended: use a fresh virtual environment |
74 | | - |
75 | | -```bash |
76 | | -python -m venv venv |
77 | | -source venv/bin/activate |
| 31 | +questions/ |
| 32 | +└── chess_task/ |
| 33 | + ├── chess_0000/ |
| 34 | + │ ├── first_frame.png |
| 35 | + │ ├── prompt.txt |
| 36 | + │ ├── final_frame.png |
| 37 | + │ └── ground_truth.mp4 |
| 38 | + ├── chess_0001/ |
| 39 | + │ └── ... |
| 40 | + └── chess_0002/ |
| 41 | + └── ... |
78 | 42 | ``` |
79 | 43 |
|
80 | | -Alternatively, you can use other tools like [`uv`](https://github.com/astral-sh/uv) for faster install (`uv venv`), or [`conda`](https://docs.conda.io/) if your usecase has cross-language dependencies. |
| 44 | +**Naming Convention:** |
| 45 | +- **Task folder**: `{domain}_task` (e.g., `chess_task`, `matching_object_task`) |
| 46 | +- **Question folders**: `{domain}_{i:04d}` where `i` is zero-padded (e.g., `chess_0000`, `chess_0064`). Padding automatically expands beyond 4 digits when needed—no dataset size limit. |
81 | 47 |
|
82 | | -5. **Install dependencies:** |
| 48 | +## Quick Start |
83 | 49 |
|
84 | 50 | ```bash |
85 | | -pip install -r requirements.txt |
86 | | -pip install -e . |
87 | | -``` |
88 | | - |
89 | | -For open-source video generation and evaluator models, please refer to [**Open Source Models**](./examples/opensource/open_source.md) for detailed installation instructions. |
90 | | - |
91 | | -**Model Weights:** All model weights are stored in a centralized `weights/` directory. See [**Weights Structure**](docs/WEIGHTS_STRUCTURE.md) for details on weight management and migration. |
| 51 | +# 1. Install |
| 52 | +git clone https://github.com/Video-Reason/VMEvalKit.git |
| 53 | +cd VMEvalKit |
92 | 54 |
|
93 | | -## 🚀 Quick Start - End-to-End Example |
| 55 | +python -m venv venv |
| 56 | +source venv/bin/activate |
94 | 57 |
|
95 | | -Here's a complete workflow from creating questions to scoring results: |
| 58 | +pip install -e . |
96 | 59 |
|
97 | | -### 1️⃣ Create Questions |
98 | | -```bash |
99 | | -# Generate 5 chess and maze questions each |
100 | | -python examples/create_questions.py --task chess maze --pairs-per-domain 5 |
| 60 | +# 2. Setup models |
| 61 | +bash setup/install_model.sh --model svd --validate |
101 | 62 |
|
102 | | -# Output: Creates data/questions/ with chess_task/ and maze_task/ folders |
103 | | -``` |
| 63 | +# 3. Organize your questions data (see format above) |
| 64 | +mkdir -p ~/my_research/questions |
104 | 65 |
|
105 | | -### 2️⃣ Generate Videos |
106 | | -```bash |
107 | | -# Run on specific model (e.g., stable video diffusion) |
108 | | -python examples/generate_videos.py --model svd --task chess maze |
| 66 | +# 4. Run inference |
| 67 | +python examples/generate_videos.py --questions-dir ~/my_research/questions --output-dir ~/my_research/outputs --model svd |
109 | 68 |
|
110 | | -# Output: Creates data/outputs/pilot_experiment/ with generated videos |
111 | | -# for close source model, need to set key in .env file |
| 69 | +# 5. Run evaluation |
| 70 | +# Create eval_config.json first: |
| 71 | +echo '{"method": "human", "inference_dir": "~/my_research/outputs", "eval_output_dir": "~/my_research/evaluations"}' > eval_config.json |
| 72 | +python examples/score_videos.py --eval-config eval_config.json |
112 | 73 | ``` |
113 | 74 |
|
114 | | -### 3️⃣ Score Results |
115 | | -```bash |
116 | | -# open source VLM Automated scoring |
117 | | -bash script/lmdeploy_server.sh |
118 | | - |
119 | | -# Human scoring via web interface |
120 | | -python examples/score_videos.py human |
| 75 | +## API Keys |
121 | 76 |
|
122 | | -# Automated GPT-4O scoring |
123 | | -python examples/score_videos.py gpt4o |
124 | | -``` |
125 | | - |
126 | | -### 4️⃣ View Results |
| 77 | +Set in `.env` file: |
127 | 78 | ```bash |
128 | | -# Launch web dashboard to explore results |
129 | | -cd web && ./start.sh |
130 | | -# Open http://localhost:5000 in your browser |
| 79 | +cp env.template .env |
| 80 | +# Edit .env with your API keys: |
| 81 | +# LUMA_API_KEY=... |
| 82 | +# OPENAI_API_KEY=... |
| 83 | +# GEMINI_API_KEY=... |
131 | 84 | ``` |
132 | 85 |
|
133 | | -That's it! You now have: |
134 | | -- ✅ Custom reasoning questions in `data/questions/` |
135 | | -- ✅ Generated videos in `data/outputs/` |
136 | | -- ✅ Scoring results in `data/scorings/` |
137 | | -- ✅ Interactive dashboard |
138 | | - |
139 | | - |
140 | | -## Tasks |
141 | | - |
142 | | -Every VMEvalKit dataset consists of **Task Pairs** - the basic unit for video reasoning scoring: |
143 | | - |
144 | | -We have two types of tasks: |
145 | | - |
146 | | -### Final image |
147 | | - |
148 | | -Each Task Pair consists of three core components: |
149 | | -- 📸 **Initial state image** (`first_frame.png`): shows the starting point or problem to be solved |
150 | | -- 🎯 **Final state image** (`final_frame.png`): illustrates the goal state or solution |
151 | | -- 📝 **Text prompt** (`prompt.txt`): provides natural language instructions for the video model |
152 | | - |
153 | | -There is also an accompanying `question_metadata.json` file with rich metadata. Each task pair is organized in its own folder (`data/questions/{domain}_task/{question_id}/`) containing all four files. |
154 | | - |
155 | | - |
156 | | - |
157 | | -### Final text answer |
158 | | - |
159 | | -Each Task Pair consists of three core components: |
160 | | -- 📸 **Initial state image** (`first_frame.png`): shows the starting point or problem to be solved |
161 | | -- 📝 **Text answer** (`goal.txt`): provides the text answer to the question |
162 | | -- 📝 **Text prompt** (`prompt.txt`): provides natural language instructions for the video model |
163 | | - |
164 | | -With our VMEvalKit, you can easily create tasks with final text answer by simply adding a `goal.txt` file to the task folder, so you could adapt your VQA datasets to video reasoning tasks. |
165 | | - |
166 | | -For more details, see [**Task Docs**](docs/tasks/README.md). |
167 | | - |
168 | | -## Inference Architecture |
169 | | - |
170 | | -See **[Inference Guide](docs/INFERENCE.md)** for details. |
171 | | - |
172 | | -## Scoring Pipeline |
173 | | - |
174 | | -See **[Scoring Guide](docs/SCORING.md)** for details. |
175 | | - |
176 | | -## Dataset Management |
177 | | - |
178 | | -See **[Data Management](docs/DATA_MANAGEMENT.md)** for details. |
179 | | - |
180 | | -## Display Results |
181 | | - |
182 | | -See **[Web Dashboard](docs/WEB_DASHBOARD.md)** for details. |
183 | | - |
184 | | -## Add Models or Tasks |
185 | | - |
186 | | -You can add new video generation models and reasoning tasks with minimal effort: |
187 | | - |
188 | | -**Adding New Models** |
189 | | - |
190 | | -Add any video generation model (API-based or open-source) with just a few steps: |
| 86 | +## Adding Models |
191 | 87 |
|
192 | 88 | ```python |
193 | | -# Example: Adding a new model wrapper |
194 | | -from vmevalkit.models.base import BaseVideoModel |
| 89 | +# Inherit from ModelWrapper |
| 90 | +from vmevalkit.models.base import ModelWrapper |
195 | 91 |
|
196 | | -class MyModelWrapper(BaseVideoModel): |
197 | | - def generate_video(self, image_path, text_prompt, **kwargs): |
198 | | - # Your model's video generation logic |
199 | | - return video_path |
| 92 | +class MyModelWrapper(ModelWrapper): |
| 93 | + def generate(self, image_path, text_prompt, **kwargs): |
| 94 | + # Your inference logic |
| 95 | + return {"success": True, "video_path": "...", ...} |
200 | 96 | ``` |
201 | 97 |
|
202 | | -Then register it in `MODEL_CATALOG.py`: |
| 98 | +Register in `vmevalkit/runner/MODEL_CATALOG.py`: |
203 | 99 | ```python |
204 | 100 | "my-model": { |
205 | | - "provider": "mycompany", |
206 | | - "wrapper_path": "vmevalkit.models.my_model.MyModelWrapper", |
207 | | - ... |
| 101 | + "wrapper_module": "vmevalkit.models.my_model_inference", |
| 102 | + "wrapper_class": "MyModelWrapper", |
| 103 | + "family": "MyCompany" |
208 | 104 | } |
209 | 105 | ``` |
210 | 106 |
|
211 | | -See **[Adding Models Guide](docs/ADDING_MODELS.md)** for details. |
212 | | - |
213 | | -**Adding New Tasks** |
214 | | - |
215 | | -Create new reasoning tasks programmatically at scale: |
216 | | - |
217 | | -```python |
218 | | -from vmevalkit.tasks.base_task import BaseTask |
219 | | - |
220 | | -class MyTask(BaseTask): |
221 | | - def generate_task_pair(self, ...): |
222 | | - # Generate initial and final states |
223 | | - initial_state = self.create_initial_state() |
224 | | - final_state = self.create_final_state() |
225 | | - prompt = self.create_prompt() |
226 | | - |
227 | | - return { |
228 | | - "first_frame": initial_state, |
229 | | - "final_frame": final_state, |
230 | | - "prompt": prompt, |
231 | | - "metadata": {...} |
232 | | - } |
233 | | -``` |
234 | | - |
235 | | -See **[Adding Tasks Guide](docs/ADDING_TASKS.md)** for details. |
236 | | - |
237 | | -## Invitation to Collaborate 🤝 |
238 | | - |
239 | | -VMEvalKit is meant to be a permissively open-source **shared playground** for everyone. If you’re interested in machine cognition, video models, evaluation, or anything anything 🦄✨, we’d love to build with you: |
240 | | - |
241 | | -* 🧪 Add new reasoning tasks (planning, causality, social, physical, etc.) |
242 | | -* 🎥 Plug in new video models (APIs or open-source) |
243 | | -* 📊 Experiment with better evaluation metrics and protocols |
244 | | -* 🧱 Improve infrastructure, logging, and the web dashboard |
245 | | -* 📚 Use VMEvalKit in your own research and share back configs/scripts |
246 | | -* 🌟🎉 Or Anything anything 🦄✨ |
247 | | - |
248 | | -💬 **Join us on Slack** to ask questions, propose ideas, or start a collab: |
249 | | -[Slack Invite](https://join.slack.com/t/growingailikeachild/shared_invite/zt-309yqd0sl-W8xzOkdBPha1Jh5rnee78A) 🚀 |
250 | | - |
251 | | -## Documentation |
252 | | - |
253 | | -📚 **Core Documentation:** |
254 | | -- **[Inference Guide](docs/INFERENCE.md)** - Complete guide to running inference, supported models, and architecture |
255 | | -- **[Scoring Guide](docs/SCORING.md)** - Human and automated scoring methods |
256 | | -- **[Data Management](docs/DATA_MANAGEMENT.md)** - Dataset organization, S3 sync, and version tracking |
257 | | -- **[Adding Models](docs/ADDING_MODELS.md)** - How to add new video generation models |
258 | | -- **[Adding Tasks](docs/ADDING_TASKS.md)** - How to create new reasoning tasks |
259 | | -- **[Web Dashboard](docs/WEB_DASHBOARD.md)** - Interactive results visualization |
260 | | -- **[Weights Structure](docs/WEIGHTS_STRUCTURE.md)** - Model weights management and centralized storage |
261 | | - |
262 | | -## Research |
263 | | - |
264 | | -Here we keep track of papers spinned off from this code infrastructure and some works in progress. |
265 | | - |
266 | | -- [**"Video Models Start to Solve Chess, Maze, Sudoku, Mental Rotation, and Raven's Matrices"**](paper/video-models-start-to-solve/Video_Model_Start_to_Solve.pdf) |
267 | | - |
268 | | -This paper implements our experimental framework and demonstrates that leading video generation models (Sora-2 etc) can perform visual reasoning tasks with >60% success rates. See [**results**](https://grow-ai-like-a-child.com/video-reason/). |
269 | | - |
270 | 107 | ## License |
271 | 108 |
|
272 | | -Apache 2.0 |
273 | | - |
274 | | - |
275 | | -## Citation |
276 | | - |
277 | | -If you find VMEvalKit useful in your research, please cite: |
278 | | - |
279 | | -```bibtex |
280 | | -@misc{VMEvalKit, |
281 | | - author = {VMEvalKit Team}, |
282 | | - title = {VMEvalKit: A framework for evaluating reasoning abilities in foundational video models}, |
283 | | - year = {2025}, |
284 | | - howpublished = {\url{https://github.com/Video-Reason/VMEvalKit}} |
285 | | -} |
286 | | -``` |
| 109 | +Apache 2.0 |
0 commit comments