@@ -24,16 +24,23 @@ pip install -e .
2424## Quick Start
2525
2626``` python
27- from vmevalkit import run_inference
27+ from vmevalkit.runner.inference import InferenceRunner
28+
29+ # Initialize runner with structured output
30+ runner = InferenceRunner(output_dir = " output" )
2831
2932# Generate video solution
30- result = run_inference (
33+ result = runner.run (
3134 model_name = " luma-ray-2" ,
32- image_path = " data/maze .png" ,
33- text_prompt = " Solve this maze from start to finish "
35+ image_path = " data/questions/maze_task/maze_0000/first_frame .png" ,
36+ text_prompt = " Navigate the green dot through the maze corridors to reach the red flag "
3437)
3538
36- print (f " Video: { result[' video_path' ]} " )
39+ print (f " Video saved to: { result[' inference_dir' ]} " )
40+ # Each inference creates a self-contained folder with:
41+ # - video/: Generated video file
42+ # - question/: Input images and prompt
43+ # - metadata.json: Complete inference metadata
3744```
3845
3946## Supported Models
@@ -60,12 +67,12 @@ All models support **image + text → video** for reasoning evaluation.
6067### Task Pair: The Fundamental Unit
6168Every VMEvalKit dataset consists of ** Task Pairs** - the basic unit for video reasoning evaluation:
6269
63- - 📸 ** Initial state image** (the reasoning problem)
64- - 🎯 ** Final state image** (the solution/goal state)
65- - 📝 ** Text prompt** (instructions for video model)
66- - 📊 ** Rich metadata** (difficulty, task-specific parameters, etc.)
70+ - 📸 ** Initial state image** (` first_frame.png ` - the reasoning problem)
71+ - 🎯 ** Final state image** (` final_frame.png ` - the solution/goal state)
72+ - 📝 ** Text prompt** (` prompt.txt ` - instructions for video model)
73+ - 📊 ** Rich metadata** (` question_metadata.json ` - difficulty, task-specific parameters, etc.)
6774
68- Models must generate videos showing the reasoning process from initial → final state.
75+ Each task pair is organized in its own folder ( ` data/questions/{domain}_task/{question_id}/ ` ) containing all four files. Models must generate videos showing the reasoning process from initial → final state.
6976
7077## Tasks
7178
@@ -95,14 +102,48 @@ VMEvalKit/
95102│ ├── core/ # Evaluation framework
96103│ ├── tasks/ # Task definitions
97104│ └── utils/ # Utilities
98- ├── data/ # Datasets
105+ ├── data/
106+ │ └── questions/ # Dataset with per-question folders
107+ │ ├── vmeval_dataset.json # Master dataset manifest
108+ │ ├── chess_task/ # Chess reasoning questions
109+ │ │ └── chess_0000/ # Individual question folder
110+ │ │ ├── first_frame.png
111+ │ │ ├── final_frame.png
112+ │ │ ├── prompt.txt
113+ │ │ └── question_metadata.json
114+ │ ├── maze_task/ # Maze navigation questions
115+ │ ├── raven_task/ # Pattern completion questions
116+ │ └── rotation_task/ # 3D rotation questions
117+ ├── output/ # Structured inference outputs
118+ │ └── <inference_id>/ # Self-contained folders per inference
119+ │ ├── video/ # Generated video file
120+ │ ├── question/ # Input images and prompt
121+ │ └── metadata.json # Complete inference metadata
99122├── examples/ # Example scripts
100123└── tests/ # Unit tests
101124```
102125
126+ ## Structured Output System
127+
128+ Each inference creates a ** self-contained folder** with all relevant data:
129+
130+ ```
131+ output/<model>_<question_id>_<timestamp>/
132+ ├── video/
133+ │ └── generated_video.mp4 # Output video
134+ ├── question/
135+ │ ├── first_frame.png # Input image (sent to model)
136+ │ ├── final_frame.png # Reference image (not sent)
137+ │ ├── prompt.txt # Text prompt used
138+ │ └── question_metadata.json # Full question data from dataset
139+ └── metadata.json # Complete inference metadata
140+ ```
141+
142+ This structure ensures reproducibility and makes batch analysis easy.
143+
103144## Examples
104145
105- See ` examples/simple_inference .py ` for more usage patterns .
146+ See ` examples/experiment_2025-10-14 .py ` for sequential inference across multiple models .
106147
107148## Submodules
108149
@@ -111,7 +152,6 @@ Initialize after cloning:
111152git submodule update --init --recursive
112153```
113154
114- - ** KnowWhat** : Research on knowing-how vs knowing-that
115155- ** maze-dataset** : Maze datasets for ML evaluation
116156- ** HunyuanVideo-I2V** : High-quality image-to-video generation (720p)
117157- ** LTX-Video** : Real-time video generation models
@@ -138,6 +178,69 @@ VMEvalKit supports 36+ models across 9 providers and is designed to easily accom
138178
139179Both API-based and open-source (submodule) integration patterns are supported.
140180
181+ ## Running Experiments
182+
183+ ### Quick Start
184+
185+ Generate dataset and run experiments:
186+
187+ ``` bash
188+ cd /Users/access/VMEvalKit
189+ source venv/bin/activate
190+
191+ # Generate dataset (if needed)
192+ python -m vmevalkit.runner.create_dataset --pairs-per-domain 15
193+
194+ # Run experiment (1 task per domain for testing)
195+ python examples/experiment_2025-10-14.py
196+
197+ # Run all tasks
198+ python examples/experiment_2025-10-14.py --all-tasks
199+ ```
200+
201+ ### Resume Mechanism
202+
203+ The experiment script includes robust resume capability for long-running experiments:
204+
205+ ** Features:**
206+ - 🔄 Sequential execution: one model at a time, one task at a time
207+ - ⚡ Automatic checkpointing every 5 completed jobs
208+ - 🛡️ Graceful interruption handling (Ctrl+C saves progress)
209+ - 📥 Resume from latest or specific experiment
210+ - 📊 Track completed, failed, and in-progress jobs
211+
212+ ** Usage:**
213+
214+ ``` bash
215+ # Resume latest interrupted experiment
216+ python examples/experiment_2025-10-14.py --resume latest
217+
218+ # Resume specific experiment
219+ python examples/experiment_2025-10-14.py --resume experiment_20241016_143022
220+
221+ # List available checkpoints
222+ python examples/experiment_2025-10-14.py --list-checkpoints
223+
224+ # Start with custom experiment ID
225+ python examples/experiment_2025-10-14.py --experiment-id my_test_001
226+ ```
227+
228+ ** Command Options:**
229+
230+ | Option | Description |
231+ | --------| -------------|
232+ | ` --resume <ID or 'latest'> ` | Resume a previous experiment |
233+ | ` --no-resume ` | Disable resume mechanism |
234+ | ` --experiment-id <ID> ` | Set custom experiment ID |
235+ | ` --all-tasks ` | Run all tasks instead of 1 per domain |
236+ | ` --list-checkpoints ` | List available checkpoints |
237+
238+ ** How It Works:**
239+ - Progress saved to ` data/outputs/pilot_experiment/logs/checkpoint_*.json `
240+ - Completed jobs won't be re-run on resume
241+ - Failed jobs can be retried
242+ - Interrupted jobs are automatically retried
243+
141244## License
142245
143246MIT
0 commit comments