Skip to content

Commit 61bfa3b

Browse files
authored
Merge pull request #5 from hokindeng/dev
Dev
2 parents a5db681 + 8a8a0c3 commit 61bfa3b

46 files changed

Lines changed: 4768 additions & 3266 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.cursor/rules/cheatsheet.mdc

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
---
2+
alwaysApply: false
3+
---
4+
To use Cairo
5+
6+
```export PKG_CONFIG_PATH=/opt/homebrew/lib/pkgconfig:/opt/homebrew/opt/libffi/lib/pkgconfig
7+
export DYLD_FALLBACK_LIBRARY_PATH=/opt/homebrew/lib```
8+
9+
To export all the keys
10+
11+
```source /Users/access/VMEvalKit/venv/bin/activate
12+
cd /Users/access/VMEvalKit
13+
set -a; source ./.env; set +a```

.cursor/rules/trycatch.mdc

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
alwaysApply: true
3+
---
4+
5+
Don't write try catch blocks.

.gitignore

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -208,4 +208,9 @@ __marimo__/
208208

209209
# VMEvalKit outputs and data
210210
outputs/
211-
data/
211+
deprecated/
212+
213+
# VMEvalKit data folder - ignore large datasets but keep scripts and logging
214+
data/questions/
215+
data/outputs/
216+
# Keep: data/s3_sync.py, data/data_logging/

CONTRIBUTING.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# 🙌 Contributors
2+
3+
Hokin Deng, Ran Ji, Maijunxian Wang, …

README.md

Lines changed: 116 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -24,16 +24,23 @@ pip install -e .
2424
## Quick Start
2525

2626
```python
27-
from vmevalkit import run_inference
27+
from vmevalkit.runner.inference import InferenceRunner
28+
29+
# Initialize runner with structured output
30+
runner = InferenceRunner(output_dir="output")
2831

2932
# Generate video solution
30-
result = run_inference(
33+
result = runner.run(
3134
model_name="luma-ray-2",
32-
image_path="data/maze.png",
33-
text_prompt="Solve this maze from start to finish"
35+
image_path="data/questions/maze_task/maze_0000/first_frame.png",
36+
text_prompt="Navigate the green dot through the maze corridors to reach the red flag"
3437
)
3538

36-
print(f"Video: {result['video_path']}")
39+
print(f"Video saved to: {result['inference_dir']}")
40+
# Each inference creates a self-contained folder with:
41+
# - video/: Generated video file
42+
# - question/: Input images and prompt
43+
# - metadata.json: Complete inference metadata
3744
```
3845

3946
## Supported Models
@@ -60,12 +67,12 @@ All models support **image + text → video** for reasoning evaluation.
6067
### Task Pair: The Fundamental Unit
6168
Every VMEvalKit dataset consists of **Task Pairs** - the basic unit for video reasoning evaluation:
6269

63-
- 📸 **Initial state image** (the reasoning problem)
64-
- 🎯 **Final state image** (the solution/goal state)
65-
- 📝 **Text prompt** (instructions for video model)
66-
- 📊 **Rich metadata** (difficulty, task-specific parameters, etc.)
70+
- 📸 **Initial state image** (`first_frame.png` - the reasoning problem)
71+
- 🎯 **Final state image** (`final_frame.png` - the solution/goal state)
72+
- 📝 **Text prompt** (`prompt.txt` - instructions for video model)
73+
- 📊 **Rich metadata** (`question_metadata.json` - difficulty, task-specific parameters, etc.)
6774

68-
Models must generate videos showing the reasoning process from initial → final state.
75+
Each task pair is organized in its own folder (`data/questions/{domain}_task/{question_id}/`) containing all four files. Models must generate videos showing the reasoning process from initial → final state.
6976

7077
## Tasks
7178

@@ -95,14 +102,48 @@ VMEvalKit/
95102
│ ├── core/ # Evaluation framework
96103
│ ├── tasks/ # Task definitions
97104
│ └── utils/ # Utilities
98-
├── data/ # Datasets
105+
├── data/
106+
│ └── questions/ # Dataset with per-question folders
107+
│ ├── vmeval_dataset.json # Master dataset manifest
108+
│ ├── chess_task/ # Chess reasoning questions
109+
│ │ └── chess_0000/ # Individual question folder
110+
│ │ ├── first_frame.png
111+
│ │ ├── final_frame.png
112+
│ │ ├── prompt.txt
113+
│ │ └── question_metadata.json
114+
│ ├── maze_task/ # Maze navigation questions
115+
│ ├── raven_task/ # Pattern completion questions
116+
│ └── rotation_task/ # 3D rotation questions
117+
├── output/ # Structured inference outputs
118+
│ └── <inference_id>/ # Self-contained folders per inference
119+
│ ├── video/ # Generated video file
120+
│ ├── question/ # Input images and prompt
121+
│ └── metadata.json # Complete inference metadata
99122
├── examples/ # Example scripts
100123
└── tests/ # Unit tests
101124
```
102125

126+
## Structured Output System
127+
128+
Each inference creates a **self-contained folder** with all relevant data:
129+
130+
```
131+
output/<model>_<question_id>_<timestamp>/
132+
├── video/
133+
│ └── generated_video.mp4 # Output video
134+
├── question/
135+
│ ├── first_frame.png # Input image (sent to model)
136+
│ ├── final_frame.png # Reference image (not sent)
137+
│ ├── prompt.txt # Text prompt used
138+
│ └── question_metadata.json # Full question data from dataset
139+
└── metadata.json # Complete inference metadata
140+
```
141+
142+
This structure ensures reproducibility and makes batch analysis easy.
143+
103144
## Examples
104145

105-
See `examples/simple_inference.py` for more usage patterns.
146+
See `examples/experiment_2025-10-14.py` for sequential inference across multiple models.
106147

107148
## Submodules
108149

@@ -111,7 +152,6 @@ Initialize after cloning:
111152
git submodule update --init --recursive
112153
```
113154

114-
- **KnowWhat**: Research on knowing-how vs knowing-that
115155
- **maze-dataset**: Maze datasets for ML evaluation
116156
- **HunyuanVideo-I2V**: High-quality image-to-video generation (720p)
117157
- **LTX-Video**: Real-time video generation models
@@ -138,6 +178,69 @@ VMEvalKit supports 36+ models across 9 providers and is designed to easily accom
138178

139179
Both API-based and open-source (submodule) integration patterns are supported.
140180

181+
## Running Experiments
182+
183+
### Quick Start
184+
185+
Generate dataset and run experiments:
186+
187+
```bash
188+
cd /Users/access/VMEvalKit
189+
source venv/bin/activate
190+
191+
# Generate dataset (if needed)
192+
python -m vmevalkit.runner.create_dataset --pairs-per-domain 15
193+
194+
# Run experiment (1 task per domain for testing)
195+
python examples/experiment_2025-10-14.py
196+
197+
# Run all tasks
198+
python examples/experiment_2025-10-14.py --all-tasks
199+
```
200+
201+
### Resume Mechanism
202+
203+
The experiment script includes robust resume capability for long-running experiments:
204+
205+
**Features:**
206+
- 🔄 Sequential execution: one model at a time, one task at a time
207+
- ⚡ Automatic checkpointing every 5 completed jobs
208+
- 🛡️ Graceful interruption handling (Ctrl+C saves progress)
209+
- 📥 Resume from latest or specific experiment
210+
- 📊 Track completed, failed, and in-progress jobs
211+
212+
**Usage:**
213+
214+
```bash
215+
# Resume latest interrupted experiment
216+
python examples/experiment_2025-10-14.py --resume latest
217+
218+
# Resume specific experiment
219+
python examples/experiment_2025-10-14.py --resume experiment_20241016_143022
220+
221+
# List available checkpoints
222+
python examples/experiment_2025-10-14.py --list-checkpoints
223+
224+
# Start with custom experiment ID
225+
python examples/experiment_2025-10-14.py --experiment-id my_test_001
226+
```
227+
228+
**Command Options:**
229+
230+
| Option | Description |
231+
|--------|-------------|
232+
| `--resume <ID or 'latest'>` | Resume a previous experiment |
233+
| `--no-resume` | Disable resume mechanism |
234+
| `--experiment-id <ID>` | Set custom experiment ID |
235+
| `--all-tasks` | Run all tasks instead of 1 per domain |
236+
| `--list-checkpoints` | List available checkpoints |
237+
238+
**How It Works:**
239+
- Progress saved to `data/outputs/pilot_experiment/logs/checkpoint_*.json`
240+
- Completed jobs won't be re-run on resume
241+
- Failed jobs can be retried
242+
- Interrupted jobs are automatically retried
243+
141244
## License
142245

143246
MIT

data/data_logging/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# Dataset Version Logging
2+
3+
Simple S3 version tracking for VMEvalKit datasets.
4+
5+
## Structure
6+
```
7+
data_logging/
8+
├── version_log.json # Version history
9+
└── versions/ # Detailed logs (auto-generated)
10+
```
11+
12+
## Usage
13+
14+
### Upload to S3
15+
```bash
16+
# Upload with today's date
17+
python data/s3_sync.py
18+
19+
# Upload and log version
20+
python data/s3_sync.py --log
21+
```
22+
23+
### View versions
24+
```bash
25+
python data/data_logging/version_tracker.py summary
26+
```
27+
28+
### From Python
29+
```python
30+
from data.data_logging import log_version, get_latest
31+
32+
# Log a version
33+
log_version("1.0", "s3://vmevalkit/20251015/data", {"size_mb": 180, "files": 1300})
34+
35+
# Get latest
36+
latest = get_latest()
37+
```
38+
39+
## S3 Structure
40+
Data is stored at: `s3://vmevalkit/YYYYMMDD/data`

data/data_logging/__init__.py

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
"""Simple dataset version logging."""
2+
3+
from .version_tracker import log_version, get_latest, print_summary
4+
5+
__all__ = ['log_version', 'get_latest', 'print_summary']
6+
__version__ = "1.0.0"

data/data_logging/version_log.json

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
{
2+
"versions": []
3+
}
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
#!/usr/bin/env python3
2+
"""Simple S3 dataset version tracker."""
3+
4+
import json
5+
from datetime import datetime
6+
from pathlib import Path
7+
from typing import Dict, Optional
8+
9+
# Paths
10+
DATA_LOGGING_DIR = Path(__file__).parent
11+
VERSION_LOG_PATH = DATA_LOGGING_DIR / "version_log.json"
12+
VERSIONS_DIR = DATA_LOGGING_DIR / "versions"
13+
14+
15+
def load_log() -> Dict:
16+
"""Load version log."""
17+
if VERSION_LOG_PATH.exists():
18+
with open(VERSION_LOG_PATH, 'r') as f:
19+
return json.load(f)
20+
return {'versions': []}
21+
22+
23+
def save_log(log: Dict) -> None:
24+
"""Save version log."""
25+
with open(VERSION_LOG_PATH, 'w') as f:
26+
json.dump(log, f, indent=2)
27+
28+
29+
def log_version(version: str, s3_uri: str, stats: Dict) -> None:
30+
"""Log a new dataset version."""
31+
log = load_log()
32+
33+
# Check if already exists
34+
for v in log['versions']:
35+
if v['version'] == version:
36+
print(f"Version {version} already exists")
37+
return
38+
39+
# Add version
40+
log['versions'].append({
41+
'version': version,
42+
'date': datetime.now().strftime('%Y%m%d'),
43+
's3_uri': s3_uri,
44+
'size_mb': stats.get('size_mb', 0),
45+
'files': stats.get('files', 0),
46+
'timestamp': datetime.now().isoformat()
47+
})
48+
49+
save_log(log)
50+
print(f"✅ Logged v{version}{s3_uri}")
51+
52+
53+
def get_latest() -> Optional[Dict]:
54+
"""Get latest version."""
55+
log = load_log()
56+
return log['versions'][-1] if log['versions'] else None
57+
58+
59+
def print_summary() -> None:
60+
"""Print version summary."""
61+
log = load_log()
62+
if not log['versions']:
63+
print("No versions logged")
64+
return
65+
66+
print("\n📊 Dataset Versions")
67+
print("=" * 40)
68+
for v in log['versions']:
69+
print(f"v{v['version']} ({v['date']}) → {v['s3_uri']}")
70+
print(f" {v.get('size_mb', 0):.1f}MB, {v.get('files', 0)} files")
71+
72+
73+
def main() -> None:
74+
"""CLI entry point."""
75+
import sys
76+
if len(sys.argv) > 1 and sys.argv[1] == 'summary':
77+
print_summary()
78+
elif len(sys.argv) > 1 and sys.argv[1] == 'latest':
79+
latest = get_latest()
80+
if latest:
81+
print(f"Latest: v{latest['version']}{latest['s3_uri']}")
82+
else:
83+
print_summary()
84+
85+
86+
if __name__ == "__main__":
87+
main()

0 commit comments

Comments
 (0)