Video-Reason
diff --git a/‎.cursor/rules/cheatsheet.mdc‎
Lines changed: 13 additions & 0 deletions b/‎.cursor/rules/cheatsheet.mdc‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎.cursor/rules/trycatch.mdc‎
Lines changed: 5 additions & 0 deletions b/‎.cursor/rules/trycatch.mdc‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 6 additions & 1 deletion b/‎.gitignore‎
Lines changed: 6 additions & 1 deletion
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 3 additions & 0 deletions b/‎CONTRIBUTING.md‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 116 additions & 13 deletions b/‎README.md‎
Lines changed: 116 additions & 13 deletions
diff --git a/‎data/data_logging/README.md‎
Lines changed: 40 additions & 0 deletions b/‎data/data_logging/README.md‎
Lines changed: 40 additions & 0 deletions
diff --git a/‎data/data_logging/__init__.py‎
Lines changed: 6 additions & 0 deletions b/‎data/data_logging/__init__.py‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎data/data_logging/version_log.json‎
Lines changed: 3 additions & 0 deletions b/‎data/data_logging/version_log.json‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎data/data_logging/version_tracker.py‎
Lines changed: 87 additions & 0 deletions b/‎data/data_logging/version_tracker.py‎
Lines changed: 87 additions & 0 deletions
@@ -0,0 +1,13 @@
+---
+alwaysApply: false
+---
+To use Cairo
+
+```export PKG_CONFIG_PATH=/opt/homebrew/lib/pkgconfig:/opt/homebrew/opt/libffi/lib/pkgconfig
+export DYLD_FALLBACK_LIBRARY_PATH=/opt/homebrew/lib```
+
+To export all the keys
+
+```source /Users/access/VMEvalKit/venv/bin/activate
+cd /Users/access/VMEvalKit
+set -a; source ./.env; set +a```
@@ -0,0 +1,5 @@
+---
+alwaysApply: true
+---
+
+Don't write try catch blocks.
@@ -208,4 +208,9 @@ __marimo__/
 
 # VMEvalKit outputs and data
 outputs/
-data/
+deprecated/
+
+# VMEvalKit data folder - ignore large datasets but keep scripts and logging
+data/questions/
+data/outputs/
+# Keep: data/s3_sync.py, data/data_logging/
@@ -0,0 +1,3 @@
+# 🙌 Contributors
+
+Hokin Deng, Ran Ji, Maijunxian Wang, …
@@ -24,16 +24,23 @@ pip install -e .
 ## Quick Start
 
 ```python
-from vmevalkit import run_inference
+from vmevalkit.runner.inference import InferenceRunner
+
+# Initialize runner with structured output
+runner = InferenceRunner(output_dir="output")
 
 # Generate video solution
-result = run_inference(
+result = runner.run(
     model_name="luma-ray-2",
-    image_path="data/maze.png",
-    text_prompt="Solve this maze from start to finish"
+    image_path="data/questions/maze_task/maze_0000/first_frame.png",
+    text_prompt="Navigate the green dot through the maze corridors to reach the red flag"
 )
 
-print(f"Video: {result['video_path']}")
+print(f"Video saved to: {result['inference_dir']}")
+# Each inference creates a self-contained folder with:
+# - video/: Generated video file
+# - question/: Input images and prompt  
+# - metadata.json: Complete inference metadata
 ```
 
 ## Supported Models
@@ -60,12 +67,12 @@ All models support **image + text → video** for reasoning evaluation.
 ### Task Pair: The Fundamental Unit
 Every VMEvalKit dataset consists of **Task Pairs** - the basic unit for video reasoning evaluation:
 
-- 📸 **Initial state image** (the reasoning problem)
-- 🎯 **Final state image** (the solution/goal state)  
-- 📝 **Text prompt** (instructions for video model)
-- 📊 **Rich metadata** (difficulty, task-specific parameters, etc.)
+- 📸 **Initial state image** (`first_frame.png` - the reasoning problem)
+- 🎯 **Final state image** (`final_frame.png` - the solution/goal state)  
+- 📝 **Text prompt** (`prompt.txt` - instructions for video model)
+- 📊 **Rich metadata** (`question_metadata.json` - difficulty, task-specific parameters, etc.)
 
-Models must generate videos showing the reasoning process from initial → final state.
+Each task pair is organized in its own folder (`data/questions/{domain}_task/{question_id}/`) containing all four files. Models must generate videos showing the reasoning process from initial → final state.
 
 ## Tasks
 
@@ -95,14 +102,48 @@ VMEvalKit/
 │   ├── core/           # Evaluation framework
 │   ├── tasks/          # Task definitions
 │   └── utils/          # Utilities
-├── data/               # Datasets
+├── data/
+│   └── questions/      # Dataset with per-question folders
+│       ├── vmeval_dataset.json  # Master dataset manifest
+│       ├── chess_task/          # Chess reasoning questions
+│       │   └── chess_0000/      # Individual question folder
+│       │       ├── first_frame.png
+│       │       ├── final_frame.png
+│       │       ├── prompt.txt
+│       │       └── question_metadata.json
+│       ├── maze_task/           # Maze navigation questions
+│       ├── raven_task/          # Pattern completion questions
+│       └── rotation_task/       # 3D rotation questions
+├── output/             # Structured inference outputs
+│   └── <inference_id>/ # Self-contained folders per inference
+│       ├── video/      # Generated video file
+│       ├── question/   # Input images and prompt
+│       └── metadata.json # Complete inference metadata
 ├── examples/           # Example scripts
 └── tests/              # Unit tests
 ```
 
+## Structured Output System
+
+Each inference creates a **self-contained folder** with all relevant data:
+
+```
+output/<model>_<question_id>_<timestamp>/
+├── video/
+│   └── generated_video.mp4    # Output video
+├── question/
+│   ├── first_frame.png        # Input image (sent to model)
+│   ├── final_frame.png        # Reference image (not sent)
+│   ├── prompt.txt             # Text prompt used
+│   └── question_metadata.json # Full question data from dataset
+└── metadata.json              # Complete inference metadata
+```
+
+This structure ensures reproducibility and makes batch analysis easy.
+
 ## Examples
 
-See `examples/simple_inference.py` for more usage patterns.
+See `examples/experiment_2025-10-14.py` for sequential inference across multiple models.
 
 ## Submodules
 
@@ -111,7 +152,6 @@ Initialize after cloning:
 git submodule update --init --recursive
 ```
 
-- **KnowWhat**: Research on knowing-how vs knowing-that
 - **maze-dataset**: Maze datasets for ML evaluation
 - **HunyuanVideo-I2V**: High-quality image-to-video generation (720p)
 - **LTX-Video**: Real-time video generation models
@@ -138,6 +178,69 @@ VMEvalKit supports 36+ models across 9 providers and is designed to easily accom
 
 Both API-based and open-source (submodule) integration patterns are supported.
 
+## Running Experiments
+
+### Quick Start
+
+Generate dataset and run experiments:
+
+```bash
+cd /Users/access/VMEvalKit
+source venv/bin/activate
+
+# Generate dataset (if needed)
+python -m vmevalkit.runner.create_dataset --pairs-per-domain 15
+
+# Run experiment (1 task per domain for testing)
+python examples/experiment_2025-10-14.py
+
+# Run all tasks
+python examples/experiment_2025-10-14.py --all-tasks
+```
+
+### Resume Mechanism
+
+The experiment script includes robust resume capability for long-running experiments:
+
+**Features:**
+- 🔄 Sequential execution: one model at a time, one task at a time
+- ⚡ Automatic checkpointing every 5 completed jobs
+- 🛡️ Graceful interruption handling (Ctrl+C saves progress)
+- 📥 Resume from latest or specific experiment
+- 📊 Track completed, failed, and in-progress jobs
+
+**Usage:**
+
+```bash
+# Resume latest interrupted experiment
+python examples/experiment_2025-10-14.py --resume latest
+
+# Resume specific experiment
+python examples/experiment_2025-10-14.py --resume experiment_20241016_143022
+
+# List available checkpoints
+python examples/experiment_2025-10-14.py --list-checkpoints
+
+# Start with custom experiment ID
+python examples/experiment_2025-10-14.py --experiment-id my_test_001
+```
+
+**Command Options:**
+
+| Option | Description |
+|--------|-------------|
+| `--resume <ID or 'latest'>` | Resume a previous experiment |
+| `--no-resume` | Disable resume mechanism |
+| `--experiment-id <ID>` | Set custom experiment ID |
+| `--all-tasks` | Run all tasks instead of 1 per domain |
+| `--list-checkpoints` | List available checkpoints |
+
+**How It Works:**
+- Progress saved to `data/outputs/pilot_experiment/logs/checkpoint_*.json`
+- Completed jobs won't be re-run on resume
+- Failed jobs can be retried
+- Interrupted jobs are automatically retried
+
 ## License
 
 MIT
@@ -0,0 +1,40 @@
+# Dataset Version Logging
+
+Simple S3 version tracking for VMEvalKit datasets.
+
+## Structure
+```
+data_logging/
+├── version_log.json      # Version history
+└── versions/             # Detailed logs (auto-generated)
+```
+
+## Usage
+
+### Upload to S3
+```bash
+# Upload with today's date
+python data/s3_sync.py
+
+# Upload and log version
+python data/s3_sync.py --log
+```
+
+### View versions
+```bash
+python data/data_logging/version_tracker.py summary
+```
+
+### From Python
+```python
+from data.data_logging import log_version, get_latest
+
+# Log a version
+log_version("1.0", "s3://vmevalkit/20251015/data", {"size_mb": 180, "files": 1300})
+
+# Get latest
+latest = get_latest()
+```
+
+## S3 Structure
+Data is stored at: `s3://vmevalkit/YYYYMMDD/data`
@@ -0,0 +1,6 @@
+"""Simple dataset version logging."""
+
+from .version_tracker import log_version, get_latest, print_summary
+
+__all__ = ['log_version', 'get_latest', 'print_summary']
+__version__ = "1.0.0"
@@ -0,0 +1,3 @@
+{
+  "versions": []
+}
@@ -0,0 +1,87 @@
+#!/usr/bin/env python3
+"""Simple S3 dataset version tracker."""
+
+import json
+from datetime import datetime
+from pathlib import Path
+from typing import Dict, Optional
+
+# Paths
+DATA_LOGGING_DIR = Path(__file__).parent
+VERSION_LOG_PATH = DATA_LOGGING_DIR / "version_log.json"
+VERSIONS_DIR = DATA_LOGGING_DIR / "versions"
+
+
+def load_log() -> Dict:
+    """Load version log."""
+    if VERSION_LOG_PATH.exists():
+        with open(VERSION_LOG_PATH, 'r') as f:
+            return json.load(f)
+    return {'versions': []}
+
+
+def save_log(log: Dict) -> None:
+    """Save version log."""
+    with open(VERSION_LOG_PATH, 'w') as f:
+        json.dump(log, f, indent=2)
+
+
+def log_version(version: str, s3_uri: str, stats: Dict) -> None:
+    """Log a new dataset version."""
+    log = load_log()
+    
+    # Check if already exists
+    for v in log['versions']:
+        if v['version'] == version:
+            print(f"Version {version} already exists")
+            return
+    
+    # Add version
+    log['versions'].append({
+        'version': version,
+        'date': datetime.now().strftime('%Y%m%d'),
+        's3_uri': s3_uri,
+        'size_mb': stats.get('size_mb', 0),
+        'files': stats.get('files', 0),
+        'timestamp': datetime.now().isoformat()
+    })
+    
+    save_log(log)
+    print(f"✅ Logged v{version} → {s3_uri}")
+
+
+def get_latest() -> Optional[Dict]:
+    """Get latest version."""
+    log = load_log()
+    return log['versions'][-1] if log['versions'] else None
+
+
+def print_summary() -> None:
+    """Print version summary."""
+    log = load_log()
+    if not log['versions']:
+        print("No versions logged")
+        return
+    
+    print("\n📊 Dataset Versions")
+    print("=" * 40)
+    for v in log['versions']:
+        print(f"v{v['version']} ({v['date']}) → {v['s3_uri']}")
+        print(f"  {v.get('size_mb', 0):.1f}MB, {v.get('files', 0)} files")
+
+
+def main() -> None:
+    """CLI entry point."""
+    import sys
+    if len(sys.argv) > 1 and sys.argv[1] == 'summary':
+        print_summary()
+    elif len(sys.argv) > 1 and sys.argv[1] == 'latest':
+        latest = get_latest()
+        if latest:
+            print(f"Latest: v{latest['version']} → {latest['s3_uri']}")
+    else:
+        print_summary()
+
+
+if __name__ == "__main__":
+    main()
-Original file line number
+Diff line change
@@ @@ -0,0 +1,5 @@ @@
 +---
 +alwaysApply: true
 +---
++
 +Don't write try catch blocks.
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+# 🙌 Contributors`
	`2`	`+`
	`3`	`+Hokin Deng, Ran Ji, Maijunxian Wang, …`