TL;DR: The first open-domain closed-loop revisited benchmark for evaluating memory consistency and action control in world models
πHomepage | π Dataset | π Paper | π» Code | π Leaderboard (coming soon)
- [2026-2-9]: MIND is online π π π
- Open-source MIND-World (1.3B) all training and inference code including a detailed code tutorial
- Release the weights of all stages for MIND-World (1.3B) including frame-wised student model
- Building Leaderboard
- Building repo AwesomeβInteractive World Model
- π Abstract
- π Project Overview
- π Dataset Overview
- π Setup
- π Dataset Format
- π LeaderBoard (Coming soon)
- π BibTex
- π§ Contact
- π Acknowledgements
β World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces.
Fig 1. Overview of the MIND. We build and collect the first open-domain benchmark using Unreal Engine 5, supporting both first-person and third-person perspectives with 1080p resolution at 24 FPS.
Fig 2. Distribution for Scene Categories and Action Space in MIND Dataset. MIND supports open-domain scenarios with diverse and well-balanced action spaces.-
Follow ViPE's instruction to build conda envrionment, until
ViPEcommand is avilable -
install our requirements under the same conda env (in the same above env.):
pip install -r requirements.txt How Multi-GPU Works
- Videos are put into a task queue.
- Each GPU process take one task from the queue when vacant.
- If failed, the task will be put back into the queue.
- Progress bars show accumulation for all results.
- Every time when a task is finished, the result file is updated. You can obtain intermediate results from the file.
The metrics computation supports multi-GPU parallel processing for faster evaluation.
python src/process.py --gt_root /path/to/MIND-Data --test_root /path/to/test/videos --num_gpus 8 --metrics lcm,visual,action--gt_root: Ground truth data root directory (required)--test_root: Test data root directory (required)--dino_path: DINOv3 model weights directory (default:./dinov3_vitb16)--num_gpus: Number of GPUs to use for parallel processing (default: 1)--video_max_time: Maximum video frames to process (default:None= use all frames)--output: Output JSON file path (default:result_{test_root}_{timestamp}.json)--metrics: Comma-separated metrics to compute (default:lcm,visual,dino,action,gsc)
{model_name}
βββ 1st_data
β βββ action_space_test
β β βββ {corresponding data name}
β β β βββ video.mp4
| | ...
| |
β βββ mirror_test
β | βββ {arbitrary data name}
β β β βββ path-1.mp4
β β β βββ path-2.mp4
β β β βββ path-3.mp4
β β β ...
β β β βββ path-10.mp4
| | ...
| |
| βββ mem_test
β βββ {corresponding data name}
β β βββ video.mp4
| ...
|
βββ 3rd_data
β βββ action_space_test
β β βββ {corresponding data name}
β β β βββ video.mp4
| | ...
| |
β βββ mirror_test
β | βββ {carbitrary data name}
β β β βββ path-1.mp4
β β β βββ path-2.mp4
β β β βββ path-3.mp4
β β β ...
β β β βββ path-10.mp4
| | ...
| |
β βββ mem_test
β βββ {corresponding data name}
β β βββ video.mp4
| ...
{model_name}: custom your model name{corresponding data name}: corresponding ground truth data file name
{
"video_max_time": [int] video_max_time given in cmd parameters; max frames of the sample video to compute metrics (except action accuracy).
"data": [
{
"path": [string] the directory name of the video data.
"perspective": [string] 1st_data/3rd_data, the perspective of the video data.
"test_type": [string] mem_test/action_space_test, the test set of the video data.
"error": [string] the error occur when computing metrics
"mark_time": [int] the divider of memory context and expected perdiction; the start frame index of the expected prediction.
"total_time": [int] the total frames of the ground truth video.
"sample_frames": [int ]the total frames of the video to be tested.
"video_results": [ the general scene consistency metric result.
{
"video_name": [string] the name of the video of the specific action path
"error": [string] the error occur when computing metrics in this video
"mark_time": [int] the divider of prediction and mirror perdiction; the start frame index of the mirror prediction.
"sample_frames": [int] total frames of prediction and mirror perdiction; should be 2x of marktime.
"gsc": {
"length": [int] length of the origin prediction and the mirror prediction.
"mse": [list[float]] the per-frame mean square error.
"avg_mse": [float] the average of mse.
"lpips": [list[float]] the per-frame Learned Perceptual Image Patch Similarity.
"avg_lpips": [float] the average of lpips.
"ssim": [list[float]] the per-frame Structural Similarity Index Measure.
"avg_ssim": [float] the average of ssim.
"psnr": [list[float]] the per-frame Peak Signal-to-Noise Ratio.
"avg_psnr": [float] the average of psnr.
}
},
...
]
"lcm": { the long context memory metric result.
"mse": [list[float]] the per-frame mean square error.
"avg_mse": [float] the average of mse.
"lpips": [list[float]] the per-frame Learned Perceptual Image Patch Similarity.
"avg_lpips": [float] the average of lpips.
"ssim": [list[float]] the per-frame Structural Similarity Index Measure.
"avg_ssim": [float] the average of ssim.
"psnr": [list[float]] the per-frame Peak Signal-to-Noise Ratio.
"avg_psnr": [float] the average of psnr.
},
"visual_quality": { the visual quality metric result.
"imaging": [list[float]] the per-frame imaging quality.
"avg_imaging": [float] the average of imaging quality.
"aesthetic": [list[float]] the per-frame aesthetic quality.
"avg_imaging": [float] the average of aesthetic quality.
},
"action": { the action accuracy metric result. computed by ViPE pose estimation and trajectory alignment.
"__overall__": { the overall statistics of all valid frames after outlier filtering.
"count": [int] number of valid samples used for statistics.
"rpe_trans_mean": [float] mean of Relative Pose Error for translation (in meters).
"rpe_trans_median": [float] median of RPE translation.
"rpe_rot_mean_deg": [float] mean of RPE rotation in degrees.
"rpe_rot_median_deg": [float] median of RPE rotation.
},
"translation": { the statistics of pure translation actions (forward/backward/left/right).
"count": [int] number of valid samples for translation actions.
"rpe_trans_mean": [float] mean RPE translation for translation actions.
"rpe_trans_median": [float] median RPE translation for translation actions.
"rpe_rot_mean_deg": [float] mean RPE rotation for translation actions.
"rpe_rot_median_deg": [float] median RPE rotation for translation actions.
},
"rotation": { the statistics of pure rotation actions (cam_left/cam_right/cam_up/cam_down).
"count": [int] number of valid samples for rotation actions.
...
},
"other": { the statistics of combined actions (e.g., forward+look_right).
"count": [int] number of valid samples for other actions.
...
},
"act:forward": { the statistics of specific action "forward".
"count": [int] number of valid samples for this action.
"rpe_trans_mean": [float] mean RPE translation.
"rpe_trans_median": [float] median RPE translation.
"rpe_rot_mean_deg": [float] mean RPE rotation.
"rpe_rot_median_deg": [float] median RPE rotation.
},
"act:look_right": { the statistics of specific action "look_right".
...
},
...
},
"dino": { the dino mse metric result.
"dino_mse": [list[float]] the per-frame mse of dino features.
"avg_dino_mse": [float] the average of dino_mse.
}
},
...
]
}
MIND is available here ! ! !
MIND-Data
βββ 1st_data
β βββ test
β β βββ action_space_test
β β β βββ {gt data name}
β β β β βββ action.json
β β β β βββ images.txt
β β β β βββ video.mp4
| | | ...
| | |
β β βββ mem_test
β β βββ {gt data name}
β β β βββ action.json
β β β βββ images.txt
β β β βββ video.mp4
| | ...
| βββ train
| βββ {gt data name}
| β βββ action.json
| β βββ video.mp4
| ...
|
βββ 3rd_data
β βββ test
β β βββ action_space_test
β β β βββ {gt data name}
β β β β βββ action.json
β β β β βββ images.txt
β β β β βββ video.mp4
| | | ...
| | |
β β βββ mem_test
β β βββ {gt data name}
β β β βββ action.json
β β β βββ images.txt
β β β βββ video.mp4
| | ...
| βββ train
| βββ {gt data name}
| β βββ action.json
| β βββ video.mp4
| ...{
"mark_time": [int] the divider of memory context and expected perdiction; the start frame index of the expected prediction
"total_time": [int] the total frames of the ground truth video
"caption" : [text] the text description of the ground truth video
"data": [
{
"time": [int] frame index
"ws": [int] 0: move forward, 1: move backward
"ad": [int] 0: move left, 1: move right
"ud": [int] 0: look up, 1: look down
"lr": [int] 0: look left, 1: look right
"actor_pos": {
"x": [float] the x-coordinate of the character
"y": [float] the y-coordinate of the character
"z": [float] the z-coordinate of the character
},
"actor_rpy": {
"x": [float] the roll angle of the character (Euler angles)
"y": [float] the pitch angle of the character
"z": [float] the yaw angle of the character
},
"camera_pos": {
# only exists in 3rd-person mode
"x": [float] the x-coordinate of the camera
"y": [float] the y-coordinate of the camera
"z": [float] the z-coordinate of the camera
},
"camera_rpy": {
# only exists in 3rd-person mode
"x": [float] the roll angle of the camera (Euler angles)
"y": [float] the pitch angle of the camera
"z": [float] the yaw angle of the camera
}
},
...
]
}
The leaderboard is coming...
If you find our work can be helpful, we would appreciate your citation and star:
@misc{ye2026mind,
title={MIND: Benchmarking Memory Consistency and Action Control in World Models},
author={Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, Alex Jinpeng Wang},
year={2026},
eprint={xxx},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/xxx},
}Please send emails to yixuanye12@gmail.com if there is any question
We would like to thank ViPE and SkyReels-V2 for their great work.


