GitHub - CSU-JPG/MIND: The first open-domain closed-loop revisited benchmark for evaluating memory consistency and action control in world models.

MIND: Benchmarking Memory Consistency and Action Control in World Models

TL;DR: The first open-domain closed-loop revisited benchmark for evaluating memory consistency and action control in world models

🌐Homepage | 👉 Dataset | 📄 Paper | 💻 Code | 🏆 Leaderboard (coming soon)

📢 Updates

[2026-2-9]: MIND is online 🎉 🎉 🎉

📝 TODO

Open-source MIND-World (1.3B) all training and inference code including a detailed code tutorial
Release the weights of all stages for MIND-World (1.3B) including frame-wised student model
Building Leaderboard
Building repo Awesome—Interactive World Model

📜 Abstract

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces.

🌟 Project Overview

Fig 1. Overview of the MIND. We build and collect the first open-domain benchmark using Unreal Engine 5, supporting both first-person and third-person perspectives with 1080p resolution at 24 FPS.

📊 Dataset Overview

Fig 2. Distribution for Scene Categories and Action Space in MIND Dataset. MIND supports open-domain scenarios with diverse and well-balanced action spaces.

🚀 Setup

1. Environment setup

Follow ViPE's instruction to build conda envrionment, until ViPE command is avilable
install our requirements under the same conda env (in the same above env.):

pip install -r requirements.txt

2. Multi-GPU Support

How Multi-GPU Works

Videos are put into a task queue.
Each GPU process take one task from the queue when vacant.
If failed, the task will be put back into the queue.
Progress bars show accumulation for all results.
Every time when a task is finished, the result file is updated. You can obtain intermediate results from the file.

The metrics computation supports multi-GPU parallel processing for faster evaluation.

python src/process.py --gt_root /path/to/MIND-Data --test_root /path/to/test/videos --num_gpus 8 --metrics lcm,visual,action

--gt_root: Ground truth data root directory (required)
--test_root: Test data root directory (required)
--dino_path: DINOv3 model weights directory (default: ./dinov3_vitb16)
--num_gpus: Number of GPUs to use for parallel processing (default: 1)
--video_max_time: Maximum video frames to process (default: None = use all frames)
--output: Output JSON file path (default: result_{test_root}_{timestamp}.json)
--metrics: Comma-separated metrics to compute (default: lcm,visual,dino,action,gsc)

3. How to order your test files

{model_name}
├── 1st_data
│   ├── action_space_test
│   │   ├── {corresponding data name}
│   │   │   └── video.mp4
|   |   ...
|   |    
│   ├── mirror_test
│   |   ├── {arbitrary data name}
│   │   │   ├── path-1.mp4
│   │   │   ├── path-2.mp4
│   │   │   ├── path-3.mp4
│   │   │   ...
│   │   │   └── path-10.mp4
|   |   ...
|   |
|   └── mem_test
│       ├── {corresponding data name}
│       │   └── video.mp4
|       ...
|
├── 3rd_data
│   ├── action_space_test
│   │   ├── {corresponding data name}
│   │   │   └── video.mp4
|   |   ...
|   |    
│   ├── mirror_test
│   |   ├── {carbitrary data name}
│   │   │   ├── path-1.mp4
│   │   │   ├── path-2.mp4
│   │   │   ├── path-3.mp4
│   │   │   ...
│   │   │   └── path-10.mp4
|   |   ...
|   |
│   └── mem_test
│       ├── {corresponding data name}
│       │   └── video.mp4
|       ...

{model_name}: custom your model name
{corresponding data name}: corresponding ground truth data file name

4. The detailed information of output `Result.json`

{
  "video_max_time": [int] video_max_time given in cmd parameters; max frames of the sample video to compute metrics (except action accuracy).
  "data": [
    {
      "path": [string] the directory name of the video data.
      "perspective": [string] 1st_data/3rd_data, the perspective of the video data.
      "test_type": [string] mem_test/action_space_test, the test set of the video data.
      "error": [string] the error occur when computing metrics
      "mark_time": [int] the divider of memory context and expected perdiction; the start frame index of the expected prediction.
      "total_time": [int] the total frames of the ground truth video.
      "sample_frames": [int ]the total frames of the video to be tested.
      "video_results": [ the general scene consistency metric result.
        {
          "video_name": [string] the name of the video of the specific action path
          "error": [string] the error occur when computing metrics in this video
          "mark_time": [int] the divider of prediction and mirror perdiction; the start frame index of the mirror prediction.
          "sample_frames": [int] total frames of prediction and mirror perdiction; should be 2x of marktime.
          "gsc": { 
            "length": [int] length of the origin prediction and the mirror prediction.
            "mse": [list[float]] the per-frame mean square error.
            "avg_mse": [float] the average of mse.
            "lpips": [list[float]] the per-frame Learned Perceptual Image Patch Similarity.
            "avg_lpips": [float] the average of lpips.
            "ssim": [list[float]] the per-frame Structural Similarity Index Measure.
            "avg_ssim": [float] the average of ssim.
            "psnr": [list[float]] the per-frame Peak Signal-to-Noise Ratio.
            "avg_psnr": [float] the average of psnr.
          }
        },
        ...
      ]
      "lcm": { the long context memory metric result.
        "mse": [list[float]] the per-frame mean square error.
        "avg_mse": [float] the average of mse.
        "lpips": [list[float]] the per-frame Learned Perceptual Image Patch Similarity.
        "avg_lpips": [float] the average of lpips.
        "ssim": [list[float]] the per-frame Structural Similarity Index Measure.
        "avg_ssim": [float] the average of ssim.
        "psnr": [list[float]] the per-frame Peak Signal-to-Noise Ratio.
        "avg_psnr": [float] the average of psnr.
      },
      "visual_quality": { the visual quality metric result.
        "imaging": [list[float]] the per-frame imaging quality.
        "avg_imaging": [float] the average of imaging quality. 
        "aesthetic": [list[float]] the per-frame aesthetic quality.
        "avg_imaging": [float] the average of aesthetic quality. 
      },
      "action": { the action accuracy metric result. computed by ViPE pose estimation and trajectory alignment.
        "__overall__": { the overall statistics of all valid frames after outlier filtering.
          "count": [int] number of valid samples used for statistics.
          "rpe_trans_mean": [float] mean of Relative Pose Error for translation (in meters).
          "rpe_trans_median": [float] median of RPE translation.
          "rpe_rot_mean_deg": [float] mean of RPE rotation in degrees.
          "rpe_rot_median_deg": [float] median of RPE rotation.
        },
        "translation": { the statistics of pure translation actions (forward/backward/left/right).
          "count": [int] number of valid samples for translation actions.
          "rpe_trans_mean": [float] mean RPE translation for translation actions.
          "rpe_trans_median": [float] median RPE translation for translation actions.
          "rpe_rot_mean_deg": [float] mean RPE rotation for translation actions.
          "rpe_rot_median_deg": [float] median RPE rotation for translation actions.
        },
        "rotation": { the statistics of pure rotation actions (cam_left/cam_right/cam_up/cam_down).
          "count": [int] number of valid samples for rotation actions.
          ...
        },
        "other": { the statistics of combined actions (e.g., forward+look_right).
          "count": [int] number of valid samples for other actions.
          ...
        },
        "act:forward": { the statistics of specific action "forward".
          "count": [int] number of valid samples for this action.
          "rpe_trans_mean": [float] mean RPE translation.
          "rpe_trans_median": [float] median RPE translation.
          "rpe_rot_mean_deg": [float] mean RPE rotation.
          "rpe_rot_median_deg": [float] median RPE rotation.
        },
        "act:look_right": { the statistics of specific action "look_right".
          ...
        },
        ...
      },
      "dino": { the dino mse metric result.
        "dino_mse": [list[float]] the per-frame mse of dino features.
        "avg_dino_mse": [float] the average of dino_mse. 
      }
    },
    ...
  ]
}

🗂 Dataset Format

MIND is available here ! ! !

1. The structure of MIND ground truth videos (both for training and for testing)

MIND-Data
├── 1st_data
│   ├── test
│   │   ├── action_space_test
│   │   │   ├── {gt data name}
│   │   │   │   ├── action.json
│   │   │   │   ├── images.txt
│   │   │   │   └── video.mp4
|   |   |   ...
|   |   |    
│   │   └── mem_test
│   │       ├── {gt data name}
│   │       │   ├── action.json
│   │       │   ├── images.txt
│   │       │   └── video.mp4
|   |       ...
|   └── train
|       ├── {gt data name}
|       │   ├── action.json
|       │   └── video.mp4
|       ...
|
├── 3rd_data
│   ├── test
│   │   ├── action_space_test
│   │   │   ├── {gt data name}
│   │   │   │   ├── action.json
│   │   │   │   ├── images.txt
│   │   │   │   └── video.mp4
|   |   |   ...
|   |   |    
│   │   └── mem_test
│   │       ├── {gt data name}
│   │       │   ├── action.json
│   │       │   ├── images.txt
│   │       │   └── video.mp4
|   |       ...
|   └── train
|       ├── {gt data name}
|       │   ├── action.json
|       │   └── video.mp4
|       ...

2. The detailed information of `Action.json`

{
    "mark_time": [int] the divider of memory context and expected perdiction; the start frame index of the expected prediction
    "total_time": [int] the total frames of the ground truth video
    "caption" : [text] the text description of the ground truth video
    "data": [
        {
            "time": [int] frame index
            "ws": [int] 0: move forward, 1: move backward
            "ad": [int] 0: move left, 1: move right
            "ud": [int] 0: look up, 1: look down
            "lr": [int] 0: look left, 1: look right
            "actor_pos": {
                "x": [float] the x-coordinate of the character
                "y": [float] the y-coordinate of the character
                "z": [float] the z-coordinate of the character
            },
            "actor_rpy": {
                "x": [float] the roll angle of the character (Euler angles)
                "y": [float] the pitch angle of the character
                "z": [float] the yaw angle of the character
            },
            "camera_pos": {
            		# only exists in 3rd-person mode
                "x": [float] the x-coordinate of the camera 
                "y": [float] the y-coordinate of the camera
                "z": [float] the z-coordinate of the camera
            },
            "camera_rpy": {
           			# only exists in 3rd-person mode
                "x": [float] the roll angle of the camera (Euler angles)
                "y": [float] the pitch angle of the camera
                "z": [float] the yaw angle of the camera
            }
        },
        ...
    ]
}

🏆 LeaderBoard

The leaderboard is coming...

🎓 BibTex

If you find our work can be helpful, we would appreciate your citation and star:

@misc{ye2026mind,
      title={MIND: Benchmarking Memory Consistency and Action Control in World Models}, 
      author={Yixuan Ye, Xuanyu Lu, Yuxin Jiang, Yuchao Gu, Rui Zhao, Qiwei Liang, Jiachun Pan, Fengda Zhang, Weijia Wu, Alex Jinpeng Wang},
      year={2026},
      eprint={xxx},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/xxx}, 
}

📧 Contact

Please send emails to yixuanye12@gmail.com if there is any question

🙏 Acknowledgements

We would like to thank ViPE and SkyReels-V2 for their great work.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
assets		assets
envs		envs
src		src
vipe @ 7e66b5a		vipe @ 7e66b5a
.DS_Store		.DS_Store
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIND: Benchmarking Memory Consistency and Action Control in World Models

TL;DR: The first open-domain closed-loop revisited benchmark for evaluating memory consistency and action control in world models

📢 Updates

📝 TODO

📑 Table of Contents

📜 Abstract

🌟 Project Overview

📊 Dataset Overview

🚀 Setup

1. Environment setup

2. Multi-GPU Support

3. How to order your test files

4. The detailed information of output `Result.json`

🗂 Dataset Format

MIND is available here ! ! !

1. The structure of MIND ground truth videos (both for training and for testing)

2. The detailed information of `Action.json`

🏆 LeaderBoard

🎓 BibTex

📧 Contact

🙏 Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

CSU-JPG/MIND

Folders and files

Latest commit

History

Repository files navigation

MIND: Benchmarking Memory Consistency and Action Control in World Models

TL;DR: The first open-domain closed-loop revisited benchmark for evaluating memory consistency and action control in world models

📢 Updates

📝 TODO

📑 Table of Contents

📜 Abstract

🌟 Project Overview

📊 Dataset Overview

🚀 Setup

1. Environment setup

2. Multi-GPU Support

3. How to order your test files

4. The detailed information of output Result.json

🗂 Dataset Format

MIND is available here ! ! !

1. The structure of MIND ground truth videos (both for training and for testing)

2. The detailed information of Action.json

🏆 LeaderBoard

🎓 BibTex

📧 Contact

🙏 Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

4. The detailed information of output `Result.json`

2. The detailed information of `Action.json`

Packages