This is the official repository for the paper:
MInM: Mask Instance Modeling for Visual Representation Learning
Yiran Wangb,1, Junlin Longb,1, Zeyu Zhanga,2, Rong Fuc, Ruicheng Zhangd, Rundong Xuee, Zirui Songf, Renda Hang, Hoi Leong Leeh, Xiuying Chenf and Yang Zhaoa,*
aLa Trobe University bUniversity of Sydney cUniversity of Macau dTsinghua University eXi'an Jiaotong University fMBZUAI gTianjin University hUniversiti Malaysia Perlis
1Equal contribution, co-first authors. 2Project lead. *Corresponding author.
Note
If you find our work useful, please cite:
@article{wang2026minm,
title={MInM: Mask Instance Modeling for Visual Representation Learning},
author={Wang, Yiran and Long, Junlin and Zhang, Zeyu and Fu, Rong and Zhang, Ruicheng and Xue, Rundong and Song, Zirui and Han, Renda and Lee, Hoi Leong and Chen, Xiuying and Zhao, Yang},
journal={arXiv preprint arXiv:xxxx.xxxxx},
year={2026}
}Masked image modeling (MIM) has emerged as a powerful self-supervised learning paradigm in computer vision, inspired by the success of masked language modeling in NLP. By masking parts of the input image and training the model to reconstruct the missing content, MIM enables the learning of rich and transferable visual representations without requiring manual annotations. Recent methods such as MAE, BEiT, and SimMIM have demonstrated strong performance on large-scale benchmarks.
Despite their success, existing MIM methods predominantly rely on random masking strategies that treat all image regions equally, regardless of their semantic content. Common prediction targets, such as pixel-level or discrete tokens, often fail to align with human perception, leading to semantically ambiguous representations. As a result, the model may allocate excessive capacity to reconstructing redundant background content, weakening its ability to learn representations useful for downstream tasks.
We present MInM (Mask Instance Modeling), a novel masked image modeling framework that leverages instance-aware saliency masks to guide visual representation learning. Instead of applying uniformly distributed random occlusion, MInM deliberately identifies foreground instance areas derived from SAM2 as the primary reconstruction objective. Built upon the MAE architecture, MInM integrates a task-aligned masking pipeline that improves both global and localized representation quality — without any modifications to the encoder or decoder.
Key contributions:
- Instance-Guided Masking Framework: We introduce MInM, a novel instance-guided masked image modeling framework that leverages semantic masks to enhance visual representation learning.
- Task-Aligned Masking Strategy: We propose a masking strategy based on high-quality instance segmentation masks from SAM2, which encourages the model to reconstruct foreground content while ignoring background redundancy.
- Extensive Validation: We validate the effectiveness of MInM across multiple datasets, including ImageNet-1K, Pascal VOC, and Imagenette.
# 1. Create environment
conda create -n minm python=3.10 -y
conda activate minm
# 2. Install dependencies
pip install -r requirements.txtYou must configure API keys and data paths before running experiments.
Option A: Environment Variables
# --- For W&B Experiment Tracking ---
WANDB_API_KEY="your-wandb-key"Option B: Dataset Preparation
- ImageNet-1K: Download and organize into
train/andval/directories following the PyTorch ImageNet format. - ImageNette: Automatically downloaded and extracted by the training scripts from HuggingFace.
- Pascal VOC: Download VOC 2007 + VOC 2012 and organize following MMDetection format.
Generate SAM-based instance masks for your dataset. This is a prerequisite for MInM pre-training.
Requires SAM2.
# Generate instance masks for ImageNet-1K
python tools/generate_sam_masks.py \
--dataset imagenet \
--output_dir data/imagenet/instance_masks
# Generate instance masks for Imagenette
python tools/generate_sam_masks.py \
--dataset imagenette \
--output_dir data/imagenette/instance_masksPre-train a ViT with instance-guided masking.
# MInM on ImageNet-1K (ViT-Base, 600 epochs, multi-node)
python tools/imagenet_1kminm_parallel.py \
--epochs 600 \
--batch_size 256 \
--blr 5e-4 \
--model mae_vit_base_patch16 \
--warmup_epochs 80 \
--data_path /path/to/imagenet
# MInM on Imagenette (ViT-Base, 100 epochs)
python tools/train_imagenette.py \
--epochs 100 \
--batch_size 32 \
--blr 1.5e-3 \
--model mae_vit_base_patch16Train the standard MAE baseline for comparison.
# MAE on ImageNette
python tools/train_imagenette_mae.py \
--epochs 125 \
--batch_size 32 \
--blr 1.5e-3 \
--model mae_vit_base_patch16
# MAE on ImageNet-1K (multi-node)
python tools/submitit_pretrain.py \
--job_dir ./output \
--nodes 8 \
--batch_size 64 \
--model mae_vit_large_patch16 \
--mask_ratio 0.75 \
--norm_pix_loss \
--epochs 800Evaluate the quality of learned representations by training a linear classifier on frozen features.
python tools/main_linprobe.py \
--batch_size 512 \
--model vit_base_patch16 \
--finetune /path/to/pretrain_checkpoint.pth \
--epochs 90 \
--data_path /path/to/imagenetFine-tune the pre-trained model end-to-end for downstream classification.
python tools/main_finetune.py \
--batch_size 32 \
--model vit_base_patch16 \
--finetune /path/to/pretrain_checkpoint.pth \
--epochs 100 \
--data_path /path/to/imagenet| ViT-Base | ViT-Large | ViT-Huge | |
|---|---|---|---|
| MAE (Baseline) | download | download | download |
| MInM (Ours) | coming soon | coming soon | coming soon |
| Method | Top-1 (%) | Top-5 (%) |
|---|---|---|
| MAE (Baseline) | 53.15 | 80.59 |
| MInM (Ours, best) | 38.25 | 81.50 |
MInM surpasses MAE in Top-5 accuracy under the same training protocol, indicating stronger semantic coverage. See the paper for detailed hyperparameter ablations (Table 3).
| Method | mAP |
|---|---|
| Faster R-CNN + R50-FPN (ours) | 75.3 |
| Faster R-CNN + MInM ViT (ours) | 34.5 |
| Faster R-CNN + MAE ViT (ours) | 33.5 |
MInM-ViT consistently outperforms MAE-ViT across the majority of semantic categories, with particular prominence on categories such as cat, dog, and sofa.
| Method | Epochs | Top-1 (%) | Top-5 (%) |
|---|---|---|---|
| MAE (Baseline) | 100 | 59.87 | 93.48 |
| MInM (Tuned) | 100 | 60.28 | 93.55 |
| MInM (Long-horizon) | 400 | 69.38 | 95.75 |
Long-horizon training reveals MInM's stronger training persistence: top-1 accuracy continues to improve steadily from 56.23% (epoch 100) to 69.38% (epoch 400).
.
├── assets/ # Images for README
├── models/ # Model architectures
│ ├── models_mae.py # Standard MAE (ViT encoder-decoder)
│ ├── models_minm.py # MInM with InstanceGuidedMasking
│ └── models_vit.py # Vision Transformer utilities
├── engine/ # Training & evaluation loops
│ ├── engine_pretrain.py # MAE pre-training loop
│ ├── engine_pretrain_minm.py # MInM pre-training loop
│ ├── engine_finetune.py # Fine-tuning loop
│ └── engine_probing.py # Linear probing evaluation
├── tools/ # Entry-point scripts
│ ├── train_imagenette.py # MInM training on ImageNette
│ ├── train_imagenette_mae.py # MAE baseline on ImageNette
│ ├── imagenet_1kminm_parallel.py # Parallel MInM training on ImageNet-1K
│ ├── generate_sam_masks.py # SAM instance mask generation
│ ├── main_pretrain.py # MAE pre-training (ImageNet)
│ ├── main_finetune.py # Fine-tuning script
│ ├── main_linprobe.py # Linear probing script
│ └── submitit_*.py # Distributed training wrappers
├── util/ # Utilities (LR scheduling, LARS, etc.)
├── data/ # Dataset storage
│ ├── imagenet/ # ImageNet-1K + SAM masks
│ └── imagenette/ # Imagenette + SAM masks
├── configs/ # Configuration YAMLs
├── docs/ # Additional documentation
│ ├── PRETRAIN.md # Pre-training instructions
│ └── FINETUNE.md # Fine-tuning instructions
├── demo/ # Visualization demos
├── output/ # Checkpoints & logs
├── requirements.txt # Python dependencies
└── LICENSE # CC-BY-NC 4.0
We acknowledge the use of the following resources:
- MAE: Masked Autoencoders Are Scalable Vision Learners.
- SAM2: Segment Anything Model 2 for instance mask generation.
- DeiT: Data-efficient Image Transformers.
- timm: PyTorch Image Models.
This project is licensed under the CC-BY-NC 4.0 License.