Skip to content

AIGeeksGroup/MInM

Repository files navigation

MInM: Mask Instance Modeling for Visual Representation Learning

This is the official repository for the paper:

MInM: Mask Instance Modeling for Visual Representation Learning

Yiran Wangb,1, Junlin Longb,1, Zeyu Zhanga,2, Rong Fuc, Ruicheng Zhangd, Rundong Xuee, Zirui Songf, Renda Hang, Hoi Leong Leeh, Xiuying Chenf and Yang Zhaoa,*

aLa Trobe University   bUniversity of Sydney   cUniversity of Macau   dTsinghua University   eXi'an Jiaotong University   fMBZUAI   gTianjin University   hUniversiti Malaysia Perlis

1Equal contribution, co-first authors. 2Project lead. *Corresponding author.

Note

⚠️ Repository Structure: This repo contains the MInM pre-training framework (extending MAE with instance-guided masking) and evaluation pipelines for ImageNet-1K, Pascal VOC, and Imagenette.

Citation

If you find our work useful, please cite:

@article{wang2026minm,
  title={MInM: Mask Instance Modeling for Visual Representation Learning},
  author={Wang, Yiran and Long, Junlin and Zhang, Zeyu and Fu, Rong and Zhang, Ruicheng and Xue, Rundong and Song, Zirui and Han, Renda and Lee, Hoi Leong and Chen, Xiuying and Zhao, Yang},
  journal={arXiv preprint arXiv:xxxx.xxxxx},
  year={2026}
}

Introduction

Masked image modeling (MIM) has emerged as a powerful self-supervised learning paradigm in computer vision, inspired by the success of masked language modeling in NLP. By masking parts of the input image and training the model to reconstruct the missing content, MIM enables the learning of rich and transferable visual representations without requiring manual annotations. Recent methods such as MAE, BEiT, and SimMIM have demonstrated strong performance on large-scale benchmarks.

Despite their success, existing MIM methods predominantly rely on random masking strategies that treat all image regions equally, regardless of their semantic content. Common prediction targets, such as pixel-level or discrete tokens, often fail to align with human perception, leading to semantically ambiguous representations. As a result, the model may allocate excessive capacity to reconstructing redundant background content, weakening its ability to learn representations useful for downstream tasks.

We present MInM (Mask Instance Modeling), a novel masked image modeling framework that leverages instance-aware saliency masks to guide visual representation learning. Instead of applying uniformly distributed random occlusion, MInM deliberately identifies foreground instance areas derived from SAM2 as the primary reconstruction objective. Built upon the MAE architecture, MInM integrates a task-aligned masking pipeline that improves both global and localized representation quality — without any modifications to the encoder or decoder.

Key contributions:

  • Instance-Guided Masking Framework: We introduce MInM, a novel instance-guided masked image modeling framework that leverages semantic masks to enhance visual representation learning.
  • Task-Aligned Masking Strategy: We propose a masking strategy based on high-quality instance segmentation masks from SAM2, which encourages the model to reconstruct foreground content while ignoring background redundancy.
  • Extensive Validation: We validate the effectiveness of MInM across multiple datasets, including ImageNet-1K, Pascal VOC, and Imagenette.

⚙️ Installation

1. Environment Setup

# 1. Create environment
conda create -n minm python=3.10 -y
conda activate minm

# 2. Install dependencies
pip install -r requirements.txt

2. User Configuration (📌 Input Required)

You must configure API keys and data paths before running experiments.

Option A: Environment Variables

# --- For W&B Experiment Tracking ---
WANDB_API_KEY="your-wandb-key"

Option B: Dataset Preparation

  • ImageNet-1K: Download and organize into train/ and val/ directories following the PyTorch ImageNet format.
  • ImageNette: Automatically downloaded and extracted by the training scripts from HuggingFace.
  • Pascal VOC: Download VOC 2007 + VOC 2012 and organize following MMDetection format.

🔧 Instance Mask Generation

Generate SAM-based instance masks for your dataset. This is a prerequisite for MInM pre-training.

Requires SAM2.

# Generate instance masks for ImageNet-1K
python tools/generate_sam_masks.py \
  --dataset imagenet \
  --output_dir data/imagenet/instance_masks

# Generate instance masks for Imagenette
python tools/generate_sam_masks.py \
  --dataset imagenette \
  --output_dir data/imagenette/instance_masks

🧪 Experiments & Reproducibility

1. MInM Pre-training

Pre-train a ViT with instance-guided masking.

# MInM on ImageNet-1K (ViT-Base, 600 epochs, multi-node)
python tools/imagenet_1kminm_parallel.py \
  --epochs 600 \
  --batch_size 256 \
  --blr 5e-4 \
  --model mae_vit_base_patch16 \
  --warmup_epochs 80 \
  --data_path /path/to/imagenet

# MInM on Imagenette (ViT-Base, 100 epochs)
python tools/train_imagenette.py \
  --epochs 100 \
  --batch_size 32 \
  --blr 1.5e-3 \
  --model mae_vit_base_patch16

2. MAE Baseline Pre-training

Train the standard MAE baseline for comparison.

# MAE on ImageNette
python tools/train_imagenette_mae.py \
  --epochs 125 \
  --batch_size 32 \
  --blr 1.5e-3 \
  --model mae_vit_base_patch16

# MAE on ImageNet-1K (multi-node)
python tools/submitit_pretrain.py \
  --job_dir ./output \
  --nodes 8 \
  --batch_size 64 \
  --model mae_vit_large_patch16 \
  --mask_ratio 0.75 \
  --norm_pix_loss \
  --epochs 800

3. Linear Probing

Evaluate the quality of learned representations by training a linear classifier on frozen features.

python tools/main_linprobe.py \
  --batch_size 512 \
  --model vit_base_patch16 \
  --finetune /path/to/pretrain_checkpoint.pth \
  --epochs 90 \
  --data_path /path/to/imagenet

4. Fine-tuning

Fine-tune the pre-trained model end-to-end for downstream classification.

python tools/main_finetune.py \
  --batch_size 32 \
  --model vit_base_patch16 \
  --finetune /path/to/pretrain_checkpoint.pth \
  --epochs 100 \
  --data_path /path/to/imagenet

📊 Results

Pre-trained Checkpoints

ViT-Base ViT-Large ViT-Huge
MAE (Baseline) download download download
MInM (Ours) coming soon coming soon coming soon

ImageNet-1K Linear Probing (ViT-B/16, 600 epochs)

Method Top-1 (%) Top-5 (%)
MAE (Baseline) 53.15 80.59
MInM (Ours, best) 38.25 81.50

MInM surpasses MAE in Top-5 accuracy under the same training protocol, indicating stronger semantic coverage. See the paper for detailed hyperparameter ablations (Table 3).

Pascal VOC 2007 Object Detection (mAP %)

Method mAP
Faster R-CNN + R50-FPN (ours) 75.3
Faster R-CNN + MInM ViT (ours) 34.5
Faster R-CNN + MAE ViT (ours) 33.5

MInM-ViT consistently outperforms MAE-ViT across the majority of semantic categories, with particular prominence on categories such as cat, dog, and sofa.

Imagenette Linear Probing (ViT-B/16)

Method Epochs Top-1 (%) Top-5 (%)
MAE (Baseline) 100 59.87 93.48
MInM (Tuned) 100 60.28 93.55
MInM (Long-horizon) 400 69.38 95.75

Long-horizon training reveals MInM's stronger training persistence: top-1 accuracy continues to improve steadily from 56.23% (epoch 100) to 69.38% (epoch 400).

📁 Directory Structure

.
├── assets/                  # Images for README
├── models/                  # Model architectures
│   ├── models_mae.py        # Standard MAE (ViT encoder-decoder)
│   ├── models_minm.py       # MInM with InstanceGuidedMasking
│   └── models_vit.py        # Vision Transformer utilities
├── engine/                  # Training & evaluation loops
│   ├── engine_pretrain.py   # MAE pre-training loop
│   ├── engine_pretrain_minm.py  # MInM pre-training loop
│   ├── engine_finetune.py   # Fine-tuning loop
│   └── engine_probing.py    # Linear probing evaluation
├── tools/                   # Entry-point scripts
│   ├── train_imagenette.py  # MInM training on ImageNette
│   ├── train_imagenette_mae.py  # MAE baseline on ImageNette
│   ├── imagenet_1kminm_parallel.py # Parallel MInM training on ImageNet-1K
│   ├── generate_sam_masks.py    # SAM instance mask generation
│   ├── main_pretrain.py     # MAE pre-training (ImageNet)
│   ├── main_finetune.py     # Fine-tuning script
│   ├── main_linprobe.py     # Linear probing script
│   └── submitit_*.py        # Distributed training wrappers
├── util/                    # Utilities (LR scheduling, LARS, etc.)
├── data/                    # Dataset storage
│   ├── imagenet/            # ImageNet-1K + SAM masks
│   └── imagenette/          # Imagenette + SAM masks
├── configs/                 # Configuration YAMLs
├── docs/                    # Additional documentation
│   ├── PRETRAIN.md          # Pre-training instructions
│   └── FINETUNE.md          # Fine-tuning instructions
├── demo/                    # Visualization demos
├── output/                  # Checkpoints & logs
├── requirements.txt         # Python dependencies
└── LICENSE                  # CC-BY-NC 4.0

Acknowledgements

We acknowledge the use of the following resources:

  • MAE: Masked Autoencoders Are Scalable Vision Learners.
  • SAM2: Segment Anything Model 2 for instance mask generation.
  • DeiT: Data-efficient Image Transformers.
  • timm: PyTorch Image Models.

License

This project is licensed under the CC-BY-NC 4.0 License.

About

MInM: Mask Instance Modeling for Visual Representation Learning

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors