zenlm · hanzo-dev · Aug 18, 2025 · Aug 19, 2025 · Aug 19, 2025 · Aug 19, 2025
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # Changelog
 
+
+## [0.0.2] - 2025-03-16
+
+Release of V-JEPA 2.1
+
+
 ## [0.0.1] - 2025-06-05
 
-Initial release of V-JEPA 2 codebase
+Initial release of V-JEPA 2 codebase
diff --git a/README.md b/README.md
@@ -1,3 +1,9 @@
+
+🆕 **[2026-03-16]:** :fire: V-JEPA 2.1 is released :fire: A new familly of models trained with a novel recipe that learns high quality and temporolly consistent dense features !!!
+
+**[2025-06-25]:** V-JEPA 2 is released. [[`Blog`](https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks)]
+
+
 # V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
 
 ### [Meta FAIR](https://ai.meta.com/research/)
@@ -13,19 +19,45 @@ Rabbat*, Nicolas Ballas*
 
 [[`Paper`](https://arxiv.org/abs/2506.09985)] [[`Blog`](https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks)] [[`BibTex`](#Citation)]
 
-Official Pytorch codebase for V-JEPA 2 and V-JEPA 2-AC.
+Official Pytorch codebase for V-JEPA 2, V-JEPA 2-AC, V-JEPA 2.1.
 
-V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticpation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
+V-JEPA 2 is a self-supervised approach to training video encoders, using internet-scale video data, that attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
 
 <p align="center">
 	<img src="assets/flowchart.png" width=100%>
 </p>
 
-<!---
-## Updates
 
-* **[Jun-6-25]:** V-JEPA 2 is released. [[`Blog`](https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks)]
---->
+
+## V-JEPA 2.1 Pre-training
+
+Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael
+Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes
+
+[[`Paper`](https://arxiv.org/abs/TODO)] [[`BibTex`](#Citation)]
+
+V-JEPA 2.1 improves the training recipe to focus on learning high-quality and temporally consistent dense features, as higlighted by PCA visualizations:
+
+<p align="center">
+	<img src="assets/teaser_screenshot_5dice.png" width=100%>
+</p>
+
+The V-JEPA 2.1 approach leverages: (1) **Dense Predictive Loss**, a masking-based
+self-supervision objective where all tokens (both visible/context and masked tokens) contribute to the
+self-supervised training loss; (2) **Deep Self-Supervision**, which applies the self-supervised loss at multiple
+intermediate representations of the encoder models; (3) **Multi-Modal Tokenizers** for images and videos;
+and we show that our approach benefit from (4) **Model and data scaling**.
+
+<p align="center">
+	<img src="assets/architecture_vjepa2_1.jpg" width=100%>
+</p>
+
+V-JEPA 2.1 performance across dense and global prediction tasks:
+
+<p align="center">
+	<img src="assets/bars_teaser_tikz-1.png" width=100%>
+</p>
+
 
 ## V-JEPA 2 Pre-training
 
@@ -35,7 +67,7 @@ V-JEPA 2 is a self-supervised approach to training video encoders, using interne
 <table>
   <tr>
     <th colspan="1">Benchmark</th>
-    <th colspan="1">VJEPA 2</th>
+    <th colspan="1">V-JEPA 2</th>
     <th colspan="1">Previous Best</th>
   </tr>
   <tr>
@@ -67,7 +99,7 @@ V-JEPA 2 is a self-supervised approach to training video encoders, using interne
 
 ## V-JEPA 2-AC Post-training
 
-**(Top)** After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. **(Bottom)** Performance on robot maniuplation tasks using a Franka arm, with input provided through a monocular RGB camera.
+**(Top)** After post-training with a small amount of robot data, we can deploy the model on a robot arm in new environments, and tackle foundational tasks like reaching, grasping, and pick-and-place by planning from image goals. **(Bottom)** Performance on robot manipulation tasks using a Franka arm, with input provided through a monocular RGB camera.
 
 <img align="left" src="https://github.com/user-attachments/assets/c5d42221-0102-4216-911d-061a4369a805" width=65%>&nbsp;
 <table>
@@ -111,15 +143,19 @@ V-JEPA 2 is a self-supervised approach to training video encoders, using interne
   </tr>
 </table>
 
+
+
+
+
 ## Models
 
-### V-JEPA 2
+### V-JEPA 2 and V-JEPA 2.1
 
 #### HuggingFace
 
-See our [HuggingFace collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6) for V-JEPA 2.
+See our HuggingFace [collection](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6) for V-JEPA 2.
 
-#### Pretrained Checkpoints
+#### V-JEPA 2 Pretrained Checkpoints
 
 <table>
   <tr>
@@ -159,6 +195,51 @@ See our [HuggingFace collection](https://huggingface.co/collections/facebook/v-j
   </tr>
 </table>
 
+#### V-JEPA 2.1 Pretrained Checkpoints
+
+<table>
+  <tr>
+    <th colspan="1">Model</th>
+    <th colspan="1">#Parameters</th>
+    <th colspan="1">Resolution</th>
+    <th colspan="1">Download Link</th>
+    <th colspan="1">Pretraining Config</th>
+  </tr>
+
+  <tr>
+    <td>ViT-B/16</td>
+    <td>80M</td>
+    <td>384</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitb_dist_vitG_384.pt">checkpoint</a></td>
+    <td><a href="configs/train_2_1/vitb16">configs</a></td>
+  </tr>
+
+  <tr>
+    <td>ViT-L/16</td>
+    <td>300M</td>
+    <td>384</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitl_dist_vitG_384.pt">checkpoint</a></td>
+    <td><a href="configs/train_2_1/vitl16">configs</a></td>
+  </tr>
+
+  <tr>
+    <td>ViT-g/16</td>
+    <td>1B</td>
+    <td>384</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitg_384.pt">checkpoint</a></td>
+    <td><a href="configs/train_2_1/vitg16">configs</a></td>
+  </tr>
+
+  <tr>
+    <td>ViT-G/16</td>
+    <td>2B</td>
+    <td>384</td>
+    <td><a href="https://dl.fbaipublicfiles.com/vjepa2/vjepa2_1_vitG_384.pt">checkpoint</a></td>
+    <td><a href="configs/train_2_1/vitG16">configs</a></td>
+  </tr>
+</table>
+
+
 #### Pretrained backbones (via PyTorch Hub)
 
 Please install [Pytorch](https://pytorch.org/get-started/locally/), [timm](https://pypi.org/project/timm/) and [einops](https://pypi.org/project/einops/) locally, then run the following to load each model. Installing Pytorch with CUDA support is strongly recommended.
@@ -169,16 +250,22 @@ import torch
 # preprocessor
 processor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_preprocessor')
 # models
+# V-JEPA 2
 vjepa2_vit_large = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_large')
 vjepa2_vit_huge = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_huge')
 vjepa2_vit_giant = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant')
 vjepa2_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_vit_giant_384')
+# V-JEPA 2.1
+vjepa2_1_vit_base_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_base_384')
+vjepa2_1_vit_large_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_large_384')
+vjepa2_1_vit_giant_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_giant_384')
+vjepa2_1_vit_gigantic_384 = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_1_vit_gigantic_384')
 
 ```
 
 #### Pretrained checkpoints on Huggingface
 
-You can also use our pretrained checkpoints on [Huggingface](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6).
+You can also use our pretrained checkpoints on [Huggingface for V-JEPA 2](https://huggingface.co/collections/facebook/v-jepa-2-6841bad8413014e185b497a6).
 
 ```python
 from transformers import AutoVideoProcessor, AutoModel
@@ -189,7 +276,6 @@ hf_repo = "facebook/vjepa2-vitg-fpc64-256"
 # facebook/vjepa2-vitg-fpc64-256
 # facebook/vjepa2-vitg-fpc64-384
 
-
 model = AutoModel.from_pretrained(hf_repo)
 processor = AutoVideoProcessor.from_pretrained(hf_repo)
 ```
@@ -278,8 +364,11 @@ import torch
 vjepa2_encoder, vjepa2_ac_predictor = torch.hub.load('facebookresearch/vjepa2', 'vjepa2_ac_vit_giant')
 ```
 
+
 See [energy_landscape_example.ipynb](notebooks/energy_landscape_example.ipynb) for an example notebook computing the energy landscape of the pretrained action-conditioned backbone using a robot trajectory collected from our lab.
-To run this notebook, you'll need to aditionally install [Jupyter](https://jupyter.org/install) and [Scipy](https://scipy.org/install/) in your conda environment.
+To run this notebook, you'll need to additionally install [Jupyter](https://jupyter.org/install) and [Scipy](https://scipy.org/install/) in your conda environment.
+
+
 
 ## Getting Started
 
@@ -291,6 +380,8 @@ conda activate vjepa2-312
 pip install .  # or `pip install -e .` for development mode
 ```
 
+**Note to macOS users:** V-JEPA 2 relies on [`decord`](https://github.com/dmlc/decord), which does not support macOS (and, unfortunately, is also no longer under development). In order to run the V-JEPA 2 code on macOS, you will need a different `decord` implementation. We do not make specific recommendations, although some users have reported the use of [`eva-decord`](https://github.com/georgia-tech-db/eva-decord) (see [PR 1](https://github.com/facebookresearch/vjepa2/pull/1)) or [`decord2`](https://github.com/johnnynunez/decord2) (see [PR 31](https://github.com/facebookresearch/vjepa2/pull/31)).  We leave the selection of the `decord` package up to the user's discretion.
+
 ### Usage Demo
 
 See [vjepa2_demo.ipynb](notebooks/vjepa2_demo.ipynb) [(Colab Link)](https://colab.research.google.com/github/facebookresearch/vjepa2/blob/main/notebooks/vjepa2_demo.ipynb) or [vjepa2_demo.py](notebooks/vjepa2_demo.py) for an example of how to load both the HuggingFace and PyTorch V-JEPA 2 models and run inference on a sample video to get a sample classification result.
@@ -316,7 +407,7 @@ Probe-based evaluation consists in training an attentive probe on top of frozen
 
 Evaluations can be run either locally, or distributed via SLURM. (Running locally is useful for debugging and validation).
 These sample commands launch Something-Something v2 video classification; other evals are launched by specifying the corresponding config.
-Use provided training configs under "Evaluation Attentive Probes". These configs allow to train multiple probes in parrallel with various optimization parameters.
+Use provided training configs under "Evaluation Attentive Probes". These configs allow to train multiple probes in parallel with various optimization parameters.
 Change filepaths as needed (e.g. `folder`, `checkpoint`, `dataset_train`, `dataset_val`) to match locations of data and downloaded checkpoints on your local filesystem.
 Change \# nodes and local batch size as needed to not exceed available GPU memory.
 
@@ -396,13 +487,16 @@ python -m app.main_distributed \
 ```
 .
 ├── app                              # training loops
-│   ├── vjepa                        #   video JEPA pre-training
+│   ├── vjepa                        #   V-JEPA 2 pre-training
+│   ├── vjepa_2_1                    #   V-JEPA 2.1 pre-training
 │   ├── vjepa_droid                  #   training the action-conditioned model
 │   ├── main_distributed.py          #   entrypoint for launch app on slurm cluster
 │   └── main.py                      #   entrypoint for launch app locally on your machine
 ├── configs                          # config files with experiment params for training and evaluation
-│   ├── train                        #   pretraining (phase 1), cooldown (phase 2), and action-conditioned training
+│   ├── train                        #   pretraining with V-JEPA 2 (phase 1), cooldown (phase 2), and action-conditioned training
+│   ├── train_2_1                    #   pretraining with V-JEPA 2.1 (phase 1), cooldown (phase 2)
 │   └── eval                         #   frozen evaluations
+│   └── inference                    #   inference only frozen evaluations
 ├── evals                            # evaluation loops training an attentive probe with frozen backbone...
 │   ├── action_anticipation_frozen   #   action anticipation
 │   ├── image_classification_frozen  #   image understanding
@@ -430,7 +524,8 @@ are licensed under the Apache 2.0 license.
 
 
 ## Citation
-If you find this repository useful in your research, please consider giving a star :star: and a citation
+If you find this repository useful in your research, please consider giving a star :star: and cite the papers:
+
 ```bibtex
 @article{assran2025vjepa2,
   title={V-JEPA~2: Self-Supervised Video Models Enable Understanding, Prediction and Planning},
@@ -444,3 +539,13 @@ Rabbat, Michael and Ballas, Nicolas},
   year={2025}
 }
 ```
+
+```bibtex
+@article{murlabadia2026vjepa2_1,
+  title={V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning},
+  author={Mur-Labadia, Lorenzo and Muckley, Matthew and Bar, Amir and Assran, Mahmoud and
+Sinha, Koustuv and Rabbat, Michael and LeCun, Yann and Ballas, Nicolas and Bardes, Adrien},
+  journal={arXiv preprint arXiv:2603.14482},
+  year={2026}
+}
+```
diff --git a/app/main_distributed.py b/app/main_distributed.py
@@ -31,7 +31,7 @@
 parser.add_argument(
     "--batch-launch",
     action="store_true",
-    help="whether fname points to a file to batch-lauch several config files",
+    help="whether fname points to a file to batch-launch several config files",
 )
 parser.add_argument(
     "--use_fname_as_folder",