Analyzing 3D Awareness in Vision Foundation Models
This project investigates how much 3D geometric understanding is encoded in the intermediate representations of modern Vision Transformers (ViTs). The project provides a unified evaluation framework for probing pretrained vision foundation models (VFMs) on explicit 3D tasks: multiview correspondence multiview reconstruction, monocular depth estimation, pose estimation, and single-view camera calibration. This is done via fine-tuning experiments that swap in different VFM backbones as the patch embedding encoder for VGGT.
If you're curious about the insights we gained from the project, check out the findings here (report, slides, images).
All models are instantiated via a model_id string (e.g., feature_extractor/dinov2). Feature extractors inherit from a common FeatureExtractor base class with a standardized output schema (CLS token, patch tokens, full hidden state), making it straightforward to add new models.
| Model | model_id | Patch Size | Source |
|---|---|---|---|
| DINOv2 (S/B/L/G) | feature_extractor/dinov2 |
14 | HuggingFace / timm |
| DINOv3 (S/B/L/H/7B) | feature_extractor/dinov3 |
16 | timm |
| CLIP (ViT-B/16, ViT-L/14) | feature_extractor/clip |
14/16/32 | HuggingFace |
| MAE (B/L/H) | feature_extractor/mae |
16 | HuggingFace |
| MuM (ViT-L/16) | feature_extractor/mumvisiontransformer |
16 | davnords/mum |
| VGGT (feature extractor) | feature_extractor/vggtfeatureextractor |
14 | facebookresearch/vggt |
| VGGT (full model) | vggt |
14 | facebookresearch/vggt |
For backbone options, config parameters, and how to add new models, see src/models/README.md.
| Benchmark | Script | Description |
|---|---|---|
| Multiview Correspondence | scripts.eval.multiview_correspondence_benchmark |
Measures cosine similarity of patch features for 3D-corresponding vs. non-corresponding patches across views using voxelized world coordinates. |
| Dense N-View Reconstruction | scripts.eval.dense_n_view_benchmark |
Evaluates pointmap accuracy, depth estimation, camera pose (ATE, AUC@5°), and ray direction error. |
| Single-View Calibration | scripts.eval.calibration_benchmark |
Evaluates predicted camera intrinsics (ray direction angular error) across multiple resolutions. |
| Feature Visualization | scripts.eval.visualize_feats |
PCA and k-means visualization of patch features across models. |
Benchmarks and training use the Map-Anything WAI data format with scenes from:
- ETH3D — High-quality multi-view indoor/outdoor stereo dataset
- ScanNet++ V2 — Large-scale indoor RGB-D dataset
- BlendedMVS — Blended multi-view stereo dataset
- MVS-Synth — Synthetic multi-view stereo dataset
- UnrealStereo4K — Synthetic stereo from Unreal Engine
mkdir -p external
cd external
git clone https://github.com/facebookresearch/vggt.git
git clone https://github.com/facebookresearch/map-anything.git
git clone https://github.com/davnords/mum.git
cd ..From the repository root:
pip install -e .This installs all dependencies from pyproject.toml (PyTorch 2.8, timm, transformers, hydra, open3d, etc.).
cd external/map-anything && pip install -e . --no-deps && cd ../..
cd external/vggt && pip install -e . --no-deps && cd ../..
pip install uniception --no-depsNote on MuM: The
mumrepository cannot be installed as a package but has an__init__.py. It is imported directly viaexternal.mum.mum. This is why all scripts must be run using module notation (see below).
To download and process datasets, follow the instructions provided in the Map-Anything README. Unfortunately, it can be quite painful to do so as WAI is preprocessing-heavy and a work-in-progress, but the upshot is that any dataset converted to WAI can be used with this code.
The config system is built around Hydra.
Training and Eval
Configs for training and eval are at the top level of configs/ but reference sub-configs in configs/machine (paths), configs/model (model used), configs/dataset (datasets used), and configs/training (training args). Generally, the sub-configs supply a "default" for each model, dataset, training setup, etc and the top-level configs allow for customization. I decided on this structure for modularity, but it could definitely use refinement in hindsight. For a detailed description of how the configs work, see configs/README.md.
Configure Paths
Once you've setup the data and directories on your machine, it's important to supply the paths to configs/machine/default.yaml (or create a new machine config) to point to your local data and directories:
root_data_dir: /path/to/datasets
mapanything_metadata_dir: /path/to/mapanything/metadata
checkpoints_dir: /path/to/checkpoints
results_dir: /path/to/resultsImportant: Always run scripts from the repository root using module notation (
python -m ...) so that relative imports resolve correctly.
# Multiview correspondence benchmark (probing frozen features)
python -m scripts.eval.multiview_correspondence_benchmark
# Dense N-view reconstruction benchmark
python -m scripts.eval.dense_n_view_benchmarkBenchmark configs are in configs/. Switch models by changing the corresponding config YAMLs or via command-line overrides. See configs/README.md for details.
Fine-tune VGGT with a custom patch embedding backbone using distributed training:
torchrun --nproc_per_node=auto scripts/training/train_vggt.py --config train_vggtThe training config (configs/train_vggt.yaml) controls:
- Backbone swap — Set
model.model_config.patch_embed_configto any supportedFeatureExtractor - Aggregator architecture — Adjust depth, heads, MLP ratio
- Frozen modules — Freeze the patch embedding encoder via
training.optim.frozen_submodules - Multi-task loss — Camera pose, depth, and pointmap losses with confidence weighting
- Logging — Integrated W&B logging with depth maps and 3D point cloud visualizations
3dfeat-refs/
├── configs/ # Hydra YAML configs (see configs/README.md)
│ ├── machine/ # Machine-specific paths
│ ├── model/ # Model configs (dinov2, dinov3, mum, vggt, ...)
│ ├── dataset/ # Dataset configs + benchmark/train dataset mixes
│ │ ├── resolution/ # Default resolutions per dataset
│ │ ├── blendedmvs/ # BlendedMVS WAI template
│ │ ├── eth3d/ # ETH3D WAI template
│ │ ├── scannetppv2/ # ScanNet++ V2 WAI template
│ │ ├── mvs_synth/ # MVS-Synth WAI template
│ │ ├── unrealstereo4k/ # UnrealStereo4K WAI template
│ │ └── mpsd/ # MPSD WAI template
│ ├── training/ # Training sub-configs (optimizer, loss, logging, ...)
│ ├── train_vggt.yaml # Top-level: VGGT training
│ ├── multiview_correspondence_benchmark.yaml # Top-level: correspondence eval
│ ├── dense_n_view_benchmark.yaml # Top-level: reconstruction eval
│ ├── calibration_benchmark.yaml # Top-level: calibration eval
│ └── visualize_feats.yaml # Top-level: feature visualization
├── external/ # External dependencies (cloned repos)
│ ├── map-anything/
│ ├── vggt/
│ └── mum/
├── src/ # Core source code
│ ├── models/ # (see src/models/README.md)
│ │ ├── feature_extractors/ # Unified VFM wrappers (DINOv2, DINOv3, CLIP, MAE, MuM, VGGT)
│ │ ├── processors/ # Image preprocessing (resize, normalize, pad)
│ │ └── vggt.py # Modified VGGT with swappable patch embeddings
│ ├── eval/
│ │ ├── multiview_correspondence.py # Voxel-based correspondence scoring
│ │ ├── dense_n_view.py # Full reconstruction metrics (pointmaps, depth, pose)
│ │ ├── metrics.py # Metric computation helpers
│ │ └── vis_feats.py # PCA / k-means feature visualization
│ ├── training/
│ │ ├── trainers/ # Base trainer + VGGT trainer with DDP support
│ │ ├── losses/ # Multi-task loss (camera, depth, pointmap)
│ │ ├── distributed.py # DDP utilities
│ │ ├── gradient_clip.py # Per-module gradient clipping
│ │ └── optimizer.py # Optimizer and scheduler setup
│ ├── datasets/ # WAI dataset wrappers
│ └── utils/ # Camera math, image utils, I/O, logging
├── scripts/
│ ├── eval/ # Evaluation entry points
│ │ ├── multiview_correspondence_benchmark.py
│ │ ├── dense_n_view_benchmark.py
│ │ ├── calibration_benchmark.py
│ │ └── visualize_feats.py
│ └── training/ # Training entry points
│ └── train_vggt.py
├── results/ # Saved benchmark results (JSON)
│ ├── multiview_correspondence/
│ └── dense_n_view/
└── pyproject.toml
This project builds on and integrates the following works:
- VGGT — Visual Geometry Grounded Transformer (Meta)
- Map-Anything — Multi-view 3D reconstruction framework and datasets (Meta)
- MuM — Multi-view masked autoencoder
- Fit3D - Improving 2D Feature Representations by 3D-Aware Fine-Tuning
- 3DRS - Basis for multiview correspondence
Please check them out as well.