Models

All models are instantiated through factory functions in src/models/__init__.py using a model_id string. The ID follows the pattern <model_type>/<model_name>, which is resolved by get_model_from_model_id().

Model Types

There are two top-level model types:

`model_type`	Description
`feature_extractor`	Frozen VFM backbones for feature probing (multiview correspondence, visualization)
`vggt`	Full VGGT model for end-to-end 3D tasks (reconstruction, calibration, training)

Feature Extractors

Feature extractors are wrappers around pretrained vision foundation models. They all inherit from FeatureExtractor (src/models/feature_extractors/base.py) and implement a standardized output schema (for ease of use in eval and larger architectures):

# Single-layer output (most models)
{
    "x_norm": torch.Tensor,            # Full last hidden state (CLS + registers + patches)
    "x_norm_clstoken": torch.Tensor,   # CLS token
    "x_norm_patchtokens": torch.Tensor, # Patch embeddings (B, P, D) or (B, H_p, W_p, D)
}

# Multi-layer output (VGGTFeatureExtractor)
{
    "<layer_name>": {
        "x_norm_patchtokens": torch.Tensor,  # Patch embeddings for this layer
    },
    ...
}

Available Feature Extractors

DINOv2

Field	Value
model_id	`feature_extractor/dinov2`
Class	`DINOv2`
Source	HuggingFace (`facebook/dinov2-*`) or `timm`
Backbones	`small`, `base`, `large`, `giant`
Patch size	14

Config options:

backbone (str): One of small, base, large, giant. Default: "base"
use_timm (bool): Use timm instead of HuggingFace. Default: false
with_registers (bool): Use variant with register tokens. Default: false
checkpoint_path (str): Path to custom checkpoint. Default: null (uses pretrained)
preprocess_images (bool): Apply internal preprocessing. Default: true

DINOv3

Field	Value
model_id	`feature_extractor/dinov3`
Class	`DINOv3`
Source	`timm`
Backbones	`small`, `small_plus`, `base`, `large`, `huge`, `huge_plus`, `7b`
Patch size	16

Config options:

backbone (str): One of small, small_plus, base, large, huge, huge_plus, 7b. Default: "base"
checkpoint_path (str): Path to custom checkpoint. Default: null
preprocess_images (bool): Default: true

CLIP

Field	Value
model_id	`feature_extractor/clip`
Class	`CLIP`
Source	HuggingFace (`openai/clip-*`)
Backbones	`vit-large-patch14`, `vit-base-patch16`, `vit-base-patch32`
Patch size	14 or 16 or 32 (depends on backbone)

Config options:

backbone (str): One of vit-large-patch14, vit-base-patch16, vit-base-patch32. Default: "vit-base-patch16"
checkpoint_path (str): Path to custom checkpoint. Default: null
preprocess_images (bool): Default: true

MAE

Field	Value
model_id	`feature_extractor/mae`
Class	`MAE`
Source	HuggingFace (`facebook/vit-mae-*`)
Backbones	`base`, `large`, `huge`
Patch size	16

Config options:

backbone (str): One of base, large, huge. Default: "base"
checkpoint_path (str): Path to custom checkpoint. Default: null
preprocess_images (bool): Default: true

Note: Masking is disabled (mask_ratio=0.0) so all patches are used for feature extraction.

MuM

Field	Value
model_id	`feature_extractor/mumvisiontransformer`
Class	`MuMVisionTransformer`
Source	davnords/mum (cloned into `external/`)
Backbone	ViT-L/16 (only variant)
Patch size	16

Config options:

checkpoint_path (str): Path to custom checkpoint. Default: null
preprocess_images (bool): Default: true

VGGTFeatureExtractor

Field	Value
model_id	`feature_extractor/vggtfeatureextractor`
Class	`VGGTFeatureExtractor`
Source	facebookresearch/vggt
Backbone	VGGT-1B (DINOv2-Large + Aggregator)
Patch size	14

This extractor returns multi-layer output — one feature dict per aggregator layer (and optionally the patch embedding layer). This is useful for probing how 3D awareness evolves across the depth of the VGGT aggregator.

Config options:

vggt_config (Dict): Config dict for the underlying VGGT model. Default: null (loads official 1B pretrained)
checkpoint_path (str): Path to custom VGGT checkpoint. Default: null
layer_types (List[str]): Which layers to extract. Options:
- "patch_embed" — Final layer of the patch embedding encoder
- "frame" — All local (frame) attention layers in the aggregator
- "global" — All global attention layers in the aggregator
- "all" — All of the above (default)

VGGT (Full Model)

The VGGT class (src/models/vggt.py) is a modified version of the original VGGT that adds support for swapping in any FeatureExtractor as the patch embedding encoder.

Field	Value
model_id	`vggt`
Class	`VGGT`

Architecture

Images → [Patch Embed (any FeatureExtractor)] → [Aggregator (local + global attention)] → [Heads]
                                                                                            ├── CameraHead → pose encoding
                                                                                            ├── DPTHead (depth) → depth + confidence
                                                                                            ├── DPTHead (point) → 3D points + confidence
                                                                                            └── TrackHead → point tracks

Config options

img_size (int): Image size to train on. Default: 518
patch_size (int): Patch size. Default: 14
embed_dim (int): Embedding dimension. Default: 1024
enable_camera (bool): Train camera head. Default: true
enable_depth (bool): Train depth head. Default: true
enable_point (bool): Train pointmap head. Default: false
enable_track (bool): Train tracking head. Default: false

patch_embed_config (Dict): Config for a custom patch embedding model. If null, defaults to DINOv2-Large with registers. Example:

patch_embed_config:
  model_id: feature_extractor/dinov2
  model_config:
    backbone: "base"
    use_timm: true
    with_registers: true
    preprocess_images: false  # aggregator handles this

aggregator_config (Dict): Override aggregator architecture (depth, num_heads, mlp_ratio, etc.)
camera_head_config, depth_head_config, point_head_config, track_head_config (Dict): Override head architectures

Loading

# From official pretrained checkpoint
model = VGGT.from_pretrained("facebook/VGGT-1B")

# From config (random init or custom backbone)
model = VGGT(**model_config)

Adding a New Model

Create a new class in src/models/feature_extractors/ that inherits from FeatureExtractor
Implement forward_features() returning the standardized output schema
Set self.model, self.patch_size, self.embed_dim, and self.img_size in __init__
Register the new model in src/models/feature_extractors/__init__.py by adding a case to get_extractor_from_id()
Create a config YAML in configs/model/ that inherits from feature_extractor and sets the model_id

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Models

Model Types

Feature Extractors

Available Feature Extractors

DINOv2

DINOv3

CLIP

MAE

MuM

VGGTFeatureExtractor

VGGT (Full Model)

Architecture

Config options

Loading

Adding a New Model

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Models

Model Types

Feature Extractors

Available Feature Extractors

DINOv2

DINOv3

CLIP

MAE

MuM

VGGTFeatureExtractor

VGGT (Full Model)

Architecture

Config options

Loading

Adding a New Model