All models are instantiated through factory functions in src/models/__init__.py using a model_id string. The ID follows the pattern <model_type>/<model_name>, which is resolved by get_model_from_model_id().
There are two top-level model types:
model_type |
Description |
|---|---|
feature_extractor |
Frozen VFM backbones for feature probing (multiview correspondence, visualization) |
vggt |
Full VGGT model for end-to-end 3D tasks (reconstruction, calibration, training) |
Feature extractors are wrappers around pretrained vision foundation models. They all inherit from FeatureExtractor (src/models/feature_extractors/base.py) and implement a standardized output schema (for ease of use in eval and larger architectures):
# Single-layer output (most models)
{
"x_norm": torch.Tensor, # Full last hidden state (CLS + registers + patches)
"x_norm_clstoken": torch.Tensor, # CLS token
"x_norm_patchtokens": torch.Tensor, # Patch embeddings (B, P, D) or (B, H_p, W_p, D)
}
# Multi-layer output (VGGTFeatureExtractor)
{
"<layer_name>": {
"x_norm_patchtokens": torch.Tensor, # Patch embeddings for this layer
},
...
}| Field | Value |
|---|---|
| model_id | feature_extractor/dinov2 |
| Class | DINOv2 |
| Source | HuggingFace (facebook/dinov2-*) or timm |
| Backbones | small, base, large, giant |
| Patch size | 14 |
Config options:
backbone(str): One ofsmall,base,large,giant. Default:"base"use_timm(bool): Usetimminstead of HuggingFace. Default:falsewith_registers(bool): Use variant with register tokens. Default:falsecheckpoint_path(str): Path to custom checkpoint. Default:null(uses pretrained)preprocess_images(bool): Apply internal preprocessing. Default:true
| Field | Value |
|---|---|
| model_id | feature_extractor/dinov3 |
| Class | DINOv3 |
| Source | timm |
| Backbones | small, small_plus, base, large, huge, huge_plus, 7b |
| Patch size | 16 |
Config options:
backbone(str): One ofsmall,small_plus,base,large,huge,huge_plus,7b. Default:"base"checkpoint_path(str): Path to custom checkpoint. Default:nullpreprocess_images(bool): Default:true
| Field | Value |
|---|---|
| model_id | feature_extractor/clip |
| Class | CLIP |
| Source | HuggingFace (openai/clip-*) |
| Backbones | vit-large-patch14, vit-base-patch16, vit-base-patch32 |
| Patch size | 14 or 16 or 32 (depends on backbone) |
Config options:
backbone(str): One ofvit-large-patch14,vit-base-patch16,vit-base-patch32. Default:"vit-base-patch16"checkpoint_path(str): Path to custom checkpoint. Default:nullpreprocess_images(bool): Default:true
| Field | Value |
|---|---|
| model_id | feature_extractor/mae |
| Class | MAE |
| Source | HuggingFace (facebook/vit-mae-*) |
| Backbones | base, large, huge |
| Patch size | 16 |
Config options:
backbone(str): One ofbase,large,huge. Default:"base"checkpoint_path(str): Path to custom checkpoint. Default:nullpreprocess_images(bool): Default:true
Note: Masking is disabled (
mask_ratio=0.0) so all patches are used for feature extraction.
| Field | Value |
|---|---|
| model_id | feature_extractor/mumvisiontransformer |
| Class | MuMVisionTransformer |
| Source | davnords/mum (cloned into external/) |
| Backbone | ViT-L/16 (only variant) |
| Patch size | 16 |
Config options:
checkpoint_path(str): Path to custom checkpoint. Default:nullpreprocess_images(bool): Default:true
| Field | Value |
|---|---|
| model_id | feature_extractor/vggtfeatureextractor |
| Class | VGGTFeatureExtractor |
| Source | facebookresearch/vggt |
| Backbone | VGGT-1B (DINOv2-Large + Aggregator) |
| Patch size | 14 |
This extractor returns multi-layer output — one feature dict per aggregator layer (and optionally the patch embedding layer). This is useful for probing how 3D awareness evolves across the depth of the VGGT aggregator.
Config options:
vggt_config(Dict): Config dict for the underlying VGGT model. Default:null(loads official 1B pretrained)checkpoint_path(str): Path to custom VGGT checkpoint. Default:nulllayer_types(List[str]): Which layers to extract. Options:"patch_embed"— Final layer of the patch embedding encoder"frame"— All local (frame) attention layers in the aggregator"global"— All global attention layers in the aggregator"all"— All of the above (default)
The VGGT class (src/models/vggt.py) is a modified version of the original VGGT that adds support for swapping in any FeatureExtractor as the patch embedding encoder.
| Field | Value |
|---|---|
| model_id | vggt |
| Class | VGGT |
Images → [Patch Embed (any FeatureExtractor)] → [Aggregator (local + global attention)] → [Heads]
├── CameraHead → pose encoding
├── DPTHead (depth) → depth + confidence
├── DPTHead (point) → 3D points + confidence
└── TrackHead → point tracks
img_size(int): Image size to train on. Default:518patch_size(int): Patch size. Default:14embed_dim(int): Embedding dimension. Default:1024enable_camera(bool): Train camera head. Default:trueenable_depth(bool): Train depth head. Default:trueenable_point(bool): Train pointmap head. Default:falseenable_track(bool): Train tracking head. Default:falsepatch_embed_config(Dict): Config for a custom patch embedding model. Ifnull, defaults to DINOv2-Large with registers. Example:patch_embed_config: model_id: feature_extractor/dinov2 model_config: backbone: "base" use_timm: true with_registers: true preprocess_images: false # aggregator handles this
aggregator_config(Dict): Override aggregator architecture (depth, num_heads, mlp_ratio, etc.)camera_head_config,depth_head_config,point_head_config,track_head_config(Dict): Override head architectures
# From official pretrained checkpoint
model = VGGT.from_pretrained("facebook/VGGT-1B")
# From config (random init or custom backbone)
model = VGGT(**model_config)- Create a new class in
src/models/feature_extractors/that inherits fromFeatureExtractor - Implement
forward_features()returning the standardized output schema - Set
self.model,self.patch_size,self.embed_dim, andself.img_sizein__init__ - Register the new model in
src/models/feature_extractors/__init__.pyby adding a case toget_extractor_from_id() - Create a config YAML in
configs/model/that inherits fromfeature_extractorand sets themodel_id