SnapViT is a contrastive learning pipeline that uses Vision Transformers (ViT) to create unified Bird's-Eye View (BEV) representations from both aerial (UAV) and ground-level (UGV) images of the same scene. By training the model to recognize corresponding ground and aerial views, it learns to generate powerful, viewpoint-invariant feature maps.
The core of the project is the SnapViT model, which consists of two main components:
- Ground Encoder: Processes multiple UGV images and their camera poses, projecting their features into a unified BEV feature map.
- Overhead Encoder: Processes a single UAV image to create a corresponding BEV feature map.
Both encoders share a Vision Transformer (ViT) backbone for feature extraction. The model is trained using an InfoNCE contrastive loss function, which encourages BEV maps from corresponding ground and aerial views to be similar while being dissimilar from non-corresponding views.
| File | Description |
|---|---|
model.py |
Defines the core SnapViT architecture: GroundEncoder, OverheadEncoder, and ViTFeatureExtractor. |
train.py |
Main script for training the SnapViT model. |
dataset.py |
Contains VineyardDataset for loading UAV and UGV scene data. |
create_dataset.py |
Converts raw data (GeoTIFFs, ROS bags, GPS .llh files) into structured datasets. |
create_dataset_from_frames.py |
Simpler dataset creation script from image folders and CSV. |
verify_dataset.py |
Visualizes and verifies alignment between UAV and UGV data. |
visualize.py |
Runs a trained model and saves generated BEV maps for inspection. |
requirements.txt |
Python dependencies for the project. |
git clone https://github.com/rajithadesilva/snapViT.git
cd snapViTCreate a virtual environment (recommended):
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activateInstall dependencies:
pip install -r requirements.txtFor dataset creation from raw robotics data:
pip install pyrealsense2 rasteriopython create_dataset.py \
--drone_geotiff /path/to/orthophoto.tif \
--ground_rosbag /path/to/ugv_data.bag \
--ground_llh /path/to/ugv_gps.llh \
--output_dir datasets/vineyard_dataset \
--scene_radius 15 \
--tile_ground_size 10.0python create_dataset_from_frames.py \
--drone_folder /path/to/drone_images \
--ground_folder /path/to/ground_images \
--ground_csv /path/to/ground_gps.csv \
--output_dir datasets/vineyard_dataset \
--scene_radius 10.0Visually inspect alignment between UGV and UAV data:
python verify_dataset.py \
--dataset_dir datasets/vineyard_dataset \
--output_dir verification \
--tile_ground_size 10.0Note: Ensure
tile_ground_sizematches the value used during dataset creation.
Train the SnapViT model:
python train.py- Adjust hyperparameters via the
CONFIGdictionary intrain.py. - Best model is saved to
models/best_model.pth.
Generate BEV feature maps using a trained model:
python visualize.py \
--data_root datasets/vineyard_dataset \
--checkpoint models/best_model.pth \
--output_dir visualisationsOutputs include:
- UAV image
- UGV image collage
- Ground BEV map
- Aerial BEV map
Contributions are welcome! Please open issues or pull requests.