[CVPR2024 Highlight] LangSplat: 3D Language Gaussian Splatting

AMD ROCm GPU Fork

This is the AMD ROCm GPU tested version of the original repo, with the following key contributions:

Performance: ~10x Training Throughput (15 -> 145 iter/s on AMD GPUs)

Optimization	Throughput	Speedup
Baseline (`python train.py`, single GPU)	~15 iter/s	--
`OMP_NUM_THREADS=1`	~25 iter/s	1.7x
GPU-side caching of language features	~45 iter/s	1.8x
AMD-optimized gsplat rasterization	~145 iter/s	3.2x

OMP_NUM_THREADS=1 (1.7x) -- Without this, PyTorch dispatches every small CPU op (mask indexing, L1 loss) across all CPU cores. On a GPU node with 100+ CPU core, the thread coordination overhead dominates. Single-threaded execution is faster for these microsecond-scale operations.
GPU-side caching of language features (1.8x) -- The original code loaded two .npy files from disk and ran CPU preprocessing every iteration. Caching the result on GPU HBM after first access eliminates ~45% of per-iteration cost. Also fixed a glibc heap memory leak (via MALLOC_MMAP_THRESHOLD_) that caused OOM with 8 DDP workers.
AMD-optimized gsplat rasterization (3.2x) -- Replaced the hipified langsplat-rasterization with ROCm/gsplat, which has DPP warp reductions in the backward pass, 8x8 tiles (1 wavefront per tile on CDNA), __launch_bounds__(64), and fused projection kernels. Language features are handled via gsplat's N-D channel support (no kernel changes needed).

See Performance_Optimization_Journey.md for the full analysis with code snippets and commit references.

Key Contributions

Distributed Data Parallel (DDP) Training for Language Features
- Added multi-GPU support for language feature training via train_ddp.py, enabling training across multiple GPUs using PyTorch's torchrun.
- Includes a DistributedCameraSampler that distributes cameras across ranks with per-epoch shuffling and drop-last support.
- Gradients for _language_feature are averaged across all ranks via all_reduce, while only rank 0 handles logging, checkpointing, and TensorBoard.
GPU-side Caching and Memory Leak Fix
- Identified and fixed a critical glibc heap memory leak in Camera.get_language_feature(): every training iteration loaded .npy files from disk and performed CPU-side tensor operations (~70 MB of heap allocations per iteration), which glibc's malloc never returned to the OS. With 8 DDP workers running 30,000 iterations each, this caused monotonic RSS growth exceeding the system's RAM, triggering the Linux OOM killer.
- Applied two fixes: (a) GPU-side caching of language features on each Camera object so disk I/O and CPU preprocessing happen only once per camera, and (b) setting MALLOC_MMAP_THRESHOLD_ to force glibc to use mmap() for large allocations (properly freed on release). Together, these eliminated the OOM and improved training throughput by ~1.8x.
AMD-optimized gsplat Integration
- Replaced langsplat-rasterization with ROCm/gsplat as the rasterization backend, gaining AMD-specific kernel optimizations (DPP warp reductions, 8x8 tiles, fused projection). Added optional language_features parameter to gsplat.rasterization() upstream. Training throughput improved ~3.2x.

Installation

To install ROCm and PyTorch suite of software, please refer: The Rock releases for rocm and PyTorch. To install the AMD ROCm GPU version of the submodules and dependencies for this repo, use the commands below:

pip install open-clip-torch plyfile jaxtyping typing pathlib
pip install submodules/segment-anything-langsplat --no-build-isolation

# Install AMD-optimized gsplat (replaces the old langsplat-rasterization)
# Step 1: Clone with --recursive to get the bundled GLM submodule
git clone --recursive https://github.com/ROCm/gsplat.git ~/gsplat

# Step 2: Copy bundled GLM headers (has native HIP support, unlike system GLM)
mkdir -p ~/.local/include
cp -r ~/gsplat/gsplat/cuda/csrc/third_party/glm/glm ~/.local/include/

# Step 3: Build and install in editable mode
cd ~/gsplat && pip install --no-build-isolation --no-cache-dir -e .

pip install --no-build-isolation git+https://github.com/amd-wangfan/simple-knn.git@hip_support
pip install opencv-python

Note: Building gsplat from source requires a working ROCm toolchain. The build auto-detects your GPU architecture via rocminfo (e.g. gfx942, gfx90a). If the detection fails, it defaults to gfx942. The --recursive clone is required to get the bundled GLM math library which has native HIP __device__ annotations (the system libglm-dev package does not). You can verify the install with: python -c "import gsplat; print(gsplat.__version__)"

Quick Start

Below are the commands for running the entire pipeline tested on AMD GPUs.

# Clone LangSplat repo
git clone https://github.com/jiagaoxiang/LangSplat.git --recursive
cd LangSplat/

# Download the LERF_OVS dataset
pip install gdown
gdown --id 1QF1Po5p5DwTjFHu6tnTeYs_G0egMVmHt --no-check-certificate
# Unzip the downloaded dataset
apt-get update
apt-get install unzip
unzip -q lerf_ovs.zip

# Download SAM model checkpoints
mkdir -p ckpts && cd ckpts && wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth

# Preprocess the dataset
dataset_path=lerf_ovs/figurines
python preprocess.py --dataset_path $dataset_path

# Train the autoencoder
cd autoencoder
python train.py \
  --dataset_path $dataset_path \
  --dataset_name figurines \
  --encoder_dims 256 128 64 32 3 \
  --decoder_dims 16 32 64 128 256 256 512 \
  --lr 0.0007 \
  --num_epochs 100

# Get the 3-dims language feature of the scene
python test.py \
  --dataset_path $dataset_path \
  --dataset_name figurines \
  --encoder_dims 256 128 64 32 3 \
  --decoder_dims 16 32 64 128 256 256 512

# Train RGB 3DGS 30000 checkpoint
cd ..
python train.py \
  -s lerf_ovs/figurines \
  -m lerf_ovs/figurines/output/figurines \
  --iterations 30000 \
  --no_include_feature

Single-GPU Language Feature Training

for level in 1 2 3; do
  # Train the LangSplat (include_feature defaults to True)
  python train.py \
    -s $dataset_path \
    -m output/figurines \
    --start_checkpoint lerf_ovs/figurines/output/figurines_-1/chkpnt30000.pth \
    --feature_level $level
  # Render the LangSplat
  python render.py \
    -s $dataset_path \
    -m output/figurines_${level} \
    --feature_level ${level} \
    --include_feature
done

Multi-GPU DDP Language Feature Training

for level in 1 2 3; do
  # Train the LangSplat with DDP (8 GPUs)
  OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=8 \
      train_ddp.py \
      -s $dataset_path \
      -m output/figurines_ddp \
      --start_checkpoint lerf_ovs/figurines/output/figurines_-1/chkpnt30000.pth \
      --feature_level $level
  # Render the LangSplat (no DDP needed for rendering)
  python render.py \
    -s $dataset_path \
    -m output/figurines_ddp_${level} \
    --feature_level ${level} \
    --include_feature
done

Evaluation, Visualization, and Annotation

All downstream scripts (evaluate_iou_loc.py, visualize_langsplat.py, annotate_objects.py) accept a --model_path / -m flag that takes the same base path you passed to -m during training/rendering. The script automatically appends _{1,2,3} for the three feature levels, so you no longer need to manually construct feature directory paths.

# Set MODEL_PATH to match the -m flag used during training.
# Single-GPU example: MODEL_PATH=output/figurines
# DDP example:        MODEL_PATH=output/figurines_ddp
MODEL_PATH=output/figurines_ddp

# Eval
cd eval
pip install matplotlib mediapy
python evaluate_iou_loc.py \
    --dataset_name figurines \
    --model_path ../${MODEL_PATH} \
    --ae_ckpt_dir ../autoencoder/ckpt \
    --output_dir ../eval_result \
    --json_folder ../lerf_ovs/label
cd ..

# Visualization (heatmaps + localization per query)
python visualize_langsplat.py \
    --dataset_name figurines \
    --model_path ${MODEL_PATH} \
    --use_gt_labels

# Annotation (bounding boxes + labels overlaid on images)
python annotate_objects.py \
    --dataset_name figurines \
    --model_path ${MODEL_PATH} \
    --use_gt_labels

Backward compatibility: The old --feat_dir + --dataset_name combination still works if --model_path is not provided.

The original README is below:

[CVPR2024 Highlight] LangSplat: 3D Language Gaussian Splatting

This repository contains the official authors implementation associated with the paper "LangSplat: 3D Language Gaussian Splatting" (CVPR 2024), which can be found here. We further provide the preprocessed datasets 3D-OVS with language feature, as well as pre-trained models.

😊LangSplat Family

@inproceedings{qin2024langsplat,
  title={Langsplat: 3d language gaussian splatting},
  author={Qin, Minghan and Li, Wanhua and Zhou, Jiawei and Wang, Haoqian and Pfister, Hanspeter},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={20051--20060},
  year={2024}
}

🎉 We have released LangSplat V2! The new version significantly improves performance, achieving over 450+ FPS in rendering. [NeurIPS 2025] LangSplat V2

@article{li2025langsplatv2,
  title={LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS},
  author={Li, Wanhua and Zhao, Yujie and Qin, Minghan and Liu, Yang and Cai, Yuanhao and Gan, Chuang and Pfister, Hanspeter},
  journal={arXiv preprint arXiv:2507.07136},
  year={2025}
}

🎉We also invite everyone to check out our [CVPR 2025] 4D LangSplat, which is a multimodal, object-wise video prompting approach combined with a status deformable network to learn 4D language fields.

@inproceedings{li20254d,
  title={4d langsplat: 4d language gaussian splatting via multimodal large language models},
  author={Li, Wanhua and Zhou, Renping and Zhou, Jiawei and Song, Yingwei and Herter, Johannes and Qin, Minghan and Huang, Gao and Pfister, Hanspeter},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={22001--22011},
  year={2025}
}

Cloning the Repository

The repository contains submodules, thus please check it out with

# SSH
git clone git@github.com:minghanqin/LangSplat.git --recursive

or

# HTTPS
git clone https://github.com/minghanqin/LangSplat.git --recursive

Overview

The codebase has 3 main components:

A PyTorch-based optimizer to produce a LangSplat model from SfM datasets with language feature inputs to
A scene-wise language autoencode to alleviate substantial memory demands imposed by explicit modeling.
A script to help you turn your own images into optimization-ready SfM data sets with language feature

The components have been tested on Ubuntu Linux 18.04. Instructions for setting up and running each of them are found in the sections below.

Datasets

In the experiments section of our paper, we primarily utilized two datasets: the 3D-OVS dataset and the LERF dataset.

The 3D-OVS dataset is accessible for download via the following link: Download 3D-OVS Dataset .

For the LERF dataset, we have expanded upon its existing collection and also provided the corresponding COLMAP data. These resources can be accessed through this link: Download Expanded LERF Dataset and COLMAP Data.

Optimizer

The optimizer uses PyTorch and CUDA extensions in a Python environment to produce trained models.

Hardware Requirements

CUDA-ready GPU with Compute Capability 7.0+
24 GB VRAM (to train to paper evaluation quality)

Software Requirements

Conda (recommended for easy setup)
C++ Compiler for PyTorch extensions (we used VS Code)
CUDA SDK 11 for PyTorch extensions (we used 11.8)
C++ Compiler and CUDA SDK must be compatible

Setup

Environment Setup

Our default, provided install method is based on Conda package and environment management:

conda env create --file environment.yml
conda activate langsplat

QuickStart

Download the pretrained model to output/, then simply use

python render.py -m output/$CASENAME --include_feature

Processing your own Scenes

Before getting started

Firstly, put your images into the data dir.

<dataset_name>
|---input
|   |---<image 0>
|   |---<image 1>
|   |---...

Secondly, you need to acquire the following dataset format and a pre-trained RGB model follow the 3dgs repository.

<dataset_name>
|---images
|   |---<image 0>
|   |---<image 1>
|   |---...
|---input
|   |---<image 0>
|   |---<image 1>
|   |---...
|---output
|   |---<dataset_name>
|   |   |---point_cloud/iteration_30000/point_cloud.ply
|   |   |---cameras.json
|   |   |---cfg_args
|   |   |---chkpnt30000.pth
|   |   |---input.ply
|---sparse
    |---0
        |---cameras.bin
        |---images.bin
        |---points3D.bin

Environment setup.

Please install segment-anything-langsplat and download the checkpoints of SAM from here to ckpts/.

Pipeline

Follow the process.sh and train LangSplat on your own scenes.

Step 1: Generate Language Feature of the Scenes. Put the image data into the "input" directory under the <dataset_name>/, then run the following code.
```
python preprocess.py --dataset_path $dataset_path 
```

Step 2: Train the Autoencoder and get the lower-dims Feature.

# train the autoencoder
cd autoencoder
python train.py --dataset_name $dataset_path --encoder_dims 256 128 64 32 3 --decoder_dims 16 32 64 128 256 256 512 --lr 0.0007 --output ae_ckpt
# get the 3-dims language feature of the scene
python test.py --dataset_name $dataset_path --output

Our model expect the following dataset structure in the source path location:

<dataset_name>
|---images
|   |---<image 0>
|   |---<image 1>
|   |---...
|---language_feature
|   |---00_f.npy
|   |---00_s.npy
|   |---...
|---language_feature_dim3
|   |---00_f.npy
|   |---00_s.npy
|   |---...
|---output
|   |---<dataset_name>
|   |   |---point_cloud/iteration_30000/point_cloud.ply
|   |   |---cameras.json
|   |   |---cfg_args
|   |   |---chkpnt30000.pth
|   |   |---input.ply
|---sparse
    |---0
        |---cameras.bin
        |---images.bin
        |---points3D.bin

Step 3: Train the LangSplat.

python train.py -s dataset_path -m output/${casename} --start_checkpoint $dataset_path/output/$casename/chkpnt30000.pth --feature_level ${level}

Step 4: Render the LangSplat.

python render.py -s dataset_path -m output/${casename} --feature_level ${level}

Step 5: Eval. First, we generate the 3-dim language feature map through Step 4. Subsequently, the decoder elevates the features from 3 dimensions to 512 dimensions. For further operations and detailed explanations, please refer to the supplementary materials.
- 3D Object Localization on LERF and 3D Semantic Segmentation on LERF. Our eval code is based on LERF and NerfStudio, thanks for these impressive open-source projects!
  - Please download the lerf_ovs first.
  - Set the gt_folder as the path to lerf_ovs/label.
  - Make sure finish the Step 4 before you run the eval code.
```
cd eval
sh eval.sh
```

TODO list:

release the code of the optimizer
release the code of the autoencoder
release the code of the segment-anything-langsplat
update the arxiv link
release the preprocessed dataset and the pretrained model
release more preprocessed dataset and the pretrained model (coming soon)
release the code of the eval

This project is still under development. Please feel free to raise issues or submit pull requests to contribute to our codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
arguments		arguments
assets		assets
autoencoder		autoencoder
eval		eval
gaussian_renderer		gaussian_renderer
lpipsPyTorch		lpipsPyTorch
scene		scene
scripts		scripts
submodules		submodules
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
DDP_TRAINING.md		DDP_TRAINING.md
LICENSE.md		LICENSE.md
Performance_Optimization_Journey.md		Performance_Optimization_Journey.md
README.md		README.md
annotate_objects.py		annotate_objects.py
compare_rgb_quality.py		compare_rgb_quality.py
convert.py		convert.py
environment.yml		environment.yml
launch_baremetal.sh		launch_baremetal.sh
preprocess.py		preprocess.py
process.sh		process.sh
render.py		render.py
train.py		train.py
train_ddp.py		train_ddp.py
visualize_langsplat.py		visualize_langsplat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMD ROCm GPU Fork

Performance: ~10x Training Throughput (15 -> 145 iter/s on AMD GPUs)

Key Contributions

Installation

Quick Start

Single-GPU Language Feature Training

Multi-GPU DDP Language Feature Training

Evaluation, Visualization, and Annotation

[CVPR2024 Highlight] LangSplat: 3D Language Gaussian Splatting

😊LangSplat Family

Cloning the Repository

Overview

Datasets

Optimizer

Hardware Requirements

Software Requirements

Setup

Environment Setup

QuickStart

Processing your own Scenes

Before getting started

Environment setup.

Pipeline

TODO list:

About

Uh oh!

Releases

Packages

Languages

License

jiagaoxiang/LangSplat

Folders and files

Latest commit

History

Repository files navigation

AMD ROCm GPU Fork

Performance: ~10x Training Throughput (15 -> 145 iter/s on AMD GPUs)

Key Contributions

Installation

Quick Start

Single-GPU Language Feature Training

Multi-GPU DDP Language Feature Training

Evaluation, Visualization, and Annotation

[CVPR2024 Highlight] LangSplat: 3D Language Gaussian Splatting

😊LangSplat Family

Cloning the Repository

Overview

Datasets

Optimizer

Hardware Requirements

Software Requirements

Setup

Environment Setup

QuickStart

Processing your own Scenes

Before getting started

Environment setup.

Pipeline

TODO list:

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages