[IJCAI 2024] UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation
Qingdong He1, Jinlong Peng1, Zhengkai Jiang1, Kai Wu1, Xiaozhong Ji1, Jiangning Zhang1, Yabiao Wang1, Chengjie Wang1, Mingang Chen2, Yunsheng Wu1.
1Youtu Lab, Tencent, 2Shanghai Development Center of Computer Software Technology
3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, which aligns point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns comprehensive fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes.
All the codes are tested in the following environment:
- Python 3.7+
- PyTorch 1.8
- CUDA 11.1
- spconv v2.x
a. Clone this repository.
git clone https://github.com/hithqd/UniM-OV3D.git
git fetch -all
git checkout mainb. Install the dependent libraries as follows:
-
Install the dependent Python libraries (Please note that you need to install the correct version of
torchandspconvaccording to your CUDA version):pip install -r requirements.txt
-
Install SoftGroup following its official guidance.
cd pcseg/external_libs/softgroup_ops python3 setup.py build_ext develop cd ../../..
-
Install pcseg
python3 setup.py develop
The dataset configs are located within tools/cfgs/dataset_configs, and the model configs are located within tools/cfgs for different settings.
-
Please download the ScanNet Dataset and follow PointGroup to pre-process the dataset as follows or directly download the pre-processed data here.
-
Additionally, please download the caption data here. If you want to generate captions on your own, please download image data (scannet_frames_25k) from ScanNet and follow scripts generate_caption.py and generate_caption_idx.py.
-
The directory organization should be as follows:
├── data │ ├── scannetv2 │ │ │── train │ │ │ │── scene0000_00.pth │ │ │ │── ... │ │ │── val │ │ │── text_embed │ │ │── caption_idx │ │ │── scannetv2_train.txt │ │ │── scannetv2_val.txt │ │ │—— scannet_frames_25k (optional, only for caption generation) ├── pcseg ├── tools
-
Please download the S3DIS Dataset and follow dataset/s3dis/preprocess.py to pre-process the dataset as follows or directly download the pre-processed data here.
python3 pcseg/datasets/s3dis/preprocess.py
-
Additionally, please download the caption data here. If you want to generate captions on your own, please download image data here and follows scripts here: generate_caption.py and generate_caption_idx.py.
-
The directory organization should be as follows:
├── data │ ├── s3dis │ │ │── stanford_indoor3d_inst │ │ │ │── Area_1_Conference_1.npy │ │ │ │── ... │ │ │── text_embed │ │ │── caption_idx │ │ │—— s3dis_2d (optional, only for caption generation) ├── pcseg ├── tools
- Please download the official NuScenes 3D object detection dataset and organize the downloaded files as follows:
- Additionally, please download the caption data here.
├── data
│ ├── nuscenes
│ │ │── text_embed
│ │ │── v1.0-trainval (or v1.0-mini if you use mini)
│ │ │ │── samples
│ │ │ │── sweeps
│ │ │ │── maps
│ │ │ │── caption_idx
│ │ │ │── v1.0-trainval
├── pcseg
├── tools
- Install the
nuscenes-devkitwith version1.0.5by running the following command:
pip install nuscenes-devkit==1.0.5-
Semantic segmentation on four datasets
Dataset Partition Path ScanNet B15/N4 ckpt ScanNet B12/N7 ckpt ScanNet B10/N9 ckpt S3DIS B8/N4 ckpt S3DIS B6/N6 ckpt ScanNet200 B170/N30 ckpt ScanNet200 B150/N50 ckpt nuScenes B12/N3 ckpt nuScenes B10/N5 ckpt
-
Instance segmentation on two datasets
Dataset Partition Path ScanNet B13/N4 ckpt ScanNet B10/N7 ckpt ScanNet B8/N9 ckpt S3DIS B8/N4 ckpt S3DIS B6/N6 ckpt
cd tools
sh scripts/dist_train.sh ${NUM_GPUS} --cfg_file ${CONFIG_FILE} ${PY_ARGS}cd tools
sh scripts/dist_test.sh ${NUM_GPUS} --cfg_file ${CONFIG_FILE} --ckpt ${CKPT_PATH}@article{he2024unim,
title={UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation},
author={He, Qingdong and Peng, Jinlong and Jiang, Zhengkai and Wu, Kai and Ji, Xiaozhong and Zhang, Jiangning and Wang, Yabiao and Wang, Chengjie and Chen, Mingang and Wu, Yunsheng},
journal={33rd International Joint Conference on Artificial Intelligence (IJCAI)},
year={2024}
}
