[IJCAI 2024] UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Qingdong He¹, Jinlong Peng¹, Zhengkai Jiang¹, Kai Wu¹, Xiaozhong Ji¹, Jiangning Zhang¹, Yabiao Wang¹, Chengjie Wang¹, Mingang Chen², Yunsheng Wu¹.

¹Youtu Lab, Tencent, ²Shanghai Development Center of Computer Software Technology

3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, which aligns point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns comprehensive fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes.

Requirements

All the codes are tested in the following environment:

Python 3.7+
PyTorch 1.8
CUDA 11.1
spconv v2.x

Install dependent libraries

a. Clone this repository.

git clone https://github.com/hithqd/UniM-OV3D.git
git fetch -all
git checkout main

b. Install the dependent libraries as follows:

Install the dependent Python libraries (Please note that you need to install the correct version of torch and spconv according to your CUDA version):
```
pip install -r requirements.txt 
```

Install SoftGroup following its official guidance.

cd pcseg/external_libs/softgroup_ops
python3 setup.py build_ext develop
cd ../../..

Install pcseg
```
python3 setup.py develop
```

The dataset configs are located within tools/cfgs/dataset_configs, and the model configs are located within tools/cfgs for different settings.

Datasets

ScanNet Dataset

Please download the ScanNet Dataset and follow PointGroup to pre-process the dataset as follows or directly download the pre-processed data here.
Additionally, please download the caption data here. If you want to generate captions on your own, please download image data (scannet_frames_25k) from ScanNet and follow scripts generate_caption.py and generate_caption_idx.py.

The directory organization should be as follows:

├── data
│   ├── scannetv2
│   │   │── train
│   │   │   │── scene0000_00.pth
│   │   │   │── ...
│   │   │── val
│   │   │── text_embed
│   │   │── caption_idx
│   │   │── scannetv2_train.txt
│   │   │── scannetv2_val.txt
│   │   │—— scannet_frames_25k (optional, only for caption generation)
├── pcseg
├── tools

S3DIS Dataset

Please download the S3DIS Dataset and follow dataset/s3dis/preprocess.py to pre-process the dataset as follows or directly download the pre-processed data here.
```
python3 pcseg/datasets/s3dis/preprocess.py 
```
Additionally, please download the caption data here. If you want to generate captions on your own, please download image data here and follows scripts here: generate_caption.py and generate_caption_idx.py.

The directory organization should be as follows:

├── data
│   ├── s3dis
│   │   │── stanford_indoor3d_inst
│   │   │   │── Area_1_Conference_1.npy
│   │   │   │── ...
│   │   │── text_embed
│   │   │── caption_idx
│   │   │—— s3dis_2d (optional, only for caption generation)
├── pcseg
├── tools

nuScenes Dataset

Please download the official NuScenes 3D object detection dataset and organize the downloaded files as follows:
Additionally, please download the caption data here.

├── data
│   ├── nuscenes
│   │   │── text_embed
│   │   │── v1.0-trainval (or v1.0-mini if you use mini)
│   │   │   │── samples
│   │   │   │── sweeps
│   │   │   │── maps
│   │   │   │── caption_idx
│   │   │   │── v1.0-trainval
├── pcseg
├── tools

Install the nuscenes-devkit with version 1.0.5 by running the following command:

pip install nuscenes-devkit==1.0.5

Model Zoo

3D Semantic Segmentation

Semantic segmentation on four datasets

Dataset	Partition	Path
ScanNet	B15/N4	ckpt
ScanNet	B12/N7	ckpt
ScanNet	B10/N9	ckpt
S3DIS	B8/N4	ckpt
S3DIS	B6/N6	ckpt
ScanNet200	B170/N30	ckpt
ScanNet200	B150/N50	ckpt
nuScenes	B12/N3	ckpt
nuScenes	B10/N5	ckpt

3D Instance Segmentation

Instance segmentation on two datasets

Dataset Partition Path

ScanNet B13/N4 ckpt

ScanNet B10/N7 ckpt

ScanNet B8/N9 ckpt

S3DIS B8/N4 ckpt

S3DIS B6/N6 ckpt

Training

cd tools
sh scripts/dist_train.sh ${NUM_GPUS} --cfg_file ${CONFIG_FILE} ${PY_ARGS}

Inference

cd tools
sh scripts/dist_test.sh ${NUM_GPUS} --cfg_file ${CONFIG_FILE} --ckpt ${CKPT_PATH}

Citation

@article{he2024unim,
  title={UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation},
  author={He, Qingdong and Peng, Jinlong and Jiang, Zhengkai and Wu, Kai and Ji, Xiaozhong and Zhang, Jiangning and Wang, Yabiao and Wang, Chengjie and Chen, Mingang and Wu, Yunsheng},
  journal={33rd International Joint Conference on Artificial Intelligence (IJCAI)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
docs		docs
pcseg		pcseg
tools		tools
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[IJCAI 2024] UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Requirements

Install dependent libraries

Datasets

ScanNet Dataset

S3DIS Dataset

nuScenes Dataset

Model Zoo

3D Semantic Segmentation

3D Instance Segmentation

Training

Inference

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Dataset	Partition	Path
ScanNet	B13/N4	ckpt
ScanNet	B10/N7	ckpt
ScanNet	B8/N9	ckpt
S3DIS	B8/N4	ckpt
S3DIS	B6/N6	ckpt

Folders and files

Latest commit

History

Repository files navigation

[IJCAI 2024] UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Requirements

Install dependent libraries

Datasets

ScanNet Dataset

S3DIS Dataset

nuScenes Dataset

Model Zoo

3D Semantic Segmentation

3D Instance Segmentation

Training

Inference

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages