Skip to content

hithqd/UniM-OV3D

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 

Repository files navigation

[IJCAI 2024] UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation

Qingdong He1, Jinlong Peng1, Zhengkai Jiang1, Kai Wu1, Xiaozhong Ji1, Jiangning Zhang1, Yabiao Wang1, Chengjie Wang1, Mingang Chen2, Yunsheng Wu1.

1Youtu Lab, Tencent, 2Shanghai Development Center of Computer Software Technology

arXiv

Image description

3D open-vocabulary scene understanding aims to recognize arbitrary novel categories beyond the base label space. However, existing works not only fail to fully utilize all the available modal information in the 3D domain but also lack sufficient granularity in representing the features of each modality. In this paper, we propose a unified multimodal 3D open-vocabulary scene understanding network, namely UniM-OV3D, which aligns point clouds with image, language and depth. To better integrate global and local features of the point clouds, we design a hierarchical point cloud feature extraction module that learns comprehensive fine-grained feature representations. Further, to facilitate the learning of coarse-to-fine point-semantic representations from captions, we propose the utilization of hierarchical 3D caption pairs, capitalizing on geometric constraints across various viewpoints of 3D scenes. Extensive experimental results demonstrate the effectiveness and superiority of our method in open-vocabulary semantic and instance segmentation, which achieves state-of-the-art performance on both indoor and outdoor benchmarks such as ScanNet, ScanNet200, S3IDS and nuScenes.

Requirements

All the codes are tested in the following environment:

Install dependent libraries

a. Clone this repository.

git clone https://github.com/hithqd/UniM-OV3D.git
git fetch -all
git checkout main

b. Install the dependent libraries as follows:

  • Install the dependent Python libraries (Please note that you need to install the correct version of torch and spconv according to your CUDA version):

    pip install -r requirements.txt 
  • Install SoftGroup following its official guidance.

    cd pcseg/external_libs/softgroup_ops
    python3 setup.py build_ext develop
    cd ../../..
  • Install pcseg

    python3 setup.py develop

The dataset configs are located within tools/cfgs/dataset_configs, and the model configs are located within tools/cfgs for different settings.

Datasets

ScanNet Dataset

  • Please download the ScanNet Dataset and follow PointGroup to pre-process the dataset as follows or directly download the pre-processed data here.

  • Additionally, please download the caption data here. If you want to generate captions on your own, please download image data (scannet_frames_25k) from ScanNet and follow scripts generate_caption.py and generate_caption_idx.py.

  • The directory organization should be as follows:

    ├── data
    │   ├── scannetv2
    │   │   │── train
    │   │   │   │── scene0000_00.pth
    │   │   │   │── ...
    │   │   │── val
    │   │   │── text_embed
    │   │   │── caption_idx
    │   │   │── scannetv2_train.txt
    │   │   │── scannetv2_val.txt
    │   │   │—— scannet_frames_25k (optional, only for caption generation)
    ├── pcseg
    ├── tools
    

S3DIS Dataset

  • Please download the S3DIS Dataset and follow dataset/s3dis/preprocess.py to pre-process the dataset as follows or directly download the pre-processed data here.

    python3 pcseg/datasets/s3dis/preprocess.py 
  • Additionally, please download the caption data here. If you want to generate captions on your own, please download image data here and follows scripts here: generate_caption.py and generate_caption_idx.py.

  • The directory organization should be as follows:

    ├── data
    │   ├── s3dis
    │   │   │── stanford_indoor3d_inst
    │   │   │   │── Area_1_Conference_1.npy
    │   │   │   │── ...
    │   │   │── text_embed
    │   │   │── caption_idx
    │   │   │—— s3dis_2d (optional, only for caption generation)
    ├── pcseg
    ├── tools
    

nuScenes Dataset

├── data
│   ├── nuscenes
│   │   │── text_embed
│   │   │── v1.0-trainval (or v1.0-mini if you use mini)
│   │   │   │── samples
│   │   │   │── sweeps
│   │   │   │── maps
│   │   │   │── caption_idx
│   │   │   │── v1.0-trainval
├── pcseg
├── tools
  • Install the nuscenes-devkit with version 1.0.5 by running the following command:
pip install nuscenes-devkit==1.0.5

Model Zoo

3D Semantic Segmentation

  • Semantic segmentation on four datasets

    Dataset Partition Path
    ScanNet B15/N4 ckpt
    ScanNet B12/N7 ckpt
    ScanNet B10/N9 ckpt
    S3DIS B8/N4 ckpt
    S3DIS B6/N6 ckpt
    ScanNet200 B170/N30 ckpt
    ScanNet200 B150/N50 ckpt
    nuScenes B12/N3 ckpt
    nuScenes B10/N5 ckpt

3D Instance Segmentation

  • Instance segmentation on two datasets

    Dataset Partition Path
    ScanNet B13/N4 ckpt
    ScanNet B10/N7 ckpt
    ScanNet B8/N9 ckpt
    S3DIS B8/N4 ckpt
    S3DIS B6/N6 ckpt

Training

cd tools
sh scripts/dist_train.sh ${NUM_GPUS} --cfg_file ${CONFIG_FILE} ${PY_ARGS}

Inference

cd tools
sh scripts/dist_test.sh ${NUM_GPUS} --cfg_file ${CONFIG_FILE} --ckpt ${CKPT_PATH}

Citation

@article{he2024unim,
  title={UniM-OV3D: Uni-Modality Open-Vocabulary 3D Scene Understanding with Fine-Grained Feature Representation},
  author={He, Qingdong and Peng, Jinlong and Jiang, Zhengkai and Wu, Kai and Ji, Xiaozhong and Zhang, Jiangning and Wang, Yabiao and Wang, Chengjie and Chen, Mingang and Wu, Yunsheng},
  journal={33rd International Joint Conference on Artificial Intelligence (IJCAI)},
  year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors