SCoT: Teaching 3D-LLMs to Think Spatially with Million-scale CoT Annotations

This is the official PyTorch implementation of SCoT.

Abstract

Recent advances in 3D Large Language Models (3D-LLMs) show strong potential in understanding and interacting with 3D environments, yet their training data typically lack explicit reasoning processes, limiting complex spatial reasoning and task planning. To address this, we annotate SCoT, a million-scale Chainof-Thought dataset spanning three levels: a) Spatial Perception (what is there), recognizing object properties, relations, and scene attributes; b) Spatial Analysis (what does it mean), inferring rationality, functionalities, and physical implications; c) Spatial Planning (what should I do), integrating perception and reasoning for actionable strategies. Unlike prior datasets supervising only answers, SCoT annotates intermediate reasoning grounded in scene cues, specifically for analysis and planning tasks. Results show that CoT supervision greatly benefits complex analysis and planning but induces hallucinations and accuracy drops in simple perception. These findings highlight both the necessity and the nuanced challenges of scene-grounded reasoning for advancing 3D intelligence.

💾 SCoT Dataset

Text Annotations

We have released SCoT dataset (including Spatial Perception, Spatial Analysis, and Spatial Planning Data) in Google Drive and Hugging Face.

Source Data and Preprocessed Features

We have released all source data and preprocessed features in Google Drive.

If you intend to train and test on your own 3D dataset, or if you want to explore generating the features yourself, it is recommended to refer to the way similar to Chat Scene.

💻 Requirements

The code has been tested on:

Ubuntu 20.04
CUDA 12.2
Python 3.10
Pytorch 2.2.1
NVIDIA A100 GPU (40G).

🔧 Installation

Create and activate the conda environment

conda create -n SCoT python=3.10
conda activate SCoT

Install the necessary packages

conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

🚅 Train

In the first stage, SCoT-Reasoner is trained to establish a basic understanding for 3D scenes. You can train the SCoT-Reasoner using Vicuna-7B v1.5 as backbone, note that you should change the llama_model_path in run.sh to the path of Vicuna-7B v1.5.

Please download the spatial perception data and modify scripts/config_stage_1.py. We have organized it into a trainable format in ./SCoT_Dataset/annotations.
Please modify run.sh with the following configuration. Then, run the code: bash scripts/run.sh.

   # run.sh (stage 1)
   train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
   val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
   evaluate=False

   python tasks/train.py \
      "$(dirname $0)/${config}config_stage_1.py" \
      ...

In the second stage, SCoT-Reasoner is fine-tuned to generate reasoning chains. Please change pretrained_path in run.sh to the path of checkpoint from the first stage, or you can use checkpoint from Chat Scene as pre-trained model.

Please download the spatial analysis and planning data and modify scripts/config_stage_2.py. We have organized it into a trainable format in ./SCoT_Dataset/SCoT_Training_Stage_2.
Please modify run.sh with the following configuration. Then, run the code: bash scripts/run.sh.

   # run.sh (stage 2)
   train_tag="scanrefer#obj_align#scan2cap#sqa3d"
   val_tag="scanrefer"
   evaluate=False

   python tasks/train.py \
      "$(dirname $0)/${config}config_stage_2.py" \
      ...

✏️ Evaluation

You can evaluate the model performances on SCoT dataset. Please change the pretrained_path in run.sh to the path of checkpoints and eval with the code bash scripts/run.sh:

    # run.sh (Test)
    val_tag="scanrefer#sqa3d"
    evaluate=True
    pretrained_path="/Path_to_Pretrained_Model.pth"

We have provided the pretrained checkpoint of SCoT-Reasoner in Google Drive and Hugging Face.

Text-based Metrics

   python utils/Eval_SCoT.py

LLM-based Assessments

   python utils/Eval_SCoT_LLM_Score.py

😊 Acknowledgement

Thanks to these extremely wonderful open-source projects:

3D Dataset: ScanNet, ARKitScenes.

3D-Language Dataset: ScanRefer, Scan2Cap, Sqa3D, MSR3D.

Representations: Uni3D, DINO v2.

3D-LLMs: 3D LLM, Video-3D-LLM, Chat Scene.

Contact us

If you find this repo helpful, please give us a star. For any questions, please contact us via lijp57@whu.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Fig		Fig
dataset		dataset
models		models
others		others
preprocess		preprocess
prompts		prompts
scripts		scripts
tasks		tasks
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SCoT: Teaching 3D-LLMs to Think Spatially with Million-scale CoT Annotations

Abstract

💾 SCoT Dataset

Text Annotations

Source Data and Preprocessed Features

💻 Requirements

🔧 Installation

🚅 Train

✏️ Evaluation

😊 Acknowledgement

Contact us

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SCoT: Teaching 3D-LLMs to Think Spatially with Million-scale CoT Annotations

Abstract

💾 SCoT Dataset

Text Annotations

Source Data and Preprocessed Features

💻 Requirements

🔧 Installation

🚅 Train

✏️ Evaluation

😊 Acknowledgement

Contact us

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages