Official PyTorch implementation of
“SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models.”
Shun Taguchi, Hideki Deguchi, Takumi Hamazaki, Hiroyuki Sakai
SpatialPrompting tackles zero-shot spatial question answering in 3D scenes by
- extracting representative keyframes based on spatial and semantic features, and
- constructing LLM prompts that embed spatial context without any 3D-specific fine-tuning.
This repository contains:
- Feature Extraction –
extract_features.py - Interactive Spatial QA –
spatialqa.py - Benchmark Inference –
predict_scanqa.py,predict_sqa3d.py - Evaluation –
score_scanqa.py,score_sqa3d.py
git clone https://github.com/ToyotaCRDL/SpatialPrompting.git
cd SpatialPromptingconda create -n spatialprompting python=3.10 -y
conda activate spatialpromptingInstall PyTorch (see https://pytorch.org), and other dependencies:
# CUDA 11.8 build
pip install torch==2.5.0+cu118 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118pip install -r requirements.txtThe project is tested on Ubuntu 22.04 + Python 3.10 + CUDA 11.8 + PyTorch 2.5.0.
export OPENAI_API_KEY="your_openai_key"
export GOOGLE_API_KEY="your_gemini_key"/path/to/your/data
└── data
├── ScanNet
├── ScanQA
└── SQA3D
- Please extract .sens files of the ScanNet.
- When running the scripts, specify the base path using the
--base_pathargument.
python extract_features.py \
--base_path /path/to/your/data \
--dataset scannet \
--env scene0050_00 \
--model vitl336python spatialqa.py \
--llm gpt-4o-2024-11-20 \
--feature /path/to/spatial_feature.npz \
--image_num 30-
ScanQA
- Predict:
python predict_scanqa.py \ --base_path /path/to/your/data \ --llm gpt-4o-2024-11-20 \ --model vitl336 \ --image_num 30
- Evaluate:
python score_scanqa.py \ --base_path /path/to/your/data \ --pred /path/to/prediction.jsonl \ --use_spice # optional -
SQA3D
- Predict:
python predict_sqa3d.py \ --base_path /path/to/your/data \ --llm gpt-4o-2024-11-20 \ --model vitl336 \ --image_num 30
- Evaluate:
python score_sqa3d.py \ --base_path /path/to/your/data \ --pred /path/to/prediction.jsonl
If you find this project useful in your research, please consider citing:
@article{taguchi2025spatialprompting,
title={SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models},
author={Taguchi, Shun and Deguchi, Hideki and Hamazaki, Takumi and Sakai, Hiroyuki},
journal={arXiv preprint arXiv:2505.04911},
year={2025}
}
This code is released for non-commercial research use only.
See the full text in the LICENSE file.
