A Vision-Language Model for Next Location Prediction in Trajectory Data.
VLMLocPredictor is a model that combines vision-language capabilities with trajectory data to predict locations. It uses a two-stage training approach: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) based fine-tuning.
# Clone the repository
git clone https://github.com/Rising0321/VLMLocPredictor.git
cd VLMLocPredictor
# Install dependencies
pip install -r requirements.txt- Configure your dataset paths in
data/dataset_info.json - Supported datasets:
- Chengdu:
pointLabel,pointLogic - Porto:
pointLabelPorto,pointLogicPorto - San Francisco:
pointLabelSanfrancisco,pointLogicSanfrancisco - Rome:
pointLabelRome,pointLogicRome
- Chengdu:
We use Llama Factory for the SFT stage.
- Set up your Vision-Language Model path as
PATH_MODEL - Configure datasets:
pointLabel,pointLabelPorto,pointLabelSanfrancisco,pointLabelRome
- Run:
bash scripts/train/cot_sft/resume_finetune_qwen2vl_2b_pointLabel_cot_sft.sh
- Use the model trained in First Stage as
PRETRAIN_MODEL_PATH - Add logic datasets:
pointLabel,pointLabelPorto,pointLabelSanfrancisco,pointLabelRome,pointLogic,pointLogicPorto,pointLogicSanfrancisco,pointLogicRome
- Run the same script as First Stage
The RL model implementation is located in train/stage_rl/.
-
Configure:
DATASET_NAME: Path to your datasetMODEL_NAME_OR_PATH: Path to your pre-trained modelIMAGE_PATH: Path to your image data (will be released soon)
-
Run:
bash scripts/train/reason_rft_zero/resume_finetune_qwen2vl_2b_traj_only_rl.sh
VLMLocPredictor/
├── data/ # Dataset configuration
├── eval/ # Evaluation scripts
├── train/
│ ├── stage_sft/ # Supervised Fine-Tuning
│ └── stage_rl/ # RL-based Fine-tuning
├── trajDataJsonsDirty/ # Training Dataset
├── VLM/ # Training Images (upload later)
├── roadSoftMap/ # Distance to Closet Road (upload later)
├── scripts/ # Training scripts
└── utils/ # Utility functions