Currently, LLM agents often make irreversible mistakes in long-horizon tasks because they cannot reliably look ahead. Imagine-then-Plan (ITP) addresses this by first using a learned world model to run an adaptive K-step imagination rollout, then selecting the action based on both the current state and the predicted future trajectory. This formulates the policy reasoning process as a Partially Observable and Imaginable Markov Decision Process (POIMDP).
- [Coming Soon]
- 📦 Processed data release
- 🧊 Training checkpoints release (World Model / ITP-R)
- [Jan 18, 2026] 🚀 Code release (environment setup, prompts, and core modules).
- [Jan 15, 2026] 📄 Paper released on arXiv and Hugging Face.
Modern LLM agents often behave reactively: they decide actions from the current observation and short history, which can be brittle for long-horizon tasks. ITP addresses this by introducing a learned textual world model and an adaptive lookahead mechanism, enabling the agent to "mentally rehearse" possible futures before committing actions in the real environment.
ITP treats decision making as reasoning with both the current observable state, and an imagined multi-step future trajectory produced by the world model. At each step, the agent can invoke the world model to simulate several steps ahead, then use that foresight to reflect on the progress, risks, and constraints, and finally output the action.
ITP-I enhances an LLM agent at inference time without parameter updates. The agent follows a three-stage Imagine-then-Plan procedure at each step:
- Adaptive horizon selection: decide how many steps to look ahead (K) based on the task and current situation.
- World-model imagination: roll out K steps to obtain a foresight trajectory.
- Reflect-then-act: reflect on the foresight (progress, risks, constraints) and then output the real action.
ITP-R learns when and how long to imagine by adding a lightweight K-head predictor on top of the backbone LLM and training it with a three-stage pipeline:
- Pseudo-labeling horizons: derive training targets for K by selecting the most helpful lookahead depth under a cost trade-off.
- Warm-up training: jointly train the action policy (imitation) and the K-head predictor.
- Online A2C optimization: optimize the action policy + K-head predictor + value head online with actor–critic training while the world model is frozen.
We recommend conda with Python 3.9+.
# Create a clean environment
conda create -n itp python=3.9 -y
conda activate itp
# (Optional) upgrade pip tooling
python -m pip install --upgrade pip setuptools wheel# From the repository root
pip install -r requirements.txt
# (Recommended) install as editable for local development
pip install -e .conda activate <ENV_WITH_DEEPSPEED>
export MODEL_PATH="path/to/base_lm"
export DATA_PATH="path/to/wm_train.jsonl"
export OUTPUT_DIR="outputs/wm_run"
export DS_CONFIG="world_model/base_tuning/config/deepspeed_config_s2.json"
bash world_model/base_tuning/run_worldmodel_tuning.shOutput worldmodel is located at:
${OUTPUT_DIR}/merged_fullpython -u -m itp.training.train_adaptive_k label \
--expert_jsonl "path/to/expert_rollouts.jsonl" \
--policy_model_path "path/to/policy_base_or_sft" \
--wm_model_path "path/to/wm_merged_full" \
--out_labeled_jsonl "outputs/itp_r/labeled.jsonl" \
--kmax 5 \
--k_candidates "0,1,2,3,4,5" \
--lambda_k 0.2 \
--wm_max_new_tokens 192 \
--max_seq_len 1536 \
--padding_side left \
--logprob_norm sumpython -u -m itp.training.train_adaptive_k sft \
--train_jsonl "outputs/itp_r/labeled.jsonl" \
--policy_model_path "path/to/policy_base_or_sft" \
--out_dir "outputs/itp_r/policy_sft_khead" \
--kmax 5 \
--epochs 3 \
--batch_size 1 \
--grad_accum 16 \
--lr 2e-5 \
--wd 0.0 \
--beta_k 0.5 \
--warmup_ratio 0.03 \
--lr_scheduler_type cosine \
--max_seq_len 1536 \
--fp16 \
--padding_side leftconda activate <ENV_WITH_ALFWORLD>
export TOKENIZERS_PARALLELISM=false
python -u -m itp.training.train_adaptive_k rl_k \
--policy_model_path "outputs/itp_r/policy_sft_khead" \
--out_dir "outputs/itp_r/policy_rlk" \
--alfworld_config "eval_agent/data/alfworld/base_config.yaml" \
--env_data_path "eval_agent/data/alfworld" \
--env_split "valid_seen" \
--wm_model_path "path/to/wm_merged_full" \
--wm_max_new_tokens 192 \
--policy_device "cuda:0" \
--wm_device "cuda:1" \
--kmax 5 \
--epochs 1 \
--episodes_per_epoch 50 \
--max_steps 50 \
--gamma 0.99 \
--lr 5e-6 \
--wd 0.0 \
--lambda_k 0.2 \
--step_cost 0.01 \
--success_bonus 0.01 \
--invalid_action_penalty -0.1 \
--entropy_coef 0.01 \
--value_coef 1.0 \
--max_grad_norm 1.0 \
--action_max_new_tokens 16 \
--action_do_sample 1 \
--action_temperature 0.7 \
--imagine_action_max_new_tokens 12 \
--max_seq_len 1536 \
--history_keep_steps 6 \
--padding_side left \
--logprob_norm sum \
--fp16export POLICY_MODEL="path/to/policy_checkpoint"
export WM_MODEL="path/to/wm_merged_full"
export OUT_DIR="outputs/eval/alfworld_seen"
mkdir -p "${OUT_DIR}"
python -u -m foresight_eval \
--method foresight \
--split seen \
--env_data_root "eval_agent/data/alfworld" \
--policy_model "${POLICY_MODEL}" \
--wm_model "${WM_MODEL}" \
--wm_backend local \
--output_path "${OUT_DIR}" \
--max_steps 40 \
--max_k 5 \
--fixed_k -1 \
--use_history 0 \
--decision_tokens 16 \
--act_tokens 256 \
--foresight_tokens 256export SCIENCEWORLD_JAR="path/to/scienceworld.jar"
export POLICY_MODEL="path/to/policy_checkpoint"
export WM_MODEL="path/to/wm_merged_full"
export OUT_DIR="outputs/eval/sciworld_test"
mkdir -p "${OUT_DIR}"
python -u -m foresight_eval.runner_sciworld \
--split test \
--jar_path "${SCIENCEWORLD_JAR}" \
--policy_model "${POLICY_MODEL}" \
--wm_model "${WM_MODEL}" \
--output_path "${OUT_DIR}" \
--wm_backend local \
--policy_device "cuda:0" \
--wm_device "cuda:1" \
--policy_dtype fp16 \
--wm_dtype fp16 \
--max_k 3 \
--fixed_k -1For any questions, please reach out to us at loyiv5477@gmail.com.
If you find this work helpful, please consider citing our paper as follows:
@article{liu2026itp,
title = {Imagine-then-Plan: Agent Learning from Adaptive Lookahead with World Models},
author = {Liu, Youwei and Wang, Jian and Wang, Hanlin and Guo, Beichen and Li, Wenjie},
journal = {arXiv preprint arXiv:2601.08955},
year = {2026},
url = {https://arxiv.org/abs/2601.08955}
}

