Skip to content

Repository for stress testing LLMs in planning tasks through different scenarios, prompts, and observations.

Notifications You must be signed in to change notification settings

thehcalab-uiuc/AST-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

79 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adaptive Stress Testing Black-Box LLM Planners

Note: This repository contains the code for our arXiv 2025 paper. For issues, contact Neeloy Chakraborty at neeloyc2@illinois.edu.

Abstract

Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in real-world environments where sensors are noisy or unreliable. Characterizing LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically characterize the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, our second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. Unfortunately, covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.

References

Part of the code is based on the following repositories:

Command Notation

Below, certain commands are run across multiple terminals. To denote when a command should be run in terminal 1, we write, [T1] <command>, and so on for other terminals.

README Table of Contents

  1. Setup Ollama
  2. Setup AnythingLLM
  3. Setup AST Toolbox
  4. Run Driving Experiments
  5. Run Robot Crowd Navigation Experiments
  6. Run Lunar Lander Experiments

Setup Ollama

We use Ollama as the inference server for running open-source LLMs.

Download Ollama

Rather than directly installing Ollama, we use a portable executable version of the platform.

  1. Download a version of Ollama from their GitHub releases here: https://github.com/ollama/ollama/releases. We used version 0.6.8.
    • wget https://github.com/ollama/ollama/releases/download/v0.6.8/ollama-linux-amd64.tgz
  2. Create a folder where to extract executable
    • mkdir -p path/to/ollama_v0.6.8
  3. Extract executable
    • tar -xvzf ollama-linux-amd64.tgz -C path/to/ollama_v0.6.8
  4. Append the following lines to your ~/.bashrc to update the environment variables for Ollama
    • Update OLLAMA_HOST so AnythingLLM can access the models later on
      • export OLLAMA_HOST=0.0.0.0:11434
    • Update OLLAMA_MDOELS to a folder where you want to have model weights downloaded
      • export OLLAMA_MODELS="/path/to/ollama_models"

Download Models

We now can download the models used for planning and for vector embedding. Weights will be downloaded to OLLAMA_MODELS environment variable value.

  1. Start Ollama server
    • [T1] cd path/to/ollama_v0.6.8/bin
    • [T1] ./ollama serve
  2. Download models (ID of model weights used in experiments are listed in parenthesis)
    • [T2] cd path/to/ollama_v0.6.8/bin
      • Vector embedding model (790764642607)
        • [T2] ./ollama run bge-m3:latest
      • DeepSeek-R1 14B (ea35dfe18182)
        • [T2] ./ollama run deepseek-r1:14b
      • Llama 3.2 3B (e410b836fe61)
        • [T2] ./ollama run llama3.2:3b-instruct-q8_0
      • Dolphin 3.0 8B (d5ab9ae8e1f2)
        • [T2] ./ollama run dolphin3:latest
      • Qwen 3.0 8B (500a1f067a9f)
        • [T2] ./ollama run qwen3:8b

You can now close these terminals.

Setup AnythingLLM

AnythingLLM is a service that we use to enable easy integration with Ollama and any other LLM provider, to allow for quick swapping of LLMs during experiments. It uses a unified API regardless of the LLM used.

Download Image

We download the Docker version of AnythingLLM and convert it to an Apptainer image. The Docker image could be directly used, but we ran experiments on a cluster where only Apptainer was available.

  1. Download image and convert to Apptainer (we used version 1.7.8)
    • apptainer pull ./anythingllm.sif docker://mintplexlabs/anythingllm:1.7.8

Run Apptainer

  1. Create directory to save logs
    • [T1] export STORAGE_LOCATION=$HOME/anythingllm && mkdir -p $STORAGE_LOCATION && touch "$STORAGE_LOCATION/.env"
  2. Run container
    • [T1] apptainer run --bind ${STORAGE_LOCATION}:/app/server/storage --bind ${STORAGE_LOCATION}/.env:/app/server/.env --env STORAGE_DIR="/app/server/storage" --writable-tmpfs anythingllm.sif

Access AnythingLLM

The container is now running on the server at http://localhost:3001/.

  1. If you are running AnythingLLM on a headless server like a cluster, you can port-forward to a computer with a screen
    • [Local Computer with Screen] ssh -l <remote_user_name> -L 127.0.0.1:3001:<remote_server_ip>:3001 <remote_server_ip>
  2. Visit http://localhost:3001/
  3. Follow the user interface to create a default account on the server
    • Do NOT enable multiple users on the server
    • It does not matter what you select for the default workspace or LLM provider at this time; we will be configuring them later on

Start Ollama

We run one instance of Ollama for LLMs and another instance for embedding models.

  1. Run Ollama server for parallel LLM queries
    • [T2] cd path/to/ollama_v0.6.8/bin
    • [T2] OLLAMA_KEEP_ALIVE="1h" OLLAMA_FLASH_ATTENTION="true" OLLAMA_SCHED_SPREAD="true" OLLAMA_NUM_PARALLEL=40 ./ollama serve
      • OLLAMA_KEEP_ALIVE keeps the model weights in GPU memory for a greater time before they are erased
      • OLLAMA_FLASH_ATTENTION enables faster inference
      • OLLAMA_SCHED_SPREAD splits parallel processing across multiple GPUs
      • OLLAMA_NUM_PARALLEL sets the number of maximum possible requests to process in parallel (make this smaller if you find that the weights are split across GPU and CPU, and the speed is slow; discussed later)
  2. Run Ollama server for embedding queries
    • [T3] cd path/to/ollama_v0.6.8/bin
    • [T3] OLLAMA_HOST=0.0.0.0:11435 OLLAMA_KEEP_ALIVE="1h" OLLAMA_FLASH_ATTENTION="true" OLLAMA_SCHED_SPREAD="true" ./ollama serve
      • OLLAMA_HOST is overwritten here from the one in ~/.bashrc to process embedding queries on a separate port from LLM queries

Configure AnythingLLM

Now, we can configure AnythingLLM to communicate with Ollama.

  1. Visit http://localhost:3001/settings/llm-preference
  2. Select Ollama under the LLM Provider drop-down menu
  3. Under Advanced Settings
    • Set Ollama Base URL to http://127.0.0.1:11434
      • One of the downloaded Ollama models should automatically become selected under Ollama Model
    • Set Performance Mode to Maximum
    • Set Ollama Keep Alive to 1 hour
  4. Set Max Tokens to 4096
  5. Visit http://localhost:3001/settings/vector-database
  6. Select LanceDB under the Vector Database Provider drop-down menu
  7. Visit http://localhost:3001/settings/embedding-preference
  8. Under Manual Endpoint Input
    • Set Ollama Base URL to http://127.0.0.1:11435
  9. Select bge-m3 under Ollama Embedding Model
  10. Set Max Embedding Chunk Length to 4096
  11. Visit http://localhost:3001/settings/text-splitter-preference
  12. Set Text Chunk Size to 4096
  13. Set Text Chunk Overlap to 20
  14. Visit http://localhost:3001/settings/api-keys
  15. Click Generate New API Key
  16. Copy the API key and add it to your ~/.bashrc
    • export ANYTHINGLLM_API_KEY="<copied_key>"

Setup AST Toolbox

We have cloned the AST Toolbox provided by SISL and updated it to work for our experiments. Install dependencies as follows.

cd /ast-llm/toolbox_source
sudo chmod a+x scripts/install_all.sh
sudo scripts/install_all.sh

Now, create a Conda environment for AST experiments.

cd /ast-llm/conda_envs
conda env create -f ast.yml

Run Driving Experiments

Here, we walk through how to run our case study, characterize LLMs in different scenarios in the highway-env, and run real-time applications of the characterization trees.

Setup Driving Workspaces in AnythingLLM

  1. Start AnythingLLM
    • Follow steps 2-5 and 7-8 under Setup AnythingLLM in this README
  2. Create a workspace for holding driving experiences
  3. Download the Chroma driving memories from DiLu
  4. Create a Conda environment for converting memories from Chroma and uploading to AnythingLLM workspace
    • [T4] cd /ast-llm/conda_envs
    • [T4] conda env create -f chroma-dilu.yml
    • [T4] conda activate chroma-dilu
  5. Upload memories to DiLu memory workspace
    • [T4] cd /ast-llm/dilu
    • [T4] python upload_chroma_memories.py

You should now see a value of 21 under Vector Count at http://localhost:3001/workspace/dilu_memory_agent/settings/vector-databasememories.

  1. Create a workspace for each LLM used for planning

Repeat step 6 for the following workspaces and LLMs:

Workspace Name Ollama LLM
ollama_deepseek_r1_driver_agent deepseek-r1
ollama_llama_3.2_driver_agent llama3.2
ollama_dolphin_3_driver_agent dolphin3
ollama_qwen_3_driver_agent qwen3

Setup DiLu Conda Environment

Because the ast Conda environment has old dependencies, we create a separate environment that runs the actual simulators to test LLMs in. You can refer to our dilu Conda environment yml file at /ast-llm/conda_envs/dilu.yml to cross-reference versions of required packages. Our environment is adapted to work for the DiLu repo provided here.

conda create -n dilu python=3.10
conda activate dilu
pip install git+https://github.com/eleurent/highway-env
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/eleurent/rl-agents#egg=rl-agents
pip install git+https://github.com/DLR-RM/stable-baselines3
pip install moviepy -U
pip install imageio_ffmpeg
pip install pyvirtualdisplay
sudo apt-get install -y xvfb ffmpeg
pip install PyYAML
pip install rich

If your server is headless, you can skip installing xvfb. It is only used for rendering frames from the simulator.

Run Case-Study

To analyze speed, return, crash, etc. stats of LLMs in environment:

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Start evaluating LLM in environment
    • [T4] conda activate dilu
    • [T4] cd /ast-llm/dilu
    • [T4] python run_dilu.py --prompt <PROMPT> --policy <POLICY> --memory <MEMORY> --obs_subset <OBS_SUB> --obs_details <OBS_DET> --obs_noise <OBS_NOISE> --obs_order <OBS_ORDER> --save_dir <EPS_DIR> --num_eps <EPS> --save_traj --render
      • Argument explanations:
        • PROMPT
          • A system prompt configuration class from /ast-llm/dilu/configs.py
        • POLICY
          • An LLM policy class from /ast-llm/dilu/configs.py
        • MEMORY
          • A memory class from /ast-llm/dilu/configs.py (whether few-shot examples are used)
        • OBS_SUB
          • An observation subset class from /ast-llm/dilu/configs.py (are both ego and other agents' obs provided)
        • OBS_DET
          • An observation detail class from /ast-llm/dilu/configs.py (which state details are provided)
        • OBS_NOISE
          • An observation noise class from /ast-llm/dilu/configs.py (how much noise to apply to different observation details)
        • OBS_ORDER
          • An observation ordering class from /ast-llm/dilu/configs.py (whether to randomize the order of descriptions of different agents)
        • EPS_DIR
          • A path of where to store all test runs for LLMs
        • EPS
          • Number of episodes to run eval for
        • --save_traj
          • Whether to save the stats of each trajectory
        • --render
          • Whether to render the video for each episode

Episodes will be stored under <EPS_DIR>/StandardEnvConfig/<MEMORY>/<PROMPT>/<POLICY>/<OBS_SUB>/<OBS_DET>/<OBS_ORDER>/<OBS_NOISE>/run_<#>, where # increments everytime a new set of episodes are run for the POLICY.

Manually perturb prompts and evaluate inconsistency of predictions:

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Start evaluating the inconsistency rate of predictions after perturbing prompts
    • [T4] conda activate dilu
    • [T4] cd /ast-llm/dilu
    • [T4] python offline_manual_perturbation.py --load_dir <EPS_DIR> --prompt <PROMPT> --policy <POLICY> --memory <MEMORY> --obs_subset <OBS_SUB> --obs_details <OBS_DET> --obs_noise <OBS_NOISE> --obs_order <OBS_ORDER> --save_dir <PERTURB_DIR>
      • Argument explanations:
        • --load_dir
          • EPS_DIR to use to initialize trajectories to generate perturbed prompts for
        • PERTURB_DIR
          • A path of where to store all results after manually perturbing prompts

Perturbed generations will be stored at <PERTURB_DIR>/<MEMORY>/<PROMPT>/<POLICY>/<OBS_SUB>/<OBS_DET>/<OBS_ORDER>/<OBS_NOISE>/run_<#>, where # increments everytime a new set of manual perturbations occurs for the POLICY.

Train MCTS Trees for Offline Characterization

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Start server for highway-env
    • [T4] conda activate dilu
    • [T4] cd /ast-llm/dilu
    • [T4] uvicorn parallel_highway_env_api:parallel_app --host 0.0.0.0 --port 8000
  3. Start parallel environments for the LLM to be characterized by running sample episodes
    • [T5] conda activate ast
    • [T5] cd /ast-llm/toolbox_source
    • [T5] source scripts/setup.sh
    • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <TRAIN_DIR> --num_eps <EPS>
      • Argument explanations:
        • TRAIN_DIR
          • A path of where to store all training runs for the driving environment
        • EPS
          • Number of episodes to run environment setup for
      • Example commands:
        • Llama
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Llama3dot2StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
        • Dolphin
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Dolphin3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
        • Qwen
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardCleanedPromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10

Episodes will be stored under <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>, where # increments everytime a new set of environments are created for the POLICY. Note that the first episode's step will take some time (at most ~3 minutes) to load the embedding model and LLM weights onto the GPU. If [T5] is stuck at Successfully added a new environment, for over 5 minutes, then Ollama more than likely split the model weights across GPU and CPU since all copies of the LLM were unable to fit into GPU memory only. To double check the memory usage split, in another terminal, you can cd path/to/ollama_v0.6.8/bin and run ./ollama ps. If there is some percentage of CPU being used, you can ctrl-c terminals [T2]-[T5], and re-run the commands with a smaller number for OLLAMA_NUM_PARALLEL.

  1. Start characterizing timesteps (below we list example commands with Llama, but the same could be done with Dolphin and Qwen by swapping the --policy class)
    • Example commands:
      • Using action diversity metric
        • [T5] python examples/language_highway_env_runners/batch_language_highway_env_parallel_runner.py --policy Llama3dot2StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_actions --diversity_measure diversity --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Llama3dot2StochasticPolicyConfig/run_<#>/mcts_runs
      • Using Shannon entroy metric
        • [T5] python examples/language_highway_env_runners/batch_language_highway_env_parallel_runner.py --policy Llama3dot2StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_actions --diversity_measure shannon --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Llama3dot2StochasticPolicyConfig/run_<#>/mcts_runs
      • Using environment reward function metric
        • We only run the reward metric-based trees on feasible timesteps, so we first run this command to find those timesteps
          • [T5] python examples/language_highway_env_runners/find_critical_timesteps.py
        • [T5] python examples/language_highway_env_runners/batch_language_highway_env_parallel_runner.py --policy Llama3dot2StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_rewards--use_rew_difference --plot_tree --num_trees 20 --avoid_infeasible --log_path <TRAIN_DIR>/StandardEnvConfig/Llama3dot2StochasticPolicyConfig/run_<#>/mcts_runs

LEN and ITER can be updated to search the whole, deep, or shallow tree space

Tree Type LEN ITER
Whole 8 504
Deep 8 32
Shallow 5 32

Characterizations will be stored under <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##>, where ## increments everytime a new characterization is started under that mcts_runs folder. In that run folder, parallel_trees.txt holds a list of the paths to timesteps characterized for the LLM.

Analyze Characterization Trees

Below, we list our nomenclature for different directories after the offline characterization process has completed.

Directory Title Purpose Example
TRAIN_DIR Where all driving environment training runs are held.
MCTS_RUNS Where one experiment is stored. <MCTS_RUNS>/parallel_trees.txt holds paths to folders for each tree characterization. <TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_0/mcts_runs/run_0
TREE_CHAR Where results for one tree from one experiment are stored (found inside of parallel_trees.txt). <TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_0/eps_0_t_0/ast_runs/Qwen3StochasticPolicyConfig/run_0
  1. We can start by processing the results of each characterization by performing DFS:
    • cd /ast-llm/toolbox_source
    • conda activate ast
    • source scripts/setup.sh
    • python examples/language_highway_env_runners/visualization/graph_parallel_trees.py --path <MCTS_RUNS>
  2. Generate analysis visualizations per tree
    • python examples/language_highway_env_runners/visualization/analyze_parallel_trees.py --path <MCTS_RUNS>
  3. Save the raw prompts and LLM predictions per tree
    • python examples/language_highway_env_runners/visualization/save_parallel_tree_prompts.py --path <MCTS_RUNS>
KDE plot of entropy per tree based on all sampled actions KDE plot of entropy per tree based on majority voted actions from low-diversity perturbation states
<MCTS_RUNS>/all_samples_entropy.png <MCTS_RUNS>/low_diversity_samples_entropy.png
KDE plot of normalized action diversity per tree based on all sampled actions KDE plot of normalized action diversity per tree based on majority voted actions from low-diversity perturbation states
<MCTS_RUNS>/all_samples_action_diversity.png <MCTS_RUNS>/low_diversity_samples_action_diversity.png
Predicted action distribution Sample diversity for different perturbation states
<TREE_CHAR>/stats/llm_pred_dist.png <TREE_CHAR>/stats/llm_majority_action.png
Reward distribution for taking each adversarial action Number of times each adversarial action led to 0 AST reward
<TREE_CHAR>/stats/ast_action_rew.png <TREE_CHAR>/stats/no_change_rew_action.png
A visualization of the tree generated during MCTS
<TREE_CHAR>/tree.svg

Runtime Applications

To evaluate the influence of generated prompts at test-time offline:

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], Ollama embedding service is running on [T3], and highway-env server is running on [T4] as described above
  2. Start parallel environments for the LLM to be influenced by running sample episodes
    • [T5] conda activate ast
    • [T5] cd /ast-llm/toolbox_source
    • [T5] source scripts/setup.sh
    • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <INFLUENCE_DIR> --num_eps <EPS> --test
      • Argument explanations:
        • INFLUENCE_DIR
          • A path of where to store all offline influence runs for the driving environment
        • --test
          • Ensures that environment episodes are initialized with test seeds
      • Example commands:
        • Llama
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Llama3dot2StochasticPolicyConfig --save_dir <INFLUENCE_DIR> --num_eps 5 --test
        • Dolphin
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Dolphin3StochasticPolicyConfig --save_dir <INFLUENCE_DIR> --num_eps 5 --test
        • Qwen
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardCleanedPromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <INFLUENCE_DIR> --num_eps 5 --test
  3. Start recording influence of generated prompts on uncertainty of predictions across all timesteps
    • [T5] python examples/language_highway_env_runners/batch_language_highway_env_compare_influential_perturbations.py --policy <POLICY> --num_trees 1 --num_prompts 1 --load_dir <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##> --log_path <INFLUENCE_DIR>/StandardEnvConfig/<POLICY>/run_<#>/perturbed_runs
      • Argument explanations:
        • --num_trees
          • Number of trees to sample templates from (top-K most similar trees)
        • --num_prompts
          • Number of prompt templates to select that are most and least desirable from each tree
        • --load_dir
          • Path to characterization folder from training that prompt templates should originate from

Below, we refer to the folder path <INFLUENCE_DIR>/StandardEnvConfig/<POLICY>/run_<#>/perturbed_runs/run_<##> as PERTURBED_RUNS. In that run folder, parallel_trees.txt holds a list of the paths to timesteps influenced for the LLM.

  1. Visualize difference in Shannon entropy distribution between samples from desirable and undesirable templates
    • [T5] python examples/language_highway_env_runners/visualization/graph_compare_influential_perturbations.py --path <PERTURBED_RUNS>
Shannon entropy distribution comparison after applying influential prompts
<PERTURBED_RUNS>/diversity_distribution.png

To influence models in a closed-loop setting:

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], Ollama embedding service is running on [T3], and highway-env server is running on [T4] as described above
  2. Start influencing an LLM in a closed-loop simulation
    • [T5] conda activate ast
    • [T5] cd /ast-llm/toolbox_source
    • [T5] source scripts/setup.sh
    • python examples/language_highway_env_runners/language_highway_env_closed_loop_adversary.py --policy <POLICY> --load_dir <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##> --save_dir <CLOSED_LOOP_PATH>

The script will create an experimental run folder at <CLOSED_LOOP_PATH>/StandardEnvConfig/<POLICY>/run_<#>. In this folder, adversarial holds episodes when applying the most undesirable prompt from the closest tree at each timestep, and trustworthy is when applying the most desirable prompt at each timestep. Within those folders, logging.json contains the aggregated results over all closed-loop episodes.

To detect anomalous timesteps at test-time offline:

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], Ollama embedding service is running on [T3], and highway-env server is running on [T4] as described above
  2. Start parallel environments to detect anomalous uncertain timesteps by running sample episodes
    • [T5] conda activate ast
    • [T5] cd /ast-llm/toolbox_source
    • [T5] source scripts/setup.sh
    • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <ANOMALY_DIR> --num_eps <EPS> --test
      • Argument explanations:
        • ANOMALY_DIR
          • A path of where to store all offline anomaly detection runs for the driving environment
      • Example commands:
        • Llama
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Llama3dot2StochasticPolicyConfig --save_dir <ANOMALY_DIR> --num_eps 5 --test
        • Dolphin
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Dolphin3StochasticPolicyConfig --save_dir <ANOMALY_DIR> --num_eps 5 --test
        • Qwen
          • [T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardCleanedPromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <ANOMALY_DIR> --num_eps 5 --test
  3. Start recording entropy of actions from ground-truth sampling in scenario and predicted actions from original characterization trees across all timesteps using low-diversity prompts
    • [T5] python examples/language_highway_env_runners/batch_language_highway_env_estimate_uncertainty.py --policy <POLICY> --num_queries 5 --load_dir <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##> --log_path <ANOMALY_DIR>/StandardEnvConfig/<POLICY>/run_<#>/trust_runs
      • Argument explanations:
        • --num_queries
          • Number of perturbation states to use for uncertainty estimation (top-K least diverse perturbation states) from the single most similar tree to the current scenario

Below, we refer to the folder path <ANOMALY_DIR>/StandardEnvConfig/<POLICY>/run_<#>/trust_runs/run_<##> as TRUST_RUNS. In that run folder, parallel_trees.txt holds a list of the paths to timesteps used for anomaly detection for the LLM.

  1. Compute the GTR, AUC, and FPR anomaly detection stats using the entropy of actions gathered earlier
    • [T5] python examples/language_highway_env_runners/visualization/compute_trust_alert_rate.py --path <TRUST_RUNS>

<ANOMALY_DIR>/StandardEnvConfig/<POLICY>/run_<#>/trust_runs/run_<##>/alert_scores.json will contain the experiment results.

Generating Graphs in Paper

Run /ast-llm/dilu/graph_generator.py to generate figures found in the case-study. Refer to the scripts in /ast-llm/toolbox_source/examples/language_highway_env_runners/visualization/Plot/aamas for details on how to generate the AST figures in the paper that compile results across different models together.

Run Robot Crowd Navigation Experiments

Here, we walk through how to characterize LLMs in different scenarios in the crowdnav-env.

Setup Crowdnav Conda Environment

Follow instructions in the CrowdNav++ repo to install required packages into a conda environment. You can refer to our crowdnav Conda environment yml file at /ast-llm/conda_envs/crowdnav.yml to cross-reference versions of required packages.

Setup Crowdnav Workspaces in AnythingLLM

  1. Start AnythingLLM
    • Follow steps 2-5 and 7-8 under Setup AnythingLLM in this README
  2. Create a workspace for holding crowdnav experiences
  3. Create a workspace for Qwen LLM used for planning

Collect and Upload Memories

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Modify /ast-llm/crowdnav/trained_models/ORCA_no_rand/configs/config.py with desired simulator parameters
  3. Collect trajectories using ORCA
    • [T4] conda activate crowdnav
    • [T4] cd /ast-llm/crowdnav
    • [T4] python collect_traj.py
  4. Use Qwen to explain ORCA experiences to generate memories
    • [T4] python explain_memories.py
  5. Upload memories to AnythingLLM
    • [T4] python upload_memories.py

Train MCTS Trees for Offline Characterization

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Start server for crowdnav-env
    • [T4] conda activate crowdnav
    • [T4] cd /ast-llm/crowdnav
    • [T4] uvicorn parallel_crowdnav_env_api:parallel_app --host 0.0.0.0 --port 8000
  3. Start parallel environments for the LLM to be characterized by running sample episodes
    • [T5] conda activate ast
    • [T5] cd /ast-llm/toolbox_source
    • [T5] source scripts/setup.sh
    • [T5] python examples/language_crowdnav_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <TRAIN_DIR> --num_eps <EPS>
      • Argument explanations:
        • TRAIN_DIR
          • A path of where to store all training runs for the crowdnav environment
        • EPS
          • Number of episodes to run environment setup for
      • Example command with Qwen:
        • [T5] python examples/language_crowdnav_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardBasePromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
  4. Start characterizing timesteps
    • Example command with Qwen:
      • [T5] python examples/language_crowdnav_env_runners/batch_language_crowdnav_env_parallel_runner.py --policy Qwen3StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_<#>/mcts_runs

Analyze Characterization Trees

  1. We can start by processing the results of each characterization by performing DFS:
    • cd /ast-llm/toolbox_source
    • conda activate ast
    • source scripts/setup.sh
    • python examples/language_crowdnav_env_runners/visualization/graph_parallel_trees.py --path <MCTS_RUNS>
  2. Generate analysis visualizations per tree
    • python examples/language_crowdnav_env_runners/visualization/analyze_parallel_trees.py --path <MCTS_RUNS>
  3. Save the raw prompts and LLM predictions per tree
    • python examples/language_crowdnav_env_runners/visualization/save_parallel_tree_prompts.py --path <MCTS_RUNS>
Raw predicted vector action distribution Bubble plot of predicted vector action distribution
<TREE_CHAR>/stats/velocity_vectors_plot.png <TREE_CHAR>/stats/velocity_vectors_bubble_plot.png
KDE plot of diversity of sampled actions per perturbation state Reward distribution for taking each adversarial action
<TREE_CHAR>/stats/samples_kde_plot.png <TREE_CHAR>/stats/ast_action_rew.png
A visualization of the tree generated during MCTS
<TREE_CHAR>/tree.svg

Run Lunar Lander Experiments

Here, we walk through how to characterize LLMs in different scenarios in the lunarlander-env.

Setup Lunarlander Conda Environment

You can refer to our lunar Conda environment yml file at /ast-llm/conda_envs/lunar.yml to cross-reference versions of required packages.

Setup Lunarlander Workspaces in AnythingLLM

  1. Start AnythingLLM
    • Follow steps 2-5 and 7-8 under Setup AnythingLLM in this README
  2. Create a workspace for holding lunarlander experiences
  3. Create a workspace for Qwen LLM used for planning

Collect and Upload Memories

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Collect trajectories using heuristic policy
    • [T4] conda activate lunar
    • [T4] cd /ast-llm/lunar
    • [T4] python collect_traj.py
  3. Use Qwen to explain heuristic experiences to generate memories
    • [T4] python explain_memories.py
  4. Upload memories to AnythingLLM
    • [T4] python upload_memories.py

Train MCTS Trees for Offline Characterization

  1. Ensure that AnythingLLM is running on [T1], Ollama LLM service is running on [T2], and Ollama embedding service is running on [T3] as described above
  2. Start server for lunarlander-env
    • [T4] conda activate lunar
    • [T4] cd /ast-llm/lunar
    • [T4] uvicorn parallel_lunar_env_api:parallel_app --host 0.0.0.0 --port 8000
  3. Start parallel environments for the LLM to be characterized by running sample episodes
    • [T5] conda activate ast
    • [T5] cd /ast-llm/toolbox_source
    • [T5] source scripts/setup.sh
    • [T5] python examples/language_lunar_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <TRAIN_DIR> --num_eps <EPS>
      • Argument explanations:
        • TRAIN_DIR
          • A path of where to store all training runs for the lunarlander environment
        • EPS
          • Number of episodes to run environment setup for
      • Example command with Qwen:
        • [T5] python examples/language_lunar_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardBasePromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
  4. Start characterizing timesteps
    • Example command with Qwen:
      • [T5] python examples/language_lunar_env_runners/batch_language_lunar_env_parallel_runner.py --policy Qwen3StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_<#>/mcts_runs

Analyze Characterization Trees

  1. We can start by processing the results of each characterization by performing DFS:
    • cd /ast-llm/toolbox_source
    • conda activate ast
    • source scripts/setup.sh
    • python examples/language_lunar_env_runners/visualization/graph_parallel_trees.py --path <MCTS_RUNS>
  2. Generate analysis visualizations per tree
    • python examples/language_lunar_env_runners/visualization/analyze_parallel_trees.py --path <MCTS_RUNS>
  3. Save the raw prompts and LLM predictions per tree
    • python examples/language_lunar_env_runners/visualization/save_parallel_tree_prompts.py --path <MCTS_RUNS>
Predicted action distribution Sample diversity for different perturbation states
<TREE_CHAR>/stats/llm_pred_dist.png <TREE_CHAR>/stats/llm_majority_action.png
Reward distribution for taking each adversarial action
<TREE_CHAR>/ast_action_rew.svg
A visualization of the tree generated during MCTS
<TREE_CHAR>/tree.svg

About

Repository for stress testing LLMs in planning tasks through different scenarios, prompts, and observations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published