Note: This repository contains the code for our arXiv 2025 paper. For issues, contact Neeloy Chakraborty at neeloyc2@illinois.edu.
Large language models (LLMs) have recently demonstrated success in generalizing across decision-making tasks including planning, control, and prediction, but their tendency to hallucinate unsafe and undesired outputs poses risks. This unwanted behavior is further exacerbated in real-world environments where sensors are noisy or unreliable. Characterizing LLM planners to varied observations is necessary to proactively avoid failures in safety-critical scenarios. We specifically characterize the response of LLMs along two different perturbation dimensions. Like prior works, one dimension generates semantically similar prompts with varied phrasing by randomizing order of details, modifying access to few-shot examples, etc. Unique to our work, our second dimension simulates access to varied sensors and noise to mimic raw sensor or detection algorithm failures. An initial case study in which perturbations are manually applied show that both dimensions lead LLMs to hallucinate in a multi-agent driving environment. Unfortunately, covering the entire perturbation space for several scenarios is infeasible. As such, we propose a novel method for efficiently searching the space of prompt perturbations using adaptive stress testing (AST) with Monte-Carlo tree search (MCTS). Our AST formulation enables discovery of scenarios, sensor configurations, and prompt phrasing that cause language models to act with high uncertainty or even crash. By generating MCTS prompt perturbation trees across diverse scenarios, we show through extensive experiments that offline analyses can be used to proactively understand potential failures that may arise at runtime.
Part of the code is based on the following repositories:
- Adaptive Stress Testing Toolbox
- DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
- CrowdNav++
Below, certain commands are run across multiple terminals.
To denote when a command should be run in terminal 1, we write, [T1] <command>, and so on for other terminals.
- Setup Ollama
- Setup AnythingLLM
- Setup AST Toolbox
- Run Driving Experiments
- Run Robot Crowd Navigation Experiments
- Run Lunar Lander Experiments
We use Ollama as the inference server for running open-source LLMs.
Rather than directly installing Ollama, we use a portable executable version of the platform.
- Download a version of Ollama from their GitHub releases here: https://github.com/ollama/ollama/releases.
We used version
0.6.8.wget https://github.com/ollama/ollama/releases/download/v0.6.8/ollama-linux-amd64.tgz
- Create a folder where to extract executable
mkdir -p path/to/ollama_v0.6.8
- Extract executable
tar -xvzf ollama-linux-amd64.tgz -C path/to/ollama_v0.6.8
- Append the following lines to your
~/.bashrcto update the environment variables for Ollama- Update
OLLAMA_HOSTso AnythingLLM can access the models later onexport OLLAMA_HOST=0.0.0.0:11434
- Update
OLLAMA_MDOELSto a folder where you want to have model weights downloadedexport OLLAMA_MODELS="/path/to/ollama_models"
- Update
We now can download the models used for planning and for vector embedding.
Weights will be downloaded to OLLAMA_MODELS environment variable value.
- Start Ollama server
[T1] cd path/to/ollama_v0.6.8/bin[T1] ./ollama serve
- Download models (ID of model weights used in experiments are listed in parenthesis)
[T2] cd path/to/ollama_v0.6.8/bin- Vector embedding model (790764642607)
[T2] ./ollama run bge-m3:latest
- DeepSeek-R1 14B (ea35dfe18182)
[T2] ./ollama run deepseek-r1:14b
- Llama 3.2 3B (e410b836fe61)
[T2] ./ollama run llama3.2:3b-instruct-q8_0
- Dolphin 3.0 8B (d5ab9ae8e1f2)
[T2] ./ollama run dolphin3:latest
- Qwen 3.0 8B (500a1f067a9f)
[T2] ./ollama run qwen3:8b
- Vector embedding model (790764642607)
You can now close these terminals.
AnythingLLM is a service that we use to enable easy integration with Ollama and any other LLM provider, to allow for quick swapping of LLMs during experiments. It uses a unified API regardless of the LLM used.
We download the Docker version of AnythingLLM and convert it to an Apptainer image. The Docker image could be directly used, but we ran experiments on a cluster where only Apptainer was available.
- Download image and convert to Apptainer (we used version
1.7.8)apptainer pull ./anythingllm.sif docker://mintplexlabs/anythingllm:1.7.8
- Create directory to save logs
[T1] export STORAGE_LOCATION=$HOME/anythingllm && mkdir -p $STORAGE_LOCATION && touch "$STORAGE_LOCATION/.env"
- Run container
[T1] apptainer run --bind ${STORAGE_LOCATION}:/app/server/storage --bind ${STORAGE_LOCATION}/.env:/app/server/.env --env STORAGE_DIR="/app/server/storage" --writable-tmpfs anythingllm.sif
The container is now running on the server at http://localhost:3001/.
- If you are running AnythingLLM on a headless server like a cluster, you can port-forward to a computer with a screen
[Local Computer with Screen] ssh -l <remote_user_name> -L 127.0.0.1:3001:<remote_server_ip>:3001 <remote_server_ip>
- Visit http://localhost:3001/
- Follow the user interface to create a default account on the server
- Do NOT enable multiple users on the server
- It does not matter what you select for the default workspace or LLM provider at this time; we will be configuring them later on
We run one instance of Ollama for LLMs and another instance for embedding models.
- Run Ollama server for parallel LLM queries
[T2] cd path/to/ollama_v0.6.8/bin[T2] OLLAMA_KEEP_ALIVE="1h" OLLAMA_FLASH_ATTENTION="true" OLLAMA_SCHED_SPREAD="true" OLLAMA_NUM_PARALLEL=40 ./ollama serveOLLAMA_KEEP_ALIVEkeeps the model weights in GPU memory for a greater time before they are erasedOLLAMA_FLASH_ATTENTIONenables faster inferenceOLLAMA_SCHED_SPREADsplits parallel processing across multiple GPUsOLLAMA_NUM_PARALLELsets the number of maximum possible requests to process in parallel (make this smaller if you find that the weights are split across GPU and CPU, and the speed is slow; discussed later)
- Run Ollama server for embedding queries
[T3] cd path/to/ollama_v0.6.8/bin[T3] OLLAMA_HOST=0.0.0.0:11435 OLLAMA_KEEP_ALIVE="1h" OLLAMA_FLASH_ATTENTION="true" OLLAMA_SCHED_SPREAD="true" ./ollama serveOLLAMA_HOSTis overwritten here from the one in~/.bashrcto process embedding queries on a separate port from LLM queries
Now, we can configure AnythingLLM to communicate with Ollama.
- Visit http://localhost:3001/settings/llm-preference
- Select
Ollamaunder theLLM Providerdrop-down menu - Under
Advanced Settings- Set
Ollama Base URLtohttp://127.0.0.1:11434- One of the downloaded Ollama models should automatically become selected under
Ollama Model
- One of the downloaded Ollama models should automatically become selected under
- Set
Performance ModetoMaximum - Set
Ollama Keep Aliveto1 hour
- Set
- Set
Max Tokensto4096 - Visit http://localhost:3001/settings/vector-database
- Select
LanceDBunder theVector Database Providerdrop-down menu - Visit http://localhost:3001/settings/embedding-preference
- Under
Manual Endpoint Input- Set
Ollama Base URLtohttp://127.0.0.1:11435
- Set
- Select
bge-m3underOllama Embedding Model - Set
Max Embedding Chunk Lengthto4096 - Visit http://localhost:3001/settings/text-splitter-preference
- Set
Text Chunk Sizeto4096 - Set
Text Chunk Overlapto20 - Visit http://localhost:3001/settings/api-keys
- Click
Generate New API Key - Copy the API key and add it to your
~/.bashrcexport ANYTHINGLLM_API_KEY="<copied_key>"
We have cloned the AST Toolbox provided by SISL and updated it to work for our experiments. Install dependencies as follows.
cd /ast-llm/toolbox_source
sudo chmod a+x scripts/install_all.sh
sudo scripts/install_all.sh
Now, create a Conda environment for AST experiments.
cd /ast-llm/conda_envs
conda env create -f ast.yml
Here, we walk through how to run our case study, characterize LLMs in different scenarios in the highway-env, and run real-time applications of the characterization trees.
- Start AnythingLLM
- Follow steps
2-5and7-8underSetup AnythingLLMin this README
- Follow steps
- Create a workspace for holding driving experiences
- Visit http://localhost:3001
- Click on
+ New Workspace - Name the workspace
dilu_memory_agent - Visit http://localhost:3001/workspace/dilu_memory_agent/settings/vector-database
- Set
Search PreferencetoDefault - Set
Document similarity thresholdtoLow
- Download the Chroma driving memories from DiLu
- Download https://github.com/PJLab-ADG/DiLu/tree/main/memories/20_mem into
/ast-llm/dilu/memories
- Download https://github.com/PJLab-ADG/DiLu/tree/main/memories/20_mem into
- Create a Conda environment for converting memories from Chroma and uploading to AnythingLLM workspace
[T4] cd /ast-llm/conda_envs[T4] conda env create -f chroma-dilu.yml[T4] conda activate chroma-dilu
- Upload memories to DiLu memory workspace
[T4] cd /ast-llm/dilu[T4] python upload_chroma_memories.py
You should now see a value of 21 under Vector Count at http://localhost:3001/workspace/dilu_memory_agent/settings/vector-databasememories.
- Create a workspace for each LLM used for planning
- Visit http://localhost:3001
- Click on
+ New Workspace - Name the workspace
ollama_deepseek_r1_driver_agent - Visit http://localhost:3001/workspace/ollama_deepseek_r1_driver_agent/settings/chat-settings
- Select
deepseek-r1under theWorkspace Chat modeldrop-down menu
Repeat step 6 for the following workspaces and LLMs:
| Workspace Name | Ollama LLM |
|---|---|
ollama_deepseek_r1_driver_agent |
deepseek-r1 |
ollama_llama_3.2_driver_agent |
llama3.2 |
ollama_dolphin_3_driver_agent |
dolphin3 |
ollama_qwen_3_driver_agent |
qwen3 |
Because the ast Conda environment has old dependencies, we create a separate environment that runs the actual simulators to test LLMs in.
You can refer to our dilu Conda environment yml file at /ast-llm/conda_envs/dilu.yml to cross-reference versions of required packages.
Our environment is adapted to work for the DiLu repo provided here.
conda create -n dilu python=3.10
conda activate dilu
pip install git+https://github.com/eleurent/highway-env
pip install torch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 --index-url https://download.pytorch.org/whl/cu121
pip install git+https://github.com/eleurent/rl-agents#egg=rl-agents
pip install git+https://github.com/DLR-RM/stable-baselines3
pip install moviepy -U
pip install imageio_ffmpeg
pip install pyvirtualdisplay
sudo apt-get install -y xvfb ffmpeg
pip install PyYAML
pip install rich
If your server is headless, you can skip installing xvfb.
It is only used for rendering frames from the simulator.
To analyze speed, return, crash, etc. stats of LLMs in environment:
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Start evaluating LLM in environment
[T4] conda activate dilu[T4] cd /ast-llm/dilu[T4] python run_dilu.py --prompt <PROMPT> --policy <POLICY> --memory <MEMORY> --obs_subset <OBS_SUB> --obs_details <OBS_DET> --obs_noise <OBS_NOISE> --obs_order <OBS_ORDER> --save_dir <EPS_DIR> --num_eps <EPS> --save_traj --render- Argument explanations:
PROMPT- A system prompt configuration class from
/ast-llm/dilu/configs.py
- A system prompt configuration class from
POLICY- An LLM policy class from
/ast-llm/dilu/configs.py
- An LLM policy class from
MEMORY- A memory class from
/ast-llm/dilu/configs.py(whether few-shot examples are used)
- A memory class from
OBS_SUB- An observation subset class from
/ast-llm/dilu/configs.py(are both ego and other agents' obs provided)
- An observation subset class from
OBS_DET- An observation detail class from
/ast-llm/dilu/configs.py(which state details are provided)
- An observation detail class from
OBS_NOISE- An observation noise class from
/ast-llm/dilu/configs.py(how much noise to apply to different observation details)
- An observation noise class from
OBS_ORDER- An observation ordering class from
/ast-llm/dilu/configs.py(whether to randomize the order of descriptions of different agents)
- An observation ordering class from
EPS_DIR- A path of where to store all test runs for LLMs
EPS- Number of episodes to run eval for
--save_traj- Whether to save the stats of each trajectory
--render- Whether to render the video for each episode
- Argument explanations:
Episodes will be stored under <EPS_DIR>/StandardEnvConfig/<MEMORY>/<PROMPT>/<POLICY>/<OBS_SUB>/<OBS_DET>/<OBS_ORDER>/<OBS_NOISE>/run_<#>, where # increments everytime a new set of episodes are run for the POLICY.
Manually perturb prompts and evaluate inconsistency of predictions:
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Start evaluating the inconsistency rate of predictions after perturbing prompts
[T4] conda activate dilu[T4] cd /ast-llm/dilu[T4] python offline_manual_perturbation.py --load_dir <EPS_DIR> --prompt <PROMPT> --policy <POLICY> --memory <MEMORY> --obs_subset <OBS_SUB> --obs_details <OBS_DET> --obs_noise <OBS_NOISE> --obs_order <OBS_ORDER> --save_dir <PERTURB_DIR>- Argument explanations:
--load_dirEPS_DIRto use to initialize trajectories to generate perturbed prompts for
PERTURB_DIR- A path of where to store all results after manually perturbing prompts
- Argument explanations:
Perturbed generations will be stored at <PERTURB_DIR>/<MEMORY>/<PROMPT>/<POLICY>/<OBS_SUB>/<OBS_DET>/<OBS_ORDER>/<OBS_NOISE>/run_<#>, where # increments everytime a new set of manual perturbations occurs for the POLICY.
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Start server for highway-env
[T4] conda activate dilu[T4] cd /ast-llm/dilu[T4] uvicorn parallel_highway_env_api:parallel_app --host 0.0.0.0 --port 8000
- Start parallel environments for the LLM to be characterized by running sample episodes
[T5] conda activate ast[T5] cd /ast-llm/toolbox_source[T5] source scripts/setup.sh[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <TRAIN_DIR> --num_eps <EPS>- Argument explanations:
TRAIN_DIR- A path of where to store all training runs for the driving environment
EPS- Number of episodes to run environment setup for
- Example commands:
- Llama
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Llama3dot2StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
- Dolphin
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Dolphin3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
- Qwen
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardCleanedPromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
- Llama
- Argument explanations:
Episodes will be stored under <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>, where # increments everytime a new set of environments are created for the POLICY.
Note that the first episode's step will take some time (at most ~3 minutes) to load the embedding model and LLM weights onto the GPU.
If [T5] is stuck at Successfully added a new environment, for over 5 minutes, then Ollama more than likely split the model weights across GPU and CPU since all copies of the LLM were unable to fit into GPU memory only.
To double check the memory usage split, in another terminal, you can cd path/to/ollama_v0.6.8/bin and run ./ollama ps.
If there is some percentage of CPU being used, you can ctrl-c terminals [T2]-[T5], and re-run the commands with a smaller number for OLLAMA_NUM_PARALLEL.
- Start characterizing timesteps (below we list example commands with Llama, but the same could be done with Dolphin and Qwen by swapping the
--policyclass)- Example commands:
- Using action diversity metric
[T5] python examples/language_highway_env_runners/batch_language_highway_env_parallel_runner.py --policy Llama3dot2StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_actions --diversity_measure diversity --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Llama3dot2StochasticPolicyConfig/run_<#>/mcts_runs
- Using Shannon entroy metric
[T5] python examples/language_highway_env_runners/batch_language_highway_env_parallel_runner.py --policy Llama3dot2StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_actions --diversity_measure shannon --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Llama3dot2StochasticPolicyConfig/run_<#>/mcts_runs
- Using environment reward function metric
- We only run the reward metric-based trees on feasible timesteps, so we first run this command to find those timesteps
[T5] python examples/language_highway_env_runners/find_critical_timesteps.py
[T5] python examples/language_highway_env_runners/batch_language_highway_env_parallel_runner.py --policy Llama3dot2StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_rewards--use_rew_difference --plot_tree --num_trees 20 --avoid_infeasible --log_path <TRAIN_DIR>/StandardEnvConfig/Llama3dot2StochasticPolicyConfig/run_<#>/mcts_runs
- We only run the reward metric-based trees on feasible timesteps, so we first run this command to find those timesteps
- Using action diversity metric
- Example commands:
LEN and ITER can be updated to search the whole, deep, or shallow tree space
| Tree Type | LEN |
ITER |
|---|---|---|
| Whole | 8 |
504 |
| Deep | 8 |
32 |
| Shallow | 5 |
32 |
Characterizations will be stored under <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##>, where ## increments everytime a new characterization is started under that mcts_runs folder.
In that run folder, parallel_trees.txt holds a list of the paths to timesteps characterized for the LLM.
Below, we list our nomenclature for different directories after the offline characterization process has completed.
| Directory Title | Purpose | Example |
|---|---|---|
TRAIN_DIR |
Where all driving environment training runs are held. | |
MCTS_RUNS |
Where one experiment is stored. <MCTS_RUNS>/parallel_trees.txt holds paths to folders for each tree characterization. |
<TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_0/mcts_runs/run_0 |
TREE_CHAR |
Where results for one tree from one experiment are stored (found inside of parallel_trees.txt). |
<TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_0/eps_0_t_0/ast_runs/Qwen3StochasticPolicyConfig/run_0 |
- We can start by processing the results of each characterization by performing DFS:
cd /ast-llm/toolbox_sourceconda activate astsource scripts/setup.shpython examples/language_highway_env_runners/visualization/graph_parallel_trees.py --path <MCTS_RUNS>
- Generate analysis visualizations per tree
python examples/language_highway_env_runners/visualization/analyze_parallel_trees.py --path <MCTS_RUNS>
- Save the raw prompts and LLM predictions per tree
python examples/language_highway_env_runners/visualization/save_parallel_tree_prompts.py --path <MCTS_RUNS>
| Predicted action distribution | Sample diversity for different perturbation states |
|---|---|
![]() |
![]() |
<TREE_CHAR>/stats/llm_pred_dist.png |
<TREE_CHAR>/stats/llm_majority_action.png |
| Reward distribution for taking each adversarial action | Number of times each adversarial action led to 0 AST reward |
|---|---|
![]() |
![]() |
<TREE_CHAR>/stats/ast_action_rew.png |
<TREE_CHAR>/stats/no_change_rew_action.png |
| A visualization of the tree generated during MCTS |
|---|
<TREE_CHAR>/tree.svg |
To evaluate the influence of generated prompts at test-time offline:
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], Ollama embedding service is running on[T3], and highway-env server is running on[T4]as described above - Start parallel environments for the LLM to be influenced by running sample episodes
[T5] conda activate ast[T5] cd /ast-llm/toolbox_source[T5] source scripts/setup.sh[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <INFLUENCE_DIR> --num_eps <EPS> --test- Argument explanations:
INFLUENCE_DIR- A path of where to store all offline influence runs for the driving environment
--test- Ensures that environment episodes are initialized with test seeds
- Example commands:
- Llama
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Llama3dot2StochasticPolicyConfig --save_dir <INFLUENCE_DIR> --num_eps 5 --test
- Dolphin
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Dolphin3StochasticPolicyConfig --save_dir <INFLUENCE_DIR> --num_eps 5 --test
- Qwen
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardCleanedPromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <INFLUENCE_DIR> --num_eps 5 --test
- Llama
- Argument explanations:
- Start recording influence of generated prompts on uncertainty of predictions across all timesteps
[T5] python examples/language_highway_env_runners/batch_language_highway_env_compare_influential_perturbations.py --policy <POLICY> --num_trees 1 --num_prompts 1 --load_dir <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##> --log_path <INFLUENCE_DIR>/StandardEnvConfig/<POLICY>/run_<#>/perturbed_runs- Argument explanations:
--num_trees- Number of trees to sample templates from (top-K most similar trees)
--num_prompts- Number of prompt templates to select that are most and least desirable from each tree
--load_dir- Path to characterization folder from training that prompt templates should originate from
- Argument explanations:
Below, we refer to the folder path <INFLUENCE_DIR>/StandardEnvConfig/<POLICY>/run_<#>/perturbed_runs/run_<##> as PERTURBED_RUNS.
In that run folder, parallel_trees.txt holds a list of the paths to timesteps influenced for the LLM.
- Visualize difference in Shannon entropy distribution between samples from desirable and undesirable templates
[T5] python examples/language_highway_env_runners/visualization/graph_compare_influential_perturbations.py --path <PERTURBED_RUNS>
| Shannon entropy distribution comparison after applying influential prompts |
|---|
![]() |
<PERTURBED_RUNS>/diversity_distribution.png |
To influence models in a closed-loop setting:
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], Ollama embedding service is running on[T3], and highway-env server is running on[T4]as described above - Start influencing an LLM in a closed-loop simulation
[T5] conda activate ast[T5] cd /ast-llm/toolbox_source[T5] source scripts/setup.shpython examples/language_highway_env_runners/language_highway_env_closed_loop_adversary.py --policy <POLICY> --load_dir <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##> --save_dir <CLOSED_LOOP_PATH>
The script will create an experimental run folder at <CLOSED_LOOP_PATH>/StandardEnvConfig/<POLICY>/run_<#>.
In this folder, adversarial holds episodes when applying the most undesirable prompt from the closest tree at each timestep, and trustworthy is when applying the most desirable prompt at each timestep.
Within those folders, logging.json contains the aggregated results over all closed-loop episodes.
To detect anomalous timesteps at test-time offline:
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], Ollama embedding service is running on[T3], and highway-env server is running on[T4]as described above - Start parallel environments to detect anomalous uncertain timesteps by running sample episodes
[T5] conda activate ast[T5] cd /ast-llm/toolbox_source[T5] source scripts/setup.sh[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <ANOMALY_DIR> --num_eps <EPS> --test- Argument explanations:
ANOMALY_DIR- A path of where to store all offline anomaly detection runs for the driving environment
- Example commands:
- Llama
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Llama3dot2StochasticPolicyConfig --save_dir <ANOMALY_DIR> --num_eps 5 --test
- Dolphin
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt InformalRewardCleanedPromptConfig --policy Dolphin3StochasticPolicyConfig --save_dir <ANOMALY_DIR> --num_eps 5 --test
- Qwen
[T5] python examples/language_highway_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardCleanedPromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <ANOMALY_DIR> --num_eps 5 --test
- Llama
- Argument explanations:
- Start recording entropy of actions from ground-truth sampling in scenario and predicted actions from original characterization trees across all timesteps using low-diversity prompts
[T5] python examples/language_highway_env_runners/batch_language_highway_env_estimate_uncertainty.py --policy <POLICY> --num_queries 5 --load_dir <TRAIN_DIR>/StandardEnvConfig/<POLICY>/run_<#>/mcts_runs/run_<##> --log_path <ANOMALY_DIR>/StandardEnvConfig/<POLICY>/run_<#>/trust_runs- Argument explanations:
--num_queries- Number of perturbation states to use for uncertainty estimation (top-K least diverse perturbation states) from the single most similar tree to the current scenario
- Argument explanations:
Below, we refer to the folder path <ANOMALY_DIR>/StandardEnvConfig/<POLICY>/run_<#>/trust_runs/run_<##> as TRUST_RUNS.
In that run folder, parallel_trees.txt holds a list of the paths to timesteps used for anomaly detection for the LLM.
- Compute the GTR, AUC, and FPR anomaly detection stats using the entropy of actions gathered earlier
[T5] python examples/language_highway_env_runners/visualization/compute_trust_alert_rate.py --path <TRUST_RUNS>
<ANOMALY_DIR>/StandardEnvConfig/<POLICY>/run_<#>/trust_runs/run_<##>/alert_scores.json will contain the experiment results.
Run /ast-llm/dilu/graph_generator.py to generate figures found in the case-study.
Refer to the scripts in /ast-llm/toolbox_source/examples/language_highway_env_runners/visualization/Plot/aamas for details on how to generate the AST figures in the paper that compile results across different models together.
Here, we walk through how to characterize LLMs in different scenarios in the crowdnav-env.
Follow instructions in the CrowdNav++ repo to install required packages into a conda environment.
You can refer to our crowdnav Conda environment yml file at /ast-llm/conda_envs/crowdnav.yml to cross-reference versions of required packages.
- Start AnythingLLM
- Follow steps
2-5and7-8underSetup AnythingLLMin this README
- Follow steps
- Create a workspace for holding crowdnav experiences
- Visit http://localhost:3001
- Click on
+ New Workspace - Name the workspace
crowdnav_memory_agent - Visit http://localhost:3001/workspace/crowdnav_memory_agent/settings/vector-database
- Set
Search PreferencetoDefault - Set
Document similarity thresholdtoLow
- Create a workspace for Qwen LLM used for planning
- Visit http://localhost:3001
- Click on
+ New Workspace - Name the workspace
ollama_qwen_3_crowdnav_agent - Visit http://localhost:3001/workspace/ollama_qwen_3_crowdnav_agent/settings/chat-settings
- Select
qwen3under theWorkspace Chat modeldrop-down menu
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Modify
/ast-llm/crowdnav/trained_models/ORCA_no_rand/configs/config.pywith desired simulator parameters - Collect trajectories using ORCA
[T4] conda activate crowdnav[T4] cd /ast-llm/crowdnav[T4] python collect_traj.py
- Use Qwen to explain ORCA experiences to generate memories
[T4] python explain_memories.py
- Upload memories to AnythingLLM
[T4] python upload_memories.py
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Start server for crowdnav-env
[T4] conda activate crowdnav[T4] cd /ast-llm/crowdnav[T4] uvicorn parallel_crowdnav_env_api:parallel_app --host 0.0.0.0 --port 8000
- Start parallel environments for the LLM to be characterized by running sample episodes
[T5] conda activate ast[T5] cd /ast-llm/toolbox_source[T5] source scripts/setup.sh[T5] python examples/language_crowdnav_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <TRAIN_DIR> --num_eps <EPS>- Argument explanations:
TRAIN_DIR- A path of where to store all training runs for the crowdnav environment
EPS- Number of episodes to run environment setup for
- Example command with Qwen:
[T5] python examples/language_crowdnav_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardBasePromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
- Argument explanations:
- Start characterizing timesteps
- Example command with Qwen:
[T5] python examples/language_crowdnav_env_runners/batch_language_crowdnav_env_parallel_runner.py --policy Qwen3StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_<#>/mcts_runs
- Example command with Qwen:
- We can start by processing the results of each characterization by performing DFS:
cd /ast-llm/toolbox_sourceconda activate astsource scripts/setup.shpython examples/language_crowdnav_env_runners/visualization/graph_parallel_trees.py --path <MCTS_RUNS>
- Generate analysis visualizations per tree
python examples/language_crowdnav_env_runners/visualization/analyze_parallel_trees.py --path <MCTS_RUNS>
- Save the raw prompts and LLM predictions per tree
python examples/language_crowdnav_env_runners/visualization/save_parallel_tree_prompts.py --path <MCTS_RUNS>
| Raw predicted vector action distribution | Bubble plot of predicted vector action distribution |
|---|---|
![]() |
![]() |
<TREE_CHAR>/stats/velocity_vectors_plot.png |
<TREE_CHAR>/stats/velocity_vectors_bubble_plot.png |
| KDE plot of diversity of sampled actions per perturbation state | Reward distribution for taking each adversarial action |
|---|---|
![]() |
![]() |
<TREE_CHAR>/stats/samples_kde_plot.png |
<TREE_CHAR>/stats/ast_action_rew.png |
| A visualization of the tree generated during MCTS |
|---|
<TREE_CHAR>/tree.svg |
Here, we walk through how to characterize LLMs in different scenarios in the lunarlander-env.
You can refer to our lunar Conda environment yml file at /ast-llm/conda_envs/lunar.yml to cross-reference versions of required packages.
- Start AnythingLLM
- Follow steps
2-5and7-8underSetup AnythingLLMin this README
- Follow steps
- Create a workspace for holding lunarlander experiences
- Visit http://localhost:3001
- Click on
+ New Workspace - Name the workspace
lunar_lander_memory_agent - Visit http://localhost:3001/workspace/lunar_lander_memory_agent/settings/vector-database
- Set
Search PreferencetoDefault - Set
Document similarity thresholdtoLow
- Create a workspace for Qwen LLM used for planning
- Visit http://localhost:3001
- Click on
+ New Workspace - Name the workspace
ollama_qwen_3_lunar_lander_agent - Visit http://localhost:3001/workspace/ollama_qwen_3_lunar_lander_agent/settings/chat-settings
- Select
qwen3under theWorkspace Chat modeldrop-down menu
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Collect trajectories using heuristic policy
[T4] conda activate lunar[T4] cd /ast-llm/lunar[T4] python collect_traj.py
- Use Qwen to explain heuristic experiences to generate memories
[T4] python explain_memories.py
- Upload memories to AnythingLLM
[T4] python upload_memories.py
- Ensure that AnythingLLM is running on
[T1], Ollama LLM service is running on[T2], and Ollama embedding service is running on[T3]as described above - Start server for lunarlander-env
[T4] conda activate lunar[T4] cd /ast-llm/lunar[T4] uvicorn parallel_lunar_env_api:parallel_app --host 0.0.0.0 --port 8000
- Start parallel environments for the LLM to be characterized by running sample episodes
[T5] conda activate ast[T5] cd /ast-llm/toolbox_source[T5] source scripts/setup.sh[T5] python examples/language_lunar_env_runners/startup_parallel_envs.py --prompt <PROMPT> --policy <POLICY> --save_dir <TRAIN_DIR> --num_eps <EPS>- Argument explanations:
TRAIN_DIR- A path of where to store all training runs for the lunarlander environment
EPS- Number of episodes to run environment setup for
- Example command with Qwen:
[T5] python examples/language_lunar_env_runners/startup_parallel_envs.py --prompt NoThinkInformalRewardBasePromptConfig --policy Qwen3StochasticPolicyConfig --save_dir <TRAIN_DIR> --num_eps 10
- Argument explanations:
- Start characterizing timesteps
- Example command with Qwen:
[T5] python examples/language_lunar_env_runners/batch_language_lunar_env_parallel_runner.py --policy Qwen3StochasticPolicyConfig --path_len <LEN> --n_iter <ITER> --use_rew_difference --plot_tree --num_trees 20 --log_path <TRAIN_DIR>/StandardEnvConfig/Qwen3StochasticPolicyConfig/run_<#>/mcts_runs
- Example command with Qwen:
- We can start by processing the results of each characterization by performing DFS:
cd /ast-llm/toolbox_sourceconda activate astsource scripts/setup.shpython examples/language_lunar_env_runners/visualization/graph_parallel_trees.py --path <MCTS_RUNS>
- Generate analysis visualizations per tree
python examples/language_lunar_env_runners/visualization/analyze_parallel_trees.py --path <MCTS_RUNS>
- Save the raw prompts and LLM predictions per tree
python examples/language_lunar_env_runners/visualization/save_parallel_tree_prompts.py --path <MCTS_RUNS>
| Predicted action distribution | Sample diversity for different perturbation states |
|---|---|
![]() |
![]() |
<TREE_CHAR>/stats/llm_pred_dist.png |
<TREE_CHAR>/stats/llm_majority_action.png |
| Reward distribution for taking each adversarial action |
|---|
![]() |
<TREE_CHAR>/ast_action_rew.svg |
| A visualization of the tree generated during MCTS |
|---|
<TREE_CHAR>/tree.svg |















