Skip to content

ucsb-mlsec/VulnLLM-R

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

38 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

VulnLLM-R: Specialized Reasoning LLM for Vulnerability Detection

model_size_vs_f1_scatter_01

Environment and dataset

πŸ› οΈ Create environment

  • git clone the repository
git clone https://github.com/ucsb-mlsec/VulnLLM-R.git
  • Create a new conda environment
conda create -n vulnscan python=3.11
conda activate vulnscan
  • Install the required packages
pip install -e . -e ./vulscan/train/LLaMA-Factory -e ./vulscan/model_zoo

For Reproducing Our Results

# generate VulnLLM-R-7B's results
python -m vulscan.test.test --output_dir results/test_data --dataset_path ./datasets/test/function_level/ ./datasets/test/repo_level/ --language python c java --model UCSB-SURFI/VulnLLM-R-7B --requests_per_minute 1000 --save --use_cot --batch_size 4 --tp 2 --vllm --max_tokens 8192 --random_cwe

python -m vulscan.test.test_hf \
      --output_dir results/test_hf \
      --hf_dataset UCSB-SURFI/VulnLLM-R-Test-Data \
      --hf_split repo_level function_level \
      --language c python java \
      --model UCSB-SURFI/VulnLLM-R-7B \
      --save --use_cot --vllm --tp 2

# [optional] generate other models' results with our shell script 
# remember to add your API keys to .env file if you want to run commercial models
# use ./run_test.sh -h for more options
./vulscan/test/run_test.sh -o results/test_data -t 2 # -o means output directory, -t means tensor parallelism
./vulscan/test/run_test.sh -o results/test_data -M o3-mini # -m means model name, which runs only one model.
./vulscan/test/run_test.sh -o results/test_data -M gpt-5 # -m means model name, which runs only one model.

# [optional] draw plot to compare with other models
python plots/plot_language_comparison_models.py --results-dir results/test_data
python plots/plot_model_size_scatter.py --results-dir results/test_data # Note: Labels may overlap with scatter points. Adjust text positions manually if needed.

Existing Distilled Datasets

We also provide the reduced reasoning version of the distilled datasets:

Technical Details

πŸ“š Construct training and testing datasets

Merge existing function-level vulnerability detection datasets: PrimeVul [1], SecCodePLT [2], Juliet [3], Sven [4], and Arvo [5]. Within these datasets, PrimeVul has the most complicated functions. We create two training sets: clean (without PrimeVul) and noisy (with PrimeVul), so we can train on relatively simple datasets and test on the complex PrimeVul dataset. Note that we name the training set with PrimeVul as noisy not means the dataset is noisy. It is a relatively arbitrary name we used at the beginning.

  • After download all the datasets, vulscan/data_process/data_utils has a set of scripts to process and merge the datasets.
    • raw_to_us.py: Merge the raw data into our dataset and remove redundant data
    • check_cwe_correct.py: Compute the accuracy for each CWE category
    • generate_arvo_raw_data.py: Generate structured raw data from arvo dataset
    • arvo_to_us.py: Reformat arvo structured raw data to our dataset format
    • split_good_bad_for_juliet.py: Extract data from the raw Juliet 1.3 dataset and convert it into the required format, which forms part of our c clean_dataset
    • add_sven_to_clean_dataset.py: Extract data from the Sven dataset, forming part of our C clean dataset
    • sync_large_small.py: Synchronize the modifications of noisy_dataset/large_train/c to noisy_dataset/small_train/c
    • remove_testing_from_training.py: Add the human tag to each data, meaning the point has been verified by human and used as testing data -data_utils.py: Add the related_cwe field to dataset.
  • The merged data will be saved in
    • datasets/clean_dataset: the training data without PrimeVul
      • datasets/clean_dataset/python has the data from SVEN and SecCodePLT
      • datasets/clean_dataset/c has the data from Juliet and SVEN
    • datasets/noisy_dataset
      • datasets/noisy_dataset/small_train: Contains the training data from PrimeVul and SVEN with selected CWEs ( we use the PrimeVul data in this dataset as the training)
      • datasets/noisy_dataset/large_train: Contains the training data from PrimeVul and SVEN and SecCodePLT with more CWEs (This dataset can later be used to train larger models)
      • datasets/noisy_dataset/test: A small testing set from PrimeVul verified by human
    • datasets/test
      • datasets/test/test_clean: The testing data from SVEN and SecCodePLT and Juliet; with OOD CWEs that are not part of the training set
      • datasets/test/test_primevul_pair: The original PrimeVul testing data
  • Dataset statistics; can run vulscan/data_process/data_utils/get_cwe_stat.py to get the histogram of the dataset
Dataset Language Train/test CWE # Benign # Vuln. average length
Clean (seccodeplt) Python Train 20 1281 1281 741
Clean (juliet) C/C++ Train 22 1716 1653 3689
Hard (primevul filtered) C/C++ Train 26 2717 2952 4689
Long Context (Oss-fuzz) C/C++ Train 3 475 604 12761
Simple (seccodeplt) Python Test 24 (6 ood) 74 74 814
Simple (juliet) C/C++ Test 38 (14 ood) 358 376 2575
Hard (PrimeVul, SecLLMHolmes) C/C++ Test 13 (5 ood) 145 152 4545
Long Context (Oss-fuzz) C/C++ Test 3 (0 ood) 0 320 18929
primevul test (noisy) C/C++ Test 56 (34 ood) 421 422 5341

πŸ€” Generate reasoning data for training

After constructing the datasets, we will generate reasoning data for our training set. We will query the DeepSeek-r1 and QwQ reasoning model to generate the reasoning data and filter out the ones with very long reasoning chains. The code for generating reasoning data is in vulscan/data_process/generate_reasoning and the reasoning data will be saved in datasets/reasoning_data.

cd vulscan/data_process/generate_reasoning

generate_reasoning/generate.py is the main script for generating reasoning data. For each data point, it will generate n reasoning data samples and select the one with the correct answer and shortest length. Examples of running it with the QwQ and DeepSeek-r1 models (using the together.AI API, which is slower but more stable than the official API) are as follows:

python generate.py \
--tp 2 \
--dataset_type clean_dataset \
--batch_size 200 \
--n 8 \
--training_set train \
--model_name Qwen/QwQ-32B

# or
python generate.py \
--dataset_type noisy_dataset \
--batch_size 200 \
--n 8 \
--training_set small_train \
--model_name together-deepseek-reasoner \
--together_deepseek

After generating the raw reasoning data, we can further use another model to summarize them and make them shorter without breaking the structure

python extract_reasoning.py

We can further filter the reasoning data based on certain length with generate_reasoning/filter.py; num_processes are changes according to your number of CPU cores.

python filter.py \
--dataset_type noisy_dataset \
--training_set small_train \
--model_name Qwen/QwQ-32B \ 
--filter_input_length 16000 \
--filter_all_length 32000 \
--num_processes 16 \
--filter_correct_only # for filtering wrong predictions
# --model_name together-deepseek-reasoner \

Finally, we will need to reformat the generated reasoning data for the target model that we will train (Qwen-Instruct).

python reformat_ds.py \
--dataset_type noisy_dataset \
--training_set small_train \
--model_name Qwen/QwQ-32B \
--filter_input_length 16000 \
--filter_all_length 32000 \
--push_to_hub \
--push_to_hub_organization secmlr \
--filter_correct_only 

for dpo dataset

python generate_dpo.py
--tp 2 --dataset_type clean_dataset
--batch_size 200 --n 8 --training_set train
--model secmlr/VD-QWQ-Clean-8k_qwen2_7B_full_sft_1e-5

πŸ€– SFT and DPO Training

refer to vulscan/train/README.md for more details

results will be saved in results/test_qwen/results.json directory.

πŸ” Test the trained models

If you want to reproduce our results, you can run the following command:

./vulscan/test/run_test.sh test.log
# open-source model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python --model Qwen/Qwen2.5-7B-Instruct \
--requests_per_minute 100 --save --use_cot \
--use_policy --batch_size 4 --tp 2 --vllm --max_tokens 16384 \
--random_cwe

# api model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python --model o3-mini-2025-01-14 \
--requests_per_minute 100 --save --use_cot \
--use_policy --batch_size 4 --max_tokens 16384 --random_cwe

# local saved model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python \
--model vulscan/train/result/VD-QWQ-Clean-16k/qwen2_7B_full_sft_1e-5 \
--requests_per_minute 100 --save --use_cot --use_policy \
--batch_size 4 --tp 2 --vllm --max_tokens 16384 --random_cwe

# our model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python \
--model secmlr/VD-QWQ-Noisy-Small-8k_qwen2_7B_full_sft_1e-5 --revision aa3235b \
--requests_per_minute 100 --save --use_cot --use_policy \
--batch_size 4 --tp 2 --vllm --max_tokens 16384 \
--random_cwe # whether to randomize the order of cwe and related cwes

After testing, model responses and performance will be saved in results/test_data directory. If you want to calculate the performance according to the model responses, you can run the following command:

python -m vulscan.test.test_existing_json \
--json_file results/test_data/datasets_test_test_clean__cot_c_policy_QwQ-32B-Preview.json # the results file
python generate_constitution.py --model gpt-4o --input_dir results/train --output_dir results/train/constitution

Citation

@article{nie2025vulnllmrspecializedreasoningllm,
      title={VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection}, 
      author={Yuzhou Nie and Hongwei Li and Chengquan Guo and Ruizhe Jiang and Zhun Wang and Bo Li and Dawn Song and Wenbo Guo},
      year={2025},
      journal={arXiv preprint arXiv:2512.07533},
      url={https://arxiv.org/abs/2512.07533}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages