- Paper: arXiv:2512.07533
- Code & Data: GitHub
- Demo: Web demo
- Model: 7B Model
- git clone the repository
git clone https://github.com/ucsb-mlsec/VulnLLM-R.git- Create a new conda environment
conda create -n vulnscan python=3.11
conda activate vulnscan- Install the required packages
pip install -e . -e ./vulscan/train/LLaMA-Factory -e ./vulscan/model_zoo# generate VulnLLM-R-7B's results
python -m vulscan.test.test --output_dir results/test_data --dataset_path ./datasets/test/function_level/ ./datasets/test/repo_level/ --language python c java --model UCSB-SURFI/VulnLLM-R-7B --requests_per_minute 1000 --save --use_cot --batch_size 4 --tp 2 --vllm --max_tokens 8192 --random_cwe
python -m vulscan.test.test_hf \
--output_dir results/test_hf \
--hf_dataset UCSB-SURFI/VulnLLM-R-Test-Data \
--hf_split repo_level function_level \
--language c python java \
--model UCSB-SURFI/VulnLLM-R-7B \
--save --use_cot --vllm --tp 2
# [optional] generate other models' results with our shell script
# remember to add your API keys to .env file if you want to run commercial models
# use ./run_test.sh -h for more options
./vulscan/test/run_test.sh -o results/test_data -t 2 # -o means output directory, -t means tensor parallelism
./vulscan/test/run_test.sh -o results/test_data -M o3-mini # -m means model name, which runs only one model.
./vulscan/test/run_test.sh -o results/test_data -M gpt-5 # -m means model name, which runs only one model.
# [optional] draw plot to compare with other models
python plots/plot_language_comparison_models.py --results-dir results/test_data
python plots/plot_model_size_scatter.py --results-dir results/test_data # Note: Labels may overlap with scatter points. Adjust text positions manually if needed.We also provide the reduced reasoning version of the distilled datasets:
Merge existing function-level vulnerability detection datasets: PrimeVul [1], SecCodePLT [2], Juliet [3], Sven [4], and Arvo [5]. Within these datasets, PrimeVul has the most complicated functions. We create two training sets: clean (without PrimeVul) and noisy (with PrimeVul), so we can train on relatively simple datasets and test on the complex PrimeVul dataset. Note that we name the training set with PrimeVul as noisy not means the dataset is noisy. It is a relatively arbitrary name we used at the beginning.
- After download all the datasets,
vulscan/data_process/data_utilshas a set of scripts to process and merge the datasets.raw_to_us.py: Merge the raw data into our dataset and remove redundant datacheck_cwe_correct.py: Compute the accuracy for each CWE categorygenerate_arvo_raw_data.py: Generate structured raw data from arvo datasetarvo_to_us.py: Reformat arvo structured raw data to our dataset formatsplit_good_bad_for_juliet.py: Extract data from the raw Juliet 1.3 dataset and convert it into the required format, which forms part of our c clean_datasetadd_sven_to_clean_dataset.py: Extract data from the Sven dataset, forming part of our C clean datasetsync_large_small.py: Synchronize the modifications of noisy_dataset/large_train/c to noisy_dataset/small_train/cremove_testing_from_training.py: Add thehumantag to each data, meaning the point has been verified by human and used as testing data -data_utils.py: Add the related_cwe field to dataset.
- The merged data will be saved in
datasets/clean_dataset: the training data without PrimeVuldatasets/clean_dataset/pythonhas the data from SVEN and SecCodePLTdatasets/clean_dataset/chas the data from Juliet and SVEN
datasets/noisy_datasetdatasets/noisy_dataset/small_train: Contains the training data from PrimeVul and SVEN with selected CWEs ( we use the PrimeVul data in this dataset as the training)datasets/noisy_dataset/large_train: Contains the training data from PrimeVul and SVEN and SecCodePLT with more CWEs (This dataset can later be used to train larger models)datasets/noisy_dataset/test: A small testing set from PrimeVul verified by human
datasets/testdatasets/test/test_clean: The testing data from SVEN and SecCodePLT and Juliet; with OOD CWEs that are not part of the training setdatasets/test/test_primevul_pair: The original PrimeVul testing data
- Dataset statistics; can run
vulscan/data_process/data_utils/get_cwe_stat.pyto get the histogram of the dataset
| Dataset | Language | Train/test | CWE | # Benign | # Vuln. | average length |
|---|---|---|---|---|---|---|
| Clean (seccodeplt) | Python | Train | 20 | 1281 | 1281 | 741 |
| Clean (juliet) | C/C++ | Train | 22 | 1716 | 1653 | 3689 |
| Hard (primevul filtered) | C/C++ | Train | 26 | 2717 | 2952 | 4689 |
| Long Context (Oss-fuzz) | C/C++ | Train | 3 | 475 | 604 | 12761 |
| Simple (seccodeplt) | Python | Test | 24 (6 ood) | 74 | 74 | 814 |
| Simple (juliet) | C/C++ | Test | 38 (14 ood) | 358 | 376 | 2575 |
| Hard (PrimeVul, SecLLMHolmes) | C/C++ | Test | 13 (5 ood) | 145 | 152 | 4545 |
| Long Context (Oss-fuzz) | C/C++ | Test | 3 (0 ood) | 0 | 320 | 18929 |
| primevul test (noisy) | C/C++ | Test | 56 (34 ood) | 421 | 422 | 5341 |
After constructing the datasets, we will generate reasoning data for our training set.
We will query the DeepSeek-r1 and QwQ reasoning model to generate the reasoning data and filter out the ones with very
long reasoning chains.
The code for generating reasoning data is in vulscan/data_process/generate_reasoning and the reasoning data will be
saved
in datasets/reasoning_data.
cd vulscan/data_process/generate_reasoninggenerate_reasoning/generate.py is the main script for generating reasoning data.
For each data point, it will generate n reasoning data samples and select the one with the correct answer and shortest
length.
Examples of running it with the QwQ and DeepSeek-r1 models (using the together.AI API, which is slower but more stable
than the official API) are as follows:
python generate.py \
--tp 2 \
--dataset_type clean_dataset \
--batch_size 200 \
--n 8 \
--training_set train \
--model_name Qwen/QwQ-32B
# or
python generate.py \
--dataset_type noisy_dataset \
--batch_size 200 \
--n 8 \
--training_set small_train \
--model_name together-deepseek-reasoner \
--together_deepseekAfter generating the raw reasoning data, we can further use another model to summarize them and make them shorter without breaking the structure
python extract_reasoning.pyWe can further filter the reasoning data based on certain length with generate_reasoning/filter.py; num_processes
are changes according to your number of CPU cores.
python filter.py \
--dataset_type noisy_dataset \
--training_set small_train \
--model_name Qwen/QwQ-32B \
--filter_input_length 16000 \
--filter_all_length 32000 \
--num_processes 16 \
--filter_correct_only # for filtering wrong predictions
# --model_name together-deepseek-reasoner \Finally, we will need to reformat the generated reasoning data for the target model that we will train (Qwen-Instruct).
python reformat_ds.py \
--dataset_type noisy_dataset \
--training_set small_train \
--model_name Qwen/QwQ-32B \
--filter_input_length 16000 \
--filter_all_length 32000 \
--push_to_hub \
--push_to_hub_organization secmlr \
--filter_correct_only python generate_dpo.py
--tp 2 --dataset_type clean_dataset
--batch_size 200 --n 8 --training_set train
--model secmlr/VD-QWQ-Clean-8k_qwen2_7B_full_sft_1e-5
refer to vulscan/train/README.md for more details
results will be saved in results/test_qwen/results.json directory.
If you want to reproduce our results, you can run the following command:
./vulscan/test/run_test.sh test.log# open-source model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python --model Qwen/Qwen2.5-7B-Instruct \
--requests_per_minute 100 --save --use_cot \
--use_policy --batch_size 4 --tp 2 --vllm --max_tokens 16384 \
--random_cwe
# api model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python --model o3-mini-2025-01-14 \
--requests_per_minute 100 --save --use_cot \
--use_policy --batch_size 4 --max_tokens 16384 --random_cwe
# local saved model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python \
--model vulscan/train/result/VD-QWQ-Clean-16k/qwen2_7B_full_sft_1e-5 \
--requests_per_minute 100 --save --use_cot --use_policy \
--batch_size 4 --tp 2 --vllm --max_tokens 16384 --random_cwe
# our model
python -m vulscan.test.test --output_dir results/one_of_4 \
--dataset_path ./datasets/test/test_clean ./datasets/test/test_primevul_pair \
--language c python \
--model secmlr/VD-QWQ-Noisy-Small-8k_qwen2_7B_full_sft_1e-5 --revision aa3235b \
--requests_per_minute 100 --save --use_cot --use_policy \
--batch_size 4 --tp 2 --vllm --max_tokens 16384 \
--random_cwe # whether to randomize the order of cwe and related cwesAfter testing, model responses and performance will be saved in results/test_data directory.
If you want to calculate the performance according to the model responses, you can run the following command:
python -m vulscan.test.test_existing_json \
--json_file results/test_data/datasets_test_test_clean__cot_c_policy_QwQ-32B-Preview.json # the results filepython generate_constitution.py --model gpt-4o --input_dir results/train --output_dir results/train/constitution@article{nie2025vulnllmrspecializedreasoningllm,
title={VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection},
author={Yuzhou Nie and Hongwei Li and Chengquan Guo and Ruizhe Jiang and Zhun Wang and Bo Li and Dawn Song and Wenbo Guo},
year={2025},
journal={arXiv preprint arXiv:2512.07533},
url={https://arxiv.org/abs/2512.07533},
}