In this repository, you will find a Python implementation of our PailGen. As described in our paper, PailGen is a novel automatic vulnerability patch generation approach that integrates retrieval-augmented fix pattern mining with in-context learning.
You can set up the environment by following commands:
conda create -n PailGen python=3.9.7
pip install transformers
pip install torch
pip install numpy
pip install tqdm
pip install pandas
pip install tokenizers
pip install datasets
pip install gdown
pip install tensorboard
pip install scikit-learn
pip install tree-sitter
pip install tree-sitter-c
pip install codebleu
Alternatively, we provide requirements.txt with version of packages specified to ensure the reproducibility, you may install via the following commands:
pip install -r requirements.txt
python preprocess_data.py
After preprocessing dataset, you can obtain two .csv files, i.e., train.csv and test.csv.
cd fix_patterns
python generate_patterns.py
The above command generates fix patterns from the retrieved relevant vulnerability-fix cases. The file retrieved_results_bigvul_cvefixes_top50.json contains the retrieved results of our hybrid retriever. In this file, each vulnerable code sample includes the top 50 most relevant vulnerability-fix pairs. We follow DPR to train and test our hybrid retriever.
cd ..
python process_prompt_data.py
Execute the above command to obtain all components of the LLM's prompt.
python llm_api_call_augment.py
The above command will generate candidate repair patches.
python calculate_combined_metrics.py
- Special thanks to authors of VulMaster (Zhou et al.)
- Special thanks to authors of TypeFix (Peng et al.)
- Special thanks to dataset providers of CVEFixes (Bhandari et al.), Big-Vul (Fan et al.), and D2A (Zheng et al.).