Skip to content

kilimanj4r0/code-summarization-beyond-function-level

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code Summarization Beyond Function Level

This paper is accepted at LLM4Code Workshop co-located with ICSE 2025 and published in 2025 IEEE/ACM International Workshop on Large Language Models for Code.

Code summarization is a critical task in natural language processing and software engineering, which aims to generate concise descriptions of source code. Recent advancements have improved the quality of these summaries, enhancing code readability and maintainability. However, the content of a repository or a class has not been considered in function code summarization. This study investigated the effectiveness of code summarization models beyond the function level, exploring the impact of class and repository contexts on the summary quality. The study involved revising benchmarks for evaluating models at class and repository levels, assessing baseline models, and evaluating LLMs with in-context learning to determine the enhancement of summary quality with additional context. The findings revealed that the fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization, while incorporating few-shot learning and retrieved code chunks from RAG significantly enhanced the performance of LLMs in this task. Notably, the Deepseek Coder 1.3B and Starcoder2 15B models demonstrated substantial improvements in metrics such as BLEURT, METEOR, and BLEU-4 at both class and repository levels. Repository-level summarization exhibited promising potential but necessitates significant computational resources and gains from the inclusion of structured context. Lastly, we employed the recent SIDE code summarization metric in our evaluation. This study contributes to refining strategies for prompt engineering, few-shot learning, and RAG, addressing gaps in benchmarks for code summarization at various levels. Finally, we publish all study details, code, datasets, and results of evaluation in the GitHub repository available at https://github.com/kilimanj4r0/code-summarization-beyond-function-level.

Code Summarization Pipeline Schema and Experiments

Repository Structure

.
├── 📂 data                          # Folder with all the data
│   ├── 📂 preprocessed              # Folder with preprocessed datasets
│   │                                       # mce - Modified ClassEval
│   │                                       # mcsn - Modified CodeSearchNet
│   ├── 📂 repos                     # Clone repositories used in the study here
│   ├── 📂 vector-stores             # Repositories will be indexed here
│   ├── 🧮 results.csv               # Aggregated metrics results of all experiments
├── 📂 figures                       # Folder with figures of the paper
├── 📂 notebooks                     # Folder containting all the notebooks
│   ├── 📂 data                      # Folder with notebooks for data preprocessing
│   ├── 📂 experiments               # Folder with notebooks for running experiments
│   ├── 📓 evaluation.ipynb          # Notebook for running evaluation
│   ├── 📓 results-analysis.ipynb    # Notebook with manual analysis of results
├── 📂 scripts                       # Folder with scripts for reproducing the results
│   ├── 📄 baselines-inference.py    # Script for running inference of baseline models
│   ├── 📄 llms-inference.py         # Script for running inference of LLMs
│   ├── 📄 evaluation.py             # Script for running evaluation
│   ├── ...
│   ├── 💲 pipeline.sh               # Shell script for running the whole pipeline of experiments
│   ...
└── 📜 miniconda-env.yml             # Configuration of conda environment with required packages

Setup

We ran our code with Python 3.11.8 and PyTorch 2.2.1 on NVIDIA A100 GPU with 80GB RAM.

  1. Install Miniconda
  2. Create conda environment from miniconda-env.yml
conda env create -f miniconda-env.yml
  1. Conda environment code-summarization-beyond-function-level will be created and you can activate it
conda activate code-summarization-beyond-function-level

Important

Most of the models and evaluation metrics will be automatically downloaded from HuggingFace Hub. However, SIDE evaluation metric should be downloaded manually (fine-tuned hard-negatives models was used). Finally, for repository level experiments, repositories should be cloned and indexed manually (see create-vector-stores-for-rag.ipynb for more details).

Key Insights

Models

PLMs (baselines)

LLMs

400 samples + 10 samples for few-shot learing of format: (code, summary, class context) taken from 100 class-level coding tasks

  • function-level
  • class-level

Modification of the dataset from ClassEval benchmark

806 samples + 40 samples for few-shot learning of format: (code, summary, repository context) taken from top-4 GitHub repositories

  • function-level
  • repo-level

Modification of the CodeSearchNet dataset from CodeXGLUE benchmark

Repository used in the study Functions Extracted Branch Hash
apache/airflow 455 main 70f868e86704ac7810762df97190aa2575fea7d2
streamlink/streamlink 64 master 7adb5b126e84d0400d747f03241a53215bcd939a
open-mmlab/mmcv 50 main 27df2b3b9c874129b1ac3d8a51d2ef7cde96f06c
Azure/azure-sdk-for-python 237 main e44fa9a324faa992c1d0dccee90f5fb577422633

To clone a repository with a specific hash, use:

git clone <repo_url> -b <branch> --single-branch --depth 1
cd <repo_name>
git checkout <hash>

For example, git clone https://github.com/apache/airflow.git -b main --single-branch --depth 1 && git checkout 70f868e.

Results

Key results of the paper. The table shows the average metrics of the some models for each dataset and configuration. The best results are highlighted in bold.


repo-level Few-shot learning with RAG results on Modified CodeSearchNet. The plot shows the average metrics of the models for each number of examples [0, 2, 10] and retrieved code context chunks [0, 12, 25, 50].


class-level Class or Skeleton context results on Modified ClassEval. The table shows the average metrics of the LLMs. The best results are highlighted in bold.


function-level Few-shot learning results on Modified ClassEval. The plot shows the average metrics of the models for each number of examples from 0 to 10.


function-level Results on Modified ClassEval and Modified CodeSearchNet. The table shows the average metrics of the models. The best results are highlighted in bold.

Citation

If you use this code for your research, please cite our paper:

@article{makharev2025codesummarizationbeyondfunction,
      title={Code Summarization Beyond Function Level}, 
      author={Vladimir Makharev and Vladimir Ivanov},
      journal={arXiv preprint arXiv:2502.16704},
      year={2025},
      url={https://arxiv.org/abs/2502.16704}, 
}

About

Code Summarization Beyond Function Level paper (LLM4Code @ ICSE 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages