This paper is accepted at LLM4Code Workshop co-located with ICSE 2025 and published in 2025 IEEE/ACM International Workshop on Large Language Models for Code.
Code summarization is a critical task in natural language processing and software engineering, which aims to generate concise descriptions of source code. Recent advancements have improved the quality of these summaries, enhancing code readability and maintainability. However, the content of a repository or a class has not been considered in function code summarization. This study investigated the effectiveness of code summarization models beyond the function level, exploring the impact of class and repository contexts on the summary quality. The study involved revising benchmarks for evaluating models at class and repository levels, assessing baseline models, and evaluating LLMs with in-context learning to determine the enhancement of summary quality with additional context. The findings revealed that the fine-tuned state-of-the-art CodeT5+ base model excelled in code summarization, while incorporating few-shot learning and retrieved code chunks from RAG significantly enhanced the performance of LLMs in this task. Notably, the Deepseek Coder 1.3B and Starcoder2 15B models demonstrated substantial improvements in metrics such as BLEURT, METEOR, and BLEU-4 at both class and repository levels. Repository-level summarization exhibited promising potential but necessitates significant computational resources and gains from the inclusion of structured context. Lastly, we employed the recent SIDE code summarization metric in our evaluation. This study contributes to refining strategies for prompt engineering, few-shot learning, and RAG, addressing gaps in benchmarks for code summarization at various levels. Finally, we publish all study details, code, datasets, and results of evaluation in the GitHub repository available at https://github.com/kilimanj4r0/code-summarization-beyond-function-level.
Code Summarization Pipeline Schema and Experiments
.
├── 📂 data # Folder with all the data
│ ├── 📂 preprocessed # Folder with preprocessed datasets
│ │ # mce - Modified ClassEval
│ │ # mcsn - Modified CodeSearchNet
│ ├── 📂 repos # Clone repositories used in the study here
│ ├── 📂 vector-stores # Repositories will be indexed here
│ ├── 🧮 results.csv # Aggregated metrics results of all experiments
├── 📂 figures # Folder with figures of the paper
├── 📂 notebooks # Folder containting all the notebooks
│ ├── 📂 data # Folder with notebooks for data preprocessing
│ ├── 📂 experiments # Folder with notebooks for running experiments
│ ├── 📓 evaluation.ipynb # Notebook for running evaluation
│ ├── 📓 results-analysis.ipynb # Notebook with manual analysis of results
├── 📂 scripts # Folder with scripts for reproducing the results
│ ├── 📄 baselines-inference.py # Script for running inference of baseline models
│ ├── 📄 llms-inference.py # Script for running inference of LLMs
│ ├── 📄 evaluation.py # Script for running evaluation
│ ├── ...
│ ├── 💲 pipeline.sh # Shell script for running the whole pipeline of experiments
│ ...
└── 📜 miniconda-env.yml # Configuration of conda environment with required packages
We ran our code with Python 3.11.8 and PyTorch 2.2.1 on NVIDIA A100 GPU with 80GB RAM.
- Install Miniconda
- Create conda environment from
miniconda-env.yml
conda env create -f miniconda-env.yml
- Conda environment
code-summarization-beyond-function-levelwill be created and you can activate it
conda activate code-summarization-beyond-function-level
Important
Most of the models and evaluation metrics will be automatically downloaded from HuggingFace Hub.
However, SIDE evaluation metric should be downloaded manually (fine-tuned hard-negatives models was used).
Finally, for repository level experiments, repositories should be cloned and indexed manually (see create-vector-stores-for-rag.ipynb for more details).
PLMs (baselines)
- SEBIS/code_trans_t5_large_source_code_summarization_python_multitask_finetune (
ct-t5-large-sum) - SEBIS/code_trans_t5_large_code_documentation_generation_python_multitask_finetune (
ct-t5-large-doc) - Salesforce/codet5-base-multi-sum (
codet5-base) - Paul-B98/codet5p_220m_py_sum (
codet5p-base) - lintang/pile-t5-large-codexglue (
pile-t5-large)
LLMs
- deepseek-ai/deepseek-coder-1.3b-instruct (
deepseek-coder-1.3b) - deepseek-ai/deepseek-coder-6.7b-instruct (
deepseek-coder-6.7b) - deepseek-ai/deepseek-coder-33b-instruct (
deepseek-coder-33b) - bigcode/starcoder2-15b-instruct-v0.1 (
starcoder2-15b) - gradientai/Llama-3-8B-Instruct-Gradient-1048k (
llama3-8b)
Modified ClassEval Dataset
400 samples + 10 samples for few-shot learing of format: (code, summary, class context) taken from 100 class-level coding tasks
function-levelclass-level
Modification of the dataset from ClassEval benchmark
Modified CodeSearchNet Dataset
806 samples + 40 samples for few-shot learning of format: (code, summary, repository context) taken from top-4 GitHub repositories
function-levelrepo-level
Modification of the CodeSearchNet dataset from CodeXGLUE benchmark
| Repository used in the study | Functions Extracted | Branch | Hash |
|---|---|---|---|
| apache/airflow | 455 | main | 70f868e86704ac7810762df97190aa2575fea7d2 |
| streamlink/streamlink | 64 | master | 7adb5b126e84d0400d747f03241a53215bcd939a |
| open-mmlab/mmcv | 50 | main | 27df2b3b9c874129b1ac3d8a51d2ef7cde96f06c |
| Azure/azure-sdk-for-python | 237 | main | e44fa9a324faa992c1d0dccee90f5fb577422633 |
To clone a repository with a specific hash, use:
git clone <repo_url> -b <branch> --single-branch --depth 1
cd <repo_name>
git checkout <hash>For example, git clone https://github.com/apache/airflow.git -b main --single-branch --depth 1 && git checkout 70f868e.
Key results of the paper. The table shows the average metrics of the some models for each dataset and configuration. The best results are highlighted in bold.
repo-levelFew-shot learning with RAG results on Modified CodeSearchNet. The plot shows the average metrics of the models for each number of examples [0, 2, 10] and retrieved code context chunks [0, 12, 25, 50].
class-levelClass or Skeleton context results on Modified ClassEval. The table shows the average metrics of the LLMs. The best results are highlighted in bold.
function-levelFew-shot learning results on Modified ClassEval. The plot shows the average metrics of the models for each number of examples from 0 to 10.
function-levelResults on Modified ClassEval and Modified CodeSearchNet. The table shows the average metrics of the models. The best results are highlighted in bold.
If you use this code for your research, please cite our paper:
@article{makharev2025codesummarizationbeyondfunction,
title={Code Summarization Beyond Function Level},
author={Vladimir Makharev and Vladimir Ivanov},
journal={arXiv preprint arXiv:2502.16704},
year={2025},
url={https://arxiv.org/abs/2502.16704},
}





