GPIT (GitHub Pull Request & Issue Tools): A toolkit for GitHub Pull Requests and Issues

✨️Introduction

GPIT is a simple and easy toolkit for collecting, cleaning, and analyzing GitHub Pull Requests (PRs) and Issues.

Warning

Precautions

OS: Recommend Ubuntu System;
Network: Good Network to GitHub;
LLM usage:If you want to use LLM analyzer locally, make sure run the code in Linux (because we use vLLM to deploy LLMs)

🌠Quick Start

Note

Before you start, clone the repo plz

git clone https://github.com/NJU-iSE/GPIT.git
cd GPIT

You should have GitHub Personal Access Token (PAT) because we use graphql to crawl the issues.

echo [YOUR_GITHUB_PAT] > config/github_pat.txt  # replace [YOUR_GITHUB_PAT] with your GitHub PAT

then pip install the dependencies:

pip install -r requirements.txt

📩Data collection

# collect the github issues from one specific repo
python main.py --repo_path pytorh/pytorch run_collection \
              --query_type issue

above command can collect all the issues from the repo pytorch/pytorch.
Of course, you can collect issues from other repositories.
Additionally, you can also collect Pull Requests by using --query_type PR.
the results would be saved in Results/{repo_name}/all_{query_type}.csv

🧹Data cleaning

# filter the issues by the given conditions (cleaner)
python main.py --repo_path pytorh/pytorch run_cleaning \
              --query_type issue \
              --years [2020,2021,2022,2023,2024] \
              --tags "high priority"  \
              --save_cols [Title,Tags,Link,Year]

the filter results would be saved in Results/{repo_name}/cleaned_issues.csv
you can change the filter conditions in the code (so sry that this is a dirty operation)

📊Data statistics

# count the issues by the given conditions (counter)
python main.py --config config/config.yaml data --processor counter --repo_name pytorch/pytorch

🔍️Issue analyzing (stay tuned)

Important

After collecting the above issues, you can use the analyzer module (LLM-based) to analyze the issues. currently, we use local inference engines to deploy systems.

Warning

Due to the analyzer is LLM-based, you may need enough GPU resources to run the analyzer (Based on my experience, it needs at least 32GB GPU memory because we use Qwen2.5-Coder-7B-Instruct).

python main.py --config config/config.yaml --repo_name pytorch/pytorch analyze

After this step, you would get the results in Results/{repo_name}/analyzer_results.csv.
You can use LLMs to specifically analyze the issues.

🛠️TODO List

support more LLMs (e.g., deepseek), especially using API service
Implement batch processing for run_collection
use logging tools instead of print
test the System
Add support for collecting PRs like issues.
the config file needs to be refined
Implement basic tools
use LLM to analyze the issues

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
config		config
gpit		gpit
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
requirements_llm.txt		requirements_llm.txt
sglang_pr.sh		sglang_pr.sh
vllm_pr.sh		vllm_pr.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPIT (GitHub Pull Request & Issue Tools): A toolkit for GitHub Pull Requests and Issues

✨️Introduction

🌠Quick Start

📩Data collection

🧹Data cleaning

📊Data statistics

🔍️Issue analyzing (stay tuned)

🛠️TODO List

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPIT (GitHub Pull Request & Issue Tools): A toolkit for GitHub Pull Requests and Issues

✨️Introduction

🌠Quick Start

📩Data collection

🧹Data cleaning

📊Data statistics

🔍️Issue analyzing (stay tuned)

🛠️TODO List

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages