Skip to content

NJU-iSE/GPIT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

52 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPIT (GitHub Pull Request & Issue Tools): A toolkit for GitHub Pull Requests and Issues

✨️Introduction

GPIT is a simple and easy toolkit for collecting, cleaning, and analyzing GitHub Pull Requests (PRs) and Issues.

Warning

Precautions

  • OS: Recommend Ubuntu System;
  • Network: Good Network to GitHub;
  • LLM usage:If you want to use LLM analyzer locally, make sure run the code in Linux (because we use vLLM to deploy LLMs)

🌠Quick Start

Note

Before you start, clone the repo plz

git clone https://github.com/NJU-iSE/GPIT.git
cd GPIT

You should have GitHub Personal Access Token (PAT) because we use graphql to crawl the issues.

echo [YOUR_GITHUB_PAT] > config/github_pat.txt  # replace [YOUR_GITHUB_PAT] with your GitHub PAT

then pip install the dependencies:

pip install -r requirements.txt

📩Data collection

# collect the github issues from one specific repo
python main.py --repo_path pytorh/pytorch run_collection \
              --query_type issue

above command can collect all the issues from the repo pytorch/pytorch.
Of course, you can collect issues from other repositories.
Additionally, you can also collect Pull Requests by using --query_type PR.
the results would be saved in Results/{repo_name}/all_{query_type}.csv

🧹Data cleaning

# filter the issues by the given conditions (cleaner)
python main.py --repo_path pytorh/pytorch run_cleaning \
              --query_type issue \
              --years [2020,2021,2022,2023,2024] \
              --tags "high priority"  \
              --save_cols [Title,Tags,Link,Year]

the filter results would be saved in Results/{repo_name}/cleaned_issues.csv
you can change the filter conditions in the code (so sry that this is a dirty operation)

📊Data statistics

# count the issues by the given conditions (counter)
python main.py --config config/config.yaml data --processor counter --repo_name pytorch/pytorch

🔍️Issue analyzing (stay tuned)

Important

After collecting the above issues, you can use the analyzer module (LLM-based) to analyze the issues. currently, we use local inference engines to deploy systems.

Warning

Due to the analyzer is LLM-based, you may need enough GPU resources to run the analyzer (Based on my experience, it needs at least 32GB GPU memory because we use Qwen2.5-Coder-7B-Instruct).

python main.py --config config/config.yaml --repo_name pytorch/pytorch analyze

After this step, you would get the results in Results/{repo_name}/analyzer_results.csv.
You can use LLMs to specifically analyze the issues.

🛠️TODO List

  • support more LLMs (e.g., deepseek), especially using API service
  • Implement batch processing for run_collection
  • use logging tools instead of print
  • test the System
  • Add support for collecting PRs like issues.
  • the config file needs to be refined
  • Implement basic tools
  • use LLM to analyze the issues

About

a toolkit used to processes github issues and pull requests

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors