An automated bug report triage system that leverages Large Language Models (LLMs) to classify Mozilla Bugzilla bug reports as valid or invalid, with multi-modal support for text descriptions and images.
This project implements an intelligent bug triage system designed to automatically evaluate and classify bug reports from Mozilla's Bugzilla platform. The system utilizes state-of-the-art LLMs (GPT-4.1, o4-mini, Grok-3, DeepSeek-R1-0528) to assess whether bug reports contain sufficient information to be actionable and reproducible. The classifier supports three evaluation scenarios:
- Description only: Text-based classification
- Description and image: Multi-modal analysis combining text and screenshots
- Image only: Visual-only classification
The system includes comprehensive evaluation metrics using semantic similarity, BERTScore, and cross-encoder models to validate classification accuracy against ground truth data.
.
├── add_image_descriptions.py
├── bug_evaluator.py
├── bug_evaluator_main.ipynb
├── bug_evaluator_notebook.ipynb
├── bug_evaluator_test.py
├── csvtojson.py
├── find_ground_truth.py
├── llm_bug_classifier.py
├── preprocess.ipynb
├── retrieve.ipynb
├── validvsinvalidbug.py
├── retry_eval.sh
├── run_all_eval.sh
├── sample_1000.csv
├── sample_1000_preprocessed.csv
├── tree.txt
├── comments/
│ ├── 1004432.csv
│ ├── 1005664.csv
│ └── ... (3000+ comment files)
├── images/
├── jsons/
├── results/
└── venv/
| Component | Type | Description |
|---|---|---|
| Core Classification | ||
llm_bug_classifier.py |
Script | Main classification engine that processes bug reports through various LLM models (GPT-4.1, o4-mini, Grok-3, DeepSeek-R1) across different scenarios (description_only, description_and_image, image_only) |
validvsinvalidbug.py |
Script | Original bug classification prototype using Azure OpenAI to evaluate bug validity based on completeness and reproducibility criteria |
| Data Processing | ||
csvtojson.py |
Script | Converts bug report CSV data into structured JSON format, extracting key fields (Bug_ID, Type, Summary, Product, Component, Status, Resolution, etc.) and merging with comment data |
preprocess.ipynb |
Notebook | Data preprocessing pipeline for cleaning and preparing raw Bugzilla data for analysis |
retrieve.ipynb |
Notebook | Data retrieval and exploration notebook for querying bug reports, particularly those containing images |
| Evaluation Framework | ||
bug_evaluator.py |
Module | Evaluation framework implementing multiple similarity metrics: cosine similarity with SentenceTransformers, cross-encoder scoring, BERTScore, and standard classification metrics (accuracy, precision, recall, F1) |
bug_evaluator_main.ipynb |
Notebook | Primary evaluation interface for running experiments and analyzing results |
bug_evaluator_notebook.ipynb |
Notebook | Alternative evaluation notebook with additional analysis capabilities |
bug_evaluator_test.py |
Script | Unit tests for the bug evaluator module |
| Ground Truth & Augmentation | ||
find_ground_truth.py |
Script | Analyzes bug discussion comments to identify the most authoritative explanation for why a bug was marked as "Invalid" using GPT-5-mini |
add_image_descriptions.py |
Script | Enhances bug reports by generating natural language descriptions of attached images using multi-modal LLM vision capabilities |
| Execution Scripts | ||
run_all_eval.sh |
Shell | Batch execution script to run all model evaluations across all scenarios |
retry_eval.sh |
Shell | Retry mechanism for failed evaluations |
| Data Directories | ||
comments/ |
Folder | Individual CSV files containing discussion threads for each bug report (organized by Bug_ID) |
images/ |
Folder | Repository of screenshot attachments referenced in bug reports |
jsons/ |
Folder | Processed bug report data in JSON format, ready for LLM consumption |
results/ |
Folder | Output directory for classification results and evaluation metrics |
| Sample Data | ||
sample_1000.csv |
Data | Raw sample dataset containing 1000 bug reports from Bugzilla |
sample_1000_preprocessed.csv |
Data | Cleaned and preprocessed version of the sample dataset |
- Language: Python 3.8+
- LLM Integration: Azure OpenAI API (GPT-4.1, o4-mini, Grok-3, DeepSeek-R1-0528)
- NLP & Embeddings:
- Sentence-Transformers (
all-MiniLM-L6-v2) - BERTScore
- Cross-Encoder (
stsb-roberta-base)
- Sentence-Transformers (
- Data Processing: Pandas, NumPy
- Evaluation: scikit-learn
- Notebooks: Jupyter
- Environment: Python venv
git clone https://github.com/yourusername/cs588group4.git
cd cs588group4python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install sentence-transformers scikit-learn numpy pandas tqdm openai bert-score jupyterSet the following environment variables:
export ENDPOINT_URL="your-azure-endpoint"
export AZURE_OPENAI_API_KEY="your-api-key"
export DEPLOYMENT_NAME="gpt-4.1" # or your preferred modelAlternatively, update the credentials directly in the Python files (not recommended for production).
Ensure your data files are in place:
sample_1000_preprocessed.csv(bug report data)comments/folder with individual bug comment CSV filesimages/folder with screenshot attachments (if using image scenarios)
Basic classification with description only:
python llm_bug_classifier.py --model gpt-4.1 --scenario description_onlyMulti-modal classification with images:
python llm_bug_classifier.py --model o4-mini --scenario description_and_imageImage-only classification:
python llm_bug_classifier.py --model grok-3 --scenario image_onlyAvailable models: gpt-4.1, o4-mini, grok-3, DeepSeek-R1-0528
Available scenarios: description_only, description_and_image, image_only
python csvtojson.pyThis processes sample_1000_preprocessed.csv and creates individual JSON files in the jsons/ directory.
python add_image_descriptions.pyEnhances JSON files with AI-generated descriptions of attached screenshots.
python find_ground_truth.pyIdentifies the most authoritative comment explaining why each bug was marked invalid.
bash run_all_eval.shExecutes all model/scenario combinations for comprehensive evaluation.
Open bug_evaluator_main.ipynb in Jupyter:
jupyter notebook bug_evaluator_main.ipynbRun the evaluation cells to compute:
- Triage accuracy, precision, recall, F1
- Semantic similarity scores
- BERTScore metrics
- Cross-encoder similarity
Classification Result (from llm_bug_classifier.py):
{
"decision": "invalid",
"fix": "The bug report lacks specific steps to reproduce the issue. While it mentions a crash, it doesn't provide environment details, browser version, or exact user actions leading to the crash.",
}Evaluation Metrics (from bug_evaluator.py):
{
"accuracy": 0.87,
"precision": 0.85,
"recall": 0.89,
"f1_score": 0.87,
"mean_similarity": 0.78
}Run unit tests for the evaluator module:
python bug_evaluator_test.py- The system requires active Azure OpenAI API credentials to function
- Image processing scenarios require images to be base64-encoded
- Results are automatically saved to the
results/directory with timestamped filenames - The
venv/directory is excluded from version control