BLADE/GENERATE_SINGLE_LLM_PATCHES.md at main · regularpooria/BLADE

Generate single LLM patches

This guide explains the workflow in generate_single_llm_patches.ipynb to produce candidate code fixes (function-level patches) from an LLM using a traceback and relevant code snippets.

Overview

Input: a bug's Python traceback and one or more candidate code chunks (functions) likely related to the failure
Model: OpenAI-compatible endpoint (via OpenRouter) with a DeepSeek reasoning model by default
Output: extracted function-level patch candidates, including function names and full function bodies, serialized to JSON for later application/evaluation

Prerequisites

Python environment with repo requirements installed
API key environment variable:
- OPENROUTER_KEY set in your environment or .env
Upstream artifacts available:
- tmp/ast/results/bug_results_{project}.json from retrieval (optional, if you use those to select candidate chunks)
- dataset/{project}/{bug}/code_chunks.json to fetch original function code if needed

Configuration variables (in the notebook)

MODEL_NAME: label used in result filenames; default: baseline_all_deepseek-r1-0528
SYSTEM_PROMPT: strict instructions for the LLM to output only complete, corrected function definitions, each inside its own ```python fenced block
FIRST_TIME: toggles whether to seed from prior results or start fresh
PREVIOUS_RESULTS_PATH: path to a previous run to continue context (optional)

OpenAI client setup:

from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENROUTER_KEY"], base_url="https://openrouter.ai/api/v1")

Prompting and generation

The conversation is assembled as messages:
- system: SYSTEM_PROMPT
- user: Traceback: ... (raw or extracted Python traceback)
- user: Code: ... (joined candidate functions/snippets)
Subsequent iterations append another user message New error: ... to the same chat if you re-try with feedback

The API call uses chat completions:

completion = client.chat.completions.create(
    model="deepseek/deepseek-r1-0528",
    messages=previous_chat,
    extra_body={}
)

Extracting function patches from the response

The notebook uses regex-based parsing to extract fenced code blocks and function definitions:
- Find python ... blocks
- Inside blocks, extract each full def name(...): ... function
- Record function names and sources
The parser enforces Python language blocks and ignores non-Python fences

Result structure and persistence

Collected results are appended to llm_results and written as JSON under tmp/ast/results/llm/single/, with a timestamp in the filename, e.g.:
- tmp/ast/results/llm/single/{MODEL_NAME}_MM_DD_YYYY__HH_MM_SS.json
Each entry typically includes:
- project, bug_id
- traceback
- chunks (the input snippets provided to the LLM)
- functions (list of {name, code} extracted from the LLM response)

Practical usage pattern

Select a bug and gather inputs
- Extract traceback via scripts.bugsinpy_utils.extract_python_tracebacks(project, bug)
- Select top-K candidate chunks for that bug (e.g., from retrieval results or filtered code_chunks.json)
Call generate_code(trace_back, chunks)
- Provides chat messages and returns the model output text
Parse patches
- Use extract_code(text) to get {name, code} for each function
Save results for later evaluation/apply

Tips and constraints

Patches must update only the given functions; do not modify tests or introduce new imports per SYSTEM_PROMPT
Use separate ```python fenced blocks for multiple functions
Long tracebacks/snippets: keep within model context; consider pruning unrelated context
Reproducibility: store MODEL_NAME, prompts, and inputs with outputs

Extending to application/evaluation

Apply patches by replacing target function bodies in the buggy commit, then run tests
For locating the right file/path for a function, correlate function names from LLM output with code_chunks.json entries
Automate evaluation by reusing test runners in scripts.bugsinpy_utils.py (run_test, checkout_to_commit, etc.)

Evaluate LLM patches with BugsInPy Docker

Follow these steps to run the BugsInPy Docker workflow against your generated LLM patches and measure success:

Copy your generated LLM results JSON to the BugsInPy framework location:
```
cp <path-to-your-llm-results>.json BugsInPy/framework/results/llm.json
```
Build the Docker images (run this AFTER copying llm.json):
```
(cd BugsInPy && docker compose build)
```
Run a project with the LLM testing entrypoint (example: youtube-dl). You can substitute any supported project target defined in BugsInPy/docker-compose.yml:
```
(cd BugsInPy && docker compose run youtube-dl)
```
Compute the success rate from the Docker logs or index using the provided utility:
```
python utils/count_success.py
```
Make another pass on new errors only. In the notebook generate_single_llm_patches.ipynb, set the variable passed_bugs to include bug ids (as strings) that have already passed so the script skips them on subsequent runs. Example:
```
passed_bugs = {"1", "3", "6", "9"}
```

Security and costs

API usage depends on the configured provider; control temperature and tokens as needed
Never log secrets; use environment variables and .env

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate single LLM patches

Overview

Prerequisites

Configuration variables (in the notebook)

Prompting and generation

Extracting function patches from the response

Result structure and persistence

Practical usage pattern

Tips and constraints

Extending to application/evaluation

Evaluate LLM patches with BugsInPy Docker

Security and costs

FilesExpand file tree

GENERATE_SINGLE_LLM_PATCHES.md

Latest commit

History

GENERATE_SINGLE_LLM_PATCHES.md

File metadata and controls

Generate single LLM patches

Overview

Prerequisites

Configuration variables (in the notebook)

Prompting and generation

Extracting function patches from the response

Result structure and persistence

Practical usage pattern

Tips and constraints

Extending to application/evaluation

Evaluate LLM patches with BugsInPy Docker

Security and costs