This guide explains the workflow in generate_single_llm_patches.ipynb to produce candidate code fixes (function-level patches) from an LLM using a traceback and relevant code snippets.
- Input: a bug's Python traceback and one or more candidate code chunks (functions) likely related to the failure
- Model: OpenAI-compatible endpoint (via OpenRouter) with a DeepSeek reasoning model by default
- Output: extracted function-level patch candidates, including function names and full function bodies, serialized to JSON for later application/evaluation
- Python environment with repo requirements installed
- API key environment variable:
OPENROUTER_KEYset in your environment or.env
- Upstream artifacts available:
tmp/ast/results/bug_results_{project}.jsonfrom retrieval (optional, if you use those to select candidate chunks)dataset/{project}/{bug}/code_chunks.jsonto fetch original function code if needed
MODEL_NAME: label used in result filenames; default:baseline_all_deepseek-r1-0528SYSTEM_PROMPT: strict instructions for the LLM to output only complete, corrected function definitions, each inside its own ```python fenced blockFIRST_TIME: toggles whether to seed from prior results or start freshPREVIOUS_RESULTS_PATH: path to a previous run to continue context (optional)- OpenAI client setup:
from openai import OpenAI client = OpenAI(api_key=os.environ["OPENROUTER_KEY"], base_url="https://openrouter.ai/api/v1")
- The conversation is assembled as messages:
- system:
SYSTEM_PROMPT - user:
Traceback: ...(raw or extracted Python traceback) - user:
Code: ...(joined candidate functions/snippets)
- system:
- Subsequent iterations append another user message
New error: ...to the same chat if you re-try with feedback - The API call uses chat completions:
completion = client.chat.completions.create( model="deepseek/deepseek-r1-0528", messages=previous_chat, extra_body={} )
- The notebook uses regex-based parsing to extract fenced code blocks and function definitions:
- Find
python ...blocks - Inside blocks, extract each full
def name(...): ...function - Record function names and sources
- Find
- The parser enforces Python language blocks and ignores non-Python fences
- Collected results are appended to
llm_resultsand written as JSON undertmp/ast/results/llm/single/, with a timestamp in the filename, e.g.:tmp/ast/results/llm/single/{MODEL_NAME}_MM_DD_YYYY__HH_MM_SS.json
- Each entry typically includes:
project,bug_idtracebackchunks(the input snippets provided to the LLM)functions(list of{name, code}extracted from the LLM response)
- Select a bug and gather inputs
- Extract traceback via
scripts.bugsinpy_utils.extract_python_tracebacks(project, bug) - Select top-K candidate chunks for that bug (e.g., from retrieval results or filtered
code_chunks.json)
- Extract traceback via
- Call
generate_code(trace_back, chunks)- Provides chat messages and returns the model output text
- Parse patches
- Use
extract_code(text)to get{name, code}for each function
- Use
- Save results for later evaluation/apply
- Patches must update only the given functions; do not modify tests or introduce new imports per
SYSTEM_PROMPT - Use separate ```python fenced blocks for multiple functions
- Long tracebacks/snippets: keep within model context; consider pruning unrelated context
- Reproducibility: store
MODEL_NAME, prompts, and inputs with outputs
- Apply patches by replacing target function bodies in the buggy commit, then run tests
- For locating the right file/path for a function, correlate function names from LLM output with
code_chunks.jsonentries - Automate evaluation by reusing test runners in
scripts.bugsinpy_utils.py(run_test,checkout_to_commit, etc.)
Follow these steps to run the BugsInPy Docker workflow against your generated LLM patches and measure success:
-
Copy your generated LLM results JSON to the BugsInPy framework location:
cp <path-to-your-llm-results>.json BugsInPy/framework/results/llm.json
-
Build the Docker images (run this AFTER copying
llm.json):(cd BugsInPy && docker compose build) -
Run a project with the LLM testing entrypoint (example: youtube-dl). You can substitute any supported project target defined in
BugsInPy/docker-compose.yml:(cd BugsInPy && docker compose run youtube-dl) -
Compute the success rate from the Docker logs or index using the provided utility:
python utils/count_success.py
-
Make another pass on new errors only. In the notebook
generate_single_llm_patches.ipynb, set the variablepassed_bugsto include bug ids (as strings) that have already passed so the script skips them on subsequent runs. Example:passed_bugs = {"1", "3", "6", "9"}
- API usage depends on the configured provider; control temperature and tokens as needed
- Never log secrets; use environment variables and .env