This document explains exactly how generate_dataset_from_bugsinpy.ipynb produces code chunks and embeds them into vectors for search, using helper modules (run_ast_old.py, scripts/embedding.py, scripts/bugsinpy_utils.py).
Quick start: You can view and run a working version of the notebook on Colab: Open in Colab
Artifacts are written under dataset/{project}/{bug}/:
code_chunks.json— extracted code snippets from the buggy commitembedding.npy— NumPy array of embeddings for the snippets incode_chunks.json, in the same order
These are later consumed by analysis to build FAISS indices and run retrieval.
BugsInPy/dataset checked out- Python env with
requirements.txtinstalled - Optional
.envat repo root for configuration:
MODEL_NAME=regularpooria/blaze_code_embedding
BATCH_SIZE=128
# Optional caching to avoid repeated downloads on clusters
# HF_HOME=./models/hf-cache
# TRANSFORMERS_CACHE=./models/hf-cacheFrom scripts.bugsinpy_utils.py used in the notebook:
get_projects()lists projects to processclone_project(project)clones intotmp/{project}get_bug_info(project, bug)returns commit IDscheckout_to_commit(project, info["buggy_commit_id"])checks out the buggy revision
This prepares the working tree for chunk extraction per bug.
Chunking is implemented in run_ast_old.py and driven by the notebook:
load_gitignore()reads.gitignore_embedding(if present) to skip files/dirsget_python_files(repo_path)lists Python files honoring ignore rulesprocess_file(fpath)parses the file withasttokensand extracts:- a synthetic
rootchunk (top-level code with defs/classes removed) - each
classanddefbody as separate chunks
- a synthetic
extract_chunks(python_files)parallelizes chunk extraction
For each {project, bug}, the notebook writes:
dataset/{project}/{bug}/code_chunks.jsonwith entries{ "file", "name", "code" }
Before embedding, the notebook aggregates all chunk texts across projects/bugs:
- Builds
texts_to_embed(unique code strings) and a mappingtext_to_indices - Avoids re-embedding identical snippets, significantly reducing compute time
scripts/embedding.py provides the embedding logic:
- Loads
MODEL_NAME(defaultregularpooria/blaze_code_embedding) viatransformers - Uses GPU if available; moves model to half precision (
.half()) when not on CPU - Tokenizes to
max_length=1024, runs a forward pass, and applies masked mean pooling to get one vector per input - Batch size is controlled by
BATCH_SIZE
In the notebook:
all_embeddings = embed(texts_to_embed, batch_size=BATCH_SIZE, show_progress_bar=True)- Fills
embedding_cache[text] = vectorfor reuse across bugs
For each {project, bug}:
- Load
code_chunks.json - Rebuild the per-chunk embedding list in the same order:
[embedding_cache[chunk["code"]] for chunk in chunks] - Save
dataset/{project}/{bug}/embedding.npy
This preserves strict 1:1 alignment: code_chunks.json[i] ↔ embedding.npy[i].
Downstream (e.g., in run_analysis.py):
- Load embeddings with
np.load(...) - Build FAISS L2 index using
scripts.embedding.index_embeddings(embeddings) - Search with
index.search(query_vectors, k=K)wherequery_vectorsare embedded tracebacks or queries
- Open and execute
generate_dataset_from_bugsinpy.ipynb- First cells perform cloning and chunk extraction
- Later cells de-duplicate, embed, and write
embedding.npy
- Ensure dependencies and optional
.envare in place
- Jobs have no Internet; pre-download models/tokenizers on the login node and set
HF_HOME/TRANSFORMERS_CACHE - Prefer
module loadfor system packages; usepip --no-indexas applicable - See:
cluster_how_to/cluster_readme.md,login.md,folders.md,clone.md,limitations.md
- Notebook:
generate_dataset_from_bugsinpy.ipynb - Chunking:
run_ast_old.py - Embedding:
scripts/embedding.py - Dataset utilities:
scripts/bugsinpy_utils.py