Retrieval-Augmented Generation (RAG) for Code Repositories
This project implements a Retrieval-Augmented Generation (RAG) system for question-answering over a GitHub code repository. It takes a GitHub URL as input, indexes the codebase, and allows users to ask natural language questions about the code. The system retrieves and reranks relevant files using advanced techniques such as query expansion, diverse retrieval strategies, and an LLM-based listwise reranker.
- Python 3.8+
- Git
git clone https://github.com/MarkoKolarski/RepoSearchRAG.git
cd RepoSearchRAG
pip install -r requirements.txtYou can run the system using the main.py script. Below are all the available command-line arguments with their descriptions and default values:
| Argument | Type | Default | Description |
|---|---|---|---|
--repo_url |
str |
"https://github.com/viarotel-org/escrcpy" |
GitHub repository URL to clone |
--repo_path |
str |
"repository" |
Local directory where the repository will be cloned and processed |
| Argument | Type | Default | Description |
|---|---|---|---|
--top_k |
int |
10 |
Number of top search results (files) to return |
--retrieval_strategy |
str |
"diverse" |
Retrieval strategy to use. Options: default, probabilistic, diverse |
| Argument | Type | Default | Description |
|---|---|---|---|
--generate_summaries |
flag |
local model (flan-t5-small) |
Enable generation of file content summaries |
--use_large_summarizer |
flag |
False |
Use a larger local summarization model (flan-t5-base) for improved quality |
--use_gemini |
flag |
False |
Use Google Gemini API for summarization |
--gemini_api_key |
str |
None |
API key for Gemini (can also be set via GOOGLE_API_KEY environment variable) |
--gemini_model |
str |
"gemini-1.5-flash" |
Google Gemini model to use |
Basic usage with default settings:
python main.pySpecify different retrieval strategy and number of results:
python main.py --repo_url https://github.com/viarotel-org/escrcpy --retrieval_strategy probabilistic --top_k 5Enable summarization using Google Gemini:
python main.py --generate_summaries --use_gemini --gemini_api_key YOUR_API_KEYUse a larger local summarizer:
python main.py --generate_summaries --use_large_summarizerYou can evaluate the system using the evaluate_coderag.py script. The evaluation is based on the Recall@K metric and compares retrieved results with ground truth from a dataset.
| Argument | Type | Default | Description |
|---|---|---|---|
--repo_url |
str |
"https://github.com/viarotel-org/escrcpy" |
GitHub repository URL to evaluate |
--repo_path |
str |
"repository" |
Local path where the repository will be cloned and processed |
--dataset |
str |
"escrcpy-commits-generated.json" |
Path to the evaluation dataset (JSON format) |
--retrieval_strategy |
str |
"diverse" |
Strategy used for retrieval: default, probabilistic, or diverse |
--top_k |
int |
10 |
Number of top results to consider when calculating Recall@K |
--output |
str |
"evaluation_results.json" |
Path to save the evaluation output results |
Run evaluation using default settings:
python evaluate_coderag.pyRun evaluation with a custom dataset and probabilistic retrieval:
python evaluate_coderag.py --dataset evaluation_dataset.json --retrieval_strategy probabilistic --top_k 5Save results to a specific file:
python evaluate_coderag.py --output my_results.jsonThe evaluation dataset should be a .json file with the following structure:
{
"strategy": "diverse",
"recall_strict": 0.7361,
"recall_extended": 1.0,
"average_query_time": 6.2093,
"total_time": 251.1481,
"retrieved_results": {
"Question 1": [ "retrieved/file1", "retrieved/file2", ... ],
"Question 2": [ "retrieved/file1", "retrieved/file2", ... ],
"Question 3": [ "retrieved/file1", "retrieved/file2", ... ]
}
}Each entry contains a natural language question and a list of expected relevant file paths (ground truth). The system calculates Recall@K based on how many of the expected files are retrieved in the top K results.
π‘ Implementation:
- Repository is cloned via
clone_repository()inrepository_utils.py. - Files are indexed using
prepare_repository_files(), supporting.py,.js,.md, etc. AdvancedCodeRAGSystemclass handles retrieval and reranking.
π‘ Implementation:
- The repo URL is passed as
--repo_urlargument. - The system is developed and tested specifically on the
escrcpyrepository for optimal results.
π‘ Implementation:
interactive_query_loop()enables interactive Q&A.- Queries are expanded via
QueryExpander(adds synonyms and code-specific terms). - Retrieval and reranking handled by
AdvancedCodeRAGSystemandListwiseReranker.
π€ Output Sample:
Your question: How does the device pairing work?
Top 10 relevant files for: "How does the device pairing work?"
β’ repository\src\utils\device\generateAdbPairingQR\index.js
β’ repository\src\dicts\device\index.js
β’ repository\electron\exposes\adb\helpers\scanner\index.js
β’ ...
π‘ Implementation:
- Evaluation is done using
evaluate_coderag.py. RAGEvaluatorcompares retrieved results with ground truth.
π Output Example:
Recall@10 (Strict semantic match, thresholds β₯ 0.5): 0.7222
Recall@10 (Extended match, including weaker matches β₯ 0.3): 1.0000
- Efficiently indexes files using
prepare_repository_files()with multiprocessing. - Filters supported extensions only (e.g.,
.py,.js,.md).
- Implemented in
QueryExpander:- Uses WordNet for synonyms.
- Adds code-specific terms like
function,class,component. - Extracts context-aware keywords from source files.
- Implemented in
ListwiseReranker. - Uses cross-encoder model:
cross-encoder/ms-marco-MiniLM-L-12-v2. - Applies custom boosting based on:
- File types (.py, .js)
- Test file priority (e.g.,
test_*.py) - Token overlap & early keyword match
- Multiple retrieval strategies:
default: Cosine similarityprobabilistic: Relevance scoringdiverse: Maximal Marginal Relevance (MMR)
- Implemented via
AdvancedSummarizer. - Supports:
- Default local model (
google/flan-t5-small) β used if no other options are specified - Larger local model (
google/flan-t5-base) β enabled with--use_large_summarizer - API-based model (e.g., Google Gemini) β enabled with
--use_gemini
- Default local model (
- Additional options:
--generate_summariesβ enables summarization--use_large_summarizerβ uses a higher-quality local model--use_geminiβ uses the Gemini API for summarization--gemini_api_keyβ required for Gemini API (can also be set viaGOOGLE_API_KEYenv variable)--gemini_modelβ specifies the Gemini model (default:gemini-1.5-flash)
| Model Type | Model Name | How to Enable | Speed β±οΈ | Quality π§ | Resource Usage π» |
|---|---|---|---|---|---|
| β Default Local Model | google/flan-t5-small |
(default) Just use --generate_summaries |
β‘ Fast | β Good | π’ Low (CPU/GPU) |
| π Large Local Model | google/flan-t5-base |
--generate_summaries --use_large_summarizer |
π’ Slower | βββ Better | π‘ MediumβHigh |
| βοΈ Cloud API (Gemini) | gemini-1.5-flash (default) |
--generate_summaries --use_gemini --gemini_api_key <KEY> |
β‘ Fast | ββ Very Good | π΅ External (API) |
| βοΈ Cloud API (Gemini Pro) | gemini-1.5-pro (custom) |
Same as above + --gemini_model gemini-1.5-pro |
ββββ Excellent | π΅ External (API) |
- Retrieval strategies allow users to balance speed vs. accuracy.
- Evaluation script logs:
- Average query time
- Total evaluation time
AdvancedSummarizersupports:- Default local model (
flan-t5-small) - Larger local model (
flan-t5-base) via--use_large_summarizer - API-based models (e.g., Google Gemini) via
--use_gemini
- Default local model (
AdvancedCodeRAGSystemusesSentenceTransformerfor embeddings, but can be extended to support:- OpenAI embeddings
- Cohere embeddings
- Custom local embedding models
.
βββ main.py # Entry point for querying
βββ evaluate_coderag.py # Evaluation script (Recall@10)
βββ repository_utils.py # Cloning, indexing, querying logic
βββ listwise_reranker.py # Reranking using LLMs
βββ requirements.txt # Python dependencies
βββ README.md # Project documentation
βββ escrcpy-commits-generated.json # Input dataset for evaluation
βββ evaluation_results.json # Output results from evaluation (optional / generated)
For questions, suggestions, or contributions, feel free to open an issue or reach out directly.