Skip to content

LLM-based Listwise Reranker for Retrieval-Augmented Generation (RAG) over GitHub code repositories. Includes advanced retrieval, reranking, summarization, and evaluation with Recall@K.

Notifications You must be signed in to change notification settings

MarkoKolarski/RepoSearchRAG

Repository files navigation

πŸ” LLM Listwise Reranker for CodeRAG

Python RAG LLM Listwise Reranker Gemini Query Expansion MMR Evaluation

Retrieval-Augmented Generation (RAG) for Code Repositories


🧠 Overview

This project implements a Retrieval-Augmented Generation (RAG) system for question-answering over a GitHub code repository. It takes a GitHub URL as input, indexes the codebase, and allows users to ask natural language questions about the code. The system retrieves and reranks relevant files using advanced techniques such as query expansion, diverse retrieval strategies, and an LLM-based listwise reranker.


πŸ› οΈ Clear Instructions for Running

πŸ”§ Prerequisites

  • Python 3.8+
  • Git

πŸ“¦ Installation

git clone https://github.com/MarkoKolarski/RepoSearchRAG.git
cd RepoSearchRAG
pip install -r requirements.txt

▢️ Run the System

You can run the system using the main.py script. Below are all the available command-line arguments with their descriptions and default values:

πŸ”§ Repository Options

Argument Type Default Description
--repo_url str "https://github.com/viarotel-org/escrcpy" GitHub repository URL to clone
--repo_path str "repository" Local directory where the repository will be cloned and processed

πŸ”Ž Retrieval Configuration

Argument Type Default Description
--top_k int 10 Number of top search results (files) to return
--retrieval_strategy str "diverse" Retrieval strategy to use. Options: default, probabilistic, diverse

🧠 Summarization Options (optional)

Argument Type Default Description
--generate_summaries flag local model (flan-t5-small) Enable generation of file content summaries
--use_large_summarizer flag False Use a larger local summarization model (flan-t5-base) for improved quality
--use_gemini flag False Use Google Gemini API for summarization
--gemini_api_key str None API key for Gemini (can also be set via GOOGLE_API_KEY environment variable)
--gemini_model str "gemini-1.5-flash" Google Gemini model to use

πŸ“Œ Example Usage

Basic usage with default settings:

python main.py

Specify different retrieval strategy and number of results:

python main.py --repo_url https://github.com/viarotel-org/escrcpy --retrieval_strategy probabilistic --top_k 5

Enable summarization using Google Gemini:

python main.py --generate_summaries --use_gemini --gemini_api_key YOUR_API_KEY

Use a larger local summarizer:

python main.py --generate_summaries --use_large_summarizer

πŸ§ͺ Run Evaluation

You can evaluate the system using the evaluate_coderag.py script. The evaluation is based on the Recall@K metric and compares retrieved results with ground truth from a dataset.

πŸ› οΈ Available Command-Line Arguments

Argument Type Default Description
--repo_url str "https://github.com/viarotel-org/escrcpy" GitHub repository URL to evaluate
--repo_path str "repository" Local path where the repository will be cloned and processed
--dataset str "escrcpy-commits-generated.json" Path to the evaluation dataset (JSON format)
--retrieval_strategy str "diverse" Strategy used for retrieval: default, probabilistic, or diverse
--top_k int 10 Number of top results to consider when calculating Recall@K
--output str "evaluation_results.json" Path to save the evaluation output results

πŸ“Œ Example Usage

Run evaluation using default settings:

python evaluate_coderag.py

Run evaluation with a custom dataset and probabilistic retrieval:

python evaluate_coderag.py --dataset evaluation_dataset.json --retrieval_strategy probabilistic --top_k 5

Save results to a specific file:

python evaluate_coderag.py --output my_results.json

πŸ“ Evaluation Dataset Format

The evaluation dataset should be a .json file with the following structure:

{
  "strategy": "diverse",
  "recall_strict": 0.7361,
  "recall_extended": 1.0,
  "average_query_time": 6.2093,
  "total_time": 251.1481,
  "retrieved_results": {
    "Question 1": [ "retrieved/file1", "retrieved/file2", ... ],
    "Question 2": [ "retrieved/file1", "retrieved/file2", ... ],
    "Question 3": [ "retrieved/file1", "retrieved/file2", ... ]
  }
}

Each entry contains a natural language question and a list of expected relevant file paths (ground truth). The system calculates Recall@K based on how many of the expected files are retrieved in the top K results.


✨ Features and Implementation Details

1️⃣ Complete Pipeline from Indexing to RAG

πŸ’‘ Implementation:

  • Repository is cloned via clone_repository() in repository_utils.py.
  • Files are indexed using prepare_repository_files(), supporting .py, .js, .md, etc.
  • AdvancedCodeRAGSystem class handles retrieval and reranking.

2️⃣ Scope Limited to a Single Repository

πŸ’‘ Implementation:

  • The repo URL is passed as --repo_url argument.
  • The system is developed and tested specifically on the escrcpy repository for optimal results.

3️⃣ Natural Language Question Answering

πŸ’‘ Implementation:

  • interactive_query_loop() enables interactive Q&A.
  • Queries are expanded via QueryExpander (adds synonyms and code-specific terms).
  • Retrieval and reranking handled by AdvancedCodeRAGSystem and ListwiseReranker.

πŸ“€ Output Sample:

Your question: How does the device pairing work?

Top 10 relevant files for: "How does the device pairing work?"
β€’ repository\src\utils\device\generateAdbPairingQR\index.js

β€’ repository\src\dicts\device\index.js

β€’ repository\electron\exposes\adb\helpers\scanner\index.js
β€’ ...

4️⃣ Evaluation with Recall@10

πŸ’‘ Implementation:

  • Evaluation is done using evaluate_coderag.py.
  • RAGEvaluator compares retrieved results with ground truth.

πŸ“Š Output Example:

Recall@10 (Strict semantic match, thresholds β‰₯ 0.5): 0.7222
Recall@10 (Extended match, including weaker matches β‰₯ 0.3): 1.0000

🧠 Advanced Techniques

5.1 πŸ—οΈ Index Building Algorithm

  • Efficiently indexes files using prepare_repository_files() with multiprocessing.
  • Filters supported extensions only (e.g., .py, .js, .md).

5.2 πŸ” Query Expansion

  • Implemented in QueryExpander:
    • Uses WordNet for synonyms.
    • Adds code-specific terms like function, class, component.
    • Extracts context-aware keywords from source files.

5.3 🧠 LLM-Based Listwise Reranker

  • Implemented in ListwiseReranker.
  • Uses cross-encoder model: cross-encoder/ms-marco-MiniLM-L-12-v2.
  • Applies custom boosting based on:
    • File types (.py, .js)
    • Test file priority (e.g., test_*.py)
    • Token overlap & early keyword match

5.4 πŸ’‘ Other Techniques

  • Multiple retrieval strategies:
    • default: Cosine similarity
    • probabilistic: Relevance scoring
    • diverse: Maximal Marginal Relevance (MMR)

πŸ… Optional Features

βœ… 1. LLM-Generated Summaries

  • Implemented via AdvancedSummarizer.
  • Supports:
    • Default local model (google/flan-t5-small) – used if no other options are specified
    • Larger local model (google/flan-t5-base) – enabled with --use_large_summarizer
    • API-based model (e.g., Google Gemini) – enabled with --use_gemini
  • Additional options:
    • --generate_summaries – enables summarization
    • --use_large_summarizer – uses a higher-quality local model
    • --use_gemini – uses the Gemini API for summarization
    • --gemini_api_key – required for Gemini API (can also be set via GOOGLE_API_KEY env variable)
    • --gemini_model – specifies the Gemini model (default: gemini-1.5-flash)

πŸ“Š Summarization Models – Visual Comparison Table

Model Type Model Name How to Enable Speed ⏱️ Quality 🧠 Resource Usage πŸ’»
βœ… Default Local Model google/flan-t5-small (default) Just use --generate_summaries ⚑ Fast ⭐ Good 🟒 Low (CPU/GPU)
πŸ”„ Large Local Model google/flan-t5-base --generate_summaries --use_large_summarizer 🐒 Slower ⭐⭐⭐ Better 🟑 Medium–High
☁️ Cloud API (Gemini) gemini-1.5-flash (default) --generate_summaries --use_gemini --gemini_api_key <KEY> ⚑ Fast ⭐⭐ Very Good πŸ”΅ External (API)
☁️ Cloud API (Gemini Pro) gemini-1.5-pro (custom) Same as above + --gemini_model gemini-1.5-pro ⚠️ Varies ⭐⭐⭐⭐ Excellent πŸ”΅ External (API)

βœ… 2. Latency/Quality Trade-Off Evaluation

  • Retrieval strategies allow users to balance speed vs. accuracy.
  • Evaluation script logs:
    • Average query time
    • Total evaluation time

βœ… 3. Switching Between LLM & Embedding Models

  • AdvancedSummarizer supports:
    • Default local model (flan-t5-small)
    • Larger local model (flan-t5-base) via --use_large_summarizer
    • API-based models (e.g., Google Gemini) via --use_gemini
  • AdvancedCodeRAGSystem uses SentenceTransformer for embeddings, but can be extended to support:
    • OpenAI embeddings
    • Cohere embeddings
    • Custom local embedding models

πŸ“ Project Structure

.
β”œβ”€β”€ main.py                          # Entry point for querying
β”œβ”€β”€ evaluate_coderag.py              # Evaluation script (Recall@10)
β”œβ”€β”€ repository_utils.py              # Cloning, indexing, querying logic
β”œβ”€β”€ listwise_reranker.py             # Reranking using LLMs
β”œβ”€β”€ requirements.txt                 # Python dependencies
β”œβ”€β”€ README.md                        # Project documentation
β”œβ”€β”€ escrcpy-commits-generated.json   # Input dataset for evaluation
└── evaluation_results.json          # Output results from evaluation (optional / generated)

πŸ™‹β€β™‚οΈ Contact

For questions, suggestions, or contributions, feel free to open an issue or reach out directly.

About

LLM-based Listwise Reranker for Retrieval-Augmented Generation (RAG) over GitHub code repositories. Includes advanced retrieval, reranking, summarization, and evaluation with Recall@K.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages