Official implementation of the paper:
QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation
Dehai Min, Kailin Zhang, Tongtong Wu, Lu Cheng
QuCo-RAG is a dynamic Retrieval-Augmented Generation method that determines when to retrieve based on objective statistics from pre-training data, rather than relying on model-internal signals (e.g., logits, entropy) which are often unreliable due to LLM miscalibration.
- π― Corpus-Grounded Uncertainty Quantification: Uses Infini-gram to query entity frequencies and co-occurrences in a 4-trillion-token corpus
- β‘ Two-Stage Retrieval Triggering:
- Before generation: Identifies low-frequency entities indicating long-tail knowledge gaps
- During generation: Verifies entity co-occurrence to detect hallucination risk
- π Model-Agnostic: Works with OLMo, Llama, Qwen, and even GPT models
- π Strong Performance: Achieves EM gains of 5-12 points over SOTA baselines on multi-hop QA benchmarks
- π Pre-computed Cache Available: Download our cache files to speed up experiments by 2-5x (see Quick Start Tip below)
- Installation
- Data Preparation
- Running Experiments
- Evaluation
- Available Configurations
- Important Notes
- Citation
- Acknowledgements
π‘ We strongly recommend downloading our pre-computed cache files to significantly accelerate your experiments!
Our cache includes Infini-gram and retrieval results for 2WikiMultihopQA and HotpotQA datasets. With cache enabled, you can reduce experiment time by 2-5x depending on your setup.
π¦ Download Cache Files (77MB)
# Quick setup
cd QuCo-RAG/data
mkdir -p cache && cd cache
# Download quco_cache.tar.gz from the link above, then:
tar -xzf quco_cache.tar.gzThen set "enable_cache": true in your config file(like QuCo-RAG-cache.json). See Optional: Speed Up with Pre-computed Cache for details.
# Clone the repository
git clone https://github.com/ZhishanQ/QuCo-RAG.git
cd QuCo-RAG
# If you already cloned the repository, pull the latest updates
git pull origin main
# Create and activate conda environment
conda create -n quco-rag python=3.9
conda activate quco-rag
# Install PyTorch with CUDA support (required first)
pip install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Install other dependencies
pip install -r requirements.txt
# Download spaCy English language model
python -m spacy download en_core_web_smImportant: All commands in this section assume you are in the
QuCo-RAGproject root directory. Make sure to runcd QuCo-RAGfirst if you haven't already.
Before setting up the BM25 index, you need to download two things:
1. Download Elasticsearch 7.17.9
# Make sure you are in the QuCo-RAG directory
cd QuCo-RAG # Skip this if you're already in the project root
# Create data directory and download Elasticsearch
mkdir -p data
cd data
wget -O elasticsearch-7.17.9.tar.gz https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.17.9-linux-x86_64.tar.gz
tar zxvf elasticsearch-7.17.9.tar.gz
cd ..2. Download Wikipedia Dump
Download the Wikipedia dump from the DPR repository:
# Make sure you are in the QuCo-RAG directory
mkdir -p data/dpr
wget -O data/dpr/psgs_w100.tsv.gz https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz
pushd data/dpr && gzip -d psgs_w100.tsv.gz && popdChoose ONE of the following options to set up the BM25 index. You only need to do this ONCE.
Option 1: Build index from scratch (~2-3 hours)
# Make sure you are in the QuCo-RAG directory first!
cd QuCo-RAG # Skip this if you're already in the project root
# Start Elasticsearch
cd data/elasticsearch-7.17.9
nohup bin/elasticsearch &
cd ../.. # Return to QuCo-RAG root
# Wait for ES to start (typically 2-5 minutes, depending on your system)
sleep 200
# You can check if ES is ready by running:
curl localhost:9200
# If you see "Connection refused", wait a bit longer and try again.
# ES is ready when you see a JSON response with "You Know, for Search"If Elasticsearch is running successfully, you should see output similar to:
{
"name" : "your-node-name",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "xxxxxx",
"version" : {
"number" : "7.17.9",
...
},
"tagline" : "You Know, for Search"
}Run the following command to build the index after Elasticsearch is running.
# Build the index (this takes 2-3 hours)
python tools/prep_elastic_index_with_progress.py --data_path data/dpr/psgs_w100.tsv --index_name wikiOption 2: Download pre-built index (Recommended)
We provide a pre-built BM25 index (~10GB) on HuggingFace for quick setup:
bash Start_Elasticsearch_from_hf.shThe script will:
- Ask you to confirm the configuration (ES directory and URL)
- Check if index already exists (skip download if yes)
- Download the pre-built index from π€ ZhishanQ/QuCo-RAG-es-data-archive
- Start Elasticsearch and verify the index
For HPC users: Use
bash Start_Elasticsearch_from_hf_HPC.shfor better I/O performance with local SSD storage.
β οΈ Warning: HPC mode stores data in/tmpwhich will be deleted when the job ends. You maybe need to re-download for each new job.
Understanding the workflow:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FIRST TIME SETUP (do once) β
β βββββββββββββββββββββββββββββ β
β Option 1: Build index from scratch β
β β python tools/prep_elastic_index_with_progress.py β
β β
β Option 2: Download pre-built index β
β β bash Start_Elasticsearch_from_hf.sh β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RUNNING EXPERIMENTS β
β βββββββββββββββββββ β
β 1. Start ES service: bash start_es.sh β
β 2. Run experiments: python main_quco.py -c ... β
β 3. Stop ES service: pkill -f elasticsearch β
β β
β The index is persistent - no need to rebuild/re-download! β
β (Exception: HPC mode stores in /tmp, needs re-download per job period) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Available scripts:
| Script | Purpose | When to use |
|---|---|---|
Start_Elasticsearch_from_hf.sh |
Download pre-built index + start ES | First time setup (downloads index once) |
Start_Elasticsearch_from_hf_HPC.sh |
Same as above, but uses local SSD | First time on HPC (re-download each job) |
start_es.sh |
Start ES with existing index | Before experiments if ES was stopped or after reboot |
Stop Elasticsearch when your experiment done:
# Elasticsearch consumes memory even when idle
pkill -f elasticsearchNote: Run these commands from the
QuCo-RAGproject root directory.
2WikiMultihopQA
Download from Dropbox, unzip, and move to data/2wikimultihopqa. (You can just keep two files: dev.json and id_aliases.json.)
HotpotQA
# Make sure you are in the QuCo-RAG directory
mkdir -p data/hotpotqa
wget -O data/hotpotqa/hotpotqa-dev.json http://curtis.ml.cmu.edu/datasets/hotpot/hotpot_dev_distractor_v1.jsonπ¨ Make sure Elasticsearch is running before running RAG experiments.
# The script automatically checks if ES is running and starts it if needed
bash start_es.shThe script will automatically:
- Check if Elasticsearch is already running
- If yes: Display status and exit
- If no: Start Elasticsearch and verify the index
You can also manually check the status:
# Check if Elasticsearch is running
curl -X GET "localhost:9200/"
# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Check all indices
curl -X GET "localhost:9200/_cat/indices?v"
# Check wiki index document count
curl -X GET "localhost:9200/wiki/_count?pretty"Expected output when Elasticsearch is running successfully:
# Cluster health should show "green" status
{
"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 4,
"active_shards" : 4,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
# Indices status should show wiki index with ~21M documents
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open wiki LqVIlRacS6C2S7CyrIfw7g 1 0 21015324 0 11.3gb 11.3gb
# Wiki index count should be approximately 21 million
{"count":21015324,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0}}If Elasticsearch is running correctly, you should see cluster status as "green" or "yellow", and the wiki index should contain approximately 21 million documents.
All the configuration files of our experiments are in the config folder. You can run QuCo-RAG with them directly.
For local models:
cd src
# Standard configuration
python main_quco.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/QuCo-RAG.json
# π With pre-computed cache (recommended if you've downloaded cache files)
python main_quco.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/QuCo-RAG-cache.jsonIf you don't have model weights locally, the above command will download them from Hugging Face Hub first.
For API models (e.g., GPT-4.1/GPT-5-chat):
# First, set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"
# Then run with API model configuration
cd src
python main_quco.py -c ../config/API-gpt-4.1/2WikiMultihopQA/QuCo-RAG.jsonIf you see log messages like below, it means QuCo-RAG is running successfully:
2025-12-24 00:01:55 - __main__ - INFO - Namespace(model_name_or_path='allenai/OLMo-2-1124-7B-Instruct', method='QuCo-RAG', dataset='2wikimultihopqa', ...)
2025-12-24 00:01:55 - __main__ - INFO - ==================== config ********************
2025-12-24 00:01:55 - __main__ - INFO - config path: ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/QuCo-RAG.json
2025-12-24 00:01:55 - __main__ - INFO - ==================== output dir ********************
2025-12-24 00:01:55 - __main__ - INFO - output dir: ../result/QuCo-RAG_OLMo-2-1124-7B-Instruct_2wikimultihopqa/1
2025-12-24 00:01:55 - data - INFO - Loading WikiMultiHopQA from ../data/2wikimultihopqa (split: dev)
2025-12-24 00:01:57 - __main__ - INFO - sample 1000 data points from the beginning
2025-12-24 00:01:57 - __main__ - INFO - data size: 1000
2025-12-24 00:01:57 - generate_quco - INFO - Loading model 'allenai/OLMo-2-1124-7B-Instruct' with device_map: auto
Loading checkpoint shards: 100%|ββββββββββ| 3/3 [00:10<00:00, 3.36s/it]
2025-12-24 00:02:10 - root - INFO - Activating Elasticsearch....
2025-12-24 00:02:10 - root - INFO - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'wiki', ...}
2025-12-24 00:02:10 - generate_quco - INFO - Using local entity extraction model: ZhishanQ/QuCo-extractor-0.5B
2025-12-24 00:04:05 - generate_quco - INFO - Local entity extraction model loaded successfully
2025-12-24 00:04:05 - generate_quco - INFO - Using prompt template key: new_prompt_v3
2025-12-24 00:04:05 - __main__ - INFO - model type: <class 'generate_quco.QuCo_RAG'>
2025-12-24 00:04:05 - __main__ - INFO - start inference
0%| | 0/1000 [00:00<?, ?it/s]
0%| | 1/1000 [00:24<6:54:28, 24.89s/it]
The output will be saved in the result/ folder. You can change the output folder by modifying the output_dir parameter in the configuration file.
We also provide implementations of baseline methods for comparison. You can run them using the corresponding configuration files:
For local models:
cd src
# Single Retrieval RAG (SR-RAG)
python main_baseline.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/SR-RAG.json
# Fix-Length RAG (FL-RAG)
python main_baseline.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/FL-RAG.json
# FLARE
python main_baseline.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/FLARE.json
# DRAGIN
python main_baseline.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/DRAGIN.json
# Without RAG (wo-RAG)
python main_baseline.py -c ../config/OLMo-2-1124-7B-Instruct/2WikiMultihopQA/wo-RAG.jsonFor API models (e.g., GPT-4.1):
# First, set your OpenAI API key
export OPENAI_API_KEY="your-api-key-here"
cd src
# Single Retrieval RAG (SR-RAG)
python main_baseline.py -c ../config/API-gpt-4.1/2WikiMultihopQA/SR-RAG.jsonUpon completion of the program, you will find a folder named with a numerical identifier within your designated output directory. This identifier corresponds to the sequential order of runs within that folder, allowing for easy organization of multiple executions.
To evaluate the results, you can use the evaluate.py script in the src folder. Assume the output folder is result/QuCo-RAG_OLMo-2-1124-7B-Instruct_2wikimultihopqa/1, you can run:
python evaluate.py --dir ../result/QuCo-RAG_OLMo-2-1124-7B-Instruct_2wikimultihopqa/1After the evaluation, you will see the evaluation results in the program output and in the output directory:
result/QuCo-RAG_OLMo-2-1124-7B-Instruct_2wikimultihopqa/1/
βββ config.json # Configuration used for this run
βββ output.txt # Raw predictions with statistics
βββ result.tsv # EM and F1 scores
βββ details.txt # Per-sample evaluation details
βββ retrieved_docs.json # Retrieved documents (useful for debugging)
We provide configuration files for the following models and datasets:
Models:
Local Models:
OLMo-2-1124-7B-InstructOLMo-2-1124-13B-InstructOLMo-2-0325-32B-InstructMeta-Llama-3-8B-InstructQwen2.5-7B-InstructQwen2.5-32B-Instruct
API Models:
gpt-4.1gpt-4ogpt-5-chat-latest
Datasets:
2WikiMultihopQAHotpotQA
All configurations are in the config/ folder.
We also provide configuration files for OpenAI GPT models. To use these models, you need to set up your OpenAI API key:
# Set the environment variable (required before running API models)
export OPENAI_API_KEY="your-api-key-here"Available methods for API models:
QuCo-RAG.json- QuCo-RAG methodwo-RAG.json- Without retrieval baselineSR-RAG.json- Single retrieval baselineFS-RAG.json- Fix-sentence retrieval baselineFL-RAG.json- Fix-length retrieval baselineWeb-Tool.json- Web search tool baseline (uses OpenAI's web search capability)
For gpt-4.1/gpt-4o models, OpenAI API provides the log-probability of generated tokens, which can be used for FLARE method. You can use FLARE's official implementation from FLARE.
Permanent setup (optional):
# Add to your shell configuration file (~/.bashrc or ~/.zshrc)
echo 'export OPENAI_API_KEY="your-api-key-here"' >> ~/.bashrc
source ~/.bashrcNote: API models do not require local GPU resources, but API calls will incur costs based on OpenAI's pricing. For GPT models, we use llama2's tokenizer for token counting.
The following table describes all parameters available in QuCo-RAG configuration files:
| Parameter | Type | Description | Example Values |
|---|---|---|---|
model_name_or_path |
string | Hugging Face model ID or local path to the LLM | "allenai/OLMo-2-1124-7B-Instruct" |
method |
string | RAG method identifier | "QuCo-RAG", "flare", etc. |
dataset |
string | Dataset name | "2wikimultihopqa", "hotpotqa" |
data_path |
string | Path to dataset directory | "../data/2wikimultihopqa" |
fewshot |
int | Number of few-shot examples in prompt | 6, 8 |
sample |
int | Number of samples to evaluate (-1 for all) |
1000, -1 |
shuffle |
bool | Whether to shuffle the dataset | true, false |
generate_max_length |
int | Maximum generation length in tokens | 128, etc. |
query_formulation |
string | Query formulation strategy | "direct" |
output_dir |
string | Directory to save results | "../result/QuCo-RAG_OLMo-2-1124-7B-Instruct_2wikimultihopqa" |
retriever |
string | Retriever type | "BM25", "SGPT", "Qwen3" |
es_index_name |
string | Elasticsearch index name | "wiki" |
retrieve_topk |
int | Number of documents to retrieve per query | 3 |
use_counter |
bool | Whether to use token counter | true, false |
debug |
bool | Enable debug logging | true, false |
enable_time_stats |
bool | Enable detailed timing statistics | true, false |
enable_cache |
bool | Enable caching to accelerate experiments | true, false |
gpt_model |
string | Entity extraction model path | "ZhishanQ/QuCo-extractor-0.5B" |
infini_gram_index_name |
string | Infini-gram corpus index name | "v4_olmo-2-0325-32b-instruct_llama" |
ngram_threshold_question |
int | Frequency threshold for question entities | 1000, 1000000, etc. |
Important Tips:
- Enable cache: Set
"enable_cache": trueto significantly speed up repeated experiments on the same dataset - Full evaluation: Use
"sample": -1to evaluate on the complete dataset - Debug mode: Set
"debug": truefor detailed logging during development - Entity extraction options: By default, we use π€ ZhishanQ/QuCo-extractor-0.5B for entity extraction, which is distilled from
gpt-4o-miniand handles most domains and datasets well. For reproducibility, this default model is sufficient. However, if you want to explore optimal performance, our code also supports usinggpt-4o-miniAPI directly for entity extraction by setting"gpt_model": "gpt-4o-mini"in the config (requiresOPENAI_API_KEY). Feel free to experiment with different entity extraction models!
-
GPU Requirements: Our experiments can be conducted on NVIDIA GPUs like A40/A100/H100/H200. Make sure you have sufficient GPU memory (at least ~36GB for 7B models).
-
Elasticsearch Management: Always ensure Elasticsearch is running before starting experiments. Use the commands in the "Verify Elasticsearch" section to verify. ES consumes memory even when idle, so stop it when not in use.
To stop Elasticsearch when you're done:
# Find Elasticsearch process ps aux | grep elasticsearch # Kill the process (replace <PID> with the actual process ID) kill <PID> # Or force kill if needed kill -9 <PID> # Alternative: kill all Elasticsearch processes at once pkill -f elasticsearch
-
Reproducibility: To reproduce our reported results, please use the exact configuration files provided in the
configfolder without modifications. Make sure to use our knowledge triple extractor from π€ ZhishanQ/QuCo-extractor-0.5B.
We provide scripts for encoding corpus with Qwen3-Embedding-0.6B using vLLM for faster encoding:
# Install vLLM if not already installed
pip install vllm>=0.8.5
# Encode corpus
cd tools
bash encode_qwen3_vllm.shThe scripts are located in tools/:
encode_qwen3_vllm.sh- Shell script to run encodingencode_qwen3_vllm.py- Python script for vLLM-based encoding
Note: This is optional. The default BM25 retriever works well for most cases.
We strongly recommend enabling cache to significantly accelerate experiments.
By setting "enable_cache": true in your configuration file, the system will save entity extraction results, Infini-gram queries, and retrieval results to local cache files. After the first run, subsequent experiments with the same queries will directly read from cache, dramatically reducing experiment time.
π‘ The more experiments you run, the better the speedup! Cache files accumulate over time, so repeated experiments on the same dataset will become increasingly faster.
We provide pre-computed cache files for 2WikiMultihopQA and HotpotQA datasets on Google Drive (~77MB compressed):
π¦ Download Cache Files
The cache includes:
2wikimultihopqa_infini_gram.json- Infini-gram query results for 2WikiMultihopQA2wikimultihopqa_wiki_retrieval.json- Wikipedia retrieval results for 2WikiMultihopQAhotpotqa_infini_gram.json- Infini-gram query results for HotpotQAhotpotqa_wiki_retrieval.json- Wikipedia retrieval results for HotpotQA
Note: These files contain only Infini-gram and retrieval cache. Entity extraction cache will be automatically created during your first run and reused in subsequent experiments.
# Make sure you are in the QuCo-RAG directory first
cd QuCo-RAG # Skip this if you're already in the project root
# Download and extract cache files
cd data
mkdir -p cache
cd cache
# Download quco_cache.tar.gz from Google Drive, then extract:
tar -xzf quco_cache.tar.gz
# Verify files
ls -lh
# Should see: 2wikimultihopqa_infini_gram.json, 2wikimultihopqa_wiki_retrieval.json,
# hotpotqa_infini_gram.json, hotpotqa_wiki_retrieval.json
# Return to project root
cd ../..Then enable cache in your configuration file:
{
"enable_cache": true,
...
}Note: Cache files are stored in
data/cache/and are dataset-specific. The system will automatically create new cache files for other datasets or queries not included in the pre-computed cache.
If you find this work useful, please cite our paper.
@article{min2025quco,
title={QuCo-RAG: Quantifying Uncertainty from the Pre-training Corpus for Dynamic Retrieval-Augmented Generation},
author={Min, Dehai and Zhang, Kailin and Wu, Tongtong and Cheng, Lu},
journal={arXiv preprint arXiv:2512.19134},
year={2025}
}We thank the authors of the following projects for their excellent work: