Paper
- Title: Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
- Authors: Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia
- Link: https://arxiv.org/abs/2604.02091v1
- Source: Arxiv
Summary
RRPO introduces a reinforcement learning framework that directly aligns reranking with the downstream LLM generation quality. By formulating reranking as a sequential decision-making process and optimizing for context utility using LLM feedback, RRPO eliminates the misalignment between static relevance-based reranking and actual answer generation utility. It includes a reference-anchored deterministic baseline for training stability and outperforms strong baselines including RankZephyr on knowledge-intensive benchmarks.
Relevance to AutoRAG-Research
Directly relevant as a retrieval pipeline enhancement. AutoRAG-Research currently lacks a dedicated reranking optimization module — existing retrieval pipelines (bm25, vector_search, hyde, hybrid) retrieve and rank documents based on static relevance signals without considering downstream generation quality. RRPO bridges this gap by training a reranker that optimizes for what actually helps the LLM generate correct answers. It also integrates orthogonally with query expansion modules (like HyDE) and generalizes across different reader LLMs.
Implementation Notes
- Pipeline type: retrieval (reranking stage)
- Key components:
- Reranking formulated as sequential decision-making (selecting documents one by one)
- RL optimization with LLM generation quality as reward signal
- Reference-anchored deterministic baseline for stable training
- Compatible with various reader LLMs (tested with GPT-4o)
- Orthogonal integration with query expansion (e.g., Query2Doc/HyDE)
- Dependencies: PyTorch, RL training utilities (PPO-style), LangChain LLM integration for reward computation
Acceptance Criteria
Paper
Summary
RRPO introduces a reinforcement learning framework that directly aligns reranking with the downstream LLM generation quality. By formulating reranking as a sequential decision-making process and optimizing for context utility using LLM feedback, RRPO eliminates the misalignment between static relevance-based reranking and actual answer generation utility. It includes a reference-anchored deterministic baseline for training stability and outperforms strong baselines including RankZephyr on knowledge-intensive benchmarks.
Relevance to AutoRAG-Research
Directly relevant as a retrieval pipeline enhancement. AutoRAG-Research currently lacks a dedicated reranking optimization module — existing retrieval pipelines (bm25, vector_search, hyde, hybrid) retrieve and rank documents based on static relevance signals without considering downstream generation quality. RRPO bridges this gap by training a reranker that optimizes for what actually helps the LLM generate correct answers. It also integrates orthogonally with query expansion modules (like HyDE) and generalizes across different reader LLMs.
Implementation Notes
Acceptance Criteria