feat: Add REFRAG preprocessing adapter for next-gen RAG context compression#162
feat: Add REFRAG preprocessing adapter for next-gen RAG context compression#162AnasAmchaar wants to merge 3 commits intotavily-ai:masterfrom
Conversation
This update introduces the TavilyRefragClient, RefragChunk, and RefragContext classes to the tavily module, enhancing its functionality for handling refrag operations.
…ansion policy Made-with: Cursor
Updated the chunk size assignment to use a conditional expression, ensuring that the default chunk size is used only when chunk_size is explicitly set to None.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
| if encode and self.encoder_function is not None: | ||
| self.encode_chunks(chunks) | ||
| if apply_policy: | ||
| self.apply_expansion_policy(chunks, query=query) |
There was a problem hiding this comment.
Silent skip of encoding contradicts docstring and encode_chunks
Medium Severity
When encode=True (the default) but encoder_function is None, prepare_context silently skips both encoding and the expansion policy. This contradicts the docstring which states encode "Requires encoder_function to be set" and is inconsistent with encode_chunks, which raises a RuntimeError when called without an encoder. A user who explicitly passes encode=True signals intent to encode — silently producing a RefragContext with no embeddings and no expansion flags creates a subtle failure mode where downstream REFRAG decoders receive incomplete data without any error being surfaced.


Motivation
REFRAG (REpresentation For RAG) is a recent research framework from Meta Superintelligence Labs that fundamentally rethinks how RAG contexts are fed to decoder LLMs. Instead of passing all retrieved passage tokens directly into the decoder, REFRAG:
The results are striking: 30.85x TTFT acceleration and 3.75x improvement over prior SOTA (CEPE) with zero loss in perplexity. It also extends the effective context window by 16x, enabling LLMs to consume far more retrieved passages within the same latency budget.
As RAG becomes the dominant pattern for grounding LLMs in real-world knowledge, REFRAG represents the next evolution in how retrieval results are consumed at inference time. This PR positions tavily-python as REFRAG-ready by providing the retrieval-to-decoder preprocessing pipeline.
What this PR adds
A new \\TavilyRefragClient\\ that bridges Tavily search results and REFRAG-compatible decoders. It follows the same architectural pattern as \\TavilyHybridClient\\ -- wrapping an internal \\TavilyClient\\ and accepting pluggable functions -- so users bring their own trained REFRAG models.
New files
Modified files
Pipeline architecture
\
User Query
--> TavilyClient.search() # Retrieve passages from the web
--> chunk_passages() # Split into k-token chunks (k=8,16,32)
--> encode_chunks() # Pluggable encoder -> chunk embeddings
--> apply_expansion_policy() # Pluggable RL policy -> expand/compress decisions
--> RefragContext # Ready for external REFRAG decoder
\\
Each step is callable independently or combined via \\prepare_context()\.
Pluggable interfaces (no heavy dependencies)
This means zero new dependencies -- the core only uses \\ iktoken\\ which is already in \\setup.py\. Users plug in their own encoder (RoBERTa, etc.) and RL policy when they have trained REFRAG models.
Usage
\\python
from tavily import TavilyRefragClient
client = TavilyRefragClient(
api_key='tvly-...',
chunk_size=16, # k=16 as in the paper
encoder_function=my_roberta_encoder,
expansion_policy=my_rl_policy,
)
Full pipeline in one call
ctx = client.prepare_context('What are the latest advances in RAG?', max_results=10)
print(len(ctx.compressed_chunks)) # Feed as embeddings to decoder
print(len(ctx.expanded_chunks)) # Feed as full tokens to decoder
Or step-by-step for more control
chunks = client.chunk_passages(tavily_results, chunk_size=8)
chunks = client.encode_chunks(chunks)
chunks = client.apply_expansion_policy(chunks, query='...')
\\
Design decisions
Test plan
equest_intercept.py\\ framework -- no real API calls
References
Note
Low Risk
Additive feature introducing a new
TavilyRefragClientwrapper and datamodels without modifying existing request/response behavior; primary risk is new public API surface and tokenization/encoding assumptions in the new adapter.Overview
Adds a new REFRAG preprocessing flow via
TavilyRefragClient, including passage tokenization/chunking, optional embedding via a user-suppliedencoder_function, and optional chunk selection via a pluggableexpansion_policy, returning aRefragContextwithcompressed_chunks/expanded_chunks.Exposes the new client and
RefragChunk/RefragContextfromtavily.__init__, and adds an end-to-end example (examples/refrag.py) plus a newtests/test_refrag.pycovering constructor validation, chunking/encoding/policy behavior, andprepare_context()integration with forwarded search kwargs.Written by Cursor Bugbot for commit 41b4b9a. This will update automatically on new commits. Configure here.