Conversation
WalkthroughThis pull request introduces block-seeking functionality to the Avro reader and updates AI agent configuration rules. The Avro changes add a ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements block-level seeking for the Avro reader by introducing a PositionTracker to monitor stream offsets and a BlockPosition struct to store block metadata. It adds current_block and seek_to_block methods to the Reader API, along with corresponding error handling and unit tests. Review feedback suggests optimizing the PositionTracker by overriding more Read trait methods and addressing a potential logic error where seeking to a valid but empty Avro block might incorrectly trigger an error.
| impl<R: Read> Read for PositionTracker<R> { | ||
| fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> { | ||
| let n = self.inner.read(buf)?; | ||
| self.pos += n as u64; | ||
| Ok(n) | ||
| } | ||
| } |
There was a problem hiding this comment.
The PositionTracker implementation for Read only overrides the base read method. While this is correct for tracking position, it might be less efficient if the underlying reader R has specialized implementations for read_exact, read_to_end, or read_vectored. Consider overriding these methods to delegate to inner and update pos accordingly.
| if self.is_empty() { | ||
| return Err(Details::SeekToBlock(std::io::Error::new( | ||
| std::io::ErrorKind::UnexpectedEof, | ||
| format!("no block at offset {offset}"), | ||
| )) | ||
| .into()); | ||
| } |
There was a problem hiding this comment.
The check if self.is_empty() after read_block_next() is used to detect if the seek landed at the end of the stream. However, if an Avro file contains an empty block (which is technically possible though rare), this would incorrectly return an error. Since read_block_next already validates the sync marker, a successful call that results in an empty block should probably not be treated as an error unless the intention is to only support seeking to non-empty blocks.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 725876f. Configure here.
| self.block.seek_to_block(offset)?; | ||
| self.errored = false; | ||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Failed seek leaves reader in inconsistent state
Medium Severity
When seek_to_block fails partway through read_block_next (e.g., after the object-count varint is read but fill_buf or marker validation fails), message_count is left non-zero while buf contains invalid/undecompressed data. Because Reader::seek_to_block only reaches self.errored = false on the success path (after ?), the errored flag stays false. A subsequent next() call sees is_empty() as false, skips loading a new block, and tries to decode from the corrupted buffer — potentially yielding garbage or a confusing error instead of cleanly stopping iteration.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 725876f. Configure here.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@avro/src/reader/block.rs`:
- Around line 61-63: The Reader::new currently sets pos = 0 which mismatches
readers that are already seeked; update Reader::new to detect if the provided
inner implements Seek and, if so, query the current stream position (e.g. seek
to SeekFrom::Current(0)) and initialize pos with that value so
data_start()/current_block().offset and seek_to_block() use the same coordinate
base; alternatively, if you prefer to disallow non-zero starts, explicitly check
the current position and return an error when it is not zero. Ensure references
to pos, data_start(), current_block().offset, and seek_to_block() all operate
against that captured base.
In `@avro/src/reader/mod.rs`:
- Around line 195-198: seek_to_block currently clears self.errored
unconditionally; change it so that if self.block.seek_to_block(offset) returns
an Err the reader is poisoned by setting self.errored = true before returning
the error, and only clear self.errored (set to false) after a successful seek;
reference the seek_to_block method and Block::seek_to_block and ensure the error
path sets self.errored = true and the Ok path keeps the existing self.errored =
false behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4c34078f-196a-4995-a4ec-a6a9fd7cb1ed
📒 Files selected for processing (7)
.cursor/rules.mdAGENTS.mdCLAUDE.mdavro/src/error.rsavro/src/lib.rsavro/src/reader/block.rsavro/src/reader/mod.rs
| fn new(inner: R) -> Self { | ||
| Self { inner, pos: 0 } | ||
| } |
There was a problem hiding this comment.
Track offsets from the reader’s initial seek position, not hard-coded zero.
pos starts at 0 even if the caller hands Reader a Read + Seek that is already positioned inside a larger stream. data_start() / current_block().offset are then recorded relative to that starting point, but seek_to_block() later feeds them to SeekFrom::Start, so seeking back to a saved block offset jumps to the wrong byte. Please capture the initial stream position for seekable readers (or explicitly reject non-zero starts) so the reported offsets and the seek API use the same coordinate system.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@avro/src/reader/block.rs` around lines 61 - 63, The Reader::new currently
sets pos = 0 which mismatches readers that are already seeked; update
Reader::new to detect if the provided inner implements Seek and, if so, query
the current stream position (e.g. seek to SeekFrom::Current(0)) and initialize
pos with that value so data_start()/current_block().offset and seek_to_block()
use the same coordinate base; alternatively, if you prefer to disallow non-zero
starts, explicitly check the current position and return an error when it is not
zero. Ensure references to pos, data_start(), current_block().offset, and
seek_to_block() all operate against that captured base.
| pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> { | ||
| self.block.seek_to_block(offset)?; | ||
| self.errored = false; | ||
| Ok(()) |
There was a problem hiding this comment.
Mark the iterator errored when seek_to_block fails.
On the error path, Block::seek_to_block has already probed the underlying stream, so leaving errored == false lets the next next() continue from that failed probe state and return unrelated data/errors. Poison the reader on failure and only clear it after a successful seek.
💡 Suggested change
pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> {
- self.block.seek_to_block(offset)?;
- self.errored = false;
- Ok(())
+ match self.block.seek_to_block(offset) {
+ Ok(()) => {
+ self.errored = false;
+ Ok(())
+ }
+ Err(err) => {
+ self.errored = true;
+ Err(err)
+ }
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> { | |
| self.block.seek_to_block(offset)?; | |
| self.errored = false; | |
| Ok(()) | |
| pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> { | |
| match self.block.seek_to_block(offset) { | |
| Ok(()) => { | |
| self.errored = false; | |
| Ok(()) | |
| } | |
| Err(err) => { | |
| self.errored = true; | |
| Err(err) | |
| } | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@avro/src/reader/mod.rs` around lines 195 - 198, seek_to_block currently
clears self.errored unconditionally; change it so that if
self.block.seek_to_block(offset) returns an Err the reader is poisoned by
setting self.errored = true before returning the error, and only clear
self.errored (set to false) after a successful seek; reference the seek_to_block
method and Block::seek_to_block and ensure the error path sets self.errored =
true and the Ok path keeps the existing self.errored = false behavior.
Code ReviewOverall this is a clean, well-structured implementation of seekable block reading. The Potential Bug: Empty-block false negativeIn self.read_block_next()?;
if self.is_empty() {
return Err(Details::SeekToBlock(...).into());
}If a valid block exists at self.read_block_next()?;
if self.current_block_info.is_none() {
return Err(Details::SeekToBlock(...).into());
}Minor: Error variant reused for two different failure modesIn Minor:
|
🤖 Augment PR SummarySummary: This PR adds support for seeking back to previously read Avro container blocks when the underlying input implements Changes:
Technical Notes: Seeking reloads and decompresses the target block so that subsequent 🤖 Was this summary useful? React with 👍 or 👎 |
|
|
||
| impl<R> PositionTracker<R> { | ||
| fn new(inner: R) -> Self { | ||
| Self { inner, pos: 0 } |
There was a problem hiding this comment.
avro/src/reader/block.rs:62 — PositionTracker always initializes pos to 0, so if a Read + Seek input is already positioned at a non-zero offset when Reader::new is called, the offsets reported by current_block()/data_start() won’t match what seek_to_block() expects with SeekFrom::Start. Consider documenting the “must start at stream position 0” precondition or ensuring tracked offsets are absolute for seekable inputs.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
|
|
||
| self.current_block_info = Some(BlockPosition { | ||
| offset: block_start, | ||
| message_count: block_len as usize, |
There was a problem hiding this comment.
avro/src/reader/block.rs:216 — block_len is an i64 but is cast to usize for BlockPosition.message_count; if a corrupt/malicious file yields a negative block_len, this will wrap to a huge count and expose a nonsensical value to callers. Consider validating block_len >= 0 before casting (same concern applies to self.message_count = block_len as usize).
Severity: medium
Other Locations
avro/src/reader/block.rs:202
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.


530: To review by AI