Skip to content

530: feat: seekable reader#68

Open
martin-augment wants to merge 3 commits intomainfrom
pr-530-2026-04-06-05-32-20
Open

530: feat: seekable reader#68
martin-augment wants to merge 3 commits intomainfrom
pr-530-2026-04-06-05-32-20

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

530: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

Walkthrough

This pull request introduces block-seeking functionality to the Avro reader and updates AI agent configuration rules. The Avro changes add a BlockPosition type to expose block metadata (offset and message count), implement a PositionTracker wrapper to monitor reader position, and provide seek_to_block() methods for random access to data blocks in both the low-level Block and high-level Reader APIs. Additionally, current_block() and data_start() methods expose block introspection. The configuration files (.cursor/rules.md, AGENTS.md, CLAUDE.md) are updated to enforce rules preventing AI agents from linking to issues/pull requests in reviews and reviewing other AI agent configuration files.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-530-2026-04-06-05-32-20

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements block-level seeking for the Avro reader by introducing a PositionTracker to monitor stream offsets and a BlockPosition struct to store block metadata. It adds current_block and seek_to_block methods to the Reader API, along with corresponding error handling and unit tests. Review feedback suggests optimizing the PositionTracker by overriding more Read trait methods and addressing a potential logic error where seeking to a valid but empty Avro block might incorrectly trigger an error.

Comment on lines +70 to +76
impl<R: Read> Read for PositionTracker<R> {
fn read(&mut self, buf: &mut [u8]) -> std::io::Result<usize> {
let n = self.inner.read(buf)?;
self.pos += n as u64;
Ok(n)
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The PositionTracker implementation for Read only overrides the base read method. While this is correct for tracking position, it might be less efficient if the underlying reader R has specialized implementations for read_exact, read_to_end, or read_vectored. Consider overriding these methods to delegate to inner and update pos accordingly.

Comment on lines +377 to +383
if self.is_empty() {
return Err(Details::SeekToBlock(std::io::Error::new(
std::io::ErrorKind::UnexpectedEof,
format!("no block at offset {offset}"),
))
.into());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The check if self.is_empty() after read_block_next() is used to detect if the seek landed at the end of the stream. However, if an Avro file contains an empty block (which is technically possible though rare), this would incorrectly return an error. Since read_block_next already validates the sync marker, a successful call that results in an empty block should probably not be treated as an error unless the intention is to only support seeking to non-empty blocks.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 725876f. Configure here.

self.block.seek_to_block(offset)?;
self.errored = false;
Ok(())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failed seek leaves reader in inconsistent state

Medium Severity

When seek_to_block fails partway through read_block_next (e.g., after the object-count varint is read but fill_buf or marker validation fails), message_count is left non-zero while buf contains invalid/undecompressed data. Because Reader::seek_to_block only reaches self.errored = false on the success path (after ?), the errored flag stays false. A subsequent next() call sees is_empty() as false, skips loading a new block, and tries to decode from the corrupted buffer — potentially yielding garbage or a confusing error instead of cleanly stopping iteration.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 725876f. Configure here.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@avro/src/reader/block.rs`:
- Around line 61-63: The Reader::new currently sets pos = 0 which mismatches
readers that are already seeked; update Reader::new to detect if the provided
inner implements Seek and, if so, query the current stream position (e.g. seek
to SeekFrom::Current(0)) and initialize pos with that value so
data_start()/current_block().offset and seek_to_block() use the same coordinate
base; alternatively, if you prefer to disallow non-zero starts, explicitly check
the current position and return an error when it is not zero. Ensure references
to pos, data_start(), current_block().offset, and seek_to_block() all operate
against that captured base.

In `@avro/src/reader/mod.rs`:
- Around line 195-198: seek_to_block currently clears self.errored
unconditionally; change it so that if self.block.seek_to_block(offset) returns
an Err the reader is poisoned by setting self.errored = true before returning
the error, and only clear self.errored (set to false) after a successful seek;
reference the seek_to_block method and Block::seek_to_block and ensure the error
path sets self.errored = true and the Ok path keeps the existing self.errored =
false behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4c34078f-196a-4995-a4ec-a6a9fd7cb1ed

📥 Commits

Reviewing files that changed from the base of the PR and between 97b0c80 and 725876f.

📒 Files selected for processing (7)
  • .cursor/rules.md
  • AGENTS.md
  • CLAUDE.md
  • avro/src/error.rs
  • avro/src/lib.rs
  • avro/src/reader/block.rs
  • avro/src/reader/mod.rs

Comment on lines +61 to +63
fn new(inner: R) -> Self {
Self { inner, pos: 0 }
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Track offsets from the reader’s initial seek position, not hard-coded zero.

pos starts at 0 even if the caller hands Reader a Read + Seek that is already positioned inside a larger stream. data_start() / current_block().offset are then recorded relative to that starting point, but seek_to_block() later feeds them to SeekFrom::Start, so seeking back to a saved block offset jumps to the wrong byte. Please capture the initial stream position for seekable readers (or explicitly reject non-zero starts) so the reported offsets and the seek API use the same coordinate system.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/reader/block.rs` around lines 61 - 63, The Reader::new currently
sets pos = 0 which mismatches readers that are already seeked; update
Reader::new to detect if the provided inner implements Seek and, if so, query
the current stream position (e.g. seek to SeekFrom::Current(0)) and initialize
pos with that value so data_start()/current_block().offset and seek_to_block()
use the same coordinate base; alternatively, if you prefer to disallow non-zero
starts, explicitly check the current position and return an error when it is not
zero. Ensure references to pos, data_start(), current_block().offset, and
seek_to_block() all operate against that captured base.

Comment on lines +195 to +198
pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> {
self.block.seek_to_block(offset)?;
self.errored = false;
Ok(())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mark the iterator errored when seek_to_block fails.

On the error path, Block::seek_to_block has already probed the underlying stream, so leaving errored == false lets the next next() continue from that failed probe state and return unrelated data/errors. Poison the reader on failure and only clear it after a successful seek.

💡 Suggested change
     pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> {
-        self.block.seek_to_block(offset)?;
-        self.errored = false;
-        Ok(())
+        match self.block.seek_to_block(offset) {
+            Ok(()) => {
+                self.errored = false;
+                Ok(())
+            }
+            Err(err) => {
+                self.errored = true;
+                Err(err)
+            }
+        }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> {
self.block.seek_to_block(offset)?;
self.errored = false;
Ok(())
pub fn seek_to_block(&mut self, offset: u64) -> AvroResult<()> {
match self.block.seek_to_block(offset) {
Ok(()) => {
self.errored = false;
Ok(())
}
Err(err) => {
self.errored = true;
Err(err)
}
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/reader/mod.rs` around lines 195 - 198, seek_to_block currently
clears self.errored unconditionally; change it so that if
self.block.seek_to_block(offset) returns an Err the reader is poisoned by
setting self.errored = true before returning the error, and only clear
self.errored (set to false) after a successful seek; reference the seek_to_block
method and Block::seek_to_block and ensure the error path sets self.errored =
true and the Ok path keeps the existing self.errored = false behavior.

@claude
Copy link
Copy Markdown

claude bot commented Apr 6, 2026

Code Review

Overall this is a clean, well-structured implementation of seekable block reading. The PositionTracker wrapper is a smart design choice — it avoids requiring Seek on the common forward-read path while still enabling seek-based APIs when the bound is present. The errored flag reset in Reader::seek_to_block is correctly handled. A few items worth addressing:


Potential Bug: Empty-block false negative

In avro/src/reader/block.rs, seek_to_block uses is_empty() to detect EOF:

self.read_block_next()?;
if self.is_empty() {
    return Err(Details::SeekToBlock(...).into());
}

If a valid block exists at offset but was written with block_len = 0 (which the Avro spec permits), read_block_next sets message_count = 0 and returns Ok(()), causing this check to incorrectly treat a real block as an EOF error. Consider checking whether current_block_info was set by read_block_next instead:

self.read_block_next()?;
if self.current_block_info.is_none() {
    return Err(Details::SeekToBlock(...).into());
}

Minor: Error variant reused for two different failure modes

In avro/src/error.rs, Details::SeekToBlock is used both for the initial IO seek error and for the "no block found at offset" case. These are logically distinct — one is an OS-level IO error, the other is a format-level error. Callers cannot distinguish them without inspecting the wrapped message string. Consider adding a second variant (e.g., NoBlockAtOffset(u64)) for the EOF/empty case.


Minor: PositionTracker assumes stream starts at offset 0

PositionTracker::new initialises pos to 0 unconditionally. If the reader passed to Block::new has already been advanced (e.g., a BufReader positioned partway into a file), data_start and all subsequent block offsets will be wrong. The assumption is safe for current callers, but worth documenting, or calling stream_position() at construction time when R: Seek.


Missing test coverage

  1. Compressed codecs — no test covers seeking within Deflate or Snappy compressed files. The decompression path in seek_to_block is exercised but not validated under compression.
  2. Forward seej — all existing tests verify backward seeks only. Seeking forward past the current read position is equally valid and worth covering.
  3. Error type assertiontest_seek_to_invalid_offset only checks is_err(). Asserting the specific error variant (e.g., GetBlockMarker) would make the test more meaningful.

Nit: Redundant cast in read_block_next

self.message_count = block_len as usize;
// ...
self.current_block_info = Some(BlockPosition {
    offset: block_start,
    message_count: block_len as usize,   // could use self.message_count
});

Using self.message_count for the second line avoids re-casting and makes it clear the two values are identical.


Nit: Intra-doc link in BlockPosition may not resolve

The doc comment in block.rs references [super::Reader::seek_to_block]. Since seek_to_block is only defined on impl<R: Read + Seek> Reader, rustdoc may fail to resolve the link. Worth running cargo doc --no-deps 2>&1 | grep warn to verify.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 6, 2026

🤖 Augment PR Summary

Summary: This PR adds support for seeking back to previously read Avro container blocks when the underlying input implements Read + Seek.

Changes:

  • Introduced BlockPosition to capture a block’s byte offset and total record count during iteration.
  • Added a PositionTracker wrapper to track consumed bytes without requiring Seek just to compute offsets.
  • Extended Reader with current_block() and data_start() accessors for offset introspection.
  • Added Reader::seek_to_block (and internal Block::seek_to_block) to jump to a block by offset and validate the sync marker.
  • Added a new error variant (Details::SeekToBlock) for seek/offset-related failures.
  • Included tests that write multi-block files and verify seeking between blocks and failure on invalid offsets.

Technical Notes: Seeking reloads and decompresses the target block so that subsequent Iterator::next yields from the start of that block.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.


impl<R> PositionTracker<R> {
fn new(inner: R) -> Self {
Self { inner, pos: 0 }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avro/src/reader/block.rs:62PositionTracker always initializes pos to 0, so if a Read + Seek input is already positioned at a non-zero offset when Reader::new is called, the offsets reported by current_block()/data_start() won’t match what seek_to_block() expects with SeekFrom::Start. Consider documenting the “must start at stream position 0” precondition or ensuring tracked offsets are absolute for seekable inputs.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.


self.current_block_info = Some(BlockPosition {
offset: block_start,
message_count: block_len as usize,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avro/src/reader/block.rs:216block_len is an i64 but is cast to usize for BlockPosition.message_count; if a corrupt/malicious file yields a negative block_len, this will wrap to a huge count and expose a nonsensical value to callers. Consider validating block_len >= 0 before casting (same concern applies to self.message_count = block_len as usize).

Severity: medium

Other Locations
  • avro/src/reader/block.rs:202

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants