Skip to content

fix: handle Avro reader schema with no fields#9611

Open
mzabaluev wants to merge 1 commit intoapache:mainfrom
mzabaluev:avro-empty-reader-schema
Open

fix: handle Avro reader schema with no fields#9611
mzabaluev wants to merge 1 commit intoapache:mainfrom
mzabaluev:avro-empty-reader-schema

Conversation

@mzabaluev
Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

In the degenerate case when the Avro reader schema has no fields, the RecordDecoder should be able to produce empty record batches with the number of rows counted from the data. As an optimization for OCF, the reader could skip decoding altogether, relying on record counts provided by data blocks.

What changes are included in this PR?

A row counter is run in the RecordDecoder state.

Are these changes tested?

Added tests to verify decoder behavior given an empty reader schema for the data files in the test suite.

Are there any user-facing changes?

No.

@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Mar 24, 2026
@mzabaluev-flarion mzabaluev-flarion force-pushed the avro-empty-reader-schema branch from f8a3f4a to c518a19 Compare March 24, 2026 17:32
Comment on lines +231 to +515
@@ -512,7 +512,7 @@ impl<R: AsyncFileReader + Unpin + 'static> AsyncAvroFileReader<R> {
// We have a full batch ready, emit it
// (This is not mutually exclusive with the block being finished, so the state change is valid)
if self.decoder.batch_is_full() {
return match self.decoder.flush() {
return match self.decoder.flush_block() {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why changed to flush_block?

Copy link
Copy Markdown
Contributor Author

@mzabaluev mzabaluev Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was something I noticed while tweaking: the OCF reader does not need the schema-updating code in flush. The method it calls to decode is decode_block, the logical companion to which is provided as flush_block.

Copy link
Copy Markdown
Member

@rluvaton rluvaton Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it required for the fix? if not can you please create a separate pr for this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved out to #9726.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 6, 2026

fyi @jecsand838

@alamb alamb marked this pull request as draft April 14, 2026 20:51
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 14, 2026

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

In the degenerate case when the Avro reader schema has no fields,
the RecordDecoder should be able to produce empty record batches with
the number of rows counted from the data.
@mzabaluev-flarion mzabaluev-flarion force-pushed the avro-empty-reader-schema branch from 7be6a71 to 9175874 Compare April 15, 2026 12:23
@mzabaluev mzabaluev marked this pull request as ready for review April 15, 2026 12:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avro decoder can't handle a reader schema with no fields

3 participants