Skip to content

[arrow-avro] Avro reader produces incorrect results when reader schema and writer schema differ #9655

@ariel-miculas

Description

@ariel-miculas

Describe the bug
There are two separate issues:

Bug 1 – Reader nullable union wrapping breaks decoding of plain writer fields

When a writer produces Avro records with plain (non-nullable) field types, but the reader schema wraps those same fields in ["null", T] unions the decoder will misread the data. Because the writer never emits a union branch index byte, but the decoder expects one, it falls out of sync with the byte stream. The result is garbage field values for every record after the first that is affected.

Bug 2 – Skipper omits writer-only fields when the writer schema uses named type references

When the writer schema uses Avro named type references (e.g., "type": "Timestamp" after Timestamp has been defined once), and the reader schema requests fewer fields than the writer wrote (either by narrowing a nested record or omitting a field entirely), the Skipper uses the wrong field list. It builds its skip plan from the reader's narrowed view of the type rather than the writer's full definition. As a result, it does not consume all the bytes the writer emitted for those fields, leaving the buffer out of sync. Every subsequent record is then decoded from the wrong byte offset, producing corrupted values.

Errors reported:

READ ERROR after 0 rows: Avro error: Parser error: offset overflow reading avro bytes
READ ERROR after 0 rows: Avro error: EOF: Unexpected EOF reading bytes

To Reproduce
See the unit tests from #9605

Expected behavior
Correct decoding

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions