458: feat: Support (De)Serialization for different representations of Nullable Unions#46
458: feat: Support (De)Serialization for different representations of Nullable Unions#46martin-augment wants to merge 5 commits intomainfrom
Conversation
Summary of ChangesHello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly improves the Avro Serde library's capability to handle nullable union types. It addresses the complexities of mapping Rust's Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
WalkthroughThis pull request implements nullable union support in Apache Avro's Serde implementation. The Deserializer gains an internal flag and constructor to track "some"-deserialization context, with enhanced union handling across multiple deserialize methods. Similarly, SchemaAwareWriteSerializer receives a matching internal flag and constructor for "some"-serialization context, with corresponding union and variant handling updates. A comprehensive test suite is added covering nullable unions with enums, primitives, records, and composite types, including Avro JSON encoding compatibility. The changes total approximately 1,127 lines across deserialization, serialization schema handling, and tests. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: Support (De)Serialization for Different Representations of Nullable UnionsSummaryThis PR adds support for multiple Rust representations of Avro nullable unions (unions with
Code Quality: ⭐⭐⭐⭐ (4/5)Strengths
Potential Issues & Concerns1. Index Calculation Logic (Medium Priority)In let variant_idx = *idx as usize - usize::from(self.deserializing_some);And let branch_index = variant_index as usize + usize::from(self.serializing_some);Concern: The asymmetry is correct (deserialize subtracts, serialize adds), but this is subtle and error-prone. Consider:
Recommendation: Add bounds checking before the subtraction or document why underflow is impossible. 2. Special Case Enum Handling (Low Priority)In if self.deserializing_some
&& let Value::Enum(_index, field) = inner.deref()
&& variants.contains(&&**field)
{
return visitor.visit_enum(EnumUnitDeserializer::new(field));
}Concern: This special-case handling for non-newtype enums within Question: Is there a specific reason why enums need special handling compared to other types when deserializing 3. Duplicated Union Handling PatternThe pattern for handling Value::Union(i, x) => {
if matches!(x.deref(), Value::Union(_, _)) {
Err(/* nested union error */)
} else {
Self::new(x.deref()).deserialize_X(visitor).map_err(/* ... */)
}
}Recommendation: Consider extracting this into a helper method to reduce code duplication: fn unwrap_union_for_deserialize<V, F>(&self, i: u32, inner: &Value, f: F) -> Result<V, Error>
where F: FnOnce(Deserializer) -> Result<V, Error>4. Missing Bounds Check Edge Case (Low Priority)In serialization methods, the bounds check happens after calculating let branch_index = variant_index as usize + usize::from(self.serializing_some);
if branch_index >= union_schema.schemas.len() {
return Err(...)
}While correct, if Performance ConsiderationsPositive: The implementation has minimal performance overhead:
Neutral: The enum schema iteration in Security Concerns✅ No major security issues identified
Minor concern: Error messages include debug output of values (e.g., Best Practices & StyleGood:
Could improve:
DocumentationMissing:
Recommendation: Add rustdoc comments: /// Tracks whether we're currently deserializing an `Option::Some` variant.
/// When true, variant indices are adjusted to account for the implicit `null`
/// first variant in the Avro union schema.
deserializing_some: bool,Test Coverage: ⭐⭐⭐⭐⭐ (5/5)The test coverage is exemplary:
One ignored test suggests there's a known limitation with untagged enums that deserialize to the wrong variant (the "pitfall" module). This is acceptable if documented. Final Recommendations
Verdict: ✅ Approve with Minor RevisionsThis is a well-implemented feature with excellent test coverage. The core logic is sound, but would benefit from additional documentation and minor code improvements. The ignored test should be documented or fixed. Review generated with assistance from Claude Code |
There was a problem hiding this comment.
Code Review
This pull request enhances support for serializing and deserializing nullable unions in various representations, introducing a state flag (deserializing_some/serializing_some) to correctly handle Option<T> types and adjusting variant indices. However, a critical security vulnerability exists: the implementation assumes the null variant is always the first element in an Avro union. This hardcoded assumption can lead to a Denial of Service (panic) during deserialization and data integrity issues during serialization when schemas do not conform, specifically causing a guaranteed panic in debug builds for certain valid Avro schemas. Furthermore, while the overall implementation is solid, there are minor opportunities to improve error messages for better debugging clarity.
| return visitor.visit_enum(EnumUnitDeserializer::new(field)); | ||
| } | ||
| // Assume `null` is the first branch if deserializing some so decrement the variant index | ||
| let variant_idx = *idx as usize - usize::from(self.deserializing_some); |
There was a problem hiding this comment.
The code performs an unchecked subtraction *idx as usize - usize::from(self.deserializing_some) on line 862. When self.deserializing_some is true (which happens when deserializing an Option<T> where the value is not null) and the Avro union index *idx is 0, this will cause a panic due to integer underflow in debug builds. In release builds, it will wrap to usize::MAX, which will then fail the bounds check on line 863 and return an error.
This occurs when the Avro schema for a nullable union has the non-null variant at index 0 (e.g., ["int", "null"]), which is perfectly valid according to the Avro specification. An attacker providing a crafted Avro file with such a schema can cause a Denial of Service (DoS) by crashing the application (in debug mode) or causing unexpected deserialization failures.
Furthermore, the logic hardcodes the assumption that null is always the first variant in a nullable union. If null is at a different position (e.g., index 1 in ["int", "null", "string"]), the subtraction will result in the wrong variant being selected, leading to data corruption.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The calculation of the branch_index may lead to usize underflow when the null variant is not the first in the union. The Avro specification recommends listing the null first, but it allows to use it in a non-first position too. Prevents an application crash at production due to invalid variant index. The calculation should be done with checked operations and an Err should be returned instead with recommendation to put the null at the first position.
| } | ||
| } | ||
| } | ||
| let branch_index = variant_index as usize + usize::from(self.serializing_some); |
There was a problem hiding this comment.
The serialization logic hardcodes the assumption that the null variant is always the first branch in a nullable union by adding usize::from(self.serializing_some) to the variant index (e.g., on line 1500). If the Avro schema has null at a different position or if the union does not contain null at all, this will result in serializing the wrong variant or an out-of-bounds error. This can lead to data corruption when serializing enums wrapped in Option if the schema does not follow the expected ["null", ...] pattern.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The calculation of the branch_index may lead to usize underflow when the null variant is not the first in the union. The Avro specification recommends listing the null first, but it allows to use it in a non-first position too. Prevents an application crash at production due to invalid variant index. The calculation should be done with checked operations and an Err should be returned instead with recommendation to put the null at the first position.
| .map_err(|e| { | ||
| de::Error::custom(format!( | ||
| "Attempted to deserialize Value::Union({i}, {x:?}) as bytes: {e:?}" | ||
| )) | ||
| }) |
There was a problem hiding this comment.
The error message here seems to be a copy-paste from deserialize_bytes. It says "as bytes", but this is inside deserialize_byte_buf. This could be confusing when debugging. It should probably be updated to "as byte_buf".
| .map_err(|e| { | |
| de::Error::custom(format!( | |
| "Attempted to deserialize Value::Union({i}, {x:?}) as bytes: {e:?}" | |
| )) | |
| }) | |
| .map_err(|e| { | |
| de::Error::custom(format!( | |
| "Attempted to deserialize Value::Union({i}, {x:?}) as byte_buf: {e:?}" | |
| )) | |
| }) |
| if branch_index >= union_schema.schemas.len() { | ||
| return Err(create_error(format!( | ||
| "Variant index out of bounds: {}. The union schema has '{}' schemas", | ||
| variant_index, | ||
| union_schema.schemas.len() | ||
| ))); | ||
| } |
There was a problem hiding this comment.
In serialize_newtype_variant_with_schema, the error message for an out-of-bounds index uses variant_index, but the check is performed on branch_index. This can be misleading when serializing_some is true, as branch_index would be variant_index + 1. The error message should use branch_index to be accurate. Using "Branch index" instead of "Variant index" in the message would also improve clarity.
| if branch_index >= union_schema.schemas.len() { | |
| return Err(create_error(format!( | |
| "Variant index out of bounds: {}. The union schema has '{}' schemas", | |
| variant_index, | |
| union_schema.schemas.len() | |
| ))); | |
| } | |
| if branch_index >= union_schema.schemas.len() { | |
| return Err(create_error(format!( | |
| "Branch index out of bounds: {}. The union schema has '{}' schemas", | |
| branch_index, | |
| union_schema.schemas.len() | |
| ))); | |
| } |
| if branch_index >= union_schema.schemas.len() { | ||
| return Err(create_error(format!( | ||
| "Variant index out of bounds: {}. The union schema has '{}' schemas", | ||
| variant_index, | ||
| union_schema.schemas.len() | ||
| ))); | ||
| } |
There was a problem hiding this comment.
In serialize_tuple_variant_with_schema, the error message for an out-of-bounds index uses variant_index, but the check is performed on branch_index. This can be misleading when serializing_some is true, as branch_index would be variant_index + 1. The error message should use branch_index to be accurate. Using "Branch index" instead of "Variant index" in the message would also improve clarity.
| if branch_index >= union_schema.schemas.len() { | |
| return Err(create_error(format!( | |
| "Variant index out of bounds: {}. The union schema has '{}' schemas", | |
| variant_index, | |
| union_schema.schemas.len() | |
| ))); | |
| } | |
| if branch_index >= union_schema.schemas.len() { | |
| return Err(create_error(format!( | |
| "Branch index out of bounds: {}. The union schema has '{}' schemas", | |
| branch_index, | |
| union_schema.schemas.len() | |
| ))); | |
| } |
| if branch_index >= union_schema.schemas.len() { | ||
| return Err(create_error(format!( | ||
| "Variant index out of bounds: {}. The union schema has '{}' schemas", | ||
| variant_index, | ||
| union_schema.schemas.len() | ||
| ))); | ||
| } |
There was a problem hiding this comment.
In serialize_struct_variant_with_schema, the error message for an out-of-bounds index uses variant_index, but the check is performed on branch_index. This can be misleading when serializing_some is true, as branch_index would be variant_index + 1. The error message should use branch_index to be accurate. Using "Branch index" instead of "Variant index" in the message would also improve clarity.
| if branch_index >= union_schema.schemas.len() { | |
| return Err(create_error(format!( | |
| "Variant index out of bounds: {}. The union schema has '{}' schemas", | |
| variant_index, | |
| union_schema.schemas.len() | |
| ))); | |
| } | |
| if branch_index >= union_schema.schemas.len() { | |
| return Err(create_error(format!( | |
| "Branch index out of bounds: {}. The union schema has '{}' schemas", | |
| branch_index, | |
| union_schema.schemas.len() | |
| ))); | |
| } |
🤖 Augment PR SummarySummary: Extends serde (de)serialization to better handle nullable Avro unions represented in different Rust/Serde shapes. Changes:
Technical Notes: The new logic relies on shifting union indices in 🤖 Was this summary useful? React with 👍 or 👎 |
| return visitor.visit_enum(EnumUnitDeserializer::new(field)); | ||
| } | ||
| // Assume `null` is the first branch if deserializing some so decrement the variant index | ||
| let variant_idx = *idx as usize - usize::from(self.deserializing_some); |
There was a problem hiding this comment.
let variant_idx = *idx as usize - usize::from(self.deserializing_some); can underflow when deserializing_some is true and idx == 0 (e.g., nullable unions where null isn’t the first branch), which can panic in debug or produce a wrapped index in release. Consider guarding this case and returning a clear error instead of subtracting unconditionally.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The calculation of the branch_index may lead to usize underflow when the null variant is not the first in the union. The Avro specification recommends listing the null first, but it allows to use it in a non-first position too. Prevents an application crash at production due to invalid variant index. The calculation should be done with checked operations and an Err should be returned instead with recommendation to put the null at the first position.
| if self.serializing_some { | ||
| for (i, variant_schema) in union_schema.schemas.iter().enumerate() { | ||
| match variant_schema { | ||
| Schema::Enum(enum_schema) if enum_schema.name.name == name => { |
There was a problem hiding this comment.
In the serializing_some path, the union scan only matches Schema::Enum, so unions that reference the enum via Schema::Ref (e.g. ["null", "MyEnum"]) will fall through and then error via serialize_unit_struct_with_schema (since it doesn’t support Enum). Consider resolving Schema::Ref variants while scanning so Option<MyEnum> can serialize against referenced-enum union branches.
Severity: high
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The new logic should add a support for Schema::Ref too because the reference may point to a Schema::Enum. Prevents an error while resolving the variant if it is a reference to an enum schema
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
avro/src/serde/de.rs (1)
852-873:⚠️ Potential issue | 🟠 MajorArithmetic underflow risk when
idx == 0anddeserializing_some == true.Line 862:
*idx as usize - usize::from(self.deserializing_some)will panic in debug mode (or wrap tousize::MAXin release) whenidxis 0. The comment says "Assumenullis the first branch" — but this assumption isn't enforced. A union like["int", "null"]or malformed data withValue::Union(0, non_null_value)will trigger this.Use
checked_subor an explicit guard:🐛 Proposed fix
- let variant_idx = *idx as usize - usize::from(self.deserializing_some); - if (variant_idx) < variants.len() { + let Some(variant_idx) = (*idx as usize).checked_sub(usize::from(self.deserializing_some)) else { + return Err(de::Error::custom(format!( + "Union variant index {idx} is too small to be a non-null branch when deserializing Some" + ))); + }; + if variant_idx < variants.len() {
🧹 Nitpick comments (3)
avro/src/serde/de.rs (1)
85-90: Consider restricting visibility ofnew_deserializing_some.This constructor is
pubbut it exposes an internal implementation detail (the "Some" deserialization context). External callers could misuse it by constructing adeserializing_some=truedeserializer on arbitrary values, triggering the underflow issue above. Considerpub(crate)unless there's a reason for full public visibility.avro/src/serde/ser_schema.rs (1)
637-650: Same visibility concern as the deserializer: considerpub(crate)fornew_serializing_some.This constructor exposes the internal
serializing_someflag. Like its deserializer counterpartnew_deserializing_some, consider restricting visibility topub(crate)unless public use is intended.avro/tests/nullable_union.rs (1)
973-981: Add a comment explaining why this test is#[ignore]d.This is a known serde limitation: with
#[serde(untagged)], serde tries variants in declaration order andMyRecordAalways matches first because both records have identical shapes ({ a: i32 }). A short comment would save future readers from investigating.💡 Suggested doc comment
#[test] - #[ignore] + #[ignore] // serde's untagged deserialization tries variants in order; + // MyRecordA always matches first since both records have the same shape { a: i32 } fn deserialize_rusty_untagged_my_record_b_27() -> AvroResult<()> {
value:useful; category:bug; feedback: The Claude AI reviewer is correct! The calculation of the branch_index may lead to usize underflow when the null variant is not the first in the union. The Avro specification recommends listing the null first, but it allows to use it in a non-first position too. Prevents an application crash at production due to invalid variant index. The calculation should be done with checked operations and an Err should be returned instead with recommendation to put the null at the first position. |
value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! As can be seen in the root Cargo.toml the project uses Rust edition 2024, so it is OK to use let-chains. |
value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! There is some code duplication that would be good to be extracted to a helper function and reused. Prevents higher maintenance cost and bigger chance of regressions if a bug is fixed at only one place but not in the duplicated code. |
value:useful; category:bug; feedback: The Claude AI reviewer is correct! The calculation of the branch_index may lead to usize underflow when the null variant is not the first in the union. The Avro specification recommends listing the null first, but it allows to use it in a non-first position too. Prevents an application crash at production due to invalid variant index. The calculation should be done with checked operations and an Err should be returned instead with recommendation to put the null at the first position. |
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The calculation of the branch_index may lead to usize underflow when the null variant is not the first in the union. The Avro specification recommends listing the null first, but it allows to use it in a non-first position too. Prevents an application crash at production due to invalid variant index. The calculation should be done with checked operations and an Err should be returned instead with recommendation to put the null at the first position. |
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! The visibility of the new methods should be as minimal (private) as possible. Once released as public API it would be much harder to modify/improve/remove it later. Prevents exposing internal implementation to the public API for no good reason. |
458: To review by AI