Skip to content

489: feat!: Add a UnionSchemaBuilder#57

Open
martin-augment wants to merge 1 commit intomainfrom
pr-489-2026-02-26-11-38-29
Open

489: feat!: Add a UnionSchemaBuilder#57
martin-augment wants to merge 1 commit intomainfrom
pr-489-2026-02-26-11-38-29

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

489: To review by AI

This also fixes a issue with the original `new` implementation where
it would insert named types in the `variant_index` and then
`find_schema_with_known_schemata` would use the fast path without
checking the schema.

`find_schema_with_known_schemata` has also been simplified to use
`known_schemata` directly instead of rebuilding it with the current
schema, as this would cause duplicate schema errors after the incorrect
fast path was removed.

Another issue fixed is that logical types cannot be added if their base type
is already in the union, and the same in reverse.

The `UnionSchemaBuilder::variant_ignore_duplicates` and `UnionSchemaBuilder::contains`
are needed for `avro_derive` to implement full support for enums.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 26, 2026

Walkthrough

This pull request introduces a builder pattern for union schema construction and refactors error handling for union duplicates. The UnionSchema type gains a public builder API via UnionSchemaBuilder with methods for adding variants and constructing unions. Error reporting is made more granular by splitting a single GetUnionDuplicate variant into three distinct variants for maps, arrays, and general duplicate types. Internal union representation changes include explicit indexing structures for tracking named and unnamed schemas. The Name struct adds PartialOrd and Ord derives. Supporting adjustments handle edge cases in Option<T> schema generation and simplify test code with error propagation.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-489-2026-02-26-11-38-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new UnionSchemaBuilder to streamline the creation and validation of Avro union schemas. This builder provides a more structured approach to adding variants, preventing common errors like nested unions or duplicate types. The changes also refine the internal representation and resolution logic for union schemas, particularly improving how logical types and named schemas are handled during schema matching, leading to more accurate and robust schema validation.

Highlights

  • New UnionSchemaBuilder: Introduced a dedicated builder for UnionSchema to facilitate more robust and controlled construction of union schemas.
  • Improved Union Validation: Enhanced error handling for duplicate types within unions, specifically distinguishing between duplicate named schemas, and different map or array schemas.
  • Refined Schema Resolution: The UnionSchema::find_schema_with_known_schemata method was significantly refactored to better handle logical types and named schemas during value resolution.
Changelog
  • avro/src/error.rs
    • Added GetUnionDuplicateMap and GetUnionDuplicateArray error variants to specifically report duplicate map or array schemas with differing subtypes within a union.
    • Modified GetUnionDuplicate to include SchemaKind for more precise error reporting.
  • avro/src/schema/mod.rs
    • Exported the new UnionSchemaBuilder from the union module.
  • avro/src/schema/name.rs
    • Derived PartialOrd and Ord traits for the Name struct.
  • avro/src/schema/union.rs
    • Introduced the UnionSchemaBuilder struct with methods for adding schema variants, checking for existing schemas, and building the final UnionSchema.
    • Refactored UnionSchema::new to utilize the new UnionSchemaBuilder.
    • Updated UnionSchema to include named_index for efficient lookup of named types.
    • Modified UnionSchema::is_nullable to leverage the variant_index for improved performance.
    • Completely re-implemented UnionSchema::find_schema_with_known_schemata to correctly resolve logical types and named schemas by converting values to base SchemaKinds.
    • Added private helper functions value_to_base_schemakind and schema_to_base_schemakind to assist in logical type resolution.
    • Included new test cases to validate the functionality of UnionSchemaBuilder with primitive, complex, and logical types, and to test map resolution.
  • avro/src/serde/derive.rs
    • Adjusted the get_schema_in_ctxt implementation for Option<T> to correctly handle cases where T itself is Null, preventing redundant Null entries in the union.
  • avro/src/types.rs
    • Simplified a test assertion in test_resolve_schemata_with_schemata by directly unwrapping the resolve_schemata result.
Activity
  • The pull request has been opened and is ready for review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a UnionSchemaBuilder to construct UnionSchema instances and refactors validation logic for a cleaner API and improved error handling. However, the new resolution logic in UnionSchema::find_schema_with_known_schemata introduces critical regressions, breaking Avro specification compliance due to incorrect variant matching order and a lack of type promotion for unnamed types. The repeated deep cloning of untrusted input values during union resolution also poses a Denial of Service risk and is a general performance concern. Additionally, a minor bug exists where an incorrect error type is returned for duplicate array schemas.

Comment on lines 94 to +155
) -> Option<(usize, &Schema)> {
let schema_kind = SchemaKind::from(value);
if let Some(&i) = self.variant_index.get(&schema_kind) {
// fast path
Some((i, &self.schemas[i]))
} else {
// slow path (required for matching logical or named types)

// first collect what schemas we already know
let mut collected_names: HashMap<Name, &Schema> = known_schemata
.map(|names| {
names
.iter()
.map(|(name, schema)| (name.clone(), schema.borrow()))
.collect()
let ValueSchemaKind { unnamed, named } = Self::value_to_base_schemakind(value);
// Unnamed schema types can be looked up directly using the variant_index
let unnamed = unnamed
.and_then(|kind| self.variant_index.get(&kind).copied())
.map(|index| (index, &self.schemas[index]))
.and_then(|(index, schema)| {
let kind = schema.discriminant();
// Maps and arrays need to be checked if they actually match the value
if kind == SchemaKind::Map || kind == SchemaKind::Array {
let known_schemata_if_none = HashMap::new();
let known_schemata = known_schemata.unwrap_or(&known_schemata_if_none);
let namespace = if schema.namespace().is_some() {
&schema.namespace()
} else {
enclosing_namespace
};

// TODO: Do this without the clone
value
.clone()
.resolve_internal(schema, known_schemata, namespace, &None)
.ok()
.map(|_| (index, schema))
} else {
Some((index, schema))
}
});
let named = named.and_then(|kind| {
// Every named type needs to be checked against a value until one matches

let known_schemata_if_none = HashMap::new();
let known_schemata = known_schemata.unwrap_or(&known_schemata_if_none);

self.named_index
.iter()
.copied()
.map(|i| (i, &self.schemas[i]))
.filter(|(_i, s)| s.discriminant() == kind || s.discriminant() == SchemaKind::Ref)
.find(|(_i, schema)| {
let namespace = if schema.namespace().is_some() {
&schema.namespace()
} else {
enclosing_namespace
};

// TODO: Do this without the clone
value
.clone()
.resolve_internal(schema, known_schemata, namespace, &None)
.is_ok()
})
.unwrap_or_default();

self.schemas.iter().enumerate().find(|(_, schema)| {
let resolved_schema = ResolvedSchema::new_with_known_schemata(
vec![*schema],
enclosing_namespace,
&collected_names,
)
.expect("Schema didn't successfully parse");
let resolved_names = resolved_schema.names_ref;

// extend known schemas with just resolved names
collected_names.extend(resolved_names);
let namespace = &schema.namespace().or_else(|| enclosing_namespace.clone());

value
.clone()
.resolve_internal(schema, &collected_names, namespace, &None)
.is_ok()
})
});

match (unnamed, named) {
(Some((u_i, _)), Some((n_i, _))) if u_i < n_i => unnamed,
(Some(_), Some(_)) => named,
(Some(_), None) => unnamed,
(None, Some(_)) => named,
(None, None) => None,
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The find_schema_with_known_schemata implementation introduces critical logic flaws that break Avro specification compliance and pose performance/security risks.

  1. Incorrect Resolution Order: The named_index is populated from a BTreeMap's values in UnionSchemaBuilder::build (line 344), which results in indices being sorted by the alphabetical order of the schema names. When find_schema_with_known_schemata iterates over named_index (line 128), it will match the alphabetically first record even if it appears later in the union definition. Avro requires that the first matching variant in the union's defined order be selected.
  2. Missing Type Promotion: The "fast path" for unnamed types (lines 97-98) only checks for an exact SchemaKind match in variant_index. It fails to account for type promotion rules (e.g., an int value matching a long schema), which were previously handled by the slow path using resolve_internal. Since there is no fallback to a full resolution check for unnamed types that don't match exactly, valid Avro data may be incorrectly rejected.
  3. Inefficient Cloning / DoS Risk: As noted in the TODO, cloning value (e.g., lines 112 and 141) can be inefficient, especially for large values, and contributes to a Denial of Service risk. To address this, consider refactoring resolve_internal (which currently consumes self) to introduce a new function, such as validate_internal(&self, ...) -> AvroResult<()>, which checks for validity without consuming or returning a modified Value. This would allow calling value.validate_internal(...) without needing a clone, improving performance and mitigating the DoS risk.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! 1) The iteration over the named schemas depends on the alphabetical order of the schemas' names, not their insertion order. This may lead to wrong resolving of named schema. 2) For unnamed schemas there is no "slow path" resolving anymore and this breaks the "type promotion" support. Prevents regressions in the union schema support.

Comment on lines +271 to +273
return Err(
Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a small copy-paste error here. When handling a duplicate Schema::Array, the error returned is Details::GetUnionDuplicateMap. This should be Details::GetUnionDuplicateArray to provide the correct error message.

Suggested change
return Err(
Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
);
return Err(
Details::GetUnionDuplicateArray(self.schemas.remove(index), schema).into(),
);

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! There is a copy/paste error in the returned error for duplicate Schema::Array's. Prevents confusing the developer by using a wrong error

Comment on lines +113 to +150
value
.clone()
.resolve_internal(schema, known_schemata, namespace, &None)
.ok()
.map(|_| (index, schema))
} else {
Some((index, schema))
}
});
let named = named.and_then(|kind| {
// Every named type needs to be checked against a value until one matches

let known_schemata_if_none = HashMap::new();
let known_schemata = known_schemata.unwrap_or(&known_schemata_if_none);

self.named_index
.iter()
.copied()
.map(|i| (i, &self.schemas[i]))
.filter(|(_i, s)| s.discriminant() == kind || s.discriminant() == SchemaKind::Ref)
.find(|(_i, schema)| {
let namespace = if schema.namespace().is_some() {
&schema.namespace()
} else {
enclosing_namespace
};

// TODO: Do this without the clone
value
.clone()
.resolve_internal(schema, known_schemata, namespace, &None)
.is_ok()
})
.unwrap_or_default();

self.schemas.iter().enumerate().find(|(_, schema)| {
let resolved_schema = ResolvedSchema::new_with_known_schemata(
vec![*schema],
enclosing_namespace,
&collected_names,
)
.expect("Schema didn't successfully parse");
let resolved_names = resolved_schema.names_ref;

// extend known schemas with just resolved names
collected_names.extend(resolved_names);
let namespace = &schema.namespace().or_else(|| enclosing_namespace.clone());

value
.clone()
.resolve_internal(schema, &collected_names, namespace, &None)
.is_ok()
})
});

match (unnamed, named) {
(Some((u_i, _)), Some((n_i, _))) if u_i < n_i => unnamed,
(Some(_), Some(_)) => named,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The implementation of find_schema_with_known_schemata performs a deep clone of the value for every variant in the union during the resolution process (lines 114 and 142). If a union has many variants and the input Value is large or deeply nested, this can lead to excessive memory allocation and CPU consumption, potentially enabling a Denial of Service (DoS) attack.

As noted in the TODO comments, this should be implemented without cloning the value, for example by modifying resolve_internal to accept a reference.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The deep cloning could be avoided by changing the way the resolve() function is implemented. It would be good to use a reference instead.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Feb 26, 2026

🤖 Augment PR Summary

Summary: This PR introduces a new UnionSchemaBuilder and refactors union construction/matching logic to support incremental building and improved handling of logical/named types.

Changes:

  • Added UnionSchemaBuilder and refactored UnionSchema::new to build unions via the builder
  • Improved union variant indexing by separating unnamed (variant_index) and named (named_index) variant lookups
  • Updated union schema selection (find_schema_with_known_schemata) to fast-path unnamed kinds and validate map/array candidates via resolution
  • Expanded error reporting for union duplicates, including more specific duplicate map/array variants
  • Exported UnionSchemaBuilder from schema::mod and added ordering derives to Name to support ordered maps
  • Adjusted serde derive for Option<T> when T is null to avoid creating invalid duplicate-null unions
  • Added targeted tests covering builder behavior and union schema selection edge cases

Technical Notes: Logical types are reduced to their base representation for indexing, and map/array matches now verify item compatibility by attempting schema resolution against the candidate variant.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

if let Some(index) = self.variant_index.get(&SchemaKind::Map).copied() {
if self.schemas[index] != schema {
return Err(
Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In variant_ignore_duplicates, building the error using self.schemas.remove(index) mutates the builder on the error path, which can invalidate indices stored in variant_index/names if the caller continues using the builder after handling the error.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The removal of the duplicate schema suggests that the caller could re-attempt to insert the schema after correcting it, but the indices (names and variant_index) should also be updated. Prevents leaving the UnionSchemaBuilder in inconsistent state.

if let Some(index) = self.variant_index.get(&SchemaKind::Array).copied() {
if self.schemas[index] != schema {
return Err(
Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This array-duplicate branch returns Details::GetUnionDuplicateMap(...) (and message) even though it’s handling arrays; it looks like it should report GetUnionDuplicateArray instead.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Augment AI reviewer is correct! There is a copy/paste error in the returned error for duplicate Schema::Array's. Prevents confusing the developer by using a wrong error

self.schemas.shrink_to_fit();
UnionSchema {
variant_index: self.variant_index,
named_index: self.names.into_values().collect(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

named_index is derived from self.names.into_values(), so its iteration order is by Name (BTreeMap key order) rather than union index order; since find_schema_with_known_schemata scans named_index with .find(), union precedence for named types may depend on Name ordering instead of the declared union order.

Severity: medium

Other Locations
  • avro/src/schema/union.rs:128

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The iteration over the named schemas depends on the alphabetical order of the schemas' names, not their insertion order. This may lead to wrong resolving of named schema.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
avro/src/serde/derive.rs (1)

640-650: Please add a regression test for the Option<Null> branch.

This new edge-case path is important and easy to regress; a focused test for generated schema behavior here would be valuable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/serde/derive.rs` around lines 640 - 650, Add a regression test that
exercises the Option<Null> branch: call T::get_schema_in_ctxt (or whatever
public helper you use in tests) for Option<Null> so the code path that checks if
let Schema::Null = schema is taken, then assert the returned Schema::Union (from
UnionSchema::new) contains exactly the single Schema::Null variant (no
nested/extra variants). Place the test alongside other serde derive tests, use
the same context setup used by derive tests, and reference the Option handling
code paths around T::get_schema_in_ctxt, Schema::Null and UnionSchema::new to
guide locating the correct behavior to verify.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@avro/src/schema/union.rs`:
- Around line 256-273: The variant_ignore_duplicates branch mutates builder
state by calling self.schemas.remove(index) before returning Err and also uses
the wrong error enum for arrays; update both the Map and Array branches in
variant_ignore_duplicates to avoid mutating self.schemas or self.variant_index
when encountering a duplicate (do not call remove), instead construct and return
the appropriate error using the existing values (clone or reference as needed)
so internal indices remain valid, and change the Array branch to return
Details::GetUnionDuplicateArray(...) while the Map branch returns
Details::GetUnionDuplicateMap(...).
- Around line 132-133: The current filter in union.rs uses s.discriminant() ==
kind (with SchemaKind::Ref) which incorrectly excludes named logical-type
variants; update the predicate to accept schemas whose effective/base kind
matches kind as well as SchemaKind::Ref — e.g., replace the s.discriminant()
check with a call that compares the schema's base/effective kind (or a helper
like is_logical_variant_of or base_kind) against kind so
duration/decimal/uuid-over-fixed and other logical variants are included when
matching.
- Around line 340-345: The named_index is being built from
self.names.into_values(), which yields BTreeMap key order instead of the union
declaration/schema order; change build() so named_index is constructed by
iterating schema-order (the order of self.schemas or using self.variant_index)
and for each schema index find and collect the corresponding name from
self.names (matching its stored index/value), so the first matching named branch
follows the original union declaration order; update the build() function to
produce UnionSchema { named_index: /* names collected in schema order */ , ... }
rather than using into_values().

---

Nitpick comments:
In `@avro/src/serde/derive.rs`:
- Around line 640-650: Add a regression test that exercises the Option<Null>
branch: call T::get_schema_in_ctxt (or whatever public helper you use in tests)
for Option<Null> so the code path that checks if let Schema::Null = schema is
taken, then assert the returned Schema::Union (from UnionSchema::new) contains
exactly the single Schema::Null variant (no nested/extra variants). Place the
test alongside other serde derive tests, use the same context setup used by
derive tests, and reference the Option handling code paths around
T::get_schema_in_ctxt, Schema::Null and UnionSchema::new to guide locating the
correct behavior to verify.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7a3548a and 7205975.

📒 Files selected for processing (6)
  • avro/src/error.rs
  • avro/src/schema/mod.rs
  • avro/src/schema/name.rs
  • avro/src/schema/union.rs
  • avro/src/serde/derive.rs
  • avro/src/types.rs

Comment on lines +132 to +133
.filter(|(_i, s)| s.discriminant() == kind || s.discriminant() == SchemaKind::Ref)
.find(|(_i, schema)| {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Named logical-type variants can be skipped during matching.

This filter compares raw discriminants against a base kind (Fixed, Bytes, etc.). For named logical schemas (for example duration/decimal/uuid-over-fixed), raw discriminants won’t match the base kind and valid variants can be excluded.

🔧 Proposed fix
-                .filter(|(_i, s)| s.discriminant() == kind || s.discriminant() == SchemaKind::Ref)
+                .filter(|(_i, s)| {
+                    s.discriminant() == SchemaKind::Ref
+                        || UnionSchemaBuilder::schema_to_base_schemakind(s) == kind
+                })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/schema/union.rs` around lines 132 - 133, The current filter in
union.rs uses s.discriminant() == kind (with SchemaKind::Ref) which incorrectly
excludes named logical-type variants; update the predicate to accept schemas
whose effective/base kind matches kind as well as SchemaKind::Ref — e.g.,
replace the s.discriminant() check with a call that compares the schema's
base/effective kind (or a helper like is_logical_variant_of or base_kind)
against kind so duration/decimal/uuid-over-fixed and other logical variants are
included when matching.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! Since the logical schemas are based on another type a direct comparison of their SchemaKind won't match. They should be compared by using their base kinds.

Comment on lines +256 to +273
} else if let Schema::Map(_) = &schema {
if let Some(index) = self.variant_index.get(&SchemaKind::Map).copied() {
if self.schemas[index] != schema {
return Err(
Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
);
}
} else {
self.variant_index
.insert(SchemaKind::Map, self.schemas.len());
self.schemas.push(schema);
}
} else if let Schema::Array(_) = &schema {
if let Some(index) = self.variant_index.get(&SchemaKind::Array).copied() {
if self.schemas[index] != schema {
return Err(
Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

variant_ignore_duplicates corrupts builder state on error and uses the wrong error variant for arrays.

On mismatch, this path calls self.schemas.remove(index) before returning Err, which mutates internal state (and stale indices remain in maps). Also, the array branch returns GetUnionDuplicateMap instead of GetUnionDuplicateArray.

🔧 Proposed fix
         } else if let Schema::Map(_) = &schema {
             if let Some(index) = self.variant_index.get(&SchemaKind::Map).copied() {
                 if self.schemas[index] != schema {
                     return Err(
-                        Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
+                        Details::GetUnionDuplicateMap(self.schemas[index].clone(), schema).into(),
                     );
                 }
             } else {
                 self.variant_index
                     .insert(SchemaKind::Map, self.schemas.len());
                 self.schemas.push(schema);
             }
         } else if let Schema::Array(_) = &schema {
             if let Some(index) = self.variant_index.get(&SchemaKind::Array).copied() {
                 if self.schemas[index] != schema {
                     return Err(
-                        Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into(),
+                        Details::GetUnionDuplicateArray(self.schemas[index].clone(), schema).into(),
                     );
                 }
             } else {
                 self.variant_index
                     .insert(SchemaKind::Array, self.schemas.len());
                 self.schemas.push(schema);
             }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/schema/union.rs` around lines 256 - 273, The
variant_ignore_duplicates branch mutates builder state by calling
self.schemas.remove(index) before returning Err and also uses the wrong error
enum for arrays; update both the Map and Array branches in
variant_ignore_duplicates to avoid mutating self.schemas or self.variant_index
when encountering a duplicate (do not call remove), instead construct and return
the appropriate error using the existing values (clone or reference as needed)
so internal indices remain valid, and change the Array branch to return
Details::GetUnionDuplicateArray(...) while the Map branch returns
Details::GetUnionDuplicateMap(...).

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The removal of the duplicate schema suggests that the caller could re-attempt to insert the schema after correcting it, but the indices (names and variant_index) should also be updated. Prevents leaving the UnionSchemaBuilder in inconsistent state.

Comment on lines +340 to +345
pub fn build(mut self) -> UnionSchema {
self.schemas.shrink_to_fit();
UnionSchema {
variant_index: self.variant_index,
named_index: self.names.into_values().collect(),
schemas: self.schemas,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve schema-order precedence when building named_index.

self.names is a BTreeMap, so into_values() is keyed-name order, not union declaration/index order. This can change which named branch is picked first when multiple candidates match.

🔧 Proposed fix
     pub fn build(mut self) -> UnionSchema {
         self.schemas.shrink_to_fit();
+        let mut named_index: Vec<_> = self.names.into_values().collect();
+        named_index.sort_unstable();

         UnionSchema {
             variant_index: self.variant_index,
-            named_index: self.names.into_values().collect(),
+            named_index,
             schemas: self.schemas,
         }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn build(mut self) -> UnionSchema {
self.schemas.shrink_to_fit();
UnionSchema {
variant_index: self.variant_index,
named_index: self.names.into_values().collect(),
schemas: self.schemas,
pub fn build(mut self) -> UnionSchema {
self.schemas.shrink_to_fit();
let mut named_index: Vec<_> = self.names.into_values().collect();
named_index.sort_unstable();
UnionSchema {
variant_index: self.variant_index,
named_index,
schemas: self.schemas,
}
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/schema/union.rs` around lines 340 - 345, The named_index is being
built from self.names.into_values(), which yields BTreeMap key order instead of
the union declaration/schema order; change build() so named_index is constructed
by iterating schema-order (the order of self.schemas or using
self.variant_index) and for each schema index find and collect the corresponding
name from self.names (matching its stored index/value), so the first matching
named branch follows the original union declaration order; update the build()
function to produce UnionSchema { named_index: /* names collected in schema
order */ , ... } rather than using into_values().

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The iteration over the named schemas depends on the alphabetical order of the schemas' names, not their insertion order. This may lead to wrong resolving of named schema.

@claude
Copy link
Copy Markdown

claude bot commented Feb 26, 2026

Code Review

This PR introduces a UnionSchemaBuilder to build unions incrementally, refactors find_schema_with_known_schemata to handle logical types properly, and adds several improvements. The overall approach is solid but there are a few bugs and design concerns worth addressing before merging.


Bug 1: Wrong error variant for duplicate arrays (high severity)

In variant_ignore_duplicates (union.rs line ~271), when a duplicate Schema::Array with a different element type is detected, the code emits GetUnionDuplicateMap instead of GetUnionDuplicateArray.

The error arm reads:

Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into()

but should use GetUnionDuplicateArray. This causes the error message to say "Found two different maps..." for an array conflict. Since GetUnionDuplicateArray was introduced in this PR specifically for this case, it should be used here.


Bug 2: Builder state corruption on Map/Array duplicate error (high severity)

In the same variant_ignore_duplicates method, self.schemas.remove(index) is called before returning Err:

return Err(Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into());

Vec::remove shifts all subsequent elements left, invalidating every index stored in variant_index and names. Because the method signature is -> Result<&mut Self, Error>, callers are allowed to catch the error and continue using the builder, which is now in an inconsistent state.

The fix is to clone instead of remove:

Details::GetUnionDuplicateArray(self.schemas[index].clone(), schema).into()

Bug 3: named_index is in alphabetical name order, not union order (medium severity)

build() does:

named_index: self.names.into_values().collect(),

self.names is a BTreeMap<Name, usize>, so into_values() yields values sorted by key (alphabetical by Name) rather than by insertion/union order.

In find_schema_with_known_schemata, tie-breaking between unnamed and named matches compares positions:

(Some((u_i, _)), Some((n_i, _))) if u_i < n_i => unnamed,

But if multiple named schemas could match a value, find on named_index returns the alphabetically-first name's schema position rather than the union-order-first position. The Avro spec requires the first matching type in a union to win, so this can produce incorrect results.

Fix: sort named_index by value before storing it:

let mut named_index: Vec<usize> = self.names.into_values().collect();
named_index.sort_unstable();

Design concern: PartialOrd + Ord derived on Name unnecessarily (low severity)

Ord was derived on Name solely to use it as a BTreeMap key in UnionSchemaBuilder. However, Name already implements Hash + Eq, and HashMap<Name, usize> is used elsewhere in the codebase (type Names = HashMap<Name, Schema>). Adding Ord to a widely-used type is an API change that could be avoided by using HashMap<Name, usize> for the builder's names field.


Missing test coverage

  • No test for the duplicate Array path in variant_ignore_duplicates -- this is how Bug 1 went undetected. A test should verify the error is GetUnionDuplicateArray, not GetUnionDuplicateMap.
  • No test for contains() -- the new public method has zero test coverage.
  • No test for the union-order priority scenario where a Value::Map could match both an unnamed Map schema and a named Record schema.

Minor nits

  1. Attribute ordering: #[doc(hidden)] appears before the doc comment on variant_ignore_duplicates. Rust convention is doc comment first, then #[doc(hidden)] immediately before pub fn.

  2. Asymmetric branch structure: variant_ignore_duplicates uses explicit if let Schema::Map / if let Schema::Array branches before the generic discriminant path, while variant() does not. The reason for the different treatment is not documented.

  3. Two // TODO: Do this without the clone comments inside find_schema_with_known_schemata represent allocations on a hot resolution path -- worth tracking as issues.


Summary

Issue Severity
Wrong error variant for array duplicates (GetUnionDuplicateMap used instead of GetUnionDuplicateArray) High
Builder state corruption via schemas.remove on error path High
named_index sorted by name, not union order Medium
Missing tests (array duplicate path, contains, ordering) Medium
Ord on Name avoidable with HashMap Low

The overall direction is a genuine improvement -- the builder API and better logical-type handling in schema matching are both valuable additions. The two high-severity bugs are straightforward to fix before merging.

@martin-augment
Copy link
Copy Markdown
Owner Author

640-650: Please add a regression test for the Option<Null> branch.

This new edge-case path is important and easy to regress; a focused test for generated schema behavior here would be valuable.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! The fix for supporting derivation of AvroSchema for Option<()> should also include a unit test, otherwise there is a chance that the feature/fix may regress in the future without being noticed.

@martin-augment
Copy link
Copy Markdown
Owner Author

Bug 1: Wrong error variant for duplicate arrays (high severity)

In variant_ignore_duplicates (union.rs line ~271), when a duplicate Schema::Array with a different element type is detected, the code emits GetUnionDuplicateMap instead of GetUnionDuplicateArray.

The error arm reads:

Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into()

but should use GetUnionDuplicateArray. This causes the error message to say "Found two different maps..." for an array conflict. Since GetUnionDuplicateArray was introduced in this PR specifically for this case, it should be used here.

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! There is a copy/paste error in the returned error for duplicate Schema::Array's. Prevents confusing the developer by using a wrong error

@martin-augment
Copy link
Copy Markdown
Owner Author

Bug 2: Builder state corruption on Map/Array duplicate error (high severity)

In the same variant_ignore_duplicates method, self.schemas.remove(index) is called before returning Err:

return Err(Details::GetUnionDuplicateMap(self.schemas.remove(index), schema).into());

Vec::remove shifts all subsequent elements left, invalidating every index stored in variant_index and names. Because the method signature is -> Result<&mut Self, Error>, callers are allowed to catch the error and continue using the builder, which is now in an inconsistent state.

The fix is to clone instead of remove:

Details::GetUnionDuplicateArray(self.schemas[index].clone(), schema).into()

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The removal of the duplicate schema suggests that the caller could re-attempt to insert the schema after correcting it, but the indices (names and variant_index) should also be updated. Prevents leaving the UnionSchemaBuilder in inconsistent state.

@martin-augment
Copy link
Copy Markdown
Owner Author

Bug 3: named_index is in alphabetical name order, not union order (medium severity)

build() does:

named_index: self.names.into_values().collect(),

self.names is a BTreeMap<Name, usize>, so into_values() yields values sorted by key (alphabetical by Name) rather than by insertion/union order.

In find_schema_with_known_schemata, tie-breaking between unnamed and named matches compares positions:

(Some((u_i, _)), Some((n_i, _))) if u_i < n_i => unnamed,

But if multiple named schemas could match a value, find on named_index returns the alphabetically-first name's schema position rather than the union-order-first position. The Avro spec requires the first matching type in a union to win, so this can produce incorrect results.

Fix: sort named_index by value before storing it:

let mut named_index: Vec<usize> = self.names.into_values().collect();
named_index.sort_unstable();

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The iteration over the named schemas depends on the alphabetical order of the schemas' names, not their insertion order. This may lead to wrong resolving of named schema.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants