Skip to content

471: feat: Use HashSet instead of HashMap for AvroSchemaComponent#51

Open
martin-augment wants to merge 1 commit intomainfrom
pr-471-2026-02-19-11-14-29
Open

471: feat: Use HashSet instead of HashMap for AvroSchemaComponent#51
martin-augment wants to merge 1 commit intomainfrom
pr-471-2026-02-19-11-14-29

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

471: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 19, 2026

Walkthrough

The changes refactor the schema tracking mechanism by replacing HashMap<Name, Schema> (the Names type alias) with HashSet<Name> across the avro serde derive and related modules. This affects function signatures in traits, public APIs, and helper functions that now accept &mut HashSet<Name> instead of &mut Names. HashMap operations like contains_key and keyed insert calls are replaced with equivalent HashSet operations. Corresponding imports are updated to include HashSet and remove Names. Test code is updated to initialize HashSet::new() instead of HashMap::new() for tracking.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-471-2026-02-19-11-14-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the internal mechanism for tracking named schemas within the AvroSchemaComponent trait and its implementations. By switching from a HashMap to a HashSet, the code more accurately represents the data structure's purpose, which is solely to keep track of schema names encountered during schema generation to prevent infinite recursion. This change improves clarity and potentially offers minor performance benefits by using a more specialized collection type for the task.

Highlights

  • Data Structure Optimization: Replaced HashMap<Name, Schema> with HashSet<Name> in AvroSchemaComponent trait methods and related functions. This change clarifies the intent of the data structure, as it was only used to track seen schema names for recursion prevention, not to store the full schema definitions.
  • API and Internal Logic Updates: Updated function signatures and internal logic across the avro and avro_derive crates to reflect the use of HashSet<Name> for managing named schemas, including adjustments to contains_key to contains and simplified insertion logic.
  • Test Suite Alignment: Modified existing tests to use HashSet::default() or HashSet::new() when calling get_schema_in_ctxt and get_record_fields_in_ctxt, ensuring compatibility with the updated API.
Changelog
  • avro/src/serde/derive.rs
    • Updated imports to include HashSet and removed Names type alias.
    • Modified get_schema_in_ctxt and get_record_fields_in_ctxt function signatures to accept &mut HashSet<Name>.
    • Adjusted logic for checking and inserting named schemas to use HashSet methods (contains and insert).
    • Removed clippy::map_entry attribute no longer relevant with HashSet.
  • avro/src/serde/with.rs
    • Imported HashSet in various modules (bytes, bytes_opt, fixed, fixed_opt, slice, slice_opt).
    • Updated function signatures in schema generation methods to use &mut HashSet<Name>.
    • Adapted schema tracking logic to use HashSet methods.
  • avro/tests/avro_schema_component.rs
    • Updated test cases to initialize named_schemas with HashSet::default() instead of HashMap::default().
  • avro/tests/get_record_fields.rs
    • Changed test initializations of named_schemas to use HashSet::new().
  • avro_derive/src/lib.rs
    • Modified the AvroSchemaComponent trait implementation in the derive macro to use HashSet<Name>.
    • Updated the handle_named_schemas function to align with HashSet behavior for schema name tracking.
    • Adjusted expected token streams in internal tests to reflect the HashSet changes.
  • avro_derive/tests/derive.rs
    • Added HashSet to imports.
    • Updated test functions to pass HashSet instances to schema generation methods.
Activity
  • The pull request was opened with a minimal description, indicating it's ready for review by AI.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude bot commented Feb 19, 2026

PR Review: Use HashSet instead of HashMap for AvroSchemaComponent

Summary

This PR replaces &mut Names (HashMap<Name, Schema>) with &mut HashSet<Name> in the AvroSchemaComponent trait. The motivation is sound: the old map stored schema values that were never actually retrieved during AvroSchemaComponent — only key existence was checked. The HashSet more accurately models the intent (cycle/recursion detection) and removes the footgun of callers worrying about what value to insert.


✅ What's Done Well

  1. Resolves a long-standing TODO: The derive macro had an explicit comment:

    // TODO: Breaking change to AvroSchemaComponent, have named_schemas be a set instead

    This PR completes that work.

  2. Clippy suppressions removed: The #[expect(clippy::map_entry, reason = "We don't use the value from the map")] attributes are gone because HashSet does use entry-style semantics naturally. Less #[expect] noise is a win.

  3. Simplified derive macro output: handle_named_schemas in avro_derive/src/lib.rs is noticeably cleaner — two insert calls (placeholder + final) collapsed into one.

  4. Correctness is preserved: The get_record_fields_in_ctxt recursion-handling path correctly adapts. Temporarily removing a name, calling schema_fn (which will reinsert it via the else branch), then calling named_schemas.insert(name) is equivalent to the old remove/rebuild/reinsert flow.

  5. Tests updated consistently across all four changed test files.


⚠️ Issues / Suggestions

1. Breaking Public API — Not Called Out

AvroSchemaComponent is a public trait. The method signatures changed from:

fn get_schema_in_ctxt(named_schemas: &mut Names, ...) -> Schema

to:

fn get_schema_in_ctxt(named_schemas: &mut HashSet<Name>, ...) -> Schema

Any downstream consumer who manually implemented the trait (without the derive macro) will get a compile error. The PR title says feat: — this should be feat!: or explicitly marked as a breaking change. A CHANGELOG entry is needed.

2. No Type Alias for the New Parameter

Names = HashMap<Name, Schema> is still defined in avro/src/schema/name.rs (used internally for JSON schema parsing). The AvroSchemaComponent trait interface now uses the raw HashSet<Name> directly.

Consider adding a type alias to make the intent clear and provide a stable surface for potential future changes:

/// Names that have already been seen during schema construction, used for cycle detection.
pub type SchemaContext = HashSet<Name>;

Without this, users implementing the trait manually must know to import both std::collections::HashSet and apache_avro::schema::Name — as shown by the doc examples which now need two additional use lines.

3. Redundant Re-insert in get_record_fields_in_ctxt

In the Schema::Ref branch of get_record_fields_in_ctxt:

assert!(named_schemas.remove(&name), ...);
let schema = schema_fn(named_schemas, enclosing_namespace);
named_schemas.insert(name); // <-- this line

Since schema_fn for any named type will always insert name back into the set (via its else branch), the final named_schemas.insert(name) is a no-op. It's harmless but could be removed or at least annotated with a comment explaining why it's there (or confirming it can be dropped). Same applies to the later occurrence:

// The schema is used, so reinsert it
named_schemas.insert(name.clone());

4. Doc Test Ergonomics

The doc examples now need:

# use apache_avro::{..., schema::{Name, ...}};
# use std::collections::HashSet;

This is a minor usability regression for users copying the example. A type alias (point 2) would eliminate the HashSet import in the examples.


Correctness Notes

The behavioral equivalence holds because:

  • Old code: contains_keySchema::Ref if present; otherwise insert placeholder, build, replace placeholder
  • New code: containsSchema::Ref if present; otherwise insert name, build and return

Neither path ever retrieved the stored Schema value from the old map. The only exception was get_record_fields_in_ctxt's Schema::Ref path which stored and restored a temporary, but the logic is equivalent in the new version.


Verdict

The core change is correct and a meaningful simplification. The main asks before merging:

  1. Add a CHANGELOG entry and mark this as a breaking change
  2. Consider a public type alias to smooth the API surface for manual implementors

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
avro/src/serde/derive.rs (1)

310-335: ⚠️ Potential issue | 🟠 Major

Public trait signature change — semver-breaking change without major version bump.

The AvroSchemaComponent trait is public and its method signatures changed from &mut Names (HashMap<Name, Schema>) to &mut HashSet<Name>. This is a semver-breaking change for any external crate implementing this trait. However, the version was bumped to 0.22.0 (minor), which violates semantic versioning for breaking API changes—a major version bump is required. Additionally, the CHANGELOG contains no documentation of this breaking change or migration guidance for external implementors.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/serde/derive.rs` around lines 310 - 335, The public trait
AvroSchemaComponent broke its external API by changing parameter types from &mut
Names (HashMap<Name, Schema>) to &mut HashSet<Name> for get_schema_in_ctxt and
get_record_fields_in_ctxt; restore backward compatibility by reintroducing the
original method signatures (using &mut Names / HashMap<Name, Schema>) as either
the primary methods or as deprecated wrappers that forward to the new
HashSet-based implementations (e.g., add deprecated get_schema_in_ctxt(&mut
Names, enclosing_namespace: &Namespace) and
get_record_fields_in_ctxt(first_field_position, &mut Names, enclosing_namespace)
that convert/derive the needed HashSet<Name> and call the new impl), update
CHANGELOG with migration guidance, and ensure the package version is bumped to a
major version if you keep the breaking change instead of restoring
compatibility.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@avro/src/serde/derive.rs`:
- Around line 310-335: The public trait AvroSchemaComponent broke its external
API by changing parameter types from &mut Names (HashMap<Name, Schema>) to &mut
HashSet<Name> for get_schema_in_ctxt and get_record_fields_in_ctxt; restore
backward compatibility by reintroducing the original method signatures (using
&mut Names / HashMap<Name, Schema>) as either the primary methods or as
deprecated wrappers that forward to the new HashSet-based implementations (e.g.,
add deprecated get_schema_in_ctxt(&mut Names, enclosing_namespace: &Namespace)
and get_record_fields_in_ctxt(first_field_position, &mut Names,
enclosing_namespace) that convert/derive the needed HashSet<Name> and call the
new impl), update CHANGELOG with migration guidance, and ensure the package
version is bumped to a major version if you keep the breaking change instead of
restoring compatibility.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Feb 19, 2026

🤖 Augment PR Summary

Summary: This PR changes the schema-derivation context for AvroSchemaComponent from schema::Names (HashMap<Name, Schema>) to a lighter HashSet<Name> that tracks which named types have already been defined/seen.

Changes:

  • Updated AvroSchemaComponent::get_schema_in_ctxt and get_record_fields_in_ctxt signatures (and their docs/examples) to use &mut HashSet<Name>.
  • Adjusted the serde::get_record_fields_in_ctxt helper and the default AvroSchema impl to construct schemas using a HashSet context.
  • Updated serde::with helper modules (bytes, fixed, slice and *_opt variants) to match the new context type.
  • Modified apache_avro_derive code generation to emit HashSet<Name>-based implementations and simplified recursion bookkeeping (no longer storing schemas in the context).
  • Refreshed affected unit tests and derive tests to use HashSet and validate recursion/non-recursion behavior.

Technical Notes: The context now represents “names already defined/seen” for reference emission and recursion prevention, rather than a map used to cache/lookup full schema definitions.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

"Name '{name}' should exist in `named_schemas` otherwise Ref is invalid: {named_schemas:?}"
);
// Get the schema
let schema = schema_fn(named_schemas, enclosing_namespace);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this Schema::Ref branch, calling schema_fn after temporarily removing name can add additional entries to named_schemas (since schema_fn may insert other names while rebuilding the schema). Is that intended given the helper’s goal of not “polluting” named_schemas / avoiding duplicate-name side effects?

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct! The Name of the Schema is inserted twice now but the first insertion happens in the default implementation of the trait method and there is no guarantee that the application developer will not provide a custom implementation of this method. It is better to be on the safe side and call the .insert() method twice than to leave this responsibility to the application developer and receive bug reports when they forget to do it.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the AvroSchemaComponent trait and its implementations to use HashSet<Name> instead of HashMap<Name, Schema> for tracking named schemas during schema generation. This change is a significant improvement, as it employs a more semantically appropriate data structure for tracking seen schema names to handle recursive types. The value part of the HashMap was not essential for this purpose, and using a HashSet makes the intent clearer. The changes are consistently applied across the avro and avro_derive crates, including trait definitions, implementations for various types, derive macros, and tests. This refactoring also leads to cleaner code, for instance by removing #[expect(clippy::map_entry)] attributes. The implementation appears correct and well-executed.

@martin-augment
Copy link
Copy Markdown
Owner Author

310-335: ⚠️ Potential issue | 🟠 Major

Public trait signature change — semver-breaking change without major version bump.

The AvroSchemaComponent trait is public and its method signatures changed from &mut Names (HashMap<Name, Schema>) to &mut HashSet<Name>. This is a semver-breaking change for any external crate implementing this trait. However, the version was bumped to 0.22.0 (minor), which violates semantic versioning for breaking API changes—a major version bump is required. Additionally, the CHANGELOG contains no documentation of this breaking change or migration guidance for external implementors.

value:annoying; category:bug; feedback: The CodeRabbit AI reviewer is not correct! The apache-avro crate has not reached v1.0.0 yet, so such API breaks are allowed without waiting for a new major release.

@martin-augment
Copy link
Copy Markdown
Owner Author

1. Breaking Public API — Not Called Out

AvroSchemaComponent is a public trait. The method signatures changed from:

fn get_schema_in_ctxt(named_schemas: &mut Names, ...) -> Schema

to:

fn get_schema_in_ctxt(named_schemas: &mut HashSet<Name>, ...) -> Schema

Any downstream consumer who manually implemented the trait (without the derive macro) will get a compile error. The PR title says feat: — this should be feat!: or explicitly marked as a breaking change. A CHANGELOG entry is needed.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The apache-avro crate has not reached v1.0.0 yet, so such API breaks are allowed without waiting for a new major release.

@martin-augment
Copy link
Copy Markdown
Owner Author

3. Redundant Re-insert in get_record_fields_in_ctxt

In the Schema::Ref branch of get_record_fields_in_ctxt:

assert!(named_schemas.remove(&name), ...);
let schema = schema_fn(named_schemas, enclosing_namespace);
named_schemas.insert(name); // <-- this line

Since schema_fn for any named type will always insert name back into the set (via its else branch), the final named_schemas.insert(name) is a no-op. It's harmless but could be removed or at least annotated with a comment explaining why it's there (or confirming it can be dropped). Same applies to the later occurrence:

// The schema is used, so reinsert it
named_schemas.insert(name.clone());

value:good-but-wont-fix; category:bug; feedback: The Claude AI reviewer is correct! The Name of the Schema is inserted twice now but the first insertion happens in the default implementation of the trait method and there is no guarantee that the application developer will not provide a custom implementation of this method. It is better to be on the safe side and call the .insert() method twice than to leave this responsibility to the application developer and receive bug reports when they forget to do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants