471: feat: Use `HashSet` instead of `HashMap` for `AvroSchemaComponent` by martin-augment · Pull Request #51 · martin-augment/avro-rs

martin-augment · 2026-02-19T11:14:33Z

471: To review by AI

coderabbitai · 2026-02-19T11:14:49Z

Walkthrough

The changes refactor the schema tracking mechanism by replacing HashMap<Name, Schema> (the Names type alias) with HashSet<Name> across the avro serde derive and related modules. This affects function signatures in traits, public APIs, and helper functions that now accept &mut HashSet<Name> instead of &mut Names. HashMap operations like contains_key and keyed insert calls are replaced with equivalent HashSet operations. Corresponding imports are updated to include HashSet and remove Names. Test code is updated to initialize HashSet::new() instead of HashMap::new() for tracking.

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-471-2026-02-19-11-14-29

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-02-19T11:14:53Z

Summary of Changes

Hello @martin-augment, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the internal mechanism for tracking named schemas within the AvroSchemaComponent trait and its implementations. By switching from a HashMap to a HashSet, the code more accurately represents the data structure's purpose, which is solely to keep track of schema names encountered during schema generation to prevent infinite recursion. This change improves clarity and potentially offers minor performance benefits by using a more specialized collection type for the task.

Highlights

Data Structure Optimization: Replaced HashMap<Name, Schema> with HashSet<Name> in AvroSchemaComponent trait methods and related functions. This change clarifies the intent of the data structure, as it was only used to track seen schema names for recursion prevention, not to store the full schema definitions.
API and Internal Logic Updates: Updated function signatures and internal logic across the avro and avro_derive crates to reflect the use of HashSet<Name> for managing named schemas, including adjustments to contains_key to contains and simplified insertion logic.
Test Suite Alignment: Modified existing tests to use HashSet::default() or HashSet::new() when calling get_schema_in_ctxt and get_record_fields_in_ctxt, ensuring compatibility with the updated API.

Changelog

avro/src/serde/derive.rs
- Updated imports to include HashSet and removed Names type alias.
- Modified get_schema_in_ctxt and get_record_fields_in_ctxt function signatures to accept &mut HashSet<Name>.
- Adjusted logic for checking and inserting named schemas to use HashSet methods (contains and insert).
- Removed clippy::map_entry attribute no longer relevant with HashSet.
avro/src/serde/with.rs
- Imported HashSet in various modules (bytes, bytes_opt, fixed, fixed_opt, slice, slice_opt).
- Updated function signatures in schema generation methods to use &mut HashSet<Name>.
- Adapted schema tracking logic to use HashSet methods.
avro/tests/avro_schema_component.rs
- Updated test cases to initialize named_schemas with HashSet::default() instead of HashMap::default().
avro/tests/get_record_fields.rs
- Changed test initializations of named_schemas to use HashSet::new().
avro_derive/src/lib.rs
- Modified the AvroSchemaComponent trait implementation in the derive macro to use HashSet<Name>.
- Updated the handle_named_schemas function to align with HashSet behavior for schema name tracking.
- Adjusted expected token streams in internal tests to reflect the HashSet changes.
avro_derive/tests/derive.rs
- Added HashSet to imports.
- Updated test functions to pass HashSet instances to schema generation methods.

Activity

The pull request was opened with a minimal description, indicating it's ready for review by AI.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

claude · 2026-02-19T11:17:58Z

PR Review: Use `HashSet` instead of `HashMap` for `AvroSchemaComponent`

Summary

This PR replaces &mut Names (HashMap<Name, Schema>) with &mut HashSet<Name> in the AvroSchemaComponent trait. The motivation is sound: the old map stored schema values that were never actually retrieved during AvroSchemaComponent — only key existence was checked. The HashSet more accurately models the intent (cycle/recursion detection) and removes the footgun of callers worrying about what value to insert.

✅ What's Done Well

Resolves a long-standing TODO: The derive macro had an explicit comment:
```
// TODO: Breaking change to AvroSchemaComponent, have named_schemas be a set instead
```
This PR completes that work.
Clippy suppressions removed: The #[expect(clippy::map_entry, reason = "We don't use the value from the map")] attributes are gone because HashSet does use entry-style semantics naturally. Less #[expect] noise is a win.
Simplified derive macro output: handle_named_schemas in avro_derive/src/lib.rs is noticeably cleaner — two insert calls (placeholder + final) collapsed into one.
Correctness is preserved: The get_record_fields_in_ctxt recursion-handling path correctly adapts. Temporarily removing a name, calling schema_fn (which will reinsert it via the else branch), then calling named_schemas.insert(name) is equivalent to the old remove/rebuild/reinsert flow.
Tests updated consistently across all four changed test files.

⚠️ Issues / Suggestions

1. Breaking Public API — Not Called Out

AvroSchemaComponent is a public trait. The method signatures changed from:

fn get_schema_in_ctxt(named_schemas: &mut Names, ...) -> Schema

to:

fn get_schema_in_ctxt(named_schemas: &mut HashSet<Name>, ...) -> Schema

Any downstream consumer who manually implemented the trait (without the derive macro) will get a compile error. The PR title says feat: — this should be feat!: or explicitly marked as a breaking change. A CHANGELOG entry is needed.

2. No Type Alias for the New Parameter

Names = HashMap<Name, Schema> is still defined in avro/src/schema/name.rs (used internally for JSON schema parsing). The AvroSchemaComponent trait interface now uses the raw HashSet<Name> directly.

Consider adding a type alias to make the intent clear and provide a stable surface for potential future changes:

/// Names that have already been seen during schema construction, used for cycle detection.
pub type SchemaContext = HashSet<Name>;

Without this, users implementing the trait manually must know to import both std::collections::HashSet and apache_avro::schema::Name — as shown by the doc examples which now need two additional use lines.

3. Redundant Re-insert in `get_record_fields_in_ctxt`

In the Schema::Ref branch of get_record_fields_in_ctxt:

assert!(named_schemas.remove(&name), ...);
let schema = schema_fn(named_schemas, enclosing_namespace);
named_schemas.insert(name); // <-- this line

Since schema_fn for any named type will always insert name back into the set (via its else branch), the final named_schemas.insert(name) is a no-op. It's harmless but could be removed or at least annotated with a comment explaining why it's there (or confirming it can be dropped). Same applies to the later occurrence:

// The schema is used, so reinsert it
named_schemas.insert(name.clone());

4. Doc Test Ergonomics

The doc examples now need:

# use apache_avro::{..., schema::{Name, ...}};
# use std::collections::HashSet;

This is a minor usability regression for users copying the example. A type alias (point 2) would eliminate the HashSet import in the examples.

Correctness Notes

The behavioral equivalence holds because:

Old code: contains_key → Schema::Ref if present; otherwise insert placeholder, build, replace placeholder
New code: contains → Schema::Ref if present; otherwise insert name, build and return

Neither path ever retrieved the stored Schema value from the old map. The only exception was get_record_fields_in_ctxt's Schema::Ref path which stored and restored a temporary, but the logic is equivalent in the new version.

Verdict

The core change is correct and a meaningful simplification. The main asks before merging:

Add a CHANGELOG entry and mark this as a breaking change
Consider a public type alias to smooth the API surface for manual implementors

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

avro/src/serde/derive.rs (1)
310-335: ⚠️ Potential issue | 🟠 Major

Public trait signature change — semver-breaking change without major version bump.

The AvroSchemaComponent trait is public and its method signatures changed from &mut Names (HashMap<Name, Schema>) to &mut HashSet<Name>. This is a semver-breaking change for any external crate implementing this trait. However, the version was bumped to 0.22.0 (minor), which violates semantic versioning for breaking API changes—a major version bump is required. Additionally, the CHANGELOG contains no documentation of this breaking change or migration guidance for external implementors.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/serde/derive.rs` around lines 310 - 335, The public trait
AvroSchemaComponent broke its external API by changing parameter types from &mut
Names (HashMap<Name, Schema>) to &mut HashSet<Name> for get_schema_in_ctxt and
get_record_fields_in_ctxt; restore backward compatibility by reintroducing the
original method signatures (using &mut Names / HashMap<Name, Schema>) as either
the primary methods or as deprecated wrappers that forward to the new
HashSet-based implementations (e.g., add deprecated get_schema_in_ctxt(&mut
Names, enclosing_namespace: &Namespace) and
get_record_fields_in_ctxt(first_field_position, &mut Names, enclosing_namespace)
that convert/derive the needed HashSet<Name> and call the new impl), update
CHANGELOG with migration guidance, and ensure the package version is bumped to a
major version if you keep the breaking change instead of restoring
compatibility.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@avro/src/serde/derive.rs`:
- Around line 310-335: The public trait AvroSchemaComponent broke its external
API by changing parameter types from &mut Names (HashMap<Name, Schema>) to &mut
HashSet<Name> for get_schema_in_ctxt and get_record_fields_in_ctxt; restore
backward compatibility by reintroducing the original method signatures (using
&mut Names / HashMap<Name, Schema>) as either the primary methods or as
deprecated wrappers that forward to the new HashSet-based implementations (e.g.,
add deprecated get_schema_in_ctxt(&mut Names, enclosing_namespace: &Namespace)
and get_record_fields_in_ctxt(first_field_position, &mut Names,
enclosing_namespace) that convert/derive the needed HashSet<Name> and call the
new impl), update CHANGELOG with migration guidance, and ensure the package
version is bumped to a major version if you keep the breaking change instead of
restoring compatibility.

augmentcode · 2026-02-19T11:24:34Z

🤖 Augment PR Summary

Summary: This PR changes the schema-derivation context for AvroSchemaComponent from schema::Names (HashMap<Name, Schema>) to a lighter HashSet<Name> that tracks which named types have already been defined/seen.

Changes:

Updated AvroSchemaComponent::get_schema_in_ctxt and get_record_fields_in_ctxt signatures (and their docs/examples) to use &mut HashSet<Name>.
Adjusted the serde::get_record_fields_in_ctxt helper and the default AvroSchema impl to construct schemas using a HashSet context.
Updated serde::with helper modules (bytes, fixed, slice and *_opt variants) to match the new context type.
Modified apache_avro_derive code generation to emit HashSet<Name>-based implementations and simplified recursion bookkeeping (no longer storing schemas in the context).
Refreshed affected unit tests and derive tests to use HashSet and validate recursion/non-recursion behavior.

Technical Notes: The context now represents “names already defined/seen” for reference emission and recursion prevention, rather than a map used to cache/lookup full schema definitions.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 1 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-02-19T11:24:35Z

avro/src/serde/derive.rs

+                "Name '{name}' should exist in `named_schemas` otherwise Ref is invalid: {named_schemas:?}"
+            );
            // Get the schema
            let schema = schema_fn(named_schemas, enclosing_namespace);


In this Schema::Ref branch, calling schema_fn after temporarily removing name can add additional entries to named_schemas (since schema_fn may insert other names while rebuilding the schema). Is that intended given the helper’s goal of not “polluting” named_schemas / avoiding duplicate-name side effects?

Severity: low

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:good-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct! The Name of the Schema is inserted twice now but the first insertion happens in the default implementation of the trait method and there is no guarantee that the application developer will not provide a custom implementation of this method. It is better to be on the safe side and call the .insert() method twice than to leave this responsibility to the application developer and receive bug reports when they forget to do it.

gemini-code-assist

Code Review

This pull request refactors the AvroSchemaComponent trait and its implementations to use HashSet<Name> instead of HashMap<Name, Schema> for tracking named schemas during schema generation. This change is a significant improvement, as it employs a more semantically appropriate data structure for tracking seen schema names to handle recursive types. The value part of the HashMap was not essential for this purpose, and using a HashSet makes the intent clearer. The changes are consistently applied across the avro and avro_derive crates, including trait definitions, implementations for various types, derive macros, and tests. This refactoring also leads to cleaner code, for instance by removing #[expect(clippy::map_entry)] attributes. The implementation appears correct and well-executed.

martin-augment · 2026-02-19T12:39:48Z

310-335: ⚠️ Potential issue | 🟠 Major

Public trait signature change — semver-breaking change without major version bump.

The AvroSchemaComponent trait is public and its method signatures changed from &mut Names (HashMap<Name, Schema>) to &mut HashSet<Name>. This is a semver-breaking change for any external crate implementing this trait. However, the version was bumped to 0.22.0 (minor), which violates semantic versioning for breaking API changes—a major version bump is required. Additionally, the CHANGELOG contains no documentation of this breaking change or migration guidance for external implementors.

value:annoying; category:bug; feedback: The CodeRabbit AI reviewer is not correct! The apache-avro crate has not reached v1.0.0 yet, so such API breaks are allowed without waiting for a new major release.

martin-augment · 2026-02-19T12:40:21Z

1. Breaking Public API — Not Called Out

AvroSchemaComponent is a public trait. The method signatures changed from:
fn get_schema_in_ctxt(named_schemas: &mut Names, ...) -> Schema
to:
fn get_schema_in_ctxt(named_schemas: &mut HashSet<Name>, ...) -> Schema
Any downstream consumer who manually implemented the trait (without the derive macro) will get a compile error. The PR title says feat: — this should be feat!: or explicitly marked as a breaking change. A CHANGELOG entry is needed.

value:annoying; category:bug; feedback: The Claude AI reviewer is not correct! The apache-avro crate has not reached v1.0.0 yet, so such API breaks are allowed without waiting for a new major release.

martin-augment · 2026-02-19T12:41:24Z

3. Redundant Re-insert in get_record_fields_in_ctxt

In the Schema::Ref branch of get_record_fields_in_ctxt:
assert!(named_schemas.remove(&name), ...);
let schema = schema_fn(named_schemas, enclosing_namespace);
named_schemas.insert(name); // <-- this line
Since schema_fn for any named type will always insert name back into the set (via its else branch), the final named_schemas.insert(name) is a no-op. It's harmless but could be removed or at least annotated with a comment explaining why it's there (or confirming it can be dropped). Same applies to the later occurrence:
// The schema is used, so reinsert it
named_schemas.insert(name.clone());

value:good-but-wont-fix; category:bug; feedback: The Claude AI reviewer is correct! The Name of the Schema is inserted twice now but the first insertion happens in the default implementation of the trait method and there is no guarantee that the application developer will not provide a custom implementation of this method. It is better to be on the safe side and call the .insert() method twice than to leave this responsibility to the application developer and receive bug reports when they forget to do it.

feat: Use HashSet instead of HashMap for AvroSchemaComponent

6724144

coderabbitai bot reviewed Feb 19, 2026

View reviewed changes

augmentcode bot reviewed Feb 19, 2026

View reviewed changes

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

Conversation

martin-augment commented Feb 19, 2026

Uh oh!

coderabbitai bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

claude bot commented Feb 19, 2026

PR Review: Use HashSet instead of HashMap for AvroSchemaComponent

Summary

✅ What's Done Well

⚠️ Issues / Suggestions

1. Breaking Public API — Not Called Out

2. No Type Alias for the New Parameter

3. Redundant Re-insert in get_record_fields_in_ctxt

4. Doc Test Ergonomics

Correctness Notes

Verdict

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot commented Feb 19, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

martin-augment commented Feb 19, 2026

Uh oh!

martin-augment commented Feb 19, 2026

1. Breaking Public API — Not Called Out

Uh oh!

martin-augment commented Feb 19, 2026

3. Redundant Re-insert in get_record_fields_in_ctxt

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Feb 19, 2026 •

edited

Loading

PR Review: Use `HashSet` instead of `HashMap` for `AvroSchemaComponent`

3. Redundant Re-insert in `get_record_fields_in_ctxt`

3. Redundant Re-insert in `get_record_fields_in_ctxt`