Skip to content

499: feat: Replace to_datum* functions with GenericDatumWriter#60

Open
martin-augment wants to merge 2 commits intomainfrom
pr-499-2026-03-04-06-09-10
Open

499: feat: Replace to_datum* functions with GenericDatumWriter#60
martin-augment wants to merge 2 commits intomainfrom
pr-499-2026-03-04-06-09-10

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

499: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

Walkthrough

This change refactors the Avro writing API by introducing a new public GenericDatumWriter<'s> struct with a builder-based construction pattern. The struct provides methods for serializing values to writers or byte vectors, including support for both regular values and serde-compatible types. Legacy functions (to_avro_datum, write_avro_datum_ref, to_avro_datum_schemata) are deprecated but maintained for backwards compatibility. The writer module is made public, and all internal usage across tests, benchmarks, and implementation files is migrated to the new builder pattern. New benchmark variants measure performance of the reused writer and no-validation approaches.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-499-2026-03-04-06-09-10

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the Avro data writing mechanism by introducing a new GenericDatumWriter. This change aims to provide a more robust, efficient, and configurable API for serializing Avro values, moving away from the previously used standalone functions. The update also includes corresponding benchmark adjustments and deprecation notices for the older writing methods, ensuring a clear path for migration to the new, improved writer.

Highlights

  • New GenericDatumWriter Introduced: A new GenericDatumWriter struct has been added, providing a more flexible and configurable way to write Avro data with a builder pattern. This writer supports options like schema validation and reusing the writer instance for multiple values.
  • Deprecation of Old to_avro_datum Functions: The existing to_avro_datum, write_avro_datum_ref, and to_avro_datum_schemata functions have been marked as deprecated, with recommendations to use the new GenericDatumWriter instead.
  • Updated Benchmarks: The Avro benchmarks have been updated to utilize the new GenericDatumWriter and include new scenarios to measure performance with reused writers and with validation disabled.
  • Refactored Internal Writing Logic: Internal Avro writing logic has been refactored to centralize around the GenericDatumWriter, improving consistency and maintainability across the codebase.
Changelog
  • avro/benches/single.rs
    • Updated benchmarks to use GenericDatumWriter for writing Avro records.
    • Added new benchmark cases for GenericDatumWriter with writer reuse and disabled validation.
  • avro/src/lib.rs
    • Removed direct to_avro_datum import.
    • Moved writer module declaration to a public section.
    • Added #[expect(deprecated)] attribute to the re-export of old writer functions.
  • avro/src/schema/mod.rs
    • Imported GenericDatumWriter for test cases.
    • Replaced calls to crate::to_avro_datum with GenericDatumWriter in test assertions.
  • avro/src/serde/de.rs
    • Imported GenericDatumWriter for test cases.
    • Replaced calls to crate::to_avro_datum with GenericDatumWriter in test assertions.
  • avro/src/writer/datum.rs
    • Introduced GenericDatumWriter struct with a builder pattern for Avro datum writing.
    • Implemented write_value, write_value_ref, write_value_to_vec, write_ser, and write_ser_to_vec methods for GenericDatumWriter.
    • Deprecated to_avro_datum, write_avro_datum_ref, and to_avro_datum_schemata functions.
    • Refactored deprecated functions to internally use GenericDatumWriter.
    • Removed internal write_avro_datum, write_avro_datum_schemata, and write_value_ref_owned_resolved functions.
  • avro/src/writer/single_object.rs
    • Removed direct imports of write_avro_datum_ref and write_value_ref_owned_resolved.
    • Imported encode_internal and SchemaAwareWriteSerializer.
    • Updated serialize_into method to use SchemaAwareWriteSerializer for data serialization.
    • Moved write_value_ref_owned_resolved function to be a private helper within this file.
  • avro/tests/avro-3786.rs
    • Imported GenericDatumWriter.
    • Replaced calls to to_avro_datum with GenericDatumWriter in test cases.
  • avro/tests/avro-3787.rs
    • Imported GenericDatumWriter.
    • Replaced calls to to_avro_datum with GenericDatumWriter in test cases.
  • avro/tests/io.rs
    • Imported GenericDatumWriter.
    • Replaced calls to to_avro_datum with GenericDatumWriter in various test functions.
  • avro/tests/schema.rs
    • Imported GenericDatumWriter and serde traits.
    • Removed direct import of to_avro_datum.
    • Replaced calls to to_avro_datum with GenericDatumWriter in test cases.
  • avro/tests/to_from_avro_datum_schemata.rs
    • Imported GenericDatumWriter.
    • Removed direct import of to_avro_datum_schemata.
    • Replaced calls to to_avro_datum_schemata with GenericDatumWriter in test cases.
Activity
  • The pull request was created with a note for AI review, indicating no human activity has occurred yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great feature enhancement, replacing the standalone to_datum* functions with a more idiomatic and performant GenericDatumWriter. The introduction of new benchmarks to showcase the performance improvements is also a valuable addition. I've found a few potential issues related to namespace handling during serialization which could cause problems with schemas that use relative name resolution. My comments below provide specific suggestions to address these.

Note: Security Review did not run due to the size of the PR.

Comment on lines +88 to +99
pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
&& self.resolved.get_schemata().iter().all(|s| {
value
.validate_internal(s, self.resolved.get_names(), None)
.is_some()
})
{
return Err(Details::Validation.into());
}
encode_internal(value, self.schema, self.resolved.get_names(), None, writer)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The calls to validate_internal and encode_internal are missing the enclosing namespace, which is hardcoded to None. This can lead to errors when resolving nested schemas that use relative names. The namespace of the schema being used should be passed to ensure correct name resolution.

Suggested change
pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
&& self.resolved.get_schemata().iter().all(|s| {
value
.validate_internal(s, self.resolved.get_names(), None)
.is_some()
})
{
return Err(Details::Validation.into());
}
encode_internal(value, self.schema, self.resolved.get_names(), None, writer)
}
pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
&& self.resolved.get_schemata().iter().all(|s| {
value
.validate_internal(s, self.resolved.get_names(), s.namespace())
.is_some()
})
{
return Err(Details::Validation.into());
}
encode_internal(value, self.schema, self.resolved.get_names(), self.schema.namespace(), writer)
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Instead of passing None as an enclosing namespace the main schema's namespace should be used instead. This will help with the resolving of any schema references used in record fields.

Comment on lines +113 to +115
let mut serializer =
SchemaAwareWriteSerializer::new(writer, self.schema, self.resolved.get_names(), None);
value.serialize(&mut serializer)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The enclosing_namespace parameter to SchemaAwareWriteSerializer::new is hardcoded to None. It should be self.schema.namespace() to correctly resolve relative names in schemas.

Suggested change
let mut serializer =
SchemaAwareWriteSerializer::new(writer, self.schema, self.resolved.get_names(), None);
value.serialize(&mut serializer)
let mut serializer =
SchemaAwareWriteSerializer::new(writer, self.schema, self.resolved.get_names(), self.schema.namespace());
value.serialize(&mut serializer)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Instead of passing None as an enclosing namespace the main schema's namespace should be used instead. This will help with the resolving of any schema references used in record fields.

Comment on lines +166 to +172
let mut serializer = SchemaAwareWriteSerializer::new(
writer,
self.resolved.get_root_schema(),
self.resolved.get_names(),
data,
writer,
)?;
None,
);
let bytes = data.serialize(&mut serializer)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The enclosing_namespace parameter to SchemaAwareWriteSerializer::new is hardcoded to None. It should be self.resolved.get_root_schema().namespace() to correctly resolve relative names in schemas.

Suggested change
let mut serializer = SchemaAwareWriteSerializer::new(
writer,
self.resolved.get_root_schema(),
self.resolved.get_names(),
data,
writer,
)?;
None,
);
let bytes = data.serialize(&mut serializer)?;
let mut serializer = SchemaAwareWriteSerializer::new(
writer,
self.resolved.get_root_schema(),
self.resolved.get_names(),
self.resolved.get_root_schema().namespace(),
);
let bytes = data.serialize(&mut serializer)?;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Instead of passing None as an enclosing namespace the main schema's namespace should be used instead. This will help with the resolving of any schema references used in record fields.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

})
{
return Err(Details::Validation.into());
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation checks all schemata instead of primary schema

Medium Severity

The write_value_ref validation iterates over self.resolved.get_schemata() (all schemata) with .all(), meaning it only rejects a value if it's invalid for every schema. The old write_avro_datum_schemata validated only against the primary schema (self.schema). When auxiliary schemata are provided (via .schemata()), a value that's invalid for the primary schema but coincidentally valid for an auxiliary schema will bypass validation and be encoded with the wrong schema, potentially producing corrupt data. The check needs to validate against self.schema, not all resolved schemata.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The value should be validated against the main schema by using the names of all additional schemata, not against all schemata. This will return the correct error reason when there is a problem. Prevents extra calculations to validate against unrelated schemas and returning wrong reasons for the error.

@claude
Copy link
Copy Markdown

claude bot commented Mar 4, 2026

PR Review: Replace to_datum* functions with GenericDatumWriter

Overall this is a solid design direction — introducing a reusable GenericDatumWriter struct is a meaningful ergonomic improvement over the free-function API, and the builder pattern (via bon) integrates cleanly with the existing codebase patterns. The deprecation annotations are well-placed and the bench additions effectively illustrate the reuse benefit. That said, I have a few concerns worth addressing before merging.


🐛 Bug: enclosing_namespace silently dropped

avro/src/writer/datum.rs, write_value_ref:

encode_internal(value, self.schema, self.resolved.get_names(), None, writer)

The namespace is passed as None, but the prior implementation of write_avro_datum_schemata passed schema.namespace():

let enclosing_namespace = schema.namespace();
encode_internal(&avro, schema, names, enclosing_namespace, buffer)

encode_internal uses the enclosing namespace when resolving Schema::Ref nodes (see encode.rs:73). With None, any reference whose fully-qualified name depends on the enclosing namespace will fail to resolve. For simple schemas this is harmless (and even the old encode function used None), but this is a silent regression for callers of to_avro_datum_schemata that have namespaced schemas. The fix is straightforward: pass self.schema.namespace() instead.

The same issue applies to the validate_internal call just above it — validate_schemata (in types.rs:384) uses schema.namespace() per schema, but write_value_ref passes None.


🐛 Bug: Validation may not check the primary schema

When GenericDatumWriter is built with .schemata(schemata), the builder calls:

let resolved = ResolvedSchema::new_with_schemata(schemata)?;

This stores only schemata in resolved.schemata. However, the primary schema (self.schema, passed to builder(schema)) is not automatically added to resolved.schemata. In write_value_ref, validation iterates self.resolved.get_schemata(), which only covers the additional schemata — not self.schema itself (unless the caller happened to include it in the schemata vec).

The old write_avro_datum_schemata explicitly validated against the primary schema:

if let Some(_err) = avro.validate_internal(schema, names, enclosing_namespace) {
    return Err(Details::Validation.into());
}

Consider adding the primary schema as the first element in the schemata vec (in the new constructor) so get_schemata()[0] is always the primary schema and validation always covers it.


📉 Regression: Richer error type lost

write_value_ref_owned_resolved (which is now private to single_object.rs) returns a Details::ValidationWithReason error containing the value, schema, and the reason string. The new write_value_ref returns only Details::Validation, discarding all diagnostic context. For users hitting a validation error this is a step backward in debuggability. Consider returning Details::ValidationWithReason at least when a reason string is available.


📝 Missing doc comments on public methods

Three public methods are exported without documentation:

  • write_value_ref — the main low-level write path, worth explaining the borrow vs owned distinction vs write_value
  • write_ser — should clarify when to prefer this over write_value
  • write_ser_to_vec — same

💡 Minor: writer visibility change has API surface implications

Changing mod writerpub mod writer in lib.rs makes the entire writer module hierarchy public. GenericDatumWriter is now reachable as apache_avro::writer::datum::GenericDatumWriter, but it is not re-exported from apache_avro root. This creates a two-tier situation where some writer types (Writer, WriterBuilder) are at the root and others require the subpath. Consider either re-exporting GenericDatumWriter from the root (aligned with the other pub-use items in lib.rs) or keeping the module private and only re-exporting specific items.


🔍 Observation: Benchmark now measures writer construction overhead

The updated bench_small_schema_write_record and bench_big_schema_write_record create a new GenericDatumWriter on every iteration (including schema resolution), which is fine for measuring the "no reuse" path. Just be aware this benchmark is no longer directly comparable to the old to_avro_datum baseline if the goal is apples-to-apples throughput comparison — though the added "reused writer" benchmarks nicely fill that gap.


Summary

Category Issue
🐛 Bug enclosing_namespace not passed to encode_internal/validate_internal
🐛 Bug Validation may skip the primary schema when using .schemata()
📉 Regression Details::ValidationWithReason dropped, only Details::Validation returned
📝 Style Missing doc comments on 3 public methods
💡 Minor GenericDatumWriter not re-exported from crate root, inconsistent with other writer types

The core abstraction is well-designed and the refactor is otherwise clean. Fixing the namespace and primary-schema validation issues is the main blocker before this is safe to merge.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
avro/benches/single.rs (1)

199-234: Reduce allocation noise in reused-writer benchmarks.

The reused/no-validation paths still allocate a fresh Vec each iteration, which can dominate the measurement and blur writer-reuse gains. Reusing a buffer per benchmark gives cleaner writer-focused numbers.

♻️ Suggested benchmark adjustment
 fn bench_small_schema_write_record_reuse_datum_writer(c: &mut Criterion) {
     let (schema, record) = make_small_record().unwrap();
     let writer = GenericDatumWriter::builder(&schema).build().unwrap();
+    let mut buf = Vec::new();
     c.bench_function("small record (reused writer)", |b| {
-        b.iter(|| writer.write_value_ref(&mut Vec::new(), &record))
+        b.iter(|| {
+            buf.clear();
+            writer.write_value_ref(&mut buf, &record).unwrap()
+        })
     });
 }

 fn bench_big_schema_write_record_reuse_datum_writer(c: &mut Criterion) {
     let (schema, record) = make_big_record().unwrap();
     let writer = GenericDatumWriter::builder(&schema).build().unwrap();
+    let mut buf = Vec::new();
     c.bench_function("big record (reused writer)", |b| {
-        b.iter(|| writer.write_value_ref(&mut Vec::new(), &record))
+        b.iter(|| {
+            buf.clear();
+            writer.write_value_ref(&mut buf, &record).unwrap()
+        })
     });
 }

 fn bench_small_schema_write_record_no_validation(c: &mut Criterion) {
     let (schema, record) = make_small_record().unwrap();
     let writer = GenericDatumWriter::builder(&schema)
         .validate(false)
         .build()
         .unwrap();
+    let mut buf = Vec::new();
     c.bench_function("small record (no validation)", |b| {
-        b.iter(|| writer.write_value_ref(&mut Vec::new(), &record))
+        b.iter(|| {
+            buf.clear();
+            writer.write_value_ref(&mut buf, &record).unwrap()
+        })
     });
 }

 fn bench_big_schema_write_record_no_validation(c: &mut Criterion) {
     let (schema, record) = make_big_record().unwrap();
     let writer = GenericDatumWriter::builder(&schema)
         .validate(false)
         .build()
         .unwrap();
+    let mut buf = Vec::new();
     c.bench_function("big record (no validation)", |b| {
-        b.iter(|| writer.write_value_ref(&mut Vec::new(), &record))
+        b.iter(|| {
+            buf.clear();
+            writer.write_value_ref(&mut buf, &record).unwrap()
+        })
     });
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/benches/single.rs` around lines 199 - 234, Benchmarks allocate a fresh
Vec in each iteration which adds allocation noise; for the four functions
bench_small_schema_write_record_reuse_datum_writer,
bench_big_schema_write_record_reuse_datum_writer,
bench_small_schema_write_record_no_validation, and
bench_big_schema_write_record_no_validation, create a single mutable buffer
(e.g., let mut buf = Vec::new()) outside the b.iter closure and inside the
closure call buf.clear() then pass &mut buf to writer.write_value_ref so the
buffer is reused across iterations and removes allocation noise from the
measurement.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@avro/src/writer/datum.rs`:
- Around line 88-99: The validation in write_value_ref is using
self.resolved.get_schemata().iter().all(...) which lets values pass if any
auxiliary schema accepts them; instead validate only against the writer schema.
Replace the current all(...) check with a direct call to
value.validate_internal(self.schema, self.resolved.get_names(), None) and, if
that returns Some(...), return the Validation error; leave the subsequent call
to encode_internal(value, self.schema, ...) unchanged. This targets
write_value_ref, self.schema, self.resolved.get_schemata(), and
value.validate_internal to ensure validation is performed against the writer
schema only.

---

Nitpick comments:
In `@avro/benches/single.rs`:
- Around line 199-234: Benchmarks allocate a fresh Vec in each iteration which
adds allocation noise; for the four functions
bench_small_schema_write_record_reuse_datum_writer,
bench_big_schema_write_record_reuse_datum_writer,
bench_small_schema_write_record_no_validation, and
bench_big_schema_write_record_no_validation, create a single mutable buffer
(e.g., let mut buf = Vec::new()) outside the b.iter closure and inside the
closure call buf.clear() then pass &mut buf to writer.write_value_ref so the
buffer is reused across iterations and removes allocation noise from the
measurement.

ℹ️ Review info
Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5827b7eb-13bc-4b90-a71d-ff126deedf23

📥 Commits

Reviewing files that changed from the base of the PR and between b652119 and 1a75da7.

📒 Files selected for processing (11)
  • avro/benches/single.rs
  • avro/src/lib.rs
  • avro/src/schema/mod.rs
  • avro/src/serde/de.rs
  • avro/src/writer/datum.rs
  • avro/src/writer/single_object.rs
  • avro/tests/avro-3786.rs
  • avro/tests/avro-3787.rs
  • avro/tests/io.rs
  • avro/tests/schema.rs
  • avro/tests/to_from_avro_datum_schemata.rs

Comment on lines +88 to +99
pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
&& self.resolved.get_schemata().iter().all(|s| {
value
.validate_internal(s, self.resolved.get_names(), None)
.is_some()
})
{
return Err(Details::Validation.into());
}
encode_internal(value, self.schema, self.resolved.get_names(), None, writer)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Validate against the writer schema only; current all(...) check can bypass validation.

Line 90 currently checks every resolved schema and only fails if all validations fail. That means a value valid for any auxiliary schema can pass, even when invalid for self.schema, then Line 98 encodes against self.schema anyway.

🐛 Suggested fix
 pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
-    if self.validate
-        && self.resolved.get_schemata().iter().all(|s| {
-            value
-                .validate_internal(s, self.resolved.get_names(), None)
-                .is_some()
-        })
-    {
+    if self.validate
+        && value
+            .validate_internal(
+                self.schema,
+                self.resolved.get_names(),
+                self.schema.namespace(),
+            )
+            .is_some()
+    {
         return Err(Details::Validation.into());
     }
-    encode_internal(value, self.schema, self.resolved.get_names(), None, writer)
+    encode_internal(
+        value,
+        self.schema,
+        self.resolved.get_names(),
+        self.schema.namespace(),
+        writer,
+    )
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
&& self.resolved.get_schemata().iter().all(|s| {
value
.validate_internal(s, self.resolved.get_names(), None)
.is_some()
})
{
return Err(Details::Validation.into());
}
encode_internal(value, self.schema, self.resolved.get_names(), None, writer)
}
pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
&& value
.validate_internal(
self.schema,
self.resolved.get_names(),
self.schema.namespace(),
)
.is_some()
{
return Err(Details::Validation.into());
}
encode_internal(
value,
self.schema,
self.resolved.get_names(),
self.schema.namespace(),
writer,
)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@avro/src/writer/datum.rs` around lines 88 - 99, The validation in
write_value_ref is using self.resolved.get_schemata().iter().all(...) which lets
values pass if any auxiliary schema accepts them; instead validate only against
the writer schema. Replace the current all(...) check with a direct call to
value.validate_internal(self.schema, self.resolved.get_names(), None) and, if
that returns Some(...), return the Validation error; leave the subsequent call
to encode_internal(value, self.schema, ...) unchanged. This targets
write_value_ref, self.schema, self.resolved.get_schemata(), and
value.validate_internal to ensure validation is performed against the writer
schema only.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The value should be validated against the main schema by using the names of all additional schemata, not against all schemata. This will return the correct error reason when there is a problem. Prevents extra calculations to validate against unrelated schemas and returning wrong reasons for the error.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 4, 2026

🤖 Augment PR Summary

Summary: This PR introduces GenericDatumWriter as the new API for encoding single Avro values (datum) and migrates call sites off to_avro_datum*.

Changes:

  • Added writer::datum::GenericDatumWriter with a builder API, optional validation, and helpers to write to any Write or to a Vec<u8>.
  • Deprecated to_avro_datum, to_avro_datum_schemata, and write_avro_datum_ref in favor of the new writer.
  • Updated schema/serde unit tests and regression tests to use GenericDatumWriter for datum encoding.
  • Refactored single-object writer serialization to go through SchemaAwareWriteSerializer and inlined the resolved-value write helper.
  • Expanded Criterion benchmarks to compare per-iteration writer construction vs reusing the writer, and to measure validation overhead.

Technical Notes: The writer module is now public so downstreams can access writer::datum::GenericDatumWriter directly while legacy exports remain temporarily available (deprecated).

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

}

pub fn write_value_ref<W: Write>(&self, writer: &mut W, value: &Value) -> AvroResult<usize> {
if self.validate
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Value::validate_internal returns Some(reason) when the value is invalid, but this check only errors when all schemata return Some (and it validates against every schema in ResolvedSchema, including dependencies). That seems likely to let values that are invalid for the writer's root schema slip through validation (or validate against the wrong schema).

Severity: high

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The value should be validated against the main schema by using the names of all additional schemata, not against all schemata. This will return the correct error reason when there is a problem. Prevents extra calculations to validate against unrelated schemas and returning wrong reasons for the error.

@martin-augment
Copy link
Copy Markdown
Owner Author

🐛 Bug: Validation may not check the primary schema

When GenericDatumWriter is built with .schemata(schemata), the builder calls:

let resolved = ResolvedSchema::new_with_schemata(schemata)?;

This stores only schemata in resolved.schemata. However, the primary schema (self.schema, passed to builder(schema)) is not automatically added to resolved.schemata. In write_value_ref, validation iterates self.resolved.get_schemata(), which only covers the additional schemata — not self.schema itself (unless the caller happened to include it in the schemata vec).

The old write_avro_datum_schemata explicitly validated against the primary schema:

if let Some(_err) = avro.validate_internal(schema, names, enclosing_namespace) {
    return Err(Details::Validation.into());
}

Consider adding the primary schema as the first element in the schemata vec (in the new constructor) so get_schemata()[0] is always the primary schema and validation always covers it.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The value should be validated against the main schema by using the names of all additional schemata, not against all schemata. This will return the correct error reason when there is a problem. Prevents extra calculations to validate against unrelated schemas and returning wrong reasons for the error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants