Add validation error feedback to arrow-json deserialization by garvit-gupta · Pull Request #6 · ArroyoSystems/arrow-rs

garvit-gupta · 2025-12-17T12:31:45Z

Rationale for this change

Arrow-json validation currently returns only boolean (valid/invalid). Users get metric counts with no details about why deserialization failed, making debugging difficult.

What changes are included in this PR?

New validation.rs module:

ErrorMarker<'tape> - Internal error marker with field names and array indices
ValidationError - User-facing error with field path, failure kind, actual/expected values
FailureKind: MissingField, NullValue, TypeMismatch, ParseFailure
Field paths with array indices: "items[2]", "matrix[1][1]"

API change:

ArrayDecoder::validate_row() now returns Result<(), Vec> instead of bool
Decoder::flush_with_bad_data() now returns 4-tuple with Vec as 4th element

Implementation:

All 13 array decoders updated (primitive, string, boolean, null, decimal, timestamp, binary, json, string_view, list, map, struct)
Zero-cost happy path - no allocations when all rows valid
Array index tracking through error propagation
Error truncation at 1000 to prevent unbounded growth

Are these changes tested?

Yes

17 integration tests covering all error types, nested structures, arrays, maps
5 unit tests for validation helpers

We typically require tests for all PRs in order to:

Prevent the code from being accidentally broken by subsequent changes
Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

Yes.
Decoder::flush_with_bad_data() now returns 4-tuple with Vec as 4th element

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

mwylde

Overall looks reasonable, but I think the validation context API makes the implementations pretty awkward compared to just returning an Option from the validate method. In particular, we now have to keep two ways of signaling errors in sync (returning the actual error and returning the boolean), when they should always be the same (either errored or not errored).

The other problem is the recursive nature of these calls. For example, if an array item fails validation, you lose the context on where the validation failed because it's being produced by the inner decoder and not by the array.

arrow-json/src/reader/validation.rs

garvit-gupta · 2025-12-19T11:25:44Z

Overall looks reasonable, but I think the validation context API makes the implementations pretty awkward compared to just returning an Option from the validate method. In particular, we now have to keep two ways of signaling errors in sync (returning the actual error and returning the boolean), when they should always be the same (either errored or not errored).

The other problem is the recursive nature of these calls. For example, if an array item fails validation, you lose the context on where the validation failed because it's being produced by the inner decoder and not by the array.

Refactored this approach to provide error details by updating the return types.

mwylde

A few more comments, but looking pretty good

arrow-json/src/reader/binary_array.rs

mwylde · 2025-12-23T02:04:48Z

arrow-json/src/reader/struct_array.rs

+                    {
+                        // Add field name to child errors that don't have one
+                        for error in &mut child_errors {
+                            if error.field_name.is_none() {


Can you help me understand what's going on here? In what cases would the child error have or not have a field name?

When child validators encounter issues such as invalid values, they do not know the field name for which the corresponding value was invalid. They would return an error marker with a None field name.

The parent validator (like a struct validator) would identify such child errors and add the field names to those errors. Only decoders that are at the leaf-level would not have access to the field name.

For example, if we have a json row like: {"user": {"age": "not_a_number"}}, the inner validator would only know that the json row provided a string instead of a number. The outer struct validator would add the field name "age" to the child error marker.

That makes sense — but why would they have the field name?

Child errors would have a field name when there are nested structs. Consider a case when we have 2 levels of struct validators. The inner struct validator would read the field name of its child field from the tape and add it to errors before returning them to the outer struct.

This check essentially prevents the outer struct from overwriting the field name of the innermost child.

Added a comment above the check to provide this context.

arrow-json/src/reader/struct_array.rs

mwylde · 2025-12-23T02:09:26Z

arrow-json/src/reader/struct_array.rs

                    if self.strict_mode {
-                        return false;
+                        // Custom field_name - can't use helper
+                        return Err(vec![ErrorMarker {


I'm a bit confused that we're both collecting errors (to be reported later) and also early returning on many errors, in which case we won't collect them.

That is sort of intentional.

If we encounter a field with invalid values, we return early. That makes sense because we could have child errors that are a consequence of a field with an invalid value, particularly with nested structs.

When we encounter a missing field, we continue checking for any other missing fields. That's because missing fields are independent, and it is convenient enough to look for other missing fields even after we encounter one.

If we want to make the behavior consistent, I would return early in all cases. But I think this current approach does a reasonable job in making debugging convenient for the caller.

github-actions bot added the arrow label Dec 17, 2025

garvit-gupta force-pushed the garvit/validation-contetx branch from e25fd88 to 1ce557c Compare December 18, 2025 16:37

mwylde reviewed Dec 19, 2025

View reviewed changes

arrow-json/src/reader/validation.rs Outdated Show resolved Hide resolved

garvit-gupta force-pushed the garvit/validation-contetx branch from 1ce557c to e0d8b63 Compare December 19, 2025 11:11

garvit-gupta marked this pull request as ready for review December 19, 2025 11:26

garvit-gupta requested a review from mwylde December 19, 2025 19:35

mwylde reviewed Dec 23, 2025

View reviewed changes

Add validation error feedback to arrow-json deserialization

ebd0259

garvit-gupta force-pushed the garvit/validation-contetx branch from e0d8b63 to ebd0259 Compare December 23, 2025 18:47

garvit-gupta requested a review from mwylde December 23, 2025 18:49

mwylde approved these changes Jan 6, 2026

View reviewed changes

mwylde merged commit d31f8d8 into ArroyoSystems:55.2.0/json Jan 6, 2026
13 of 23 checks passed

garvit-gupta mentioned this pull request Jan 7, 2026

Capture and report JSON deserialization errors ArroyoSystems/arroyo#985

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add validation error feedback to arrow-json deserialization#6

Add validation error feedback to arrow-json deserialization#6
mwylde merged 1 commit intoArroyoSystems:55.2.0/jsonfrom
garvit-gupta:garvit/validation-contetx

garvit-gupta commented Dec 17, 2025 •

edited

Loading

Uh oh!

mwylde left a comment

Uh oh!

Uh oh!

garvit-gupta commented Dec 19, 2025

Uh oh!

mwylde left a comment

Uh oh!

Uh oh!

mwylde Dec 23, 2025

Uh oh!

garvit-gupta Dec 23, 2025

Uh oh!

mwylde Dec 23, 2025

Uh oh!

garvit-gupta Dec 23, 2025

Uh oh!

garvit-gupta Dec 23, 2025

Uh oh!

Uh oh!

mwylde Dec 23, 2025

Uh oh!

garvit-gupta Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

garvit-gupta commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mwylde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garvit-gupta commented Dec 19, 2025

Uh oh!

mwylde left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mwylde Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

garvit-gupta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

mwylde Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

garvit-gupta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

garvit-gupta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mwylde Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

garvit-gupta Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

garvit-gupta commented Dec 17, 2025 •

edited

Loading