Skip to content

Add validation error feedback to arrow-json deserialization#6

Merged
mwylde merged 1 commit intoArroyoSystems:55.2.0/jsonfrom
garvit-gupta:garvit/validation-contetx
Jan 6, 2026
Merged

Add validation error feedback to arrow-json deserialization#6
mwylde merged 1 commit intoArroyoSystems:55.2.0/jsonfrom
garvit-gupta:garvit/validation-contetx

Conversation

@garvit-gupta
Copy link

@garvit-gupta garvit-gupta commented Dec 17, 2025

Rationale for this change

Arrow-json validation currently returns only boolean (valid/invalid). Users get metric counts with no details about why deserialization failed, making debugging difficult.

What changes are included in this PR?

New validation.rs module:

  • ErrorMarker<'tape> - Internal error marker with field names and array indices
  • ValidationError - User-facing error with field path, failure kind, actual/expected values
  • FailureKind: MissingField, NullValue, TypeMismatch, ParseFailure
  • Field paths with array indices: "items[2]", "matrix[1][1]"

API change:

  • ArrayDecoder::validate_row() now returns Result<(), Vec> instead of bool
  • Decoder::flush_with_bad_data() now returns 4-tuple with Vec as 4th element

Implementation:

  • All 13 array decoders updated (primitive, string, boolean, null, decimal, timestamp, binary, json, string_view, list, map, struct)
  • Zero-cost happy path - no allocations when all rows valid
  • Array index tracking through error propagation
  • Error truncation at 1000 to prevent unbounded growth

Are these changes tested?

Yes

  • 17 integration tests covering all error types, nested structures, arrays, maps
  • 5 unit tests for validation helpers

We typically require tests for all PRs in order to:

  1. Prevent the code from being accidentally broken by subsequent changes
  2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

Yes.
Decoder::flush_with_bad_data() now returns 4-tuple with Vec as 4th element

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@github-actions github-actions bot added the arrow label Dec 17, 2025
@garvit-gupta garvit-gupta force-pushed the garvit/validation-contetx branch from e25fd88 to 1ce557c Compare December 18, 2025 16:37
Copy link
Member

@mwylde mwylde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable, but I think the validation context API makes the implementations pretty awkward compared to just returning an Option from the validate method. In particular, we now have to keep two ways of signaling errors in sync (returning the actual error and returning the boolean), when they should always be the same (either errored or not errored).

The other problem is the recursive nature of these calls. For example, if an array item fails validation, you lose the context on where the validation failed because it's being produced by the inner decoder and not by the array.

@garvit-gupta garvit-gupta force-pushed the garvit/validation-contetx branch from 1ce557c to e0d8b63 Compare December 19, 2025 11:11
@garvit-gupta
Copy link
Author

Overall looks reasonable, but I think the validation context API makes the implementations pretty awkward compared to just returning an Option from the validate method. In particular, we now have to keep two ways of signaling errors in sync (returning the actual error and returning the boolean), when they should always be the same (either errored or not errored).

The other problem is the recursive nature of these calls. For example, if an array item fails validation, you lose the context on where the validation failed because it's being produced by the inner decoder and not by the array.

Refactored this approach to provide error details by updating the return types.

@garvit-gupta garvit-gupta marked this pull request as ready for review December 19, 2025 11:26
@garvit-gupta garvit-gupta requested a review from mwylde December 19, 2025 19:35
Copy link
Member

@mwylde mwylde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, but looking pretty good

{
// Add field name to child errors that don't have one
for error in &mut child_errors {
if error.field_name.is_none() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand what's going on here? In what cases would the child error have or not have a field name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When child validators encounter issues such as invalid values, they do not know the field name for which the corresponding value was invalid. They would return an error marker with a None field name.

The parent validator (like a struct validator) would identify such child errors and add the field names to those errors. Only decoders that are at the leaf-level would not have access to the field name.

For example, if we have a json row like: {"user": {"age": "not_a_number"}}, the inner validator would only know that the json row provided a string instead of a number. The outer struct validator would add the field name "age" to the child error marker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense — but why would they have the field name?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Child errors would have a field name when there are nested structs. Consider a case when we have 2 levels of struct validators. The inner struct validator would read the field name of its child field from the tape and add it to errors before returning them to the outer struct.

This check essentially prevents the outer struct from overwriting the field name of the innermost child.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment above the check to provide this context.

if self.strict_mode {
return false;
// Custom field_name - can't use helper
return Err(vec![ErrorMarker {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused that we're both collecting errors (to be reported later) and also early returning on many errors, in which case we won't collect them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is sort of intentional.

If we encounter a field with invalid values, we return early. That makes sense because we could have child errors that are a consequence of a field with an invalid value, particularly with nested structs.

When we encounter a missing field, we continue checking for any other missing fields. That's because missing fields are independent, and it is convenient enough to look for other missing fields even after we encounter one.

If we want to make the behavior consistent, I would return early in all cases. But I think this current approach does a reasonable job in making debugging convenient for the caller.

@garvit-gupta garvit-gupta force-pushed the garvit/validation-contetx branch from e0d8b63 to ebd0259 Compare December 23, 2025 18:47
@garvit-gupta garvit-gupta requested a review from mwylde December 23, 2025 18:49
@mwylde mwylde merged commit d31f8d8 into ArroyoSystems:55.2.0/json Jan 6, 2026
13 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants