Add validation error feedback to arrow-json deserialization#6
Conversation
e25fd88 to
1ce557c
Compare
mwylde
left a comment
There was a problem hiding this comment.
Overall looks reasonable, but I think the validation context API makes the implementations pretty awkward compared to just returning an Option from the validate method. In particular, we now have to keep two ways of signaling errors in sync (returning the actual error and returning the boolean), when they should always be the same (either errored or not errored).
The other problem is the recursive nature of these calls. For example, if an array item fails validation, you lose the context on where the validation failed because it's being produced by the inner decoder and not by the array.
1ce557c to
e0d8b63
Compare
Refactored this approach to provide error details by updating the return types. |
mwylde
left a comment
There was a problem hiding this comment.
A few more comments, but looking pretty good
| { | ||
| // Add field name to child errors that don't have one | ||
| for error in &mut child_errors { | ||
| if error.field_name.is_none() { |
There was a problem hiding this comment.
Can you help me understand what's going on here? In what cases would the child error have or not have a field name?
There was a problem hiding this comment.
When child validators encounter issues such as invalid values, they do not know the field name for which the corresponding value was invalid. They would return an error marker with a None field name.
The parent validator (like a struct validator) would identify such child errors and add the field names to those errors. Only decoders that are at the leaf-level would not have access to the field name.
For example, if we have a json row like: {"user": {"age": "not_a_number"}}, the inner validator would only know that the json row provided a string instead of a number. The outer struct validator would add the field name "age" to the child error marker.
There was a problem hiding this comment.
That makes sense — but why would they have the field name?
There was a problem hiding this comment.
Child errors would have a field name when there are nested structs. Consider a case when we have 2 levels of struct validators. The inner struct validator would read the field name of its child field from the tape and add it to errors before returning them to the outer struct.
This check essentially prevents the outer struct from overwriting the field name of the innermost child.
There was a problem hiding this comment.
Added a comment above the check to provide this context.
| if self.strict_mode { | ||
| return false; | ||
| // Custom field_name - can't use helper | ||
| return Err(vec![ErrorMarker { |
There was a problem hiding this comment.
I'm a bit confused that we're both collecting errors (to be reported later) and also early returning on many errors, in which case we won't collect them.
There was a problem hiding this comment.
That is sort of intentional.
If we encounter a field with invalid values, we return early. That makes sense because we could have child errors that are a consequence of a field with an invalid value, particularly with nested structs.
When we encounter a missing field, we continue checking for any other missing fields. That's because missing fields are independent, and it is convenient enough to look for other missing fields even after we encounter one.
If we want to make the behavior consistent, I would return early in all cases. But I think this current approach does a reasonable job in making debugging convenient for the caller.
e0d8b63 to
ebd0259
Compare
Rationale for this change
Arrow-json validation currently returns only boolean (valid/invalid). Users get metric counts with no details about why deserialization failed, making debugging difficult.
What changes are included in this PR?
New validation.rs module:
API change:
Implementation:
Are these changes tested?
Yes
We typically require tests for all PRs in order to:
If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
Are there any user-facing changes?
Yes.
Decoder::flush_with_bad_data() now returns 4-tuple with Vec as 4th element
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.