feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup)#9097
feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup)#9097Weijun-H wants to merge 11 commits intoapache:mainfrom
Conversation
scovich
left a comment
There was a problem hiding this comment.
I really like the idea of skipping unwanted fields -- pure overhead to keep them -- but this PR feels overly complex/nested. I wonder if there's a "flatter" way to handle the situation?
| const SKIP_IN_STRING: u8 = 1 << 0; // 0x01 | ||
| const SKIP_ESCAPE: u8 = 1 << 1; // 0x02 |
There was a problem hiding this comment.
Why not just use the hex constants directly, out of curiosity?
There was a problem hiding this comment.
Also -- my intuition is that these two flags are only needed because SkipValue does too much. The newly introduced code has a lot of looping and nesting, where the existing enum variants are quite flat. The difference seems to be that the existing variants hand off to a new state whenever they detect a state change?
So e.g. instead of messing with flags, one might declare three new enum variants, SkipValue, SkipString and SkipEscape, where each nests exclusively inside the one before it? e.g. if the projection skipped field foo, then the following JSON fragment:
{
"foo": {
"bar": "hello\nworld!"
}
}would:
- push a
SkipValueas soon as:detects thatfoois not selected - push a
SkipStringas soon as it hits the opening"of the string - push a
SkipEscapeas soon as it hits the\inside the string - pop once the escape was processed
- pop once the closing
"is found - pop once the next field starts (or whatever is currently the ending condition for
SkipValue)
There was a problem hiding this comment.
By the same token, one would arguably want to push multiple SkipValue states instead of tracking nesting depth with a new variable? But then enum variants start to proliferate (basically need two of each).
Would it instead make sense to have a single skip offset that is the first stack index being skipped?
And then have pairs of match arms that decide what state gets pushed vs. merely traversed?
There was a problem hiding this comment.
Hmm, when I ran the benchmark on my laptop:
- Your original PR had 8% overhead (wide run) and 36% benefit (narrow run)
- The revised PR has overhead 9% and benefit 30%
There was a problem hiding this comment.
In order to remove the regression, I added a projection in ReaderBuilder to enable projection-aware parsing.
When enabled, JSON fields not present in the schema are skipped during tape parsing rather than being fully parsed and later ignored. This improves performance for narrow projections over wide JSON data.
There was a problem hiding this comment.
I threw an LLM at this whole situation during a boring meeting, and arrived at a surprisingly different potential approach, if you're game to try it out?
The short version is:
- Keep the existing (highly optimized and efficient) decoding logic, but factor it out to a helper method that is generic over
const SKIP: boolthat says whether to actually store the parsed output. - Wrap that helper in
decodeanddecode_skipmethods, with a clean transition between the two: enter at the:match (like the PR does today), anddecode_skipbreaks back out todecodewhen the stack length drops back down. - We need a new boolean
skippingfield to handle cases where input bytes were exhausted while skipping (so the next call todecodecan jump straight todecode_skipwhen starting the next buffer of bytes) - The state stack tracks everything related to skipping (small memory cost but very efficient).
- No new tape decoder enum variants needed.
In theory, the approach should be simpler (less duplicated source code) while also having friendlier branching (fewer and/or more predictable branches).
Is that something you'd want me to put a bit more time into exploring further?
Or something you'd prefer to dig into yourself?
b427eeb to
dfcbc97
Compare
08fcaaa to
8c1f4e9
Compare
|
@Weijun-H could you update this PR (and the other JSON reader PRs) to resolve the conflicts and remove the now redundant benchmarks? I'll then run the benchmark numbers and give it a look. |
afde738 to
d0da5a5
Compare
4173dc6 to
130c787
Compare
|
I have a big-picture question: What trade-offs are we willing to make on validation of JSON values we will ultimately discard? At one extreme, we could fully parse and validate everything and just choose not to append the skipped bits to the tape afterward.
At the other extreme, we completely ignore the bytes corresponding to skipped values, other than the bare minimum to be relatively confident we correctly identified byte range to skip.
I think this PR currently leans toward the lenient-for-max-performance end of the spectrum. That's not necessarily bad, but the PR doesn't really talk about the trade-off. For example, if we decide we want to be maximally lenient in order to skip as quickly as possible, this PR may not be aggressive enough (dunno, haven't explored that direction yet). On the other hand, if we favor correctness even for skipped values, then this PR is probably too lenient (a motivating factor behind some of my previous comments, which I wasn't fully self-aware of at the time). Do we know what we want? |
|
That is an excellent question @scovich -- I don't have a great story but I did pull your comment into a separate ticket for discussion |
|
What is the status of this PR? Should we proceed with it? |
|
@alamb I think we were a bit stuck on performance vs. correctness trade-off?
Based on the above, we probably have to give up on enforcing correctness of the skipped bytes if we want the performance gains. My other worry is, the state machine has overhead for all object fields even tho only top-level fields might be skipped. I think the latest version of the PR reduced that overhead quite a bit, but fundamentally there has to be a branch to decide whether to skip or not. That was actually the original motivation for my pathfinding -- to find a cheaper state machine -- but then I got sidetracked on the validation aspect. |

Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
This PR implements projection-aware field skipping in the arrow-json reader:
ReaderBuilder::with_projection(bool)enables opt-in field filteringBehavior matrix:
Are these changes tested?
Yes, all existing tests pass
Are there any user-facing changes?
Yes, new public API:
ReaderBuilder::with_projection(bool)- opt-in to skip unknown JSON fields during parsingThis is additive and does not break existing behavior (default is
false).