feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup) by Weijun-H · Pull Request #9097 · apache/arrow-rs

Weijun-H · 2026-01-05T17:19:06Z

Which issue does this PR close?

Closes #NNN.

Rationale for this change

What changes are included in this PR?

This PR implements projection-aware field skipping in the arrow-json reader:

New API: ReaderBuilder::with_projection(bool) enables opt-in field filtering
Skip optimization: When enabled, JSON fields not in schema are skipped during tape parsing rather than fully parsed and discarded later
Fail-fast for strict_mode: Unknown fields now error immediately during tape parsing instead of waiting until array decoding

Behavior matrix:

strict_mode	projection	Behavior
false	false	Parse all fields, ignore unknown (original)
false	true	Skip unknown fields at tape level
true	*	Error on first unknown field (fail-fast)

Are these changes tested?

Yes, all existing tests pass

Are there any user-facing changes?

Yes, new public API:

ReaderBuilder::with_projection(bool) - opt-in to skip unknown JSON fields during parsing

This is additive and does not break existing behavior (default is false).

scovich

I really like the idea of skipping unwanted fields -- pure overhead to keep them -- but this PR feels overly complex/nested. I wonder if there's a "flatter" way to handle the situation?

scovich · 2026-01-05T21:38:21Z

+const SKIP_IN_STRING: u8 = 1 << 0; // 0x01
+const SKIP_ESCAPE: u8 = 1 << 1; // 0x02


Why not just use the hex constants directly, out of curiosity?

Also -- my intuition is that these two flags are only needed because SkipValue does too much. The newly introduced code has a lot of looping and nesting, where the existing enum variants are quite flat. The difference seems to be that the existing variants hand off to a new state whenever they detect a state change?

So e.g. instead of messing with flags, one might declare three new enum variants, SkipValue, SkipString and SkipEscape, where each nests exclusively inside the one before it? e.g. if the projection skipped field foo, then the following JSON fragment:

{ "foo": { "bar": "hello\nworld!" } }

would:

push a SkipValue as soon as : detects that foo is not selected

push a SkipString as soon as it hits the opening " of the string

push a SkipEscape as soon as it hits the \ inside the string

pop once the escape was processed

pop once the closing " is found

pop once the next field starts (or whatever is currently the ending condition for SkipValue)

By the same token, one would arguably want to push multiple SkipValue states instead of tracking nesting depth with a new variable? But then enum variants start to proliferate (basically need two of each).

Would it instead make sense to have a single skip offset that is the first stack index being skipped?
And then have pairs of match arms that decide what state gets pushed vs. merely traversed?

Change to enum variant in a255860

Hmm, when I ran the benchmark on my laptop:

Your original PR had 8% overhead (wide run) and 36% benefit (narrow run)

The revised PR has overhead 9% and benefit 30%

In order to remove the regression, I added a projection in ReaderBuilder to enable projection-aware parsing.

When enabled, JSON fields not present in the schema are skipped during tape parsing rather than being fully parsed and later ignored. This improves performance for narrow projections over wide JSON data.

Re projection -- Very slick!

Were you able to repro the slowdown going from a421d89 (original version) to a2aa758 (extra enum variants)? It's pretty consistent for me.

Were you able to repro the slowdown going from a421d89 (original version) to a2aa758 (extra enum variants)? It's pretty consistent for me.

This is also consistent on my side

I threw an LLM at this whole situation during a boring meeting, and arrived at a surprisingly different potential approach, if you're game to try it out?

The short version is:

Keep the existing (highly optimized and efficient) decoding logic, but factor it out to a helper method that is generic over const SKIP: bool that says whether to actually store the parsed output.

Wrap that helper in decode and decode_skip methods, with a clean transition between the two: enter at the : match (like the PR does today), and decode_skip breaks back out to decode when the stack length drops back down.

We need a new boolean skipping field to handle cases where input bytes were exhausted while skipping (so the next call to decode can jump straight to decode_skip when starting the next buffer of bytes)

The state stack tracks everything related to skipping (small memory cost but very efficient).

No new tape decoder enum variants needed.

In theory, the approach should be simpler (less duplicated source code) while also having friendlier branching (fewer and/or more predictable branches).

Is that something you'd want me to put a bit more time into exploring further?
Or something you'd prefer to dig into yourself?

alamb · 2026-01-10T13:06:54Z

@Weijun-H could you update this PR (and the other JSON reader PRs) to resolve the conflicts and remove the now redundant benchmarks? I'll then run the benchmark numbers and give it a look.

… validation

scovich · 2026-01-29T16:40:23Z

I have a big-picture question:

What trade-offs are we willing to make on validation of JSON values we will ultimately discard?

At one extreme, we could fully parse and validate everything and just choose not to append the skipped bits to the tape afterward.

CON: Strongly limits performance gain of skipping, because parsing and validation are the lion's share of work.

At the other extreme, we completely ignore the bytes corresponding to skipped values, other than the bare minimum to be relatively confident we correctly identified byte range to skip.

CON: Accepts blatantly invalid JSON as long as the bytes satisfy whatever region identification heuristics we come up with.
CON: Risk of identifying the wrong region and skipping bytes that should not have been skipped.

I think this PR currently leans toward the lenient-for-max-performance end of the spectrum. That's not necessarily bad, but the PR doesn't really talk about the trade-off. For example, if we decide we want to be maximally lenient in order to skip as quickly as possible, this PR may not be aggressive enough (dunno, haven't explored that direction yet). On the other hand, if we favor correctness even for skipped values, then this PR is probably too lenient (a motivating factor behind some of my previous comments, which I wasn't fully self-aware of at the time).

Do we know what we want?

alamb · 2026-02-02T14:38:22Z

That is an excellent question @scovich -- I don't have a great story but I did pull your comment into a separate ticket for discussion

JSON Reader: Validation vs Performance #9329

alamb · 2026-04-16T11:28:08Z

What is the status of this PR? Should we proceed with it?

scovich · 2026-04-16T15:08:01Z

@alamb I think we were a bit stuck on performance vs. correctness trade-off?

This PR just skims bytes of skipped fields with basically no validation, just watching for navigation aids like {, [, and ".
I had done some pathfinding that did proper validation of skipped bytes, and it basically erased all the performance gains once I fixed the bugs in my initial version.
related: JSON parser tolerates illegal commas #9204

Based on the above, we probably have to give up on enforcing correctness of the skipped bytes if we want the performance gains.

My other worry is, the state machine has overhead for all object fields even tho only top-level fields might be skipped. I think the latest version of the PR reduced that overhead quite a bit, but fundamentally there has to be a branch to decide whether to skip or not. That was actually the original motivation for my pathfinding -- to find a cheaper state machine -- but then I got sidetracked on the validation aspect.

Weijun-H changed the title ~~feat: add projection support to TapeDecoder for skipping unknown fields in json parsing~~ feat: add projection support to TapeDecoder for skipping unknown fields in json parsing (1.4x speedup) Jan 5, 2026

Weijun-H marked this pull request as ready for review January 5, 2026 17:24

scovich reviewed Jan 5, 2026

View reviewed changes

github-actions bot added the arrow Changes to the arrow crate label Jan 6, 2026

Weijun-H force-pushed the tape-skip-in-json-parse branch from b427eeb to dfcbc97 Compare January 6, 2026 11:03

scovich reviewed Jan 6, 2026

View reviewed changes

Comment thread arrow-json/src/reader/mod.rs Outdated

scovich reviewed Jan 6, 2026

View reviewed changes

Comment thread arrow-json/src/reader/mod.rs Outdated

Weijun-H force-pushed the tape-skip-in-json-parse branch from 08fcaaa to 8c1f4e9 Compare January 6, 2026 21:11

alamb added the performance label Jan 10, 2026

Weijun-H added 10 commits January 19, 2026 13:10

feat: add projection support to TapeDecoder for skipping unknown fields

0689e24

feat: add benchmark

cd84ec9

chore

96f0756

fix: handle code review

bc70488

fix: clippy

8bac019

feat: add projection info method to TapeDecoder for top-level checks

df3ac61

feat: enhance TapeDecoder to support strict mode for projection field…

f5502d9

… validation

feat: add projection support to ReaderBuilder for schema-aware parsing

881a19f

chore

4275af5

feat: add test for projection skipping unknown fields in JSON reader

d0da5a5

Weijun-H force-pushed the tape-skip-in-json-parse branch from afde738 to d0da5a5 Compare January 19, 2026 11:11

feat: add projection support to schema decoding benchmark

130c787

Weijun-H force-pushed the tape-skip-in-json-parse branch from 4173dc6 to 130c787 Compare January 19, 2026 11:23

alamb mentioned this pull request Feb 2, 2026

JSON Reader: Validation vs Performance #9329

Open

		const SKIP_IN_STRING: u8 = 1 << 0; // 0x01
		const SKIP_ESCAPE: u8 = 1 << 1; // 0x02

Conversation

Weijun-H commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scovich Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

scovich Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

scovich Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Weijun-H Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

scovich Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Weijun-H Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

scovich Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Weijun-H Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scovich Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Jan 10, 2026

Uh oh!

scovich commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Feb 2, 2026

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

scovich commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Weijun-H commented Jan 5, 2026 •

edited

Loading

Weijun-H Jan 7, 2026 •

edited

Loading

scovich Jan 7, 2026 •

edited

Loading

scovich commented Jan 29, 2026 •

edited

Loading

scovich commented Apr 16, 2026 •

edited

Loading