[WIP] Json decoder factory v2 by scovich · Pull Request #9272 · apache/arrow-rs

scovich · 2026-01-26T22:42:27Z

Which issue does this PR close?

This is a variant of #9259, but stacked on top of three building block PR:

Part of [arrow-json] deserialize Variant fields #8987

Rationale for this change

See description of #9259. This version here factors out building blocks so it's easier to see what actually changes to add custom json decoder support.

What changes are included in this PR?

See description of #9259. Same net change, just organized differently.

Are these changes tested?

Yes. Existing and newly added unit tests.

Are there any user-facing changes?

Make JSON tape decoder classes public
Make JSON ArrayDecoder trait public
Make JSON DecoderContext class public
New public trait: DecoderFactory

scovich · 2026-01-26T22:43:11Z

Oops @alamb this should prob stay in draft status until (a) the dependencies it's stacked on merge; and (b) we decide we actually want this approach in this form?

scovich · 2026-01-28T00:07:23Z

Ok, rebased now that the three prefactor PR merged. The diff looks big, but the bulk of it is tests and doc comments. The actual change is pretty small.

The biggest single contributors are:

800+ LoC for the various motivating examples/tests in custom_decoder_tests.rs
- Hopefully they're easily understood and help spur a healthy discussion on what the API should actually look like
- TBD whether they stay as tests for now, or if some are deemed useful enough to become part of the API
~280 LoC to define the variant decoder and factory (+ tests)
- Could potentially split out to a separate PR

tustvold · 2026-01-28T09:20:48Z

 pub mod writer;

-pub use self::reader::{Reader, ReaderBuilder};
+pub use self::reader::{ArrayDecoder, DecoderFactory, Reader, ReaderBuilder, Tape, TapeElement};


I think this is the key part that might be deemed controversial. Is Tape really a good thing to expose publicly? It's been a while since I wrote it, but I remember it not being especially friendly as an API, and something that stands a good chance of being changed in future - e.g. to avoid copying strings.

Good question, and I don't remember seeing any discussion on the original PR I'm building on here:

Allow extensions to arrow-json decoder and include an extension for variant #9021

Is there any way to allow users to customize parsing without exposing something? Other options might include:

Create a new trait or wrapper that exposes the tape's information in a simplified/safe/stable way, to decouple users from the low-level details.

Maybe could work? Worth exploring?

Convert the tape to variant, and shift the factory/decoder stuff over to variant-compute instead of json crate

We'd still need something to allow parsing JSON to variant, which I believe is a canonical extension types that should be supported directly.

Variant is insanely complex once shredding comes into the picture, so such an interface would not be easier or safer to use IMO.

The extra layer of conversion would impose significant overhead for somebody who just wants to parse a few misbehaving columns in a special way.

Something else I'm not thinking of?

I agree if we want to allow users to override decoding behavior we are going to have to given them direct access to Tape / Tape Element - I don't really see any way around it

something that stands a good chance of being changed in future - e.g. to avoid copying strings.

@tustvold -- what strings are you referring to? I don't see any strings copied here:

https://github.com/apache/arrow-rs/blob/7e5076f1f775a6fd08a4d63389e26e2920fe3f6a/arrow-json/src/reader/tape.rs#L34-L33

arrow-rs/arrow-json/src/reader/tape.rs

Lines 96 to 101 in 7e5076f

pub struct Tape<'a> {

elements: &'a [TapeElement],

strings: &'a str,

string_offsets: &'a [usize],

num_rows: usize,

}

There were some discussions a while back about the way that it copies all strings from the source data being unfortunate. I'm not sure how this is avoidable with the push-based decoder interface, but it has been discussed.

TBC I am not against making Tape public, but it likely needs some TLC prior to that to ensure it is usable and vaguely future-proof. Even basic things like adding non_exhaustive, hiding methods like this that are a bit odd to expose, etc...

If someone could come up with a list of changes that would make us comfortable making Tape public, I can try to put up a PR

@scovich is this something you can help drive? I think it is important but I really just don't have the bandwidth to give it the attention it deserves

From my perspective, the Arrow crates now permit making breaking API changes every 3 months (major releases) so if we expose a structure and then decide to make breaking changes to it, that isn't impossible

Thus in my mind, I think we should get something out and working

I don't suspect we are not likely to see changes to this interface unless we expose it and there are new use cases put forward.

I won't have time in the next couple of weeks, sadly. But yes I do want to see this through.

I do tend to agree we should get something out once it's reasonable, but we already know of several use cases and it's not obvious to me that even those known use cases are well-served by the current approach (speaking as somebody who wants to actually use whatever we come up with).

So: If we have something that works well for the use cases we've thought of, it's probably Good Enough and can evolve a quarter or two later as we discover more use cases or warts.

Is that a reasonable go/no-go criteria?

So: If we have something that works well for the use cases we've thought of, it's probably Good Enough and can evolve a quarter or two later as we discover more use cases or warts.

This is my personal preferred approach -- let's get it out rather than waiting on something that is perfect

scovich · 2026-02-04T15:55:28Z

I just encountered (+ remembered) another use case for custom JSON decoding: Wrong-named fields. Whether due to schema evolution or flat-out typos, it may be desirable for multiple input field names to map to the same output field (with some precedence rules if multiple candidates are available).

Just lodging that thought here for now -- I didn't have time to see if the approach explored here can actually handle the use case.

github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate parquet-variant parquet-variant* crates labels Jan 26, 2026

json custom decoder support

92929be

scovich force-pushed the json-decoder-factory-v2 branch from 8f683d7 to 92929be Compare January 27, 2026 23:56

github-actions bot removed the parquet Changes to the parquet crate label Jan 27, 2026

tustvold reviewed Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Json decoder factory v2#9272

[WIP] Json decoder factory v2#9272
scovich wants to merge 1 commit intoapache:mainfrom
scovich:json-decoder-factory-v2

scovich commented Jan 26, 2026

Uh oh!

scovich commented Jan 26, 2026

Uh oh!

scovich commented Jan 28, 2026

Uh oh!

tustvold Jan 28, 2026

Uh oh!

scovich Jan 28, 2026

Uh oh!

alamb Jan 28, 2026

Uh oh!

tustvold Jan 29, 2026 •

edited

Loading

Uh oh!

debugmiller Feb 7, 2026

Uh oh!

alamb Feb 11, 2026

Uh oh!

scovich Feb 12, 2026

Uh oh!

alamb Feb 12, 2026

Uh oh!

scovich commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	pub struct Tape<'a> {
	elements: &'a [TapeElement],
	strings: &'a str,
	string_offsets: &'a [usize],
	num_rows: usize,
	}

Conversation

scovich commented Jan 26, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

scovich commented Jan 26, 2026

Uh oh!

scovich commented Jan 28, 2026

Uh oh!

tustvold Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

scovich Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

tustvold Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

debugmiller Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

scovich Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

scovich commented Feb 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tustvold Jan 29, 2026 •

edited

Loading