[BLOG] Variant Type in Apache Parquet for Semi-Structured Data by aihuaxu · Pull Request #171 · apache/parquet-site

aihuaxu · 2026-02-15T19:42:52Z

I followed geospatial blog to create this PR.

This is to add this blog post of Variant in Parquet with the initial draft in this google doc.

Closes Blog on Variant #147

alamb

Thank you @aihuaxu -- this looks (amazing) to me, though I am somewhat biased 😆

Sorry for the delay, I am out this week.

I left some some comments but I don't think any of them are actually required to publish this post. I also made some small proposed changes in here for your consideration:

aihuaxu#1

I'll also post to the Parquet mailing list soliciting additional feedback

How about we shoot to publish this in about a week (Feb 26, 2026?)

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2026-02-21T19:36:38Z

I also left a note on the parquet dev list asking for additional feedback: https://lists.apache.org/thread/vms8ohk4onl8fom9n9zkql2ctgdbz3lo

emkornfield · 2026-02-22T01:11:58Z

+
+## Why Variant?
+
+Unlike traditional approaches that store JSON as text strings and require full parsing to access any field, making queries slow and resource-intensive, Variant solves this by storing data in a **structured binary format** that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance.


It might be better to motivate why variant with an example here. event_data is mentioned below, so maybe we can talk about storing logged Event data, that has a structure that might evolve because new events are added, or fields are added or removed from a specific event type.

Then motivate this with a query. e.g. Select event[timestamps] where event[User] = 5. Then you can explain that if the events where just stored as JSON (there original format) we require partings the JSON to do any processing. BSON could be an improvement but is still potentially sub-optimal because you end up repeated storage of common keys.

I updated a little bit. Please take a look.

emkornfield

Overall, looks good. Lets please check the status of Arrow C++ before publishing.

The other comments are not a blocker but I think this would be a more interesting read by putting the motivating examples first and discuss why existing types don't support the use-case well.

emkornfield

Thanks for the changes, mostly looks good. A few more non-required suggestions on making the blog more focused.

wgtmac · 2026-02-24T09:57:20Z

+The following example shows shredding non-nested Variant values. In this case, the writer chose to shred string values as the `typed_value` column. Rows that do not contain strings are stored in the `value` column with binary Variant encoding.
+
+```parquet
+optional group SIMPLE_DATA (VARIANT(1)) = 1 { 


As we are using VARIANT(1) here, should we also update the spec to be consistent? https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant

That is a good point -- I double checked at the (1) is the "Variant Version":

https://github.com/apache/parquet-format/blob/38818fa0e7efd54b535001a4448030a40619c2a3/src/main/thrift/parquet.thrift#L409-L413

/** * Embedded Variant logical type annotation */ struct VariantType { // The version of the variant specification that the variant was // written with. 1: optional i8 specification_version }

I made a PR to update

Correct VARIANT Logical Type annotation in Parquet examples parquet-format#555

alamb · 2026-02-24T12:33:03Z

I pushed a few small changes:

Adjusted the title to better mirror the geospatial blog title "Native Geospatial Types in Apache Parquet". New title is "Variant Type in Apache Parquet for Semi-Structured Data"
Removed the ## Introduction heading (left the content, just removed the heading)
Tweaked some table headings to look better visually
vertically aligned some comments in the examples

Co-authored-by: emkornfield <emkornfield@gmail.com>

alamb · 2026-02-25T21:21:36Z

+);
+
+-- Insert data from different sensor types
+INSERT INTO sensor_readings VALUES


I also had to update this example when I tested using spark-sql:

spark-sql (default)> INSERT INTO sensor_readings VALUES > (1, '2026-01-28 10:00:00', > PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')), > (2, '2026-01-28 10:00:05', > PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')), > (3, '2026-01-28 10:00:10', > PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}')); [INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_SAFELY_CAST] Cannot write incompatible data for the table `spark_catalog`.`default`.`sensor_readings`: Cannot safely cast `timestamp` "STRING" to "TIMESTAMP". SQLSTATE: KD000

So I added an explicit ::timestamp cast in 43fc111

spark-sql (default)> INSERT INTO sensor_readings VALUES > (1, '2026-01-28 10:00:00'::timestamp, > PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')), > (2, '2026-01-28 10:00:05'::timestamp, > PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')), > (3, '2026-01-28 10:00:10'::timestamp, > PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}')); Time taken: 1.067 seconds

alamb · 2026-02-27T14:11:18Z

Ok, I think this PR has had enough review and all the outstanding comments have been addressed. I'll update the date and then publish it.

We can address any additional feedback as follow on PRs

Add variant blog

2fadc6d

aihuaxu force-pushed the variant-blog branch from 6390d93 to 2fadc6d Compare February 15, 2026 19:50

aihuaxu commented Feb 15, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md

Address comment

f396f9f

aihuaxu force-pushed the variant-blog branch from dc1875e to f396f9f Compare February 19, 2026 06:00

alamb mentioned this pull request Feb 19, 2026

Proposal: Alternate formatting of table in Variant blog aihuaxu/parquet-site#1

Open

alamb approved these changes Feb 19, 2026

View reviewed changes

aihuaxu and others added 2 commits February 21, 2026 09:35

Apply suggestions from code review

6aae094

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update image

0168ac7

emkornfield reviewed Feb 22, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md

emkornfield reviewed Feb 22, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md Outdated

emkornfield reviewed Feb 22, 2026

View reviewed changes

sfc-gh-aixu and others added 3 commits February 22, 2026 20:47

Remove c++

4444c51

Update why variant

349814a

Merge remote-tracking branch 'origin/production' into variant-blog

8661fd6

alamb reviewed Feb 23, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md Outdated

emkornfield reviewed Feb 23, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md Outdated

emkornfield reviewed Feb 23, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md

emkornfield approved these changes Feb 23, 2026

View reviewed changes

wgtmac reviewed Feb 24, 2026

View reviewed changes

alamb added 5 commits February 24, 2026 07:21

Merge remote-tracking branch 'origin/production' into variant-blog

e0fa010

Adjust blog title

dc5775b

Remove "Introduction" heading to improve preview

37df98b

Table formatting

ffba676

whitespace alignment ocd

98c9be3

emkornfield approved these changes Feb 24, 2026

View reviewed changes

Comment thread content/en/blog/features/variant.md

Apply suggestion from @emkornfield

180d466

Co-authored-by: emkornfield <emkornfield@gmail.com>

alamb added 2 commits February 25, 2026 16:17

Add intro paragraph about real world examples

972895b

Add ::timestamp cast to example

43fc111

alamb reviewed Feb 25, 2026

View reviewed changes

alamb mentioned this pull request Feb 25, 2026

Correct VARIANT Logical Type annotation in Parquet examples apache/parquet-format#555

Merged

grammar / whitespace changes

3200e0f

alamb changed the title ~~[BLOG] Variant Blog~~ [BLOG] Variant Type in Apache Parquet for Semi-Structured Data Feb 27, 2026

Update date to 2026-02-27

07f0d70

alamb merged commit 62e281c into apache:production Feb 27, 2026
1 check passed


		## Why Variant?

		Unlike traditional approaches that store JSON as text strings and require full parsing to access any field, making queries slow and resource-intensive, Variant solves this by storing data in a structured binary format that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance.

Conversation

aihuaxu commented Feb 15, 2026 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb commented Feb 21, 2026

Uh oh!

emkornfield Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

aihuaxu Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

wgtmac Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

alamb Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 24, 2026

Uh oh!

Uh oh!

Uh oh!

alamb Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

aihuaxu commented Feb 15, 2026 •

edited by alamb

Loading