Skip to content

[BLOG] Variant Type in Apache Parquet for Semi-Structured Data#171

Merged
alamb merged 17 commits intoapache:productionfrom
aihuaxu:variant-blog
Feb 27, 2026
Merged

[BLOG] Variant Type in Apache Parquet for Semi-Structured Data#171
alamb merged 17 commits intoapache:productionfrom
aihuaxu:variant-blog

Conversation

@aihuaxu
Copy link
Copy Markdown
Contributor

@aihuaxu aihuaxu commented Feb 15, 2026

I followed geospatial blog to create this PR.

This is to add this blog post of Variant in Parquet with the initial draft in this google doc.

Comment thread content/en/blog/features/variant.md
Copy link
Copy Markdown
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aihuaxu -- this looks (amazing) to me, though I am somewhat biased 😆

Sorry for the delay, I am out this week.

I left some some comments but I don't think any of them are actually required to publish this post. I also made some small proposed changes in here for your consideration:

I'll also post to the Parquet mailing list soliciting additional feedback

How about we shoot to publish this in about a week (Feb 26, 2026?)

Comment thread content/en/blog/features/variant.md
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
aihuaxu and others added 2 commits February 21, 2026 09:35
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@alamb
Copy link
Copy Markdown
Collaborator

alamb commented Feb 21, 2026

I also left a note on the parquet dev list asking for additional feedback: https://lists.apache.org/thread/vms8ohk4onl8fom9n9zkql2ctgdbz3lo

Comment thread content/en/blog/features/variant.md Outdated

## Why Variant?

Unlike traditional approaches that store JSON as text strings and require full parsing to access any field, making queries slow and resource-intensive, Variant solves this by storing data in a **structured binary format** that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to motivate why variant with an example here. event_data is mentioned below, so maybe we can talk about storing logged Event data, that has a structure that might evolve because new events are added, or fields are added or removed from a specific event type.

Then motivate this with a query. e.g. Select event[timestamps] where event[User] = 5. Then you can explain that if the events where just stored as JSON (there original format) we require partings the JSON to do any processing. BSON could be an improvement but is still potentially sub-optimal because you end up repeated storage of common keys.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated a little bit. Please take a look.

Comment thread content/en/blog/features/variant.md
Comment thread content/en/blog/features/variant.md Outdated
Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. Lets please check the status of Arrow C++ before publishing.

The other comments are not a blocker but I think this would be a more interesting read by putting the motivating examples first and discuss why existing types don't support the use-case well.

Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md Outdated
Comment thread content/en/blog/features/variant.md
Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, mostly looks good. A few more non-required suggestions on making the blog more focused.

The following example shows shredding non-nested Variant values. In this case, the writer chose to shred string values as the `typed_value` column. Rows that do not contain strings are stored in the `value` column with binary Variant encoding.

```parquet
optional group SIMPLE_DATA (VARIANT(1)) = 1 {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are using VARIANT(1) here, should we also update the spec to be consistent? https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point -- I double checked at the (1) is the "Variant Version":

https://github.com/apache/parquet-format/blob/38818fa0e7efd54b535001a4448030a40619c2a3/src/main/thrift/parquet.thrift#L409-L413

/**
 * Embedded Variant logical type annotation
 */
struct VariantType {
  // The version of the variant specification that the variant was
  // written with.
  1: optional i8 specification_version
}

I made a PR to update

@alamb
Copy link
Copy Markdown
Collaborator

alamb commented Feb 24, 2026

I pushed a few small changes:

  1. Adjusted the title to better mirror the geospatial blog title "Native Geospatial Types in Apache Parquet". New title is "Variant Type in Apache Parquet for Semi-Structured Data"
  2. Removed the ## Introduction heading (left the content, just removed the heading)
  3. Tweaked some table headings to look better visually
  4. vertically aligned some comments in the examples

Comment thread content/en/blog/features/variant.md
Co-authored-by: emkornfield <emkornfield@gmail.com>
Comment thread content/en/blog/features/variant.md
);

-- Insert data from different sensor types
INSERT INTO sensor_readings VALUES
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also had to update this example when I tested using spark-sql:

spark-sql (default)> INSERT INTO sensor_readings VALUES
                   > (1, '2026-01-28 10:00:00',
                   >      PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')),
                   >     (2, '2026-01-28 10:00:05',
                   >      PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')),
                   >     (3, '2026-01-28 10:00:10',
                   >      PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}'));
[INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_SAFELY_CAST] Cannot write incompatible data for the table `spark_catalog`.`default`.`sensor_readings`: Cannot safely cast `timestamp` "STRING" to "TIMESTAMP". SQLSTATE: KD000

So I added an explicit ::timestamp cast in 43fc111

spark-sql (default)> INSERT INTO sensor_readings VALUES
                   >     (1, '2026-01-28 10:00:00'::timestamp,
                   >      PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')),
                   >     (2, '2026-01-28 10:00:05'::timestamp,
                   >      PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')),
                   >     (3, '2026-01-28 10:00:10'::timestamp,
                   >      PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}'));
Time taken: 1.067 seconds

@alamb alamb changed the title [BLOG] Variant Blog [BLOG] Variant Type in Apache Parquet for Semi-Structured Data Feb 27, 2026
@alamb
Copy link
Copy Markdown
Collaborator

alamb commented Feb 27, 2026

Ok, I think this PR has had enough review and all the outstanding comments have been addressed. I'll update the date and then publish it.

We can address any additional feedback as follow on PRs

@alamb alamb merged commit 62e281c into apache:production Feb 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Blog on Variant

5 participants