[BLOG] Variant Type in Apache Parquet for Semi-Structured Data#171
[BLOG] Variant Type in Apache Parquet for Semi-Structured Data#171alamb merged 17 commits intoapache:productionfrom
Conversation
6390d93 to
2fadc6d
Compare
dc1875e to
f396f9f
Compare
alamb
left a comment
There was a problem hiding this comment.
Thank you @aihuaxu -- this looks (amazing) to me, though I am somewhat biased 😆
Sorry for the delay, I am out this week.
I left some some comments but I don't think any of them are actually required to publish this post. I also made some small proposed changes in here for your consideration:
I'll also post to the Parquet mailing list soliciting additional feedback
How about we shoot to publish this in about a week (Feb 26, 2026?)
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
I also left a note on the parquet dev list asking for additional feedback: https://lists.apache.org/thread/vms8ohk4onl8fom9n9zkql2ctgdbz3lo |
|
|
||
| ## Why Variant? | ||
|
|
||
| Unlike traditional approaches that store JSON as text strings and require full parsing to access any field, making queries slow and resource-intensive, Variant solves this by storing data in a **structured binary format** that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance. |
There was a problem hiding this comment.
It might be better to motivate why variant with an example here. event_data is mentioned below, so maybe we can talk about storing logged Event data, that has a structure that might evolve because new events are added, or fields are added or removed from a specific event type.
Then motivate this with a query. e.g. Select event[timestamps] where event[User] = 5. Then you can explain that if the events where just stored as JSON (there original format) we require partings the JSON to do any processing. BSON could be an improvement but is still potentially sub-optimal because you end up repeated storage of common keys.
There was a problem hiding this comment.
I updated a little bit. Please take a look.
emkornfield
left a comment
There was a problem hiding this comment.
Overall, looks good. Lets please check the status of Arrow C++ before publishing.
The other comments are not a blocker but I think this would be a more interesting read by putting the motivating examples first and discuss why existing types don't support the use-case well.
emkornfield
left a comment
There was a problem hiding this comment.
Thanks for the changes, mostly looks good. A few more non-required suggestions on making the blog more focused.
| The following example shows shredding non-nested Variant values. In this case, the writer chose to shred string values as the `typed_value` column. Rows that do not contain strings are stored in the `value` column with binary Variant encoding. | ||
|
|
||
| ```parquet | ||
| optional group SIMPLE_DATA (VARIANT(1)) = 1 { |
There was a problem hiding this comment.
As we are using VARIANT(1) here, should we also update the spec to be consistent? https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant
There was a problem hiding this comment.
That is a good point -- I double checked at the (1) is the "Variant Version":
/**
* Embedded Variant logical type annotation
*/
struct VariantType {
// The version of the variant specification that the variant was
// written with.
1: optional i8 specification_version
}I made a PR to update
|
I pushed a few small changes:
|
Co-authored-by: emkornfield <emkornfield@gmail.com>
| ); | ||
|
|
||
| -- Insert data from different sensor types | ||
| INSERT INTO sensor_readings VALUES |
There was a problem hiding this comment.
I also had to update this example when I tested using spark-sql:
spark-sql (default)> INSERT INTO sensor_readings VALUES
> (1, '2026-01-28 10:00:00',
> PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')),
> (2, '2026-01-28 10:00:05',
> PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')),
> (3, '2026-01-28 10:00:10',
> PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}'));
[INCOMPATIBLE_DATA_FOR_TABLE.CANNOT_SAFELY_CAST] Cannot write incompatible data for the table `spark_catalog`.`default`.`sensor_readings`: Cannot safely cast `timestamp` "STRING" to "TIMESTAMP". SQLSTATE: KD000So I added an explicit ::timestamp cast in 43fc111
spark-sql (default)> INSERT INTO sensor_readings VALUES
> (1, '2026-01-28 10:00:00'::timestamp,
> PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')),
> (2, '2026-01-28 10:00:05'::timestamp,
> PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')),
> (3, '2026-01-28 10:00:10'::timestamp,
> PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}'));
Time taken: 1.067 seconds|
Ok, I think this PR has had enough review and all the outstanding comments have been addressed. I'll update the date and then publish it. We can address any additional feedback as follow on PRs |
I followed geospatial blog to create this PR.
This is to add this blog post of Variant in Parquet with the initial draft in this google doc.