Skip to content

Commit 62e281c

Browse files
aihuaxusfc-gh-aixualambemkornfield
authored
[BLOG] Variant Type in Apache Parquet for Semi-Structured Data (#171)
* Add variant blog * Address comment * Apply suggestions from code review Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update image * Remove c++ * Update why variant * Adjust blog title * Remove "Introduction" heading to improve preview * Table formatting * whitespace alignment ocd * Apply suggestion from @emkornfield Co-authored-by: emkornfield <emkornfield@gmail.com> * Add intro paragraph about real world examples * Add ::timestamp cast to example * grammar / whitespace changes * Update date to 2026-02-27 --------- Co-authored-by: Aihua Xu <aihua.xu@snowflake.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: emkornfield <emkornfield@gmail.com>
1 parent 90ea0a7 commit 62e281c

2 files changed

Lines changed: 267 additions & 0 deletions

File tree

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
---
2+
title: "Variant Type in Apache Parquet for Semi-Structured Data"
3+
date: 2026-02-27
4+
description: "Introducing Native Variant Type in Apache Parquet"
5+
author: "[Aihua Xu](https://github.com/aihuaxu), [Andrew Lamb](https://github.com/alamb)"
6+
categories: ["features"]
7+
---
8+
9+
The Apache Parquet community is excited to announce the addition of the **Variant type**—a feature that brings native support for semi-structured data to Parquet, significantly improving efficiency compared to less efficient formats such as JSON. This marks a significant addition to Parquet, demonstrating how the format continues to evolve to meet modern data engineering needs.
10+
11+
While Apache Parquet has long been the standard for structured data where each value has a fixed and known type, handling heterogeneous, nested data often required a compromise: either store it as a costly-to-parse JSON string or flatten it into a rigid schema. The introduction of the Variant logical type provides a native, high-performance solution for semi-structured data that is already seeing rapid uptake across the ecosystem.
12+
13+
---
14+
15+
## What is Variant?
16+
17+
**Variant** is a self-describing data type designed to efficiently store and process semi-structured data—JSON-like documents with arbitrary and evolving schemas.
18+
19+
---
20+
21+
## Why Variant?
22+
23+
Consider a common scenario: storing logged event data that might evolve as new events are added, or fields are added or removed from specific event types. For example, you might have events like:
24+
25+
```json
26+
{"timestamp": "2026-01-15T10:30:00Z", "user": 5, "event": "login"}
27+
{"timestamp": "2026-01-15T11:45:00Z", "user": 5, "event": "purchase", "amount": 99.99}
28+
{"timestamp": "2026-01-15T12:00:00Z", "user": 7, "event": "login", "device": "mobile"}
29+
```
30+
31+
Traditional approaches that store JSON as text strings require full parsing to access any field, making queries slow and resource-intensive. Variant solves this by storing data in a **structured binary format** that enables direct field access through offset-based navigation. Query engines can jump directly to nested fields without deserializing the entire document, dramatically improving performance.
32+
33+
Binary encodings like BSON improve upon plain JSON by storing data in binary format, but they still redundantly store field names like `"timestamp"`, `"user"`, and `"event"` in every row, wasting storage space. Variant is optimized for the common case where multiple values share a similar structure: it avoids redundantly storing repeated field names and standardizes the best practice of **"shredded storage"** for pre-extracting structured subsets.
34+
35+
### Key Benefits
36+
37+
- **Type-Preserving Storage:** Original data types are maintained in their native formats—data types (integers, strings, booleans, timestamps, etc.) are preserved, unlike JSON which has a limited type system with no native support for types like timestamps or integers.
38+
39+
- **Efficient Encoding:** The binary format uses field name deduplication to minimize storage overhead compared to JSON strings or BSON encoding.
40+
41+
- **Fast Query Performance:** Direct offset-based field access provides performance improvements over JSON string parsing. Optional shredding of frequently accessed fields into typed columns further enhances query pruning and predicate pushdown.
42+
43+
- **Schema Flexibility:** No predefined schema is required, allowing documents with different structures to coexist in the same column. This enables seamless schema evolution while maintaining full queryability across all schema variations, while still taking advantage of common structures when present.
44+
45+
---
46+
47+
## Overview of Variant Type in Parquet
48+
49+
Parquet introduced the [Variant logical type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant) in [August 2025](https://github.com/apache/parquet-format/pull/509).
50+
51+
### Variant Encoding
52+
53+
In Parquet, Variant is represented as a logical type and stored physically as a struct with two binary fields. The encoding is [designed](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md) so engines can efficiently navigate nested structures and extract only the fields they need, rather than parsing the entire binary blob.
54+
55+
```parquet
56+
optional group event_data (VARIANT(1)) {
57+
required binary metadata;
58+
required binary value;
59+
}
60+
```
61+
62+
- **`metadata`:** Encodes type information and shared dictionaries (for example, field-name dictionaries for objects). This avoids repeatedly storing the same strings and enables efficient navigation.
63+
- **`value`:** Encodes the actual data in a compact binary form, supporting primitive values as well as arrays and objects.
64+
65+
#### Example
66+
67+
A web access event can be stored in a single Variant column while preserving the original data types:
68+
69+
```json
70+
{
71+
"userId": 12345,
72+
"events": [
73+
{"eType": "login", "timestamp": "2026-01-15T10:30:00Z"},
74+
{"eType": "purchase", "timestamp": "2026-01-15T11:45:00Z", "amount": 99.99}
75+
]
76+
}
77+
```
78+
79+
Compared with storing the same payload as a JSON string, Variant retains type information (for example, timestamp values are stored as integers rather than being stored as strings), which improves correctness, enables more efficient querying and requires fewer bytes to store.
80+
81+
Just as importantly, Variant supports **schema variability**: records with different shapes can coexist in the same column without requiring schema migrations. For example, the following record can be stored alongside the event record above:
82+
83+
```json
84+
{
85+
"userId": 12345,
86+
"error": "auth_failure"
87+
}
88+
```
89+
90+
---
91+
92+
## Shredding Encoding
93+
94+
To enhance query performance and storage efficiency, Variant data can be **shredded** by extracting frequently accessed fields into separate, strongly-typed columns, as described in the [detailed shredding specification](https://github.com/apache/parquet-format/blob/master/VariantShredding.md). For each shredded field:
95+
96+
- If the field **matches the expected schema**, its value is written to the strongly typed field.
97+
- If the field **does not match**, the original representation is written as a Variant-encoded binary field and the corresponding strongly typed field is left NULL.
98+
99+
![Shredding Variant Visualization](/blog/variant/variant_shredding.png)
100+
101+
The Parquet writer, typically a query engine, decides which fields to shred based on access patterns and workload characteristics. Once shredded, the standard Parquet columnar optimizations (encoding, compression, statistics) are used for the typed columns.
102+
103+
### Implementation Considerations
104+
105+
- **Schema Inference:** Engines can infer the shredding schema from sample data by selecting the most frequently occurring type for each field. For example, if `event.id` is predominantly an integer, the engine shreds it to an INT64 column.
106+
107+
- **Type Promotion:** To maximize shredding coverage, engines can promote types within the same type family. For example, if integer values vary in size (INT8, INT32, INT64), selecting INT64 as the shredded type ensures all integer values can be shredded rather than falling back to the unshredded representation.
108+
109+
- **Metadata Control:** To control metadata overhead, engines may limit the number of shredded fields, since each field contributes statistics (min/max values, null counts) to the file footer and column stats.
110+
111+
- **Explicit Shredding Schema:** When read patterns are known in advance, engines can specify an explicit shredding schema at write time, ensuring that frequently accessed fields are shredded for optimal query performance.
112+
113+
### Performance Characteristics
114+
115+
- **Selective field access:** When queries access only the shredded fields, only those columns are read from Parquet, skipping the rest, benefiting from column pruning and predicate pushdown.
116+
117+
- **Full Variant reconstruction:** When queries require access to the complete Variant object, there is a performance overhead as the engine must reconstruct the Variant by merging data from the shredded typed fields and the base Variant column.
118+
119+
### Examples of Shredded Parquet Schemas
120+
121+
The following example shows shredding non-nested Variant values. In this case, the writer chose to shred string values as the `typed_value` column. Rows that do not contain strings are stored in the `value` column with binary Variant encoding.
122+
123+
```parquet
124+
optional group SIMPLE_DATA (VARIANT(1)) = 1 {
125+
required binary metadata; # variant metadata
126+
optional binary value; # non-shredded value
127+
optional binary typed_value (STRING); # the shredded value
128+
}
129+
```
130+
131+
The series of variant values `"Jim"`, `100`, `{"name": "Jim"}` are encoded as:
132+
133+
| Variant Value | `value` | `typed_value` |
134+
|---------------|---------|---------------|
135+
| `"Jim"` | `null` | `"Jim"` |
136+
| `100` | `100` | `null` |
137+
| `{"name": "Jim"}` | `{"name": "Jim"}` | `null` |
138+
139+
---
140+
141+
Shredding nested Variant values is similar, with shredding applied recursively, as shown in the following example. In this case, the `userId` field is shredded as an integer and stored in two columns: `typed_value.userId.typed_value` when the value is an integer, and `typed_value.userId.value` otherwise. Similarly, the `eType` field is shredded as a string and stored in `typed_value.eType.typed_value` and `typed_value.eType.value`.
142+
143+
```parquet
144+
optional group EVENT_DATA (VARIANT(1)) = 1 {
145+
required binary metadata; # variant metadata
146+
optional binary value; # non-shredded value
147+
optional group typed_value {
148+
required group userId { # userId field
149+
optional binary value; # non-shredded value
150+
optional int32 typed_value; # the shredded value
151+
}
152+
required group eType { # eType field
153+
optional binary value; # non-shredded value
154+
optional binary typed_value (STRING); # the shredded value
155+
}
156+
}
157+
}
158+
```
159+
160+
**The table below illustrates how the data is stored:**
161+
162+
| Variant Value | `value` | `typed_value`<br/>`.userId`<br/>`.value` | `typed_value`<br/>`.userId`<br/>`.typed_value` | `typed_value`<br/>`.eType`<br/>`.value` | `typed_value`<br/>`.eType`<br/>`.typed_value` |
163+
|-------------------------------------|------------------|-----------------------------------------|------------------------------------------------|-----------------------------------------|----------------------------------------------|
164+
| `{"userId": 100, "eType": "login"}` | `null` | `null` | `100` | `null` | `"login"` |
165+
| `100` | `100` | `null` | `null` | `null` | `null` |
166+
| `{"userId": "Jim"}` | `null` | `"Jim"` | `null` | `null` | `null` |
167+
| `{"userId": 200, "amount": 99}` | `{"amount": 99}` | `null` | `200` | `null` | `null` |
168+
169+
---
170+
171+
## Ecosystem Adoption: A Success Story
172+
173+
One of the most remarkable aspects of Variant's addition to Parquet is the rapid and widespread ecosystem adoption, demonstrating the strength of collaboration within the Apache Parquet community.
174+
175+
Variant support has been implemented across multiple Parquet libraries including **Java**, **Rust**, and **Go**. For the most current implementation status across all languages and platforms, refer to the [official Parquet implementation status page](https://parquet.apache.org/docs/file-format/implementationstatus/).
176+
177+
Major query engines have also integrated Variant support, including **DuckDB**, **[Apache Spark](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.types.VariantType.html)**, and **[Snowflake](https://docs.snowflake.com/en/sql-reference/data-types-semistructured)**. This cross-ecosystem adoption highlights both the value of the Variant type and the Parquet community's commitment to evolving the format to meet modern data challenges.
178+
179+
---
180+
181+
## Real-World Examples
182+
183+
This section illustrates how users can interact with Variant using [Apache Spark 4.0]
184+
185+
[Apache Spark 4.0]: https://spark.apache.org/releases/spark-release-4-0-0.html
186+
187+
### Event Stream Analytics
188+
189+
Event streaming applications often handle events with evolving schemas, where different event types contain varying fields. Variant provides a flexible solution for storing heterogeneous event data without requiring schema migrations.
190+
191+
**Example: User Activity Events**
192+
193+
```sql
194+
-- Create table with Variant column
195+
CREATE TABLE event_stream (
196+
event_id INTEGER,
197+
event_data VARIANT
198+
);
199+
200+
-- Insert events with different schemas
201+
INSERT INTO event_stream VALUES
202+
(1, PARSE_JSON('{"user": {"id": 100, "country": "US"}, "actions": ["login", "view_dashboard"]}')),
203+
(2, PARSE_JSON('{"user": {"id": 101, "country": "UK", "premium": true}, "actions": ["login", "upgrade"]}')),
204+
(3, PARSE_JSON('{"user": {"id": 102, "country": "CA"}, "session_duration": 3600}'));
205+
206+
-- Query events with path notation - handles different schemas gracefully
207+
SELECT
208+
event_id,
209+
event_data:user.id::INTEGER as user_id,
210+
event_data:user.country::STRING as country,
211+
event_data:user.premium::BOOLEAN as is_premium
212+
FROM event_stream;
213+
```
214+
215+
---
216+
217+
### IoT Sensor Data
218+
219+
IoT deployments often involve diverse sensor types, each producing data with unique structures. Traditional approaches require either separate tables per sensor type or complex union schemas, or inefficient JSON / BSON encoding. Variant enables unified storage while maintaining type safety.
220+
221+
**Example: Multi-Sensor Data Pipeline**
222+
223+
```sql
224+
-- Create unified sensor table
225+
CREATE TABLE sensor_readings (
226+
reading_id INTEGER,
227+
timestamp TIMESTAMP,
228+
sensor_data VARIANT
229+
);
230+
231+
-- Insert data from different sensor types
232+
INSERT INTO sensor_readings VALUES
233+
(1, '2026-01-28 10:00:00'::timestamp,
234+
PARSE_JSON('{"sensor_id": "T001", "temp": 72.5, "unit": "F", "battery": 95}')),
235+
(2, '2026-01-28 10:00:05'::timestamp,
236+
PARSE_JSON('{"sensor_id": "M001", "motion_detected": true, "confidence": 0.95, "zone": "entrance"}')),
237+
(3, '2026-01-28 10:00:10'::timestamp,
238+
PARSE_JSON('{"sensor_id": "C001", "image_url": "s3://bucket/img_001.jpg", "objects_detected": ["person", "vehicle"]}'));
239+
240+
-- Query temperature sensors only
241+
SELECT
242+
reading_id,
243+
sensor_data:sensor_id::STRING as sensor_id,
244+
sensor_data:temp::FLOAT as temperature,
245+
sensor_data:unit::STRING as unit,
246+
sensor_data:battery::INTEGER as battery_level
247+
FROM sensor_readings
248+
WHERE sensor_data:sensor_id LIKE 'T%';
249+
```
250+
251+
---
252+
253+
## Conclusion
254+
255+
The addition of Variant to Apache Parquet represents a significant milestone in the format's evolution. By standardizing Variant as a logical type within Apache Parquet, the format now provides efficient storage for semi-structured data, enables meaningful statistics collection, and ensures cross-engine interoperability.
256+
257+
The well-documented specification has catalyzed broad ecosystem adoption, with multiple reference implementations now available across languages. This cross-language support ensures that Variant can be seamlessly integrated into diverse data processing environments, from analytical databases to streaming platforms, making it a universal solution for handling evolving schemas in modern data architectures.
258+
259+
---
260+
261+
## Resources
262+
263+
- **Apache Parquet Format Specification:** https://github.com/apache/parquet-format
264+
- **Variant Type Specification:** [Variant Logical Type](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#variant)
265+
- **Variant Encoding Specification:** [Variant Binary Encoding](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md)
266+
- **Variant Shredding Specification:** [Variant Shredding](https://github.com/apache/parquet-format/blob/master/VariantShredding.md)
267+
- **Community Discussions:** dev@parquet.apache.org
119 KB
Loading

0 commit comments

Comments
 (0)