This repository was archived by the owner on Jun 14, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 116
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset #351
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or requestuntriagedThis is the default tag for a newly created issueThis is the default tag for a newly created issue
Description
Feature requested
In the case of very large datasets (34 Billion records) the generated index is formed out of big files and has a performance degradation.
Given the following details...
Query
val sql = s"""
SELECT ts.timestamp
FROM ts
WHERE ts.timestamp >= to_timestamp('2020-03-17')
AND ts.timestamp < to_timestamp('2020-03-18')
LIMIT 1000
"""
Executed with:
spark.sql(sql).collect
Dataset
- schema size is about 20 top fields and about 17 of these are heavily nested
- about
34 Billionrows - the
timestampfield is oftimestamptype and is up to seconds - the cardinality of the timestamp values is:
17 145 000out of34 155 510 037 - the format is Iceberg
Index
hs.createIndex(
ts,
IndexConfig(
"idx_ts3",
indexedColumns = Seq("timestamp"),
includedColumns = Seq("ns", "id")))
The index has:
434GBtotal index size200files2.3GBaverage file size
Explained query
=============================================================
Plan with indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
+- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
<----+- *(1) FileScan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1) [timestamp#207] Batched: true, DataFilters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/u.../spark-warehouse/indexes/idx_ts3/v__=0/part-00000-tid-451174797136..., PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), GreaterThanOrEqual(timestamp,2020-03-17 00:00:00.0), LessThan(timestamp,20..., ReadSchema: struct<timestamp:timestamp>---->
=============================================================
Plan without indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
+- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
<----+- *(1) ScanV2 iceberg[timestamp#207] (Filters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Options: [...)---->
=============================================================
Indexes used:
=============================================================
idx_ts3:/.../spark-warehouse/indexes/idx_ts3/v__=0
=============================================================
Physical operator stats:
=============================================================
+--------------------------------------------------------+-------------------+------------------+----------+
| Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+--------------------------------------------------------+-------------------+------------------+----------+
| *DataSourceV2Scan| 1| 0| -1|
|*Scan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1)| 0| 1| 1|
| CollectLimit| 1| 1| 0|
| Filter| 1| 1| 0|
| Project| 1| 1| 0|
| WholeStageCodegen| 1| 1| 0|
+--------------------------------------------------------+-------------------+------------------+----------+
The cluster
I did run the experiment on Databricks cluster with the following details:
- driver:
64cores432GBmemory 6workers:32cores256GBmemory- Spark version
2.4.5
Results
Time to get the 1000 rows:
- with Hyperspace is
17.24s - without Hyperspace is
16.86s
Acceptance criteria
The time to get 1000 rows using Hyperspace should be at least as twice as fast.
Additional context
For some more context, this has been started on #329 PR.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestuntriagedThis is the default tag for a newly created issueThis is the default tag for a newly created issue