Skip to content
This repository was archived by the owner on Jun 14, 2024. It is now read-only.
This repository was archived by the owner on Jun 14, 2024. It is now read-only.

[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset #351

@andrei-ionescu

Description

@andrei-ionescu

Feature requested

In the case of very large datasets (34 Billion records) the generated index is formed out of big files and has a performance degradation.

Given the following details...

Query

val sql = s"""
  SELECT   ts.timestamp
  FROM     ts 
  WHERE    ts.timestamp >= to_timestamp('2020-03-17')
  AND      ts.timestamp < to_timestamp('2020-03-18')
  LIMIT    1000
"""

Executed with:

spark.sql(sql).collect

Dataset

  • schema size is about 20 top fields and about 17 of these are heavily nested
  • about 34 Billion rows
  • the timestamp field is of timestamp type and is up to seconds
  • the cardinality of the timestamp values is: 17 145 000 out of 34 155 510 037
  • the format is Iceberg

Index

hs.createIndex(
  ts, 
  IndexConfig(
    "idx_ts3", 
    indexedColumns = Seq("timestamp"), 
    includedColumns = Seq("ns", "id")))

The index has:

  • 434GB total index size
  • 200 files
  • 2.3GB average file size

Explained query

=============================================================
Plan with indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
   +- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
      <----+- *(1) FileScan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1) [timestamp#207] Batched: true, DataFilters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/u.../spark-warehouse/indexes/idx_ts3/v__=0/part-00000-tid-451174797136..., PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), GreaterThanOrEqual(timestamp,2020-03-17 00:00:00.0), LessThan(timestamp,20..., ReadSchema: struct<timestamp:timestamp>---->

=============================================================
Plan without indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
   +- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
      <----+- *(1) ScanV2 iceberg[timestamp#207] (Filters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Options: [...)---->

=============================================================
Indexes used:
=============================================================
idx_ts3:/.../spark-warehouse/indexes/idx_ts3/v__=0

=============================================================
Physical operator stats:
=============================================================
+--------------------------------------------------------+-------------------+------------------+----------+
|                                       Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+--------------------------------------------------------+-------------------+------------------+----------+
|                                       *DataSourceV2Scan|                  1|                 0|        -1|
|*Scan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1)|                  0|                 1|         1|
|                                            CollectLimit|                  1|                 1|         0|
|                                                  Filter|                  1|                 1|         0|
|                                                 Project|                  1|                 1|         0|
|                                       WholeStageCodegen|                  1|                 1|         0|
+--------------------------------------------------------+-------------------+------------------+----------+

The cluster

I did run the experiment on Databricks cluster with the following details:

  • driver: 64 cores 432GB memory
  • 6 workers: 32 cores 256GB memory
  • Spark version 2.4.5

Results

Time to get the 1000 rows:

  • with Hyperspace is 17.24s
  • without Hyperspace is 16.86s

Acceptance criteria

The time to get 1000 rows using Hyperspace should be at least as twice as fast.

Additional context

For some more context, this has been started on #329 PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestuntriagedThis is the default tag for a newly created issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions