[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset

# Feature requested

In the case of very large datasets (`34 Billion` records) the generated index is formed out of big files and has a performance degradation.

Given the following details...

## Query

```
val sql = s"""
  SELECT   ts.timestamp
  FROM     ts 
  WHERE    ts.timestamp >= to_timestamp('2020-03-17')
  AND      ts.timestamp < to_timestamp('2020-03-18')
  LIMIT    1000
"""
```

Executed with: 

```
spark.sql(sql).collect
```

## Dataset

- schema size is about 20 top fields and about 17 of these are heavily nested
- about `34 Billion` rows
- the `timestamp` field is of `timestamp` type and is up to seconds
- the cardinality of the timestamp values is: `17 145 000` out of `34 155 510 037`
- the format is Iceberg

## Index

```
hs.createIndex(
  ts, 
  IndexConfig(
    "idx_ts3", 
    indexedColumns = Seq("timestamp"), 
    includedColumns = Seq("ns", "id")))
```

The index has:
- `434GB` total index size
- `200` files
- `2.3GB` average file size

## Explained query

```
=============================================================
Plan with indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
   +- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
      <----+- *(1) FileScan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1) [timestamp#207] Batched: true, DataFilters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Format: Parquet, Location: InMemoryFileIndex[dbfs:/u.../spark-warehouse/indexes/idx_ts3/v__=0/part-00000-tid-451174797136..., PartitionFilters: [], PushedFilters: [IsNotNull(timestamp), GreaterThanOrEqual(timestamp,2020-03-17 00:00:00.0), LessThan(timestamp,20..., ReadSchema: struct<timestamp:timestamp>---->

=============================================================
Plan without indexes:
=============================================================
CollectLimit 1000
+- *(1) Project [timestamp#207]
   +- *(1) Filter ((isnotnull(timestamp#207) && (timestamp#207 >= 1584403200000000)) && (timestamp#207 < 1584489600000000))
      <----+- *(1) ScanV2 iceberg[timestamp#207] (Filters: [isnotnull(timestamp#207), (timestamp#207 >= 1584403200000000), (timestamp#207 < 1584489600000000)], Options: [...)---->

=============================================================
Indexes used:
=============================================================
idx_ts3:/.../spark-warehouse/indexes/idx_ts3/v__=0

=============================================================
Physical operator stats:
=============================================================
+--------------------------------------------------------+-------------------+------------------+----------+
|                                       Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+--------------------------------------------------------+-------------------+------------------+----------+
|                                       *DataSourceV2Scan|                  1|                 0|        -1|
|*Scan Hyperspace(Type: CI, Name: idx_ts3, LogVersion: 1)|                  0|                 1|         1|
|                                            CollectLimit|                  1|                 1|         0|
|                                                  Filter|                  1|                 1|         0|
|                                                 Project|                  1|                 1|         0|
|                                       WholeStageCodegen|                  1|                 1|         0|
+--------------------------------------------------------+-------------------+------------------+----------+
```

## The cluster

I did run the experiment on Databricks cluster with the following details:
- driver: `64` cores `432GB` memory
- `6` workers: `32` cores `256GB` memory
- Spark version `2.4.5`

## Results

Time to get the `1000` rows:
- with Hyperspace is `17.24s`
- without Hyperspace is `16.86s`


# Acceptance criteria

The time to get `1000` rows using Hyperspace should be at least as twice as fast.

# Additional context

For some more context, this has been started on #329 PR.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset #351

Feature requested

Query

Dataset

Index

Explained query

The cluster

Results

Acceptance criteria

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE REQUEST]: Support partitioning and bucketing of the index dataset #351

Description

Feature requested

Query

Dataset

Index

Explained query

The cluster

Results

Acceptance criteria

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions