[PROPOSAL]: Data Skipping Indexes

## Problem Statement

Add support for data skipping indexes.

## Background and Motivation

Hyperspace has been supporting hash-partitioned covering indexes only. Covering indexes are good for some workloads, but we should have more index types to facilitate a broader range of use cases.

## Proposed Solution

Refactor existing code by moving covering index specific code into a single implementation for a new `Index` API and implementing data skipping indexes for the API.

## Alternatives

The proposed solution requires heavy refactoring, as there is a lot of code specific to covering indexes scattered throughout the codebase. Alternatively, we can make `DataSkippingIndex` without introducing a new API and write many if-elses depending on the index type. It might be quick to implement the required features for now in this way, but it will make the code hard to maintain in the long run.

## Known/Potential Compatibility Issues

Regarding data compatibility, it is very likely that this will make the new version of Hyperspace incompatible with indexes created in the older versions. This is unfortunate but can be excused as Hyperspace is still in version zero. If a strong need arises, this can be mitigated by a migration utility that can be designed separately.

The API compatibility can be retained if we introduce new optional arguments to `IndexConfig.` However, `includedColumns` doesn't make sense for data skipping indexes, and newly added arguments won't for covering indexes. Therefore, it might be wise to break compatibility now and make things clean for the future.

## Design

### Data Skipping Index Basics

Data skipping indexes can filter out certain files that cannot have the desired values in indexed columns based on statistics or sketches on the values in files. Examples of statistics or sketches include min/max values and bloom filters.

Suppose that we have sample data like below. There is a relation `df` which has two columns, `a` and `b`, and consists of two files, `/data/p0` and `/data/p1`:

`/data/p0`:
| a | b |
|--:|--:|
| 1 | 2 |
| 2 | 4 |
| 6 | 6 |

`/data/p1`:
|  a |  b |
|---:|---:|
|  5 | 10 |
| 10 | 10 |

A min-max data skipping index on a column `a` will look like this:

| fileid     | min | max |
|:---------|----:|----:|
| 0 |   1 |   6 |
| 1 |   5 |  10 |

The file ID column is used to map files to unique identifiers. Identifiers are created based on the file path, size, and modification time.

Then, when a query like `df.filter($"a" < 4)` is given, we know that we can skip the `/data/p1` file.

Note that the real implementation will contain file size and modified time columns for files, to distinguish files with the same name but different sizes/modified times to support deleted/added source files.

### Storing Index Data

The index data can be stored as parquet files like it is done for covering indexes. The difference here is that we don't need bucketing. Also, the data size is a lot less because there is only a single row for each file.

To speed up range queries like the last example, it might be necessary that the data is cached in the main memory as sorted sequences ordered by `min` and `max`. This should be cheap to compute as the data size is expected to be small.

If there are too many files, we might need to consider structures like B-tree to access relevant files only. But this might not be a real issue; the index data should be orders of magnitude smaller than the source data.

### Index Application

Data skipping indexes can be applied in three steps:

1. Extract constraints about key columns.
2. Filter out irrelevant files with the constraints and index data.
3. Rewrite the logical relations with the remaining files.

Unlike covering indexes, different data skipping indexes can be applied for the same relation multiple times.

Examples of constraints include `a < 4`, `a = 10`, or `a in (1, 2, 3)`. Combined with statistics or sketches like min/max values and/or bloom filters, this information can be used to weed out irrelevant files.

Theoretically, applying data skipping indexes for joins is possible. Still, the implementation will be complex, and benefits are unclear over the traditional broadcast hash join, especially when the indexed columns have a low correlation with the partitioning columns. Therefore, we will focus on filters for now.

Hybrid scan is trivial and requires no special handling for data skipping indexes, as we only filter out irrelevant files and the remaining files are not touched.

The estimated cost/benefit of applying data skipping indexes - TODO

### Index Maintenance

Given a dataframe (a single relation consisting of files), an index name, a column/columns to index, a statistics/sketch type (min/max, bloom filter, etc.), we can create a data skipping index by either scanning every row in the files and building the specified statistics or looking up existing statistics in the files.

Incremental refresh is possible, and in fact, is not much different from a full refresh. During refresh, we drop rows for deleted files from the index data and add rows for added files. If there are no deleted files, incremental refresh can store new rows in new files, not touching old files. Full refresh can then merge such files later.

## Implementation

### PRs
- #443 
- #481
- #461 
- #491 
- #482 
- #483
- #486

### TODO
- String type support enhancement
  - Implement sketch types specialized for string typed data
  - Support predicates useful for strings such as LIKE
- BloomFilterSketch user experience enhancement
  - Allow users to omit the expected number of distinct values. It can be done with an additional query with approx_count_distinct. The current implementation doesn't allow setting different numbers of distinct values for each file, so we should determine which value we will use, such as mean, median, or n-th percentile of the distinct counts.
  - Allow users to specify the target index size, absolute or relative to the source data size, to determine the false positive probability automatically.
- Supportability features
  - Provide the index effectiveness score. The rationale is that the data skipping index is not for all types of source data (layout). For example, if there is a low-cardinality column and its distribution is uncorrelated with its row position (i.e. data is randomly distributed across files), it is impossible to filter out files because every file has every possible value. If a high-cardinality column (extreme case - unique column) has a random distribution, then MinMaxSketch will be ineffective and only sketches like BloomFilterSketch will be useful. We can provide a score, for example, in the range of 0-100. 0 is for absolutely useless indexes and 100 is for fantastic ones. Here's an example: if for every file min/max values are very similar to the min/max values of the entire column, then the index will be ineffective, and its score will be close to 0.
- Dynamic file pruning
  - Similar to dynamic partition pruning, we can implement dynamic file pruning with the data skipping index. To do this, we can copy the existing Spark's dynamic partition pruning optimization rule and modify it to do file pruning instead of partition pruning.
- More sketch types
- ~~Check if Spark gives you the partition-pruned file lists when we apply data skipping indexes. If it doesn't, then we should do partition-pruning ourselves prior to filtering out files to speed up the process.~~ done
- Complete FilterRankFilter so that either the framework supports applying multiple indexes to the same plan node, or the rule returns a single index with a proper ranking scheme.
- Consider adding a sketch type called "HasNullSketch" or something, which will roughly correspond to the null count of Parquet statistics. Making MinMaxSketch and other sketches be able to read the sketch value of HasNullSketch will help them to make a better index predicate.
- Support more types for BloomFilterSketch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PROPOSAL]: Data Skipping Indexes #441

Problem Statement

Background and Motivation

Proposed Solution

Alternatives

Known/Potential Compatibility Issues

Design

Data Skipping Index Basics

Storing Index Data

Index Application

Index Maintenance

Implementation

PRs

TODO

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[PROPOSAL]: Data Skipping Indexes #441

Description

Problem Statement

Background and Motivation

Proposed Solution

Alternatives

Known/Potential Compatibility Issues

Design

Data Skipping Index Basics

Storing Index Data

Index Application

Index Maintenance

Implementation

PRs

TODO

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions