Skip to content

[META] Skiplist plan for 3.x #18882

@asimmahmood1

Description

@asimmahmood1

Please describe the end goal of this project

As follow up to skiplist meta #19384, #17967, this issue captures the introduction of skiplist in OS 3.2.

There are few points to consider:

  1. Should skiplist be enabled by default?
    This depends on index size impact. Enabling skiplist on just timestamp field shows a 0.5% increase in bytes, which is very minimal. [Feature Request] Use lucene sparse index in opensearch #17710 (comment). Another experiment on enabled for all numeric fields in big5 bechmark shows [63%, 22 to 36gb] Enable Skiplist Optimization for SortedNumericDocValuesField.newSlowExactQuery in OpenSearch #18751.
    Since skiplist of most useful on (mostly) sorted field, and to provide the targeted benefit with least indexing impact, it is enabled by default on date field with name @timestamp, or primary sort field.

  2. What is the performance impact?
    Some numbers are in [Feature Request] Use lucene sparse index in opensearch #17710 with synthenic data. big5 and http_logs benchmarks are in Enable Skiplist Optimization for SortedNumericDocValuesField.newSlowExactQuery in OpenSearch #18751 [Summary TBD]

Based on above.

Plan for 3.2 is to:

  1. ✅ Run benchmarks and find speed up in existing queries: Enable Skiplist Optimization for SortedNumericDocValuesField.newSlowExactQuery in OpenSearch #18751
  2. ✅ Add skiplist mapping option with default false: [SparseIndex] Modify FieldMappers to enable SkipList #17965

Plan for 3.3

  1. ✅ Add skip_list param for date, scaled float and token count fields Add skip_list param for date, scaled float and token count fields #19142
  2. ✅ Adding logic for histogram aggregation using skiplist Adding logic for histogram aggregation using skiplist #19130
  3. ✅ DateHistogram support sub aggregation: Add sub aggregation support for histogram aggregation using skiplist #19438
  4. ✅ Enable skip_list for @timestamp field or index sort field by default Enable skip_list for @timestamp field or index sort field by default #19480
  5. ✅ Fix @timestamp upgrade issue by adding a version check on skip_list param (3.3) Fix @timestamp upgrade issue by adding a version check on skip_list param (3.3) #19671
  6. ✅ Add nyc_taxis date_histogram_calendar_interval_with_filter operation Add nyc_taxis date_histogram_calendar_interval_with_filter operation opensearch-benchmark-workloads#697

Plan for 3.4

  1. ✅ Combining filter rewrite and skip list to optimize sub aggregation:
    ](Combining filter rewrite and skip list to optimize sub aggregation #19573)
  2. ✅ Add skip_list logic to auto date histogram, confirm with big5's range-auto-date-history-with-metrcis, range-auto-date-history Add skip_list logic to auto date histogram, confirm with big5's range-auto-date-history-with-metrcis, range-auto-date-history #19827

Plan for 3.5

  1. Add sorted version of big5 [FEATURE] Add sorted version of big5, http_log workload opensearch-benchmark-workloads#724
  2. Enable nightly bechmark of sorted big5 Adding sorted version of big5 nightly run opensearch-build#5931
  3. Tech Blog post

Plan for 3.6

  1. Use skiplist for min and max aggregation #20174
  2. Enhance date and auto date histrogram to handle multiple owning bucket ords (see min as example)
  3. Update date and auto date skip list logic to with detailed docs skipped in order to help compare performance

Follow up

Metadata

Metadata

Assignees

Projects

Status

👀 In review

Status

In Progress

Status

New

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions