Skip to content

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651

Open
Yizhou-Yang wants to merge 17 commits intoapache:mainfrom
Yizhou-Yang:percentile0225
Open

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651
Yizhou-Yang wants to merge 17 commits intoapache:mainfrom
Yizhou-Yang:percentile0225

Conversation

@Yizhou-Yang
Copy link

@Yizhou-Yang Yizhou-Yang commented Feb 25, 2026

What

Add Velox approx_percentile support for Spark.

Why

Velox uses KLL sketch while Spark uses GK algorithm — their intermediate data formats are incompatible (KLL: 9-field StructType vs GK: single BinaryType buffer). This means fallback between Velox and Spark requires separate handling.

How

  • VeloxApproximatePercentile: A DeclarativeAggregate with 9 aggBufferAttributes matching Velox's KLL sketch layout.
  • Spark-side KLL implementation (KllSketchHelper/KllSketchAdd/KllSketchMerge/KllSketchEval): Simplified KLL operations for fallback, binary-compatible with Velox's C++ accumulator.
  • ApproxPercentileRewriteRule: Rewrites Spark's ApproximatePercentile to the Velox-compatible version.
  • All 4 fallback modes supported: Full offload, partial fallback, final fallback, full fallback.

Key decisions

  • Accuracy stored as IntegerType (Spark's original value); Velox computes epsilon = 1.0/accuracy internally.
  • KLL chosen over GK for Spark-side fallback to maintain intermediate data compatibility with Velox.

Velox dependency

facebookincubator/velox#16320


Related issue: #4889

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Yizhou-Yang Yizhou-Yang changed the title feat:support gluten-level approx_percentile [GLUTEN-4889][VL] feat:support gluten-level approx_percentile Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

UPSTREAM_VELOX_PR_ID=""

@jinchengchenghh
Copy link
Contributor

Do we need the config? Usually we offload the function to native by default

@Yizhou-Yang
Copy link
Author

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

https://github.com/apache/incubator-gluten/blob/5d3f7145cd7fc258aa10b434ea4ec651bd82c764/ep/build-velox/src/get-velox.sh#L28

added the 16320 and removed the config

@github-actions github-actions bot added BUILD and removed CORE works for Gluten Core labels Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot added the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot removed the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

5 similar comments
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

@Yizhou-Yang
Copy link
Author

Yizhou-Yang commented Mar 3, 2026

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

done~

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Run Gluten Clickhouse CI on x86

@jinchengchenghh jinchengchenghh changed the title [GLUTEN-4889][VL] feat:support gluten-level approx_percentile [GLUTEN-4889][VL] feat: Support approx_percentile aggregate function Mar 6, 2026
@Yizhou-Yang
Copy link
Author

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot removed the BUILD label Mar 10, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

…ministic across Spark and Velox implementations
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Yizhou-Yang
Copy link
Author

Yizhou-Yang commented Mar 10, 2026

There are test failures in x86 environment non-slow modes.

edit:
The problem isn't that simple. it is related to gk vs kll. I might need to exclude/rewrite tests.
I will try to keep the original tests, but approximate algorithm does not give a definitive result sometimes.
I need more time on this

@jiangjiangtian
Copy link
Contributor

@Yizhou-Yang Thanks for your implementation! Recently I am cherrying pick your PR and I find that the KLL implementation in Gluten has only one level and discards items in odd position when compacting. I am wondering if this implementation can meet the accuracy requirement. Will the relative error rate be too high?

@Yizhou-Yang
Copy link
Author

Yizhou-Yang commented Mar 11, 2026

@Yizhou-Yang Thanks for your implementation! Recently I am cherrying pick your PR and I find that the KLL implementation in Gluten has only one level and discards items in odd position when compacting. I am wondering if this implementation can meet the accuracy requirement. Will the relative error rate be too high?

yes, this is part of the reason for the ut failures. The gluten version of kll is written rather hastily and was primarily concerned with simplicity and fallback compatibiltity with velox, and now it seems to have caused problems including but not limited to ut failures and lots of off-by-one issues. I am editing this implementation and doing a line-by-line comparison with the velox one to ensure the gluten side quality, and enhanced unit tests needs to be performed. Although after all this, there will still be some existing uts that might need to be excluded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants