Skip to content

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651

Open
Yizhou-Yang wants to merge 17 commits intoapache:mainfrom
Yizhou-Yang:percentile0225
Open

[GLUTEN-4889][VL] feat: Support approx_percentile aggregate function#11651
Yizhou-Yang wants to merge 17 commits intoapache:mainfrom
Yizhou-Yang:percentile0225

Conversation

@Yizhou-Yang
Copy link

@Yizhou-Yang Yizhou-Yang commented Feb 25, 2026

What

Add Velox approx_percentile support for Spark.

Why

Velox uses KLL sketch while Spark uses GK algorithm — their intermediate data formats are incompatible (KLL: 9-field StructType vs GK: single BinaryType buffer). This means fallback between Velox and Spark requires separate handling.

How

  • VeloxApproximatePercentile: A DeclarativeAggregate with 9 aggBufferAttributes matching Velox's KLL sketch layout.
  • Spark-side KLL implementation (KllSketchHelper/KllSketchAdd/KllSketchMerge/KllSketchEval): Simplified KLL operations for fallback, binary-compatible with Velox's C++ accumulator.
  • ApproxPercentileRewriteRule: Rewrites Spark's ApproximatePercentile to the Velox-compatible version.
  • All 4 fallback modes supported: Full offload, partial fallback, final fallback, full fallback.

Key decisions

  • Accuracy stored as IntegerType (Spark's original value); Velox computes epsilon = 1.0/accuracy internally.
  • KLL chosen over GK for Spark-side fallback to maintain intermediate data compatibility with Velox.

Velox dependency

facebookincubator/velox#16320


Related issue: #4889

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Yizhou-Yang Yizhou-Yang changed the title feat:support gluten-level approx_percentile [GLUTEN-4889][VL] feat:support gluten-level approx_percentile Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

UPSTREAM_VELOX_PR_ID=""

@jinchengchenghh
Copy link
Contributor

Do we need the config? Usually we offload the function to native by default

@Yizhou-Yang
Copy link
Author

Please update get-velox.sh to test your PR, then you can verify if both can work well, you may update this line

UPSTREAM_VELOX_PR_ID=""

added the 16320 and removed the config

@github-actions github-actions bot added BUILD and removed CORE works for Gluten Core labels Feb 25, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot added the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot removed the CORE works for Gluten Core label Mar 2, 2026
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

5 similar comments
@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 2, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

@Yizhou-Yang
Copy link
Author

Yizhou-Yang commented Mar 3, 2026

Please update the PR description to describe the KLL Sketch is different so that we handle fallback separately.

done~

@github-actions
Copy link

github-actions bot commented Mar 6, 2026

Run Gluten Clickhouse CI on x86

@jinchengchenghh
Copy link
Contributor

Copy link
Contributor

@jinchengchenghh jinchengchenghh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most is good, after we resolve CH CI, we can merge it.


const std::unordered_set<std::string> kBlackList =
{"split_part", "sequence", "approx_percentile", "map_from_arrays"};
{"split_part", "sequence", "map_from_arrays"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find also the map_from_arrays in black list, another topic, would you like to implement it in Gluten?

Copy link
Author

@Yizhou-Yang Yizhou-Yang Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I will look into it. @jinchengchenghh
facebookincubator/velox#13183
I think that the velox part is compatible with spark as of 2025/05, it's just for some reason the velox fix is not adopted in gluten. Right now I think the only gluten fix needed is to remove the kblacklist and test it...

I'll do a comparison with spark, post the results and add appropriate ut.

@jinchengchenghh jinchengchenghh changed the title [GLUTEN-4889][VL] feat:support gluten-level approx_percentile [GLUTEN-4889][VL] feat: Support approx_percentile aggregate function Mar 6, 2026
@Yizhou-Yang
Copy link
Author

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions github-actions bot removed the BUILD label Mar 10, 2026
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

…ministic across Spark and Velox implementations
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@Yizhou-Yang
Copy link
Author

Yizhou-Yang commented Mar 10, 2026

https://github.com/apache/incubator-gluten/actions/runs/22893811220/job/66432061435?pr=11651
There was a test failure in some x86 environment non-slow modes. Fixed, but the pipeline might need restart...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants