[AutoSparkUT] Fix LIKE with invalid escape pattern to match CPU behavior by wjxiz1992 · Pull Request #14388 · NVIDIA/spark-rapids

wjxiz1992 · 2026-03-10T04:15:40Z

Summary

Validate LIKE escape patterns on GPU to match CPU semantics. Previously GpuLike passed patterns directly to cuDF, which silently accepted invalid escape sequences (e.g. LIKE 'm%@ca' ESCAPE '%'), returning wrong results instead of throwing AnalysisException like CPU does.
Add tagExprForGpu in BinaryExprMeta[Like] with a focused O(n) escape-char validation that detects invalid patterns and falls back to CPU, which throws the correct AnalysisException.
Remove SPARK-33677 exclusion from RapidsTestSettings.

UT Traceability

Item	Detail
RAPIDS suite	`RapidsSQLQuerySuite`
Spark original test	`"SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar"`
Spark source file	`sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala`
Line range	L3758–L3773
Source (master)	SQLQuerySuite.scala#L3758-L3773
Source (pinned)	SQLQuerySuite.scala@f66c336#L3758-L3773

Performance Impact

Negligible. The validation runs once during query planning (not per-row), only on the literal pattern string. It is a simple O(n) character scan with no allocations — no regex construction, no Pattern.quote calls. Typical LIKE patterns are a few dozen characters. There is no change to the per-batch GPU execution path (GpuLike.doColumnar).

Test Plan

mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=...RapidsSQLQuerySuite — 215 passed, 0 failed, 18 ignored
Pre-merge CI (defaults unchanged, expected no regression)

Checklists

This PR has added documentation for new or modified features or
behaviors.
This PR has added new tests or modified existing tests to cover
new code paths.
(The SPARK-33677 exclusion is removed, re-enabling the existing Spark test
"SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar"
in RapidsSQLQuerySuite, which covers the new tagExprForGpu validation path.)
Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

Made with Cursor

wjxiz1992 · 2026-03-10T04:20:11Z

build

Copilot

Pull request overview

This PR aligns GPU LIKE ... ESCAPE ... behavior with Spark CPU semantics by validating escape patterns and avoiding GPU execution for invalid patterns that should raise an AnalysisException.

Changes:

Add GPU planning-time validation for LIKE escape patterns and fall back to CPU on invalid patterns.
Add executor-side defensive validation in GpuLike.doColumnar as a safety net.
Re-enable the previously excluded Spark 3.3.0 unit test for SPARK-33677 in RapidsTestSettings.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala	Removes the SPARK-33677 exclusion so the test runs again.
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala	Adds one-time runtime validation of the LIKE pattern escape semantics inside `GpuLike`.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala	Adds `tagExprForGpu` validation for `Like` to detect invalid escape patterns and force CPU fallback.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

greptile-apps · 2026-03-10T04:22:07Z

Greptile Summary

This PR fixes GpuLike to match CPU semantics when a LIKE pattern contains an invalid escape sequence (e.g. LIKE 'm%@ca' ESCAPE '%'). Previously cuDF silently accepted such patterns and returned wrong results; now the GPU planner detects them at planning time and falls back to CPU, which throws the correct AnalysisException. The SPARK-33677 test exclusion (issue #14117) is removed as a result.

Key changes:

Adds tagExprForGpu to BinaryExprMeta[Like] in GpuOverrides.scala that iterates the literal pattern, detects invalid escape sequences (escape char not followed by _, %, or the escape char itself, or trailing escape char), and calls willNotWorkOnGpu to trigger CPU fallback.
Removes the SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar exclusion from RapidsTestSettings.scala since the bug is now fixed.
The PR description mentions adding defensive validation inside GpuLike.doColumnar as a safety net, but this was not included in the diff — stringFunctions.scala is unchanged. For the known code paths this does not expose a regression, but the PR description is inaccurate on this point.

Confidence Score: 4/5

Safe to merge — the validation logic is correct and the fallback to CPU is the right behaviour for invalid escape patterns.
The escape-validation logic in tagExprForGpu is correct: it walks the literal pattern, detects trailing escape chars and chars not in {_, %, escape} after an escape, and falls back to CPU. The parent's tagExprForGpu is empty so not calling super is fine and consistent with the rest of the codebase. The one deduction is that the PR description claims a doColumnar safety net was added but it was not; while this is not a runtime regression today (all invalid patterns are caught at plan time or rejected by ExprChecks), the description is misleading.
No files require special attention beyond the noted discrepancy between the PR description and the absence of defensive validation in GpuLike.doColumnar.

Important Files Changed

Filename	Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala	Adds `tagExprForGpu` override for `Like` to validate escape sequences at planning time and fall back to CPU for invalid patterns; logic is correct but PR description mentions a `doColumnar` safety net that was not implemented.
tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala	Removes SPARK-33677 exclusion (tracked as issue #14117) which is now fixed by the escape-pattern validation fallback.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LIKE expression reaches GPU planner] --> B{a.right is Literal UTF8String?}
    B -- No --> E[case _ : no validation, ExprChecks already rejects non-literals]
    B -- Yes --> C[Iterate pattern chars with tagExprForGpu]
    C --> D{Current char == escapeChar?}
    D -- No --> F[Advance i, continue]
    F --> D
    D -- Yes --> G{Next char exists?}
    G -- No trailing escape --> H[willNotWorkOnGpu → CPU fallback → AnalysisException]
    G -- Yes --> I{Next char in '_', '%', escapeChar?}
    I -- No invalid escape --> H
    I -- Yes valid escape --> J[Advance i by 2, continue]
    J --> D
    C --> K[All chars valid → convertToGpu → GpuLike.doColumnar → cuDF like]

Comments Outside Diff (1)

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala, line 3529-3530 (link)

PR description mentions unimplemented safety net

The PR summary explicitly states: "Add defensive validation in GpuLike.doColumnar as a safety net." However, stringFunctions.scala was not changed — GpuLike.doColumnar still passes the pattern directly to cuDF without any validation. The tagExprForGpu fallback handles all planning-time cases (literal patterns are caught; non-literal patterns are rejected by ExprChecks), so there is no known runtime path that would reach doColumnar with an invalid escape sequence today. If the intent was to add belt-and-suspenders protection (e.g., against future code paths that construct GpuLike without going through the planner), that safety net should be added to doColumnar as described, or the PR description should be updated to reflect that it was intentionally omitted.

_{Last reviewed commit: bd7f8eb}

wjxiz1992 · 2026-03-10T07:02:06Z

build

@ca

…ior (NVIDIA#14117) Validate LIKE escape patterns on GPU to match CPU semantics. Previously, GpuLike passed patterns directly to cuDF which silently accepted invalid escape sequences (e.g. LIKE 'm%@ca' ESCAPE '%'), returning wrong results instead of throwing AnalysisException like CPU does. - Add tagExprForGpu in BinaryExprMeta[Like] to detect invalid escape patterns and fall back to CPU, which throws the correct exception. - Add defensive validation in GpuLike.doColumnar as a safety net. - Remove SPARK-33677 exclusion from RapidsTestSettings. Closes NVIDIA#14117 Maven validation: mvn test -pl tests -Dbuildver=330 \ -DwildcardSuites=...RapidsSQLQuerySuite Tests: succeeded 215, failed 0, ignored 18 Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

- Use NonFatal(e) instead of catching Exception in tagExprForGpu to avoid swallowing fatal errors; include e.getMessage for diagnostics. - Check l.value.isInstanceOf[UTF8String] instead of l.dataType == StringType to correctly handle CharType/VarcharType whose values are UTF8String at runtime. - Use rhs.getValue.toString in GpuLike.doColumnar to safely handle both UTF8String and String backed scalars. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor

wjxiz1992 · 2026-03-10T07:30:53Z

build

revans2

What is the performance impact of this change? All PRs are supposed to have a performance impact unless it is something like a doc or comment only change.

revans2 · 2026-03-11T14:28:10Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+                  if l.value.isInstanceOf[UTF8String] =>
+                StringUtils.escapeLikeRegex(
+                  l.value.toString, a.escapeChar)
+              case _ =>


So it is okay if we get a literal that is not a string? or is this more that constant folding was disabled?

We can remove this path: case _ => because "search" only accept literal, it can not be other type.

("search", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING))

Nice catch, updated to use a pattern match for it.

revans2 · 2026-03-11T14:34:23Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+            a.right match {
+              case l: Literal
+                  if l.value.isInstanceOf[UTF8String] =>
+                StringUtils.escapeLikeRegex(


This does a lot of work.

https://github.com/apache/spark/blob/104e43b75f4f106b03c63f0f0e5fd18d69b1cdf9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/StringUtils.scala#L36-L70

Can we please look at what it would take to validate this in a faster way. The main reason I ask is because we want to be able to support cudf's like with a columnar pattern and if we don't understand it validating the pattern on the GPU with the full column is going to be a pain.

Agree. Now replace it with a light-weight logic to checks the two invalid cases defined in Spark's Like expression doc ("It is invalid to escape any other character") and implemented by StringUtils.escapeLikeRegex:

Escape char at end of pattern (no char to escape)

Escape char followed by a char that is not _, %, or the escape char itself

I'm not sure if this should also be put in a specific function?

revans2 · 2026-03-11T14:36:20Z

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala


  override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = {
+    if (!escapeValidated && rhs.isValid) {
+      StringUtils.escapeLikeRegex(


Why do we have to validate this here too? Should we cache if we did this in planning so we don't have to do it multiple times?

right, no need to check at runtime, removed.

res-life · 2026-03-12T02:07:01Z

Add defensive validation in GpuLike.doColumnar as a safety net.

Not needed. Because there will be no GpuLike if validation is failed.

res-life · 2026-03-12T02:09:08Z

Add the checklist in the desc of this PR:
### Checklists

- [ ] This PR has added documentation for new or modified features or
behaviors.
- [ ] This PR has added new tests or modified existing tests to cover
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
- [ ] Performance testing has been performed and its results are added
in the PR description. Or, an issue has been filed with a link in the PR
description.

- Replace StringUtils.escapeLikeRegex call with a focused O(n) escape char validation that only checks the two invalid cases (escape char at end of pattern, escape char followed by non-special character). This avoids building a full regex string during planning. - Remove runtime safety net in GpuLike.doColumnar — tagExprForGpu already prevents GpuLike from being created for invalid patterns. - Remove now-unused StringUtils import from stringFunctions.scala. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor Signed-off-by: Allen Xu <allxu@nvidia.com>

wjxiz1992 · 2026-03-12T08:17:35Z

What is the performance impact of this change? All PRs are supposed to have a performance impact unless it is something like a doc or comment only change.

Update the PR description for performance impact with latest change according to review comments. Now only checks at planning stage once with a O(n) logic, where Like strings are usually not that super long.

wjxiz1992 · 2026-03-12T09:39:15Z

@revans2 I'm trying to add rules to LLM that we should conduct performance regression check after touching any core functional codes. My question is, what kind of performance check are proper for these changes.

I can think of 2 types in general:

NDS
micro benchmarks

For NDS, I saw some colleagues do it when it comes to like shuffle optimization, join optimization, rules optimizations etc.

For micro benchmarks, I saw cuDF has a lot for it, also I recall some micro-benchmarks designed for like long string operations in spark-rapids.

Use this regex change for example, I prefer micro benchmark for it instead of NDS if it really needs a performance regression check. What's your suggestion on this top? A general rule to choose what type of performance checks to apply?

revans2 · 2026-03-12T12:01:57Z

@wjxiz1992 we need to use different benchmarks for different reasons and situations. If the change targets a specific set of functionality ( an expression for example, or columnar to row modification), then a targeted micro benchmark can isolate the impact of the change. If the change is general, or the micro benchmark showed a significant regression then a more real world benchmark is helpful to either put the regression in context of real world processing or to exercise the broad change. But we have to be sure that that "real world" benchmark exercises the code we care about. NDS does not fall back to the CPU for any processing. That is good to show shuffle, but not columnar to row changes. NDS does not use maps, or arrays, or structs. So it is not good to test them. It hardly uses strings like real world queries do, so it is bad at that too. We often come up with other more realistic queries or use inspiration from customer queries we know of to help build some of these types of tests.

In general I want to think about how this change could impact the performance of the system and then build tests that would cover that case. Here we have escape checks that happen at planning time and escape checks that are happening at run time. The run time checks happen on a per-batch basis. So I probably want a few different tests. Some that are more malicious and some more realistic.

The "real world" test would be something like. spark.time(spark.range(0, 4294967296L, 1, 32).selectExpr("CAST(id as STRING) as sid").filter("SID LIKE '%B1_B1%' ESCAPE 'B'").count) I didn't check the syntax, but it should give about a 1 GiB batch to each task and give us 32 tasks to work with (assuming that we have a 1 GiB target batch size and we have 8 or 16 CPU cores to work with for spark running in local mode). This way we get parallelism on the GPU and multiple waves. I typically will capture a history file with this too and compare the optime for the operator that includes the change (probably filter in this case). For this case you have to have some knowledge of how Spark deal with the like operator. If the pattern is static then it may optimize some patterns into other operators. So we cannot do simple things, for example '%1%' is converted into a contains.

If I want to be malicious, then I start to change things around. We can set the target batch size much smaller (10 MiB for example so we get about 100x the number of batches) and see if the runtime check at the beginning of each batch has a measurable impact for before and after.

We could also do a lot of 2 or three row queries in a loop to see if the planning change has a measurable impact.

If we test what we think is the worst case situations and there is no measurable impact then we know it is not a big deal.

For this case I would not likely use NDS at all. There are few LIKE operations in it. There are few strings too, and the size of the string data isn't large, also most of the patterns that NDS uses are ones that Spark will rewrite to not use LIKE directly at all.

Copilot AI review requested due to automatic review settings March 10, 2026 04:15

Copilot started reviewing on behalf of wjxiz1992 March 10, 2026 04:16 View session

Copilot AI reviewed Mar 10, 2026

View reviewed changes

wjxiz1992 added 2 commits March 10, 2026 15:30

wjxiz1992 force-pushed the fix/14117-like-escape-validation branch from d4191a3 to 0e1b4ad Compare March 10, 2026 07:30

wjxiz1992 requested review from firestarman, res-life and thirtiseven March 11, 2026 03:27

wjxiz1992 self-assigned this Mar 11, 2026

revans2 reviewed Mar 11, 2026

View reviewed changes

Conversation

wjxiz1992 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

UT Traceability

Performance Impact

Test Plan

Checklists

Uh oh!

wjxiz1992 commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

wjxiz1992 commented Mar 10, 2026

Uh oh!

wjxiz1992 commented Mar 10, 2026

Uh oh!

revans2 left a comment

Choose a reason for hiding this comment

Uh oh!

revans2 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

res-life Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

revans2 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

revans2 Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

wjxiz1992 Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

res-life commented Mar 12, 2026

Uh oh!

res-life commented Mar 12, 2026

Uh oh!

wjxiz1992 commented Mar 12, 2026

Uh oh!

wjxiz1992 commented Mar 12, 2026

Uh oh!

revans2 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wjxiz1992 commented Mar 10, 2026 •

edited

Loading

greptile-apps bot commented Mar 10, 2026 •

edited

Loading

wjxiz1992 Mar 12, 2026 •

edited

Loading

wjxiz1992 Mar 12, 2026 •

edited

Loading