[AutoSparkUT] Fix LIKE with invalid escape pattern to match CPU behavior#14388
[AutoSparkUT] Fix LIKE with invalid escape pattern to match CPU behavior#14388wjxiz1992 wants to merge 3 commits intoNVIDIA:mainfrom
Conversation
|
build |
There was a problem hiding this comment.
Pull request overview
This PR aligns GPU LIKE ... ESCAPE ... behavior with Spark CPU semantics by validating escape patterns and avoiding GPU execution for invalid patterns that should raise an AnalysisException.
Changes:
- Add GPU planning-time validation for
LIKEescape patterns and fall back to CPU on invalid patterns. - Add executor-side defensive validation in
GpuLike.doColumnaras a safety net. - Re-enable the previously excluded Spark 3.3.0 unit test for SPARK-33677 in
RapidsTestSettings.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| tests/src/test/spark330/scala/org/apache/spark/sql/rapids/utils/RapidsTestSettings.scala | Removes the SPARK-33677 exclusion so the test runs again. |
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala | Adds one-time runtime validation of the LIKE pattern escape semantics inside GpuLike. |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala | Adds tagExprForGpu validation for Like to detect invalid escape patterns and force CPU fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala
Outdated
Show resolved
Hide resolved
Greptile SummaryThis PR fixes Key changes:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[LIKE expression reaches GPU planner] --> B{a.right is Literal UTF8String?}
B -- No --> E[case _ : no validation, ExprChecks already rejects non-literals]
B -- Yes --> C[Iterate pattern chars with tagExprForGpu]
C --> D{Current char == escapeChar?}
D -- No --> F[Advance i, continue]
F --> D
D -- Yes --> G{Next char exists?}
G -- No trailing escape --> H[willNotWorkOnGpu → CPU fallback → AnalysisException]
G -- Yes --> I{Next char in '_', '%', escapeChar?}
I -- No invalid escape --> H
I -- Yes valid escape --> J[Advance i by 2, continue]
J --> D
C --> K[All chars valid → convertToGpu → GpuLike.doColumnar → cuDF like]
|
|
build |
…ior (NVIDIA#14117) Validate LIKE escape patterns on GPU to match CPU semantics. Previously, GpuLike passed patterns directly to cuDF which silently accepted invalid escape sequences (e.g. LIKE 'm%@ca' ESCAPE '%'), returning wrong results instead of throwing AnalysisException like CPU does. - Add tagExprForGpu in BinaryExprMeta[Like] to detect invalid escape patterns and fall back to CPU, which throws the correct exception. - Add defensive validation in GpuLike.doColumnar as a safety net. - Remove SPARK-33677 exclusion from RapidsTestSettings. Closes NVIDIA#14117 Maven validation: mvn test -pl tests -Dbuildver=330 \ -DwildcardSuites=...RapidsSQLQuerySuite Tests: succeeded 215, failed 0, ignored 18 Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor
- Use NonFatal(e) instead of catching Exception in tagExprForGpu to avoid swallowing fatal errors; include e.getMessage for diagnostics. - Check l.value.isInstanceOf[UTF8String] instead of l.dataType == StringType to correctly handle CharType/VarcharType whose values are UTF8String at runtime. - Use rhs.getValue.toString in GpuLike.doColumnar to safely handle both UTF8String and String backed scalars. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor
d4191a3 to
0e1b4ad
Compare
|
build |
revans2
left a comment
There was a problem hiding this comment.
What is the performance impact of this change? All PRs are supposed to have a performance impact unless it is something like a doc or comment only change.
| if l.value.isInstanceOf[UTF8String] => | ||
| StringUtils.escapeLikeRegex( | ||
| l.value.toString, a.escapeChar) | ||
| case _ => |
There was a problem hiding this comment.
So it is okay if we get a literal that is not a string? or is this more that constant folding was disabled?
There was a problem hiding this comment.
We can remove this path: case _ => because "search" only accept literal, it can not be other type.
("search", TypeSig.lit(TypeEnum.STRING), TypeSig.STRING))There was a problem hiding this comment.
Nice catch, updated to use a pattern match for it.
| a.right match { | ||
| case l: Literal | ||
| if l.value.isInstanceOf[UTF8String] => | ||
| StringUtils.escapeLikeRegex( |
There was a problem hiding this comment.
This does a lot of work.
Can we please look at what it would take to validate this in a faster way. The main reason I ask is because we want to be able to support cudf's like with a columnar pattern and if we don't understand it validating the pattern on the GPU with the full column is going to be a pain.
There was a problem hiding this comment.
Agree. Now replace it with a light-weight logic to checks the two invalid cases defined in Spark's Like expression doc ("It is invalid to escape any other character") and implemented by StringUtils.escapeLikeRegex:
- Escape char at end of pattern (no char to escape)
- Escape char followed by a char that is not
_,%, or the escape char itself
I'm not sure if this should also be put in a specific function?
|
|
||
| override def doColumnar(lhs: GpuColumnVector, rhs: GpuScalar): ColumnVector = { | ||
| if (!escapeValidated && rhs.isValid) { | ||
| StringUtils.escapeLikeRegex( |
There was a problem hiding this comment.
Why do we have to validate this here too? Should we cache if we did this in planning so we don't have to do it multiple times?
There was a problem hiding this comment.
right, no need to check at runtime, removed.
Not needed. Because there will be no GpuLike if validation is failed. |
|
Add the checklist in the desc of this PR: |
- Replace StringUtils.escapeLikeRegex call with a focused O(n) escape char validation that only checks the two invalid cases (escape char at end of pattern, escape char followed by non-special character). This avoids building a full regex string during planning. - Remove runtime safety net in GpuLike.doColumnar — tagExprForGpu already prevents GpuLike from being created for invalid patterns. - Remove now-unused StringUtils import from stringFunctions.scala. Signed-off-by: Allen Xu <allxu@nvidia.com> Made-with: Cursor Signed-off-by: Allen Xu <allxu@nvidia.com>
Update the PR description for performance impact with latest change according to review comments. Now only checks at planning stage once with a O(n) logic, where Like strings are usually not that super long. |
|
@revans2 I'm trying to add rules to LLM that we should conduct performance regression check after touching any core functional codes. My question is, what kind of performance check are proper for these changes. I can think of 2 types in general:
For NDS, I saw some colleagues do it when it comes to like shuffle optimization, join optimization, rules optimizations etc. For micro benchmarks, I saw cuDF has a lot for it, also I recall some micro-benchmarks designed for like long string operations in spark-rapids. Use this regex change for example, I prefer micro benchmark for it instead of NDS if it really needs a performance regression check. What's your suggestion on this top? A general rule to choose what type of performance checks to apply? |
|
@wjxiz1992 we need to use different benchmarks for different reasons and situations. If the change targets a specific set of functionality ( an expression for example, or columnar to row modification), then a targeted micro benchmark can isolate the impact of the change. If the change is general, or the micro benchmark showed a significant regression then a more real world benchmark is helpful to either put the regression in context of real world processing or to exercise the broad change. But we have to be sure that that "real world" benchmark exercises the code we care about. NDS does not fall back to the CPU for any processing. That is good to show shuffle, but not columnar to row changes. NDS does not use maps, or arrays, or structs. So it is not good to test them. It hardly uses strings like real world queries do, so it is bad at that too. We often come up with other more realistic queries or use inspiration from customer queries we know of to help build some of these types of tests. In general I want to think about how this change could impact the performance of the system and then build tests that would cover that case. Here we have escape checks that happen at planning time and escape checks that are happening at run time. The run time checks happen on a per-batch basis. So I probably want a few different tests. Some that are more malicious and some more realistic. The "real world" test would be something like. If I want to be malicious, then I start to change things around. We can set the target batch size much smaller (10 MiB for example so we get about 100x the number of batches) and see if the runtime check at the beginning of each batch has a measurable impact for before and after. We could also do a lot of 2 or three row queries in a loop to see if the planning change has a measurable impact. If we test what we think is the worst case situations and there is no measurable impact then we know it is not a big deal. For this case I would not likely use NDS at all. There are few LIKE operations in it. There are few strings too, and the size of the string data isn't large, also most of the patterns that NDS uses are ones that Spark will rewrite to not use LIKE directly at all. |
Closes #14117
Summary
GpuLikepassed patterns directly to cuDF, which silently accepted invalid escape sequences (e.g.LIKE 'm%@ca' ESCAPE '%'), returning wrong results instead of throwingAnalysisExceptionlike CPU does.tagExprForGpuinBinaryExprMeta[Like]with a focused O(n) escape-char validation that detects invalid patterns and falls back to CPU, which throws the correctAnalysisException.RapidsTestSettings.UT Traceability
RapidsSQLQuerySuite"SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar"sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scalaPerformance Impact
Negligible. The validation runs once during query planning (not per-row), only on the literal pattern string. It is a simple O(n) character scan with no allocations — no regex construction, no
Pattern.quotecalls. Typical LIKE patterns are a few dozen characters. There is no change to the per-batch GPU execution path (GpuLike.doColumnar).Test Plan
mvn package -pl tests -am -Dbuildver=330 -DwildcardSuites=...RapidsSQLQuerySuite— 215 passed, 0 failed, 18 ignoredChecklists
behaviors.
new code paths.
(The SPARK-33677 exclusion is removed, re-enabling the existing Spark test
"SPARK-33677: LikeSimplification should be skipped if pattern contains any escapeChar"in
RapidsSQLQuerySuite, which covers the newtagExprForGpuvalidation path.)in the PR description. Or, an issue has been filed with a link in the PR
description.
Made with Cursor