2831: feat: Add support for RegExpExtract/RegExpExtractAll#31
2831: feat: Add support for RegExpExtract/RegExpExtractAll#31martin-augment wants to merge 4 commits intomainfrom
RegExpExtract/RegExpExtractAll#31Conversation
WalkthroughThis change adds support for two regular expression scalar functions to Comet: ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| } | ||
| } | ||
| } else { | ||
| // Using 1 here to align with Spark's default behavior. |
There was a problem hiding this comment.
In Spark, the default idx for regexp_extract_all is 0 (entire match); defaulting to 1 will produce incorrect results when the third argument is omitted (the Scala tests also assume default=0). Consider aligning the default to 0 to match Spark semantics.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The Augment AI reviewer is not correct. The default is 1 as documented at https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L885 and https://docs.databricks.com/aws/en/sql/language-manual/functions/regexp_extract_all#arguments
| fn extract_group(text: &str, regex: &Regex, idx: usize) -> Result<String> { | ||
| match regex.captures(text) { | ||
| Some(caps) => { | ||
| // Spark behavior: throw error if group index is out of bounds |
There was a problem hiding this comment.
Spark regexp_extract returns an empty string when the specified group index is out of range for a match, rather than erroring; throwing here will raise at runtime for inputs like idx=3 in your tests. Consider returning an empty string in this case to match Spark semantics.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. Prevents different behavior than Spark's.
Code Review for PR #31: RegExpExtract/RegExpExtractAll ImplementationThank you for this implementation! I've conducted a comprehensive review of the code. Overall, the implementation is solid and well-structured, with good test coverage and proper Spark compatibility. Below are my findings and recommendations: Strengths1. Code Quality
2. Spark Compatibility
3. Test Coverage
Issues and RecommendationsCritical Issues1. Missing Dictionary Encoding SupportLocation: native/spark-expr/src/string_funcs/regexp_extract.rs:266 The implementation does NOT handle dictionary-encoded string arrays, which is common in Parquet files. Looking at string_space.rs:98-111, dictionary support is a standard pattern in this codebase. Impact: Performance degradation and potential failures when processing dictionary-encoded columns. Recommendation: Add dictionary support similar to string_space.rs 2. Incorrect Default Group Index for regexp_extract_allLocation: native/spark-expr/src/string_funcs/regexp_extract.rs:183 The comment says "Using 1 here to align with Spark's default behavior" but this needs verification. Spark's actual default behavior should be confirmed and documented. Medium Priority Issues3. Missing Large String SupportLocation: Both regexp_extract_array and regexp_extract_all_array The code only handles GenericStringArray (normal strings), not GenericStringArray (large strings). Consider adding support or documenting this limitation. 4. Scala Code: Pattern Type Check IssueLocation: spark/src/main/scala/org/apache/comet/serde/strings.scala:296 Since RegExp.isSupportedPattern always returns false (RegExp.scala:29), this will ALWAYS require COMET_REGEXP_ALLOW_INCOMPATIBLE=true for any pattern. Recommendation: Either implement proper pattern validation or add a TODO comment explaining this is future work. 5. Test Gap: Out-of-Bounds Index BehaviorLocation: spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:415 The test comment says "should return empty string" but the Rust code throws an error for out-of-bounds indices. Verify this matches Spark's behavior. Security AnalysisNo Major Security Concerns
Performance ConsiderationsGood:
Potential Improvements:
DocumentationMissing:
Test Coverage AssessmentCovered:
Missing:
Summary and Priority ActionsBefore Merge (Critical):
Consider for Follow-up:
Overall Assessment: 7.5/10The code is well-written with good fundamentals, but the missing dictionary encoding support is a significant gap for a production feature. Once addressed, this would be a 9/10 implementation. Great work on the comprehensive tests and Spark compatibility! |
| } | ||
| } else { | ||
| // Using 1 here to align with Spark's default behavior. | ||
| 1 |
There was a problem hiding this comment.
Bug: Incorrect default idx value for regexp_extract_all
The default idx value for regexp_extract_all is set to 1, but it should be 0 to match Spark's behavior. The comment in strings.scala line 348 explicitly states the default should be 0 (group 0, entire match). Additionally, test case at line 502 calls regexp_extract_all with pattern \d+ (no capture groups), which would fail with idx=1 since group 1 doesn't exist, but works with idx=0.
There was a problem hiding this comment.
value:annoying; category:bug; feedback:The Bugbot AI reviewer is not correct. The default is 1 as documented at https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L885 and https://docs.databricks.com/aws/en/sql/language-manual/functions/regexp_extract_all#arguments
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
native/spark-expr/src/string_funcs/regexp_extract.rs (3)
68-116: Clarify NULL pattern and negativeidxhandling ininvoke_with_args.
patternis only accepted asColumnarValue::Scalar(ScalarValue::Utf8(Some(_))); a typed NULL string literal (likely encoded asScalarValue::Utf8(None)) will hit the_arm and raise"regexp_extract pattern must be a string literal". For queries likeregexp_extract(col, NULL, idx), this would surface as an execution error rather than following Spark’s NULL semantics. Please confirm how Spark behaves here and either (a) make the Scala serde treat NULL patterns as unsupported (so we fall back to Spark), or (b) teach the Rust side to propagate a NULL pattern to a NULL result (building an all‑NULL array for column input).idxis read asInt32and immediately cast tousize. Negative values will wrap to huge positive indices (e.g.,-1→4294967295), which then fail the bounds check inextract_group/extract_all_groupswith a confusing message. It would be cleaner to rejecti < 0explicitly before casting and return a more direct error about negative group indices.
97-101: Classify invalid regex patterns as user errors, not internal errors.
Regex::new(&pattern_str)currently maps failures tointernal_datafusion_err!("Invalid regex pattern ..."). Since an invalid pattern is user input, not a Comet bug, consider switching this toexec_err!so it surfaces as a normal execution error rather than an “internal” one:- let regex = Regex::new(&pattern_str).map_err(|e| { - internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e) - })?; + let regex = Regex::new(&pattern_str).map_err(|e| { + exec_err!("Invalid regex pattern '{}': {}", pattern_str, e) + })?;Same applies to
SparkRegExpExtractAll.
291-303: Consider explicit validation for negative group indices inextract_all_groups.
idxarrives here as ausizederived from ani32viaas, so negative Java/Spark indices get wrapped into large positive values and then trip theidx >= caps.len()check with a misleading message. For better diagnostics and closer alignment with Spark, consider validating the originali32before casting and returning a clear error for negativeidx(and potentially sharing that validation betweenextract_groupandextract_all_groups).spark/src/main/scala/org/apache/comet/serde/strings.scala (1)
24-25: Regex serde handlers look good; please clarify NULL-pattern and default-idx semantics.
CometRegExpExtract/CometRegExpExtractAllcorrectly restrict patterns andidxto literals and reuseRegExp.isSupportedPatternplusCOMET_REGEXP_ALLOW_INCOMPATIBLE. However, forexpr.regexp match { case Literal(pattern, StringType) => ... }, a literalNULLwill bindpatternasnull, andpattern.toStringbecomes"null". Combined with the Rust side currently rejecting NULL patterns, this likely makesregexp_extract(col, NULL, idx)/regexp_extract_all(col, NULL, idx)behave differently from Spark. It may be safer either to treat a NULL literal pattern asUnsupportedhere (falling back to Spark) or to explicitly propagate a NULL pattern down to native once you decide on the intended semantics.- In
CometRegExpExtractAll.getSupportLevel, the comment says “idx will default to 0 (group 0, entire match) if not specified,” but the Rust UDF’s 2‑arg fallback uses a default of1, and, from this serde layer,expr.idxis always supplied explicitly anyway. To avoid confusion for future maintainers (or non‑Spark frontends that might call the native UDF with 2 args), consider updating the comment and/or aligning the Rust default with Spark’s actual default after confirming it.Also applies to: 290-368
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
docs/source/user-guide/latest/configs.md(1 hunks)native/spark-expr/src/comet_scalar_funcs.rs(2 hunks)native/spark-expr/src/string_funcs/mod.rs(1 hunks)native/spark-expr/src/string_funcs/regexp_extract.rs(1 hunks)spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala(1 hunks)spark/src/main/scala/org/apache/comet/serde/strings.scala(2 hunks)spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-11T15:01:48.203Z
Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 17
File: docs/source/contributor-guide/adding_a_new_operator.md:349-354
Timestamp: 2025-11-11T15:01:48.203Z
Learning: For Apache DataFusion Comet debugging documentation, the correct configuration keys are `spark.comet.explain.format=verbose` for verbose explain plans and `spark.comet.logFallbackReasons.enabled=true` for logging fallback reasons (not `spark.comet.explain.verbose` or `spark.comet.logFallbackReasons` without `.enabled`).
Applied to files:
docs/source/user-guide/latest/configs.md
🧬 Code graph analysis (3)
native/spark-expr/src/comet_scalar_funcs.rs (1)
native/spark-expr/src/string_funcs/regexp_extract.rs (2)
new(42-47)new(137-142)
native/spark-expr/src/string_funcs/regexp_extract.rs (1)
native/spark-expr/src/comet_scalar_funcs.rs (6)
new(255-267)as_any(271-273)name(275-277)signature(279-281)return_type(283-285)invoke_with_args(287-289)
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (1)
spark/src/main/scala/org/apache/comet/serde/strings.scala (2)
CometRegExpExtract(290-327)CometRegExpExtractAll(329-368)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (6)
native/spark-expr/src/string_funcs/regexp_extract.rs (1)
345-451: Unit tests for helpers and array behavior are solid.The tests for
extract_group,extract_all_groups,regexp_extract_all_array, andregexp_extract_arraycover basic matches, no-match cases, NULL propagation, empty arrays vs NULL lists, and group index errors. Once the out-of-bounds semantics are finalized, these tests give good coverage of the native helpers.native/spark-expr/src/string_funcs/mod.rs (1)
18-23: String funcs module wiring for regexp UDFs looks correct.Adding
mod regexp_extract;and re-exportingSparkRegExpExtract/SparkRegExpExtractAllcleanly exposes the new UDFs to the crate, matching how other string functions are wired.docs/source/user-guide/latest/configs.md (1)
294-295: Config keys are consistent with expression and serde naming.
spark.comet.expression.RegExpExtract.enabledand.RegExpExtractAll.enabledmatch the handler names (CometRegExpExtract/CometRegExpExtractAll) and thespark.comet.expression.<Expr>.enabledpattern used elsewhere, so they should integrate cleanly withCometConf.isExprEnabled.spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (1)
143-158: Serde map correctly wires regexp expressions to their handlers.Adding
classOf[RegExpExtract] -> CometRegExpExtract, classOf[RegExpExtractAll] -> CometRegExpExtractAll,to
stringExpressionsaligns these Catalyst expressions with the new serde handlers and ensures they participate inexprSerdeMapalongside other string functions.native/spark-expr/src/comet_scalar_funcs.rs (1)
18-27: New regexp UDFs are registered consistently with existing scalar functions.Bringing
SparkRegExpExtract/SparkRegExpExtractAllinto the crate import and adding them toall_scalar_functions()ensures they’re registered viaregister_all_comet_functionsand discoverable by name ("regexp_extract","regexp_extract_all") through theFunctionRegistry. No additional cases are needed increate_comet_physical_funsince it already falls back toregistry.udf(fun_name).Also applies to: 196-204
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala (1)
394-506: Regex extract tests provide good coverage; ensure they stay aligned with native semantics.The new tests exercise:
- Basic
regexp_extractcases, including NULL input, no match, non-existent group, empty pattern, and NULL pattern.- Practical edge patterns (email, phone, price).
regexp_extract_allwith group 0 vs 1, no matches, empty/NULL patterns, multiple matches, and dictionary-encoded Parquet paths.This is a solid coverage matrix for the new expressions. Once the native side resolves the group-index and NULL-pattern semantics (see comments in
regexp_extract.rs), these tests should robustly guard against regressions.
| fn extract_group(text: &str, regex: &Regex, idx: usize) -> Result<String> { | ||
| match regex.captures(text) { | ||
| Some(caps) => { | ||
| // Spark behavior: throw error if group index is out of bounds | ||
| if idx >= caps.len() { | ||
| return exec_err!( | ||
| "Regex group count is {}, but the specified group index is {}", | ||
| caps.len(), | ||
| idx | ||
| ); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the structure and find the relevant files
find . -name "regexp_extract.rs" -o -name "CometStringExpressionSuite.scala" | head -20Repository: martin-augment/datafusion-comet
Length of output: 202
🏁 Script executed:
# Let's examine the regexp_extract.rs file to understand the full context
wc -l native/spark-expr/src/string_funcs/regexp_extract.rsRepository: martin-augment/datafusion-comet
Length of output: 131
🏁 Script executed:
# Check lines 246-256 and surrounding context, and also 349-358
sed -n '240,260p' native/spark-expr/src/string_funcs/regexp_extract.rsRepository: martin-augment/datafusion-comet
Length of output: 709
🏁 Script executed:
# Check lines 349-358
sed -n '345,365p' native/spark-expr/src/string_funcs/regexp_extract.rsRepository: martin-augment/datafusion-comet
Length of output: 952
🏁 Script executed:
# Find and examine the Scala test file
find . -name "CometStringExpressionSuite.scala" | head -5Repository: martin-augment/datafusion-comet
Length of output: 147
🏁 Script executed:
# Check the Scala test file around line 415
sed -n '405,425p' ./spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scalaRepository: martin-augment/datafusion-comet
Length of output: 972
🏁 Script executed:
# Let's search for regexp_extract tests in the Scala file
rg -n "regexp_extract" ./spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala -A 5 -B 2Repository: martin-augment/datafusion-comet
Length of output: 4071
🌐 Web query:
Spark regexp_extract behavior when group index out of bounds exceeds number of capturing groups
💡 Result:
If the specified group index is out of range (greater than the number of capturing groups) regexp_extract returns an empty string (same as when the regex or that group didn't match). [1][2]
Sources:
- PySpark docs: pyspark.sql.functions.regexp_extract — returns empty string if regex didn't match or specified group didn't match. [1]
- Spark SQL built-in functions (regexp_extract) — group index is the Java Matcher group() index; non-matching groups yield an empty string per function behavior. [2]
🏁 Script executed:
# Let's verify the exact behavior by checking if there are any special handling in the Rust code
sed -n '240,300p' native/spark-expr/src/string_funcs/regexp_extract.rsRepository: martin-augment/datafusion-comet
Length of output: 1967
🏁 Script executed:
# Also check if there are any integration tests that would catch this
rg -n "extract_group" native/spark-expr/src/string_funcs/regexp_extract.rs | head -20Repository: martin-augment/datafusion-comet
Length of output: 769
Fix extract_group and extract_all_groups to return empty string for out-of-bounds group indices instead of throwing error.
Apache Spark's regexp_extract returns an empty string when the group index exceeds the number of capturing groups, not an error. The Rust implementation incorrectly throws an error at lines 246–256 and lines 280+, and the Rust test incorrectly asserts error behavior. The Scala test at line 415 (checkSparkAnswerAndOperator) expects empty string output and will fail at runtime because it compares Comet's result against actual Spark behavior.
Replace the if idx >= caps.len() error checks with fallthrough logic that returns Ok(String::new()) when the group is out of bounds.
🤖 Prompt for AI Agents
In native/spark-expr/src/string_funcs/regexp_extract.rs around lines 246 to 256
(and similarly for the later extract_all_groups block around line ~280), the
code currently returns an exec_err! when the requested capture group index is >=
caps.len(); change this to return Ok(String::new()) so out-of-bounds group
indices produce an empty string per Spark behavior. Update extract_all_groups'
corresponding bounds checks to fall through or return empty strings for
out-of-range indices instead of errors, and adjust the Rust test assertions that
expect an error to expect an empty string result to match the Scala/Spark test.
There was a problem hiding this comment.
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. Prevents different behavior than Spark's.
value:useful; category:bug; feedback:The Claude AI reviewer is correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. Prevents different behavior than Spark's. |
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is partially correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. So, the large usize value will again don't match any group (just like negative value in Spark) but the difference will come from the fact that Comet will return an error instead of an empty string. |
value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct. The invalid user input should be reported as an execution error, not as an internal error (i.e. as a bug). Prevents reporting issues to Apache DataFusion Comet repository for issues which are caused by wrong regex patterns. |
2831: To review by AI