2831: feat: Add support for `RegExpExtract`/`RegExpExtractAll` by martin-augment · Pull Request #31 · martin-augment/datafusion-comet

martin-augment · 2025-12-01T12:38:53Z

2831: To review by AI

coderabbitai · 2025-12-01T12:39:08Z

Walkthrough

This change adds support for two regular expression scalar functions to Comet: regexp_extract and regexp_extract_all. The implementation includes a Rust backend implementation of both functions with argument validation and error handling, integration into the Comet function registry, Scala serialization handlers to convert Spark expressions to Comet protocol buffer expressions, configuration flags to enable/disable the functions, and comprehensive test coverage for various scenarios including edge cases with NULL values and array inputs.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-2831-2025-12-01-12-38-49

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2025-12-01T12:41:09Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+                }
+            }
+        } else {
+            // Using 1 here to align with Spark's default behavior.


In Spark, the default idx for regexp_extract_all is 0 (entire match); defaulting to 1 will produce incorrect results when the third argument is omitted (the Scala tests also assume default=0). Consider aligning the default to 0 to match Spark semantics.

_{🤖 Was this useful? React with 👍 or 👎}

value:annoying; category:bug; feedback:The Augment AI reviewer is not correct. The default is 1 as documented at https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L885 and https://docs.databricks.com/aws/en/sql/language-manual/functions/regexp_extract_all#arguments

augmentcode · 2025-12-01T12:41:09Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+fn extract_group(text: &str, regex: &Regex, idx: usize) -> Result<String> {
+    match regex.captures(text) {
+        Some(caps) => {
+            // Spark behavior: throw error if group index is out of bounds


Spark regexp_extract returns an empty string when the specified group index is out of range for a match, rather than erroring; throwing here will raise at runtime for inputs like idx=3 in your tests. Consider returning an empty string in this case to match Spark semantics.

_{🤖 Was this useful? React with 👍 or 👎}

value:useful; category:bug; feedback:The Augment AI reviewer is correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. Prevents different behavior than Spark's.

claude · 2025-12-01T12:41:41Z

Code Review for PR #31: RegExpExtract/RegExpExtractAll Implementation

Thank you for this implementation! I've conducted a comprehensive review of the code. Overall, the implementation is solid and well-structured, with good test coverage and proper Spark compatibility. Below are my findings and recommendations:

Strengths

1. Code Quality

Clean, readable Rust code with proper error handling
Good separation of concerns between array and scalar processing
Comprehensive unit tests in both Rust and Scala
Proper use of DataFusion's ScalarUDF framework
Follows existing code patterns (similar to string_space.rs and other functions)

2. Spark Compatibility

Correctly implements Spark's behavior for:
- Empty string return on no match (not NULL)
- Error on out-of-bounds group index
- NULL input to NULL output
- Default group index behavior
Integration with COMET_REGEXP_ALLOW_INCOMPATIBLE configuration
Proper regex pattern validation

3. Test Coverage

Tests cover edge cases: NULL inputs, empty strings, no matches, out-of-bounds indices
Dictionary encoding test included
Both unit tests (Rust) and integration tests (Scala)
Multiple test scenarios for both functions

Issues and Recommendations

Critical Issues

1. Missing Dictionary Encoding Support

Location: native/spark-expr/src/string_funcs/regexp_extract.rs:266

The implementation does NOT handle dictionary-encoded string arrays, which is common in Parquet files. Looking at string_space.rs:98-111, dictionary support is a standard pattern in this codebase.

Impact: Performance degradation and potential failures when processing dictionary-encoded columns.

Recommendation: Add dictionary support similar to string_space.rs

2. Incorrect Default Group Index for regexp_extract_all

Location: native/spark-expr/src/string_funcs/regexp_extract.rs:183

The comment says "Using 1 here to align with Spark's default behavior" but this needs verification. Spark's actual default behavior should be confirmed and documented.

Medium Priority Issues

3. Missing Large String Support

Location: Both regexp_extract_array and regexp_extract_all_array

The code only handles GenericStringArray (normal strings), not GenericStringArray (large strings). Consider adding support or documenting this limitation.

4. Scala Code: Pattern Type Check Issue

Location: spark/src/main/scala/org/apache/comet/serde/strings.scala:296

Since RegExp.isSupportedPattern always returns false (RegExp.scala:29), this will ALWAYS require COMET_REGEXP_ALLOW_INCOMPATIBLE=true for any pattern.

Recommendation: Either implement proper pattern validation or add a TODO comment explaining this is future work.

5. Test Gap: Out-of-Bounds Index Behavior

Location: spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:415

The test comment says "should return empty string" but the Rust code throws an error for out-of-bounds indices. Verify this matches Spark's behavior.

Security Analysis

No Major Security Concerns

ReDoS (Regular Expression Denial of Service):
- Risk mitigated by using Rust's regex crate (linear time guarantees)
- Pattern must be a literal (compile-time known), preventing user-controlled regex injection
Memory Safety:
- Proper use of Arrow's safe APIs
- No unsafe code blocks in this implementation
- Good!
Input Validation:
- NULL handling is correct
- Bounds checking for group indices
- Pattern validation delegated to regex crate

Performance Considerations

Good:

Regex compiled once per batch (optimal)
Direct array iteration (efficient)
Uses builders for Arrow array construction

Potential Improvements:

Dictionary encoding support would significantly improve performance for common use cases
Consider pre-allocating builders with estimated capacity

Documentation

Missing:

No user-facing documentation updates (besides configs.md)
Consider adding to docs/source/user-guide/latest/expressions.md
Examples of usage would be helpful

Test Coverage Assessment

Covered:

Basic extraction scenarios
NULL handling
Empty strings
No matches
Multiple matches (extract_all variant)
Different group indices
Dictionary encoding (Scala test)

Missing:

Very long strings (performance validation)
Special regex characters and escaping
Unicode handling
Empty capture groups vs non-matching groups
Nested groups

Summary and Priority Actions

Before Merge (Critical):

Add dictionary encoding support - this is essential for production use
Verify and fix default group index for regexp_extract_all
Clarify out-of-bounds behavior in tests and ensure Spark compatibility

Consider for Follow-up:

Implement RegExp.isSupportedPattern for automatic pattern validation
Add LargeUtf8 support
Expand test coverage for edge cases
Add user documentation with examples

Overall Assessment: 7.5/10

The code is well-written with good fundamentals, but the missing dictionary encoding support is a significant gap for a production feature. Once addressed, this would be a 9/10 implementation.

Great work on the comprehensive tests and Spark compatibility!

cursor · 2025-12-01T12:43:59Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+            }
+        } else {
+            // Using 1 here to align with Spark's default behavior.
+            1


Bug: Incorrect default idx value for regexp_extract_all

The default idx value for regexp_extract_all is set to 1, but it should be 0 to match Spark's behavior. The comment in strings.scala line 348 explicitly states the default should be 0 (group 0, entire match). Additionally, test case at line 502 calls regexp_extract_all with pattern \d+ (no capture groups), which would fail with idx=1 since group 1 doesn't exist, but works with idx=0.

value:annoying; category:bug; feedback:The Bugbot AI reviewer is not correct. The default is 1 as documented at https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L885 and https://docs.databricks.com/aws/en/sql/language-manual/functions/regexp_extract_all#arguments

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

native/spark-expr/src/string_funcs/regexp_extract.rs (3)
68-116: Clarify NULL pattern and negative idx handling in invoke_with_args.

pattern is only accepted as ColumnarValue::Scalar(ScalarValue::Utf8(Some(_))); a typed NULL string literal (likely encoded as ScalarValue::Utf8(None)) will hit the _ arm and raise "regexp_extract pattern must be a string literal". For queries like regexp_extract(col, NULL, idx), this would surface as an execution error rather than following Spark’s NULL semantics. Please confirm how Spark behaves here and either (a) make the Scala serde treat NULL patterns as unsupported (so we fall back to Spark), or (b) teach the Rust side to propagate a NULL pattern to a NULL result (building an all‑NULL array for column input).

idx is read as Int32 and immediately cast to usize. Negative values will wrap to huge positive indices (e.g., -1 → 4294967295), which then fail the bounds check in extract_group / extract_all_groups with a confusing message. It would be cleaner to reject i < 0 explicitly before casting and return a more direct error about negative group indices.

97-101: Classify invalid regex patterns as user errors, not internal errors.

Regex::new(&pattern_str) currently maps failures to internal_datafusion_err!("Invalid regex pattern ..."). Since an invalid pattern is user input, not a Comet bug, consider switching this to exec_err! so it surfaces as a normal execution error rather than an “internal” one:
- let regex = Regex::new(&pattern_str).map_err(|e| {
-     internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)
- })?;
+ let regex = Regex::new(&pattern_str).map_err(|e| {
+     exec_err!("Invalid regex pattern '{}': {}", pattern_str, e)
+ })?;
Same applies to SparkRegExpExtractAll.

291-303: Consider explicit validation for negative group indices in extract_all_groups.

idx arrives here as a usize derived from an i32 via as, so negative Java/Spark indices get wrapped into large positive values and then trip the idx >= caps.len() check with a misleading message. For better diagnostics and closer alignment with Spark, consider validating the original i32 before casting and returning a clear error for negative idx (and potentially sharing that validation between extract_group and extract_all_groups).
spark/src/main/scala/org/apache/comet/serde/strings.scala (1)

24-25: Regex serde handlers look good; please clarify NULL-pattern and default-idx semantics.

CometRegExpExtract / CometRegExpExtractAll correctly restrict patterns and idx to literals and reuse RegExp.isSupportedPattern plus COMET_REGEXP_ALLOW_INCOMPATIBLE. However, for expr.regexp match { case Literal(pattern, StringType) => ... }, a literal NULL will bind pattern as null, and pattern.toString becomes "null". Combined with the Rust side currently rejecting NULL patterns, this likely makes regexp_extract(col, NULL, idx) / regexp_extract_all(col, NULL, idx) behave differently from Spark. It may be safer either to treat a NULL literal pattern as Unsupported here (falling back to Spark) or to explicitly propagate a NULL pattern down to native once you decide on the intended semantics.

In CometRegExpExtractAll.getSupportLevel, the comment says “idx will default to 0 (group 0, entire match) if not specified,” but the Rust UDF’s 2‑arg fallback uses a default of 1, and, from this serde layer, expr.idx is always supplied explicitly anyway. To avoid confusion for future maintainers (or non‑Spark frontends that might call the native UDF with 2 args), consider updating the comment and/or aligning the Rust default with Spark’s actual default after confirming it.

Also applies to: 290-368

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f836760 and ff1ebd6.

📒 Files selected for processing (7)

docs/source/user-guide/latest/configs.md (1 hunks)
native/spark-expr/src/comet_scalar_funcs.rs (2 hunks)
native/spark-expr/src/string_funcs/mod.rs (1 hunks)
native/spark-expr/src/string_funcs/regexp_extract.rs (1 hunks)
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (1 hunks)
spark/src/main/scala/org/apache/comet/serde/strings.scala (2 hunks)
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-11T15:01:48.203Z

Learnt from: martin-augment
Repo: martin-augment/datafusion-comet PR: 17
File: docs/source/contributor-guide/adding_a_new_operator.md:349-354
Timestamp: 2025-11-11T15:01:48.203Z
Learning: For Apache DataFusion Comet debugging documentation, the correct configuration keys are `spark.comet.explain.format=verbose` for verbose explain plans and `spark.comet.logFallbackReasons.enabled=true` for logging fallback reasons (not `spark.comet.explain.verbose` or `spark.comet.logFallbackReasons` without `.enabled`).

Applied to files:

docs/source/user-guide/latest/configs.md

🧬 Code graph analysis (3)

native/spark-expr/src/comet_scalar_funcs.rs (1)

native/spark-expr/src/string_funcs/regexp_extract.rs (2)

new (42-47)

new (137-142)

native/spark-expr/src/string_funcs/regexp_extract.rs (1)

native/spark-expr/src/comet_scalar_funcs.rs (6)

new (255-267)

as_any (271-273)

name (275-277)

signature (279-281)

return_type (283-285)

invoke_with_args (287-289)

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (1)

spark/src/main/scala/org/apache/comet/serde/strings.scala (2)

CometRegExpExtract (290-327)

CometRegExpExtractAll (329-368)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: Cursor Bugbot
GitHub Check: claude-review

🔇 Additional comments (6)

native/spark-expr/src/string_funcs/regexp_extract.rs (1)

345-451: Unit tests for helpers and array behavior are solid.

The tests for extract_group, extract_all_groups, regexp_extract_all_array, and regexp_extract_array cover basic matches, no-match cases, NULL propagation, empty arrays vs NULL lists, and group index errors. Once the out-of-bounds semantics are finalized, these tests give good coverage of the native helpers.

native/spark-expr/src/string_funcs/mod.rs (1)

18-23: String funcs module wiring for regexp UDFs looks correct.

Adding mod regexp_extract; and re-exporting SparkRegExpExtract / SparkRegExpExtractAll cleanly exposes the new UDFs to the crate, matching how other string functions are wired.

docs/source/user-guide/latest/configs.md (1)

294-295: Config keys are consistent with expression and serde naming.

spark.comet.expression.RegExpExtract.enabled and .RegExpExtractAll.enabled match the handler names (CometRegExpExtract / CometRegExpExtractAll) and the spark.comet.expression.<Expr>.enabled pattern used elsewhere, so they should integrate cleanly with CometConf.isExprEnabled.
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala (1)
143-158: Serde map correctly wires regexp expressions to their handlers.

Adding
classOf[RegExpExtract]    -> CometRegExpExtract,
classOf[RegExpExtractAll] -> CometRegExpExtractAll,
to stringExpressions aligns these Catalyst expressions with the new serde handlers and ensures they participate in exprSerdeMap alongside other string functions.
native/spark-expr/src/comet_scalar_funcs.rs (1)

18-27: New regexp UDFs are registered consistently with existing scalar functions.

Bringing SparkRegExpExtract / SparkRegExpExtractAll into the crate import and adding them to all_scalar_functions() ensures they’re registered via register_all_comet_functions and discoverable by name ("regexp_extract", "regexp_extract_all") through the FunctionRegistry. No additional cases are needed in create_comet_physical_fun since it already falls back to registry.udf(fun_name).

Also applies to: 196-204

spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala (1)

394-506: Regex extract tests provide good coverage; ensure they stay aligned with native semantics.

The new tests exercise:

Basic regexp_extract cases, including NULL input, no match, non-existent group, empty pattern, and NULL pattern.

Practical edge patterns (email, phone, price).

regexp_extract_all with group 0 vs 1, no matches, empty/NULL patterns, multiple matches, and dictionary-encoded Parquet paths.

This is a solid coverage matrix for the new expressions. Once the native side resolves the group-index and NULL-pattern semantics (see comments in regexp_extract.rs), these tests should robustly guard against regressions.

coderabbitai · 2025-12-01T12:48:19Z

native/spark-expr/src/string_funcs/regexp_extract.rs

+fn extract_group(text: &str, regex: &Regex, idx: usize) -> Result<String> {
+    match regex.captures(text) {
+        Some(caps) => {
+            // Spark behavior: throw error if group index is out of bounds
+            if idx >= caps.len() {
+                return exec_err!(
+                    "Regex group count is {}, but the specified group index is {}",
+                    caps.len(),
+                    idx
+                );
+            }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check the structure and find the relevant files find . -name "regexp_extract.rs" -o -name "CometStringExpressionSuite.scala" | head -20

Repository: martin-augment/datafusion-comet

Length of output: 202

🏁 Script executed:

# Let's examine the regexp_extract.rs file to understand the full context wc -l native/spark-expr/src/string_funcs/regexp_extract.rs

Repository: martin-augment/datafusion-comet

Length of output: 131

🏁 Script executed:

# Check lines 246-256 and surrounding context, and also 349-358 sed -n '240,260p' native/spark-expr/src/string_funcs/regexp_extract.rs

Repository: martin-augment/datafusion-comet

Length of output: 709

🏁 Script executed:

# Check lines 349-358 sed -n '345,365p' native/spark-expr/src/string_funcs/regexp_extract.rs

Repository: martin-augment/datafusion-comet

Length of output: 952

🏁 Script executed:

# Find and examine the Scala test file find . -name "CometStringExpressionSuite.scala" | head -5

Repository: martin-augment/datafusion-comet

Length of output: 147

🏁 Script executed:

# Check the Scala test file around line 415 sed -n '405,425p' ./spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala

Repository: martin-augment/datafusion-comet

Length of output: 972

🏁 Script executed:

# Let's search for regexp_extract tests in the Scala file rg -n "regexp_extract" ./spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala -A 5 -B 2

Repository: martin-augment/datafusion-comet

Length of output: 4071

🌐 Web query:

Spark regexp_extract behavior when group index out of bounds exceeds number of capturing groups

💡 Result:

If the specified group index is out of range (greater than the number of capturing groups) regexp_extract returns an empty string (same as when the regex or that group didn't match). [1][2]

Sources:

PySpark docs: pyspark.sql.functions.regexp_extract — returns empty string if regex didn't match or specified group didn't match. [1]

Spark SQL built-in functions (regexp_extract) — group index is the Java Matcher group() index; non-matching groups yield an empty string per function behavior. [2]

🏁 Script executed:

# Let's verify the exact behavior by checking if there are any special handling in the Rust code sed -n '240,300p' native/spark-expr/src/string_funcs/regexp_extract.rs

Repository: martin-augment/datafusion-comet

Length of output: 1967

🏁 Script executed:

# Also check if there are any integration tests that would catch this rg -n "extract_group" native/spark-expr/src/string_funcs/regexp_extract.rs | head -20

Repository: martin-augment/datafusion-comet

Length of output: 769

Fix extract_group and extract_all_groups to return empty string for out-of-bounds group indices instead of throwing error.

Apache Spark's regexp_extract returns an empty string when the group index exceeds the number of capturing groups, not an error. The Rust implementation incorrectly throws an error at lines 246–256 and lines 280+, and the Rust test incorrectly asserts error behavior. The Scala test at line 415 (checkSparkAnswerAndOperator) expects empty string output and will fail at runtime because it compares Comet's result against actual Spark behavior.

Replace the if idx >= caps.len() error checks with fallthrough logic that returns Ok(String::new()) when the group is out of bounds.

🤖 Prompt for AI Agents

In native/spark-expr/src/string_funcs/regexp_extract.rs around lines 246 to 256 (and similarly for the later extract_all_groups block around line ~280), the code currently returns an exec_err! when the requested capture group index is >= caps.len(); change this to return Ok(String::new()) so out-of-bounds group indices produce an empty string per Spark behavior. Update extract_all_groups' corresponding bounds checks to fall through or return empty strings for out-of-range indices instead of errors, and adjust the Rust test assertions that expect an error to expect an empty string result to match the Scala/Spark test.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. Prevents different behavior than Spark's.

martin-augment · 2025-12-01T13:23:04Z

5. Test Gap: Out-of-Bounds Index Behavior

Location: spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:415

The test comment says "should return empty string" but the Rust code throws an error for out-of-bounds indices. Verify this matches Spark's behavior.

value:useful; category:bug; feedback:The Claude AI reviewer is correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. Prevents different behavior than Spark's.

martin-augment · 2025-12-01T13:25:32Z

291-303: Consider explicit validation for negative group indices in extract_all_groups.

idx arrives here as a usize derived from an i32 via as, so negative Java/Spark indices get wrapped into large positive values and then trip the idx >= caps.len() check with a misleading message. For better diagnostics and closer alignment with Spark, consider validating the original i32 before casting and returning a clear error for negative idx (and potentially sharing that validation between extract_group and extract_all_groups).

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is partially correct. Apache Spark returns an empty string when there is no such group by index - https://github.com/apache/spark/blob/branch-3.5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala#L809-L821. So, the large usize value will again don't match any group (just like negative value in Spark) but the difference will come from the fact that Comet will return an error instead of an empty string.

martin-augment · 2025-12-01T13:27:40Z

97-101: Classify invalid regex patterns as user errors, not internal errors.

Regex::new(&pattern_str) currently maps failures to internal_datafusion_err!("Invalid regex pattern ..."). Since an invalid pattern is user input, not a Comet bug, consider switching this to exec_err! so it surfaces as a normal execution error rather than an “internal” one:
- let regex = Regex::new(&pattern_str).map_err(|e| {
-     internal_datafusion_err!("Invalid regex pattern '{}': {}", pattern_str, e)
- })?;
+ let regex = Regex::new(&pattern_str).map_err(|e| {
+     exec_err!("Invalid regex pattern '{}': {}", pattern_str, e)
+ })?;
Same applies to SparkRegExpExtractAll.

value:useful; category:bug; feedback:The CodeRabbit AI reviewer is correct. The invalid user input should be reported as an execution error, not as an internal error (i.e. as a bug). Prevents reporting issues to Apache DataFusion Comet repository for issues which are caused by wrong regex patterns.

danielhumanmod added 4 commits November 29, 2025 21:17

prototype

68f74ae

refactor strings.scala

4dbed77

test, format and configs

f101362

make regexp_extract more align with spark's behavior

ff1ebd6

augmentcode bot reviewed Dec 1, 2025

View reviewed changes

cursor bot reviewed Dec 1, 2025

View reviewed changes

coderabbitai bot reviewed Dec 1, 2025

View reviewed changes

Conversation

martin-augment commented Dec 1, 2025

Uh oh!

coderabbitai bot commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot commented Dec 1, 2025

Code Review for PR #31: RegExpExtract/RegExpExtractAll Implementation

Strengths

1. Code Quality

2. Spark Compatibility

3. Test Coverage

Issues and Recommendations

Critical Issues

1. Missing Dictionary Encoding Support

2. Incorrect Default Group Index for regexp_extract_all

Medium Priority Issues

3. Missing Large String Support

4. Scala Code: Pattern Type Check Issue

5. Test Gap: Out-of-Bounds Index Behavior

Security Analysis

No Major Security Concerns

Performance Considerations

Good:

Potential Improvements:

Documentation

Missing:

Test Coverage Assessment

Covered:

Missing:

Summary and Priority Actions

Before Merge (Critical):

Consider for Follow-up:

Overall Assessment: 7.5/10

Uh oh!

cursor bot Dec 1, 2025

Choose a reason for hiding this comment

Bug: Incorrect default idx value for regexp_extract_all

Uh oh!

martin-augment Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

martin-augment commented Dec 1, 2025

5. Test Gap: Out-of-Bounds Index Behavior

Uh oh!

martin-augment commented Dec 1, 2025

Uh oh!

martin-augment commented Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

coderabbitai bot commented Dec 1, 2025 •

edited

Loading