3699: fix: route file-not-found errors through SparkError JSON path by martin-augment · Pull Request #47 · martin-augment/datafusion-comet

martin-augment · 2026-03-15T12:43:21Z

3699: To review by AI

Detect file-not-found errors from DataFusion's object store on the native side and convert them to SparkError::FileNotFound, which is serialized as JSON via CometQueryExecutionException. The shim layer then creates a proper SparkFileNotFoundException using QueryExecutionErrors.readCurrentFileNotFoundError(), producing the exact exception type that Spark tests expect. Previously, file-not-found errors arrived as CometNativeException and were pattern-matched in CometExecIterator to create a SparkException with a plain FileNotFoundException cause. Tests that cast the cause to SparkFileNotFoundException (which is private[spark]) would fail. Closes apache#3314

Remove the assume() skip for native_datafusion in FileBasedDataSourceSuite and the IgnoreCometNativeDataFusion tag in SimpleSQLViewSuite, since file-not-found errors now produce the correct SparkFileNotFoundException type.

Remove CometConf import from FileBasedDataSourceSuite and IgnoreCometNativeDataFusion import from SQLViewSuite that became unused after removing the test skips.

Extract file path from native error message and format it as "File <path> does not exist" to match the Hadoop FileNotFoundException message format that Spark tests expect.

readCurrentFileNotFoundError was removed in Spark 4.0. Construct SparkFileNotFoundException directly instead, which is accessible from the shim package.

….5.8 diff The SPARK-25207 test expects a specific error message for duplicate fields in case-insensitive mode, but native DataFusion produces a different schema error. Update the test to accept either message format.

…lds in 3.5.8 diff" This reverts commit 03328cd.

Add IgnoreCometNativeDataFusion tag to SPARK-25207 test instead of trying to accept both error messages. A separate PR will fix the underlying issue.

coderabbitai · 2026-03-15T12:43:35Z

Walkthrough

This pull request introduces Comet-related testing infrastructure and error handling enhancements. On the testing side, it adds IgnoreComet feature tags and test utilities to selectively skip tests when Comet is enabled, along with configuration helpers to manage Comet-specific test execution. On the error handling side, it adds FileNotFound error support across the native Rust layer and Spark shims. The native layer detects file-not-found scenarios by inspecting error messages and constructs appropriate SparkError variants. Version-specific Spark shims (3.4, 3.5, and 4.0) convert FileNotFound errors to platform-appropriate exception types using regex-based path extraction and error wrapping.

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-3699-2026-03-15-12-43-17

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-03-15T12:43:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the error handling mechanism within the Comet native execution engine by specifically addressing file not found errors. It introduces a dedicated error type at the native level and updates the Spark integration layers to ensure these errors are consistently and correctly translated into Spark's native file not found exceptions. This change streamlines error propagation and improves the diagnostic capabilities for users encountering missing files during Comet-accelerated operations.

Highlights

Improved File Not Found Error Handling: File not found errors originating from the native Comet engine are now correctly identified and routed through the SparkError JSON path, ensuring consistent error reporting in Spark.
Native Error Type Introduction: A new FileNotFound error type has been introduced in the native SparkError enum, allowing for explicit classification of these errors at the native layer.
Simplified Comet Execution Iterator: The custom regex-based parsing for file not found exceptions in CometExecIterator has been removed, as the native layer now handles this classification.
Spark Shim Layer Updates: The Spark shim layers for different Spark versions (3.4, 3.5, 4.0) have been updated to correctly convert the native FileNotFound SparkError into the appropriate Spark FileNotFoundException or SparkFileNotFoundException.

Changelog

dev/diffs/3.5.8.diff
- Removed CometConf import and assume statement related to native_datafusion Parquet scan in a test suite.
- Added CometSortMergeJoinExec and CometBatchScanExec to plan collection logic in FileBasedDataSourceSuite.
native/core/src/errors.rs
- Implemented logic to detect 'file not found' messages from underlying errors and convert them to SparkError::FileNotFound.
- Moved CometError::Spark handling to directly serialize to JSON.
- Added a general check for 'file not found' errors that may arrive through other wrapping paths.
native/spark-expr/src/error.rs
- Added a new FileNotFound variant to the SparkError enum.
- Implemented error_class, json_value, exception_class, and error_class_name for the new FileNotFound error type.
spark/src/main/scala/org/apache/comet/CometExecIterator.scala
- Removed java.io.FileNotFoundException and scala.util.matching.Regex imports.
- Removed custom regex-based parsing and throwing of SparkException for file not found errors.
spark/src/main/spark-3.4/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala
- Added java.io.FileNotFoundException import.
- Implemented conversion for FileNotFound SparkError to QueryExecutionErrors.readCurrentFileNotFoundError.
spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala
- Added java.io.FileNotFoundException import.
- Implemented conversion for FileNotFound SparkError to QueryExecutionErrors.readCurrentFileNotFoundError.
spark/src/main/spark-4.0/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala
- Added org.apache.spark.SparkFileNotFoundException import.
- Implemented conversion for FileNotFound SparkError to a new SparkFileNotFoundException instance, adapting for Spark 4.0's API changes.

Activity

The pull request was created by martin-augment.
The pull request is marked for review by AI, indicating it is awaiting initial assessment.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the handling of 'file not found' errors within the Comet native execution engine. Previously, the Scala layer would parse generic native error messages to identify and re-throw file not found exceptions. Now, the native Rust code explicitly detects these errors and propagates them as a new SparkError::FileNotFound type. Platform-specific Scala shims then translate this native error into the appropriate Spark FileNotFoundException or SparkFileNotFoundException, providing more robust and consistent error reporting. Additionally, test suites (FileBasedDataSourceSuite.scala) are updated to include Comet-specific execution plans in their analysis, and a previous assume statement related to native scan error handling is removed, reflecting the new error propagation mechanism.

augmentcode · 2026-03-15T12:48:49Z

🤖 Augment PR Summary

Summary: This PR routes native file-not-found failures through Comet’s structured SparkError JSON pathway so Spark sees the expected file-not-found exception type.

Changes:

Adds a new native SparkError::FileNotFound variant with JSON serialization support.
Updates JNI error throwing to detect object-store “not found / No such file or directory” messages and emit them as JSON CometQueryExecutionException carrying FileNotFound.
Removes the JVM-side regex translation of native file-not-found messages from CometExecIterator, letting the JSON converter own that logic.
Extends Spark 3.4/3.5/4.0 shims to convert errorType=FileNotFound into Spark’s file-not-found exception path (readCurrentFileNotFoundError / SparkFileNotFoundException).
Updates the Spark 3.5.8 patch file to accommodate Comet exec nodes / ignores in relevant Spark SQL tests.

Technical Notes: The shim extracts the missing file path from the native error message via regex and formats it to match Hadoop-style FileNotFoundException messaging.

_{🤖 Was this summary useful? React with 👍 or 👎}

augmentcode

Review completed. 2 suggestions posted.

Comment augment review to trigger a new review at any time.

augmentcode · 2026-03-15T12:48:50Z

native/core/src/errors.rs

-                    // Not a SparkError, use generic exception
+                    // Check for file-not-found errors from object store
+                    let error_msg = e.to_string();
+                    if error_msg.contains("not found")


This FileNotFound detection is substring-based and doesn’t ensure the message matches the Object at location ... not found shape that the JVM shim later parses, so a format change could yield a FileNotFound with an unextractable/garbled path. Consider tightening the signal or carrying a structured path so the shim doesn’t depend on message parsing.

Severity: medium

Other Locations

native/core/src/errors.rs:441

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The detection of the error type is fragile and it could match a totally unrelated error. It looks like a best effort detection that in the worst case would lead to a wrongly formatted "full" error. It will be improved with the time when the users provide examples of errors which fail be to detected.

augmentcode · 2026-03-15T12:48:50Z

spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala

+        val path = "Object at location (.+?) not found".r
+          .findFirstMatchIn(msg)
+          .map(_.group(1))
+          .getOrElse(msg)


If the native message doesn’t match the regex, getOrElse(msg) treats the entire native error text as a file path, which can lead to confusing File <full message> does not exist output. Since this behavior is format-dependent, a small regression test for errorType=FileNotFound conversion would help catch future message changes.

Severity: low

Other Locations

spark/src/main/spark-3.4/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala:255

spark/src/main/spark-4.0/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala:262

_{🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.}

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The detection of the error type is fragile and it could match a totally unrelated error. It looks like a best effort detection that in the worst case would lead to a wrongly formatted "full" error. It will be improved with the time when the users provide examples of errors which fail be to detected.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-15T12:50:12Z

native/core/src/errors.rs

+                            ),
+                            _ => env.throw_new(exception.class, exception.msg),
+                        }
+                    }


Duplicated file-not-found detection and fallback logic

Low Severity

The file-not-found detection logic (contains("not found") && contains("No such file or directory")) along with the SparkError::FileNotFound construction and the entire generic-exception fallback block are copy-pasted between the DataFusionError::External arm and the catch-all _ arm. The only difference is the source of error_msg (e.to_string() vs error.to_string()). If the detection heuristic or fallback behavior ever needs updating, it must be changed in both places, risking inconsistency.

Additional Locations (1)

native/core/src/errors.rs#L437-L456

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The detection of the error type is fragile and it could match a totally unrelated error. It looks like a best effort detection that in the worst case would lead to a wrongly formatted "full" error. It will be improved with the time when the users provide examples of errors which fail be to detected.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

dev/diffs/3.5.8.diff (2)

74-79: ⚠️ Potential issue | 🔴 Critical

Critical: extensionClassNames is created but never used.

The variable extensionClassNames combines user-configured extensions with Comet extensions, but the subsequent foreach loop on line 79 iterates over extensionConfClassNames (without Comet). This means the Comet extension will never be applied.

🐛 Proposed fix

   val extensionConfClassNames = sparkContext.getConf.get(StaticSQLConf.SPARK_SESSION_EXTENSIONS)
     .getOrElse(Seq.empty)
   val extensionClassNames = extensionConfClassNames ++ loadCometExtension(sparkContext)
-  extensionConfClassNames.foreach { extensionConfClassName =>
+  extensionClassNames.foreach { extensionConfClassName =>

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dev/diffs/3.5.8.diff` around lines 74 - 79, The code builds
extensionClassNames by combining extensionConfClassNames with
loadCometExtension(sparkContext) but then mistakenly iterates over
extensionConfClassNames, so Comet extensions are never applied; change the
foreach to iterate over extensionClassNames (the combined Seq) where
extensionConfClassNames.foreach { ... } currently appears so that
loadCometExtension(sparkContext) results are are included when registering
extensions in the SparkSessionExtensions creation path.

87-94: ⚠️ Potential issue | 🟡 Minor

toBoolean can throw on invalid input.

If ENABLE_COMET is set to a non-boolean value like "yes" or "1", v.toBoolean throws IllegalArgumentException. Consider using a safer conversion or explicitly documenting expected values.

🛡️ Proposed defensive fix

   def isCometEnabled: Boolean = {
     val v = System.getenv("ENABLE_COMET")
-    v == null || v.toBoolean
+    v == null || v.equalsIgnoreCase("true")
   }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@dev/diffs/3.5.8.diff` around lines 87 - 94, The isCometEnabled method
currently calls v.toBoolean which throws IllegalArgumentException on inputs like
"yes" or "1"; change the conversion to a safe parse (e.g. use
java.lang.Boolean.parseBoolean(v) which returns false for non-true strings) or
implement an explicit whitelist (normalize v.toLowerCase and accept
"true","1","yes" as true) so invalid values don't throw; update the
implementation in isCometEnabled accordingly while preserving the existing
null-handling (null => enabled).

🧹 Nitpick comments (1)

native/spark-expr/src/error.rs (1)

169-170: Add a dedicated unit test for FileNotFound JSON/mapping parity.

The new variant is wired in multiple match sites; a focused test will prevent accidental drift between errorType, params.message, errorClass, and exception_class.

✅ Suggested test

 #[cfg(test)]
 mod tests {
     use super::*;
@@
+    #[test]
+    fn test_file_not_found_json_and_mappings() {
+        let error = SparkError::FileNotFound {
+            message: "Object at location file:/tmp/missing.parquet not found".to_string(),
+        };
+        let json = error.to_json();
+        let parsed: serde_json::Value = serde_json::from_str(&json).unwrap();
+
+        assert_eq!(parsed["errorType"], "FileNotFound");
+        assert_eq!(parsed["errorClass"], "_LEGACY_ERROR_TEMP_2055");
+        assert_eq!(
+            parsed["params"]["message"],
+            "Object at location file:/tmp/missing.parquet not found"
+        );
+        assert_eq!(error.error_class(), Some("_LEGACY_ERROR_TEMP_2055"));
+        assert_eq!(error.exception_class(), "org/apache/spark/SparkException");
+    }

Also applies to: 242-243, 428-432, 499-501, 574-576

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@native/spark-expr/src/error.rs` around lines 169 - 170, Add a focused unit
test that verifies JSON/mapping parity for the FileNotFound variant: construct
the FileNotFound { message: String } variant, serialize it to the JSON
representation used by the code paths (asserting errorType, params.message,
errorClass, and exception_class fields are present and have the expected
values), then deserialize or inspect the JSON to ensure values round-trip and
match exactly; place the test next to other error-mapping tests and reference
the FileNotFound variant name and the JSON keys errorType, params.message,
errorClass, and exception_class so future changes to match arms will fail the
test if they drift.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@native/core/src/errors.rs`:
- Around line 416-423: The current not-found detection logic in the error
handling branch is too strict (it requires both "not found" and "No such file or
directory") and will miss backend-specific messages; refactor by adding a small
helper function (e.g. is_not_found_error) that centralizes broader matching
(check for any of several substrings like "not found", "No such file or
directory", "NotFound", "404", backend-specific phrases) and use it in the
existing branches that construct SparkError::FileNotFound and call
throw_spark_error_as_json; update both places noted (around the
SparkError::FileNotFound creation and the similar block at 439-446) to call the
helper instead of the dual-substring check so all object-store backends are
covered.

In
`@spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala`:
- Around line 244-255: The FileNotFound case in ShimSparkErrorConverter.scala
currently blindly wraps the native message, causing duplicates like "File File
<path> does not exist does not exist"; change the extraction logic in the case
"FileNotFound" block so that you first check if msg already matches the
Hadoop-style pattern "File (.+) does not exist" and, if so, extract only the
path group, otherwise fall back to the existing "Object at location (.+?) not
found" capture or use msg as path; then pass new FileNotFoundException(s"File
$path does not exist") into QueryExecutionErrors.readCurrentFileNotFoundError to
ensure you never produce "File File ..." duplicates (refer to variables msg,
path and the call to QueryExecutionErrors.readCurrentFileNotFoundError).

---

Outside diff comments:
In `@dev/diffs/3.5.8.diff`:
- Around line 74-79: The code builds extensionClassNames by combining
extensionConfClassNames with loadCometExtension(sparkContext) but then
mistakenly iterates over extensionConfClassNames, so Comet extensions are never
applied; change the foreach to iterate over extensionClassNames (the combined
Seq) where extensionConfClassNames.foreach { ... } currently appears so that
loadCometExtension(sparkContext) results are are included when registering
extensions in the SparkSessionExtensions creation path.
- Around line 87-94: The isCometEnabled method currently calls v.toBoolean which
throws IllegalArgumentException on inputs like "yes" or "1"; change the
conversion to a safe parse (e.g. use java.lang.Boolean.parseBoolean(v) which
returns false for non-true strings) or implement an explicit whitelist
(normalize v.toLowerCase and accept "true","1","yes" as true) so invalid values
don't throw; update the implementation in isCometEnabled accordingly while
preserving the existing null-handling (null => enabled).

---

Nitpick comments:
In `@native/spark-expr/src/error.rs`:
- Around line 169-170: Add a focused unit test that verifies JSON/mapping parity
for the FileNotFound variant: construct the FileNotFound { message: String }
variant, serialize it to the JSON representation used by the code paths
(asserting errorType, params.message, errorClass, and exception_class fields are
present and have the expected values), then deserialize or inspect the JSON to
ensure values round-trip and match exactly; place the test next to other
error-mapping tests and reference the FileNotFound variant name and the JSON
keys errorType, params.message, errorClass, and exception_class so future
changes to match arms will fail the test if they drift.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ff71cdd0-5290-46f3-8b19-23409e9dc8ab

📥 Commits

Reviewing files that changed from the base of the PR and between 3a5296d and 27469e0.

📒 Files selected for processing (7)

dev/diffs/3.5.8.diff
native/core/src/errors.rs
native/spark-expr/src/error.rs
spark/src/main/scala/org/apache/comet/CometExecIterator.scala
spark/src/main/spark-3.4/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala
spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala
spark/src/main/spark-4.0/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala

coderabbitai · 2026-03-15T12:50:18Z

native/core/src/errors.rs

+                    // Check for file-not-found errors from object store
+                    let error_msg = e.to_string();
+                    if error_msg.contains("not found")
+                        && error_msg.contains("No such file or directory")
+                    {
+                        let spark_error = SparkError::FileNotFound { message: error_msg };
+                        throw_spark_error_as_json(env, &spark_error)
+                    } else {


⚠️ Potential issue | 🟠 Major

File-not-found detection is too narrow for multi-backend object stores.

Requiring both substrings ("not found" and "No such file or directory") is fragile and can miss valid not-found errors (then falling back to generic exception flow). Centralize and broaden matching in one helper.

💡 Proposed fix

+fn is_file_not_found_message(msg: &str) -> bool { + let m = msg.to_ascii_lowercase(); + (m.contains("object at location") && m.contains("not found")) + || m.contains("no such file or directory") + || m.contains("no such key") + || m.contains("status code: 404") +} + fn throw_exception(env: &mut JNIEnv, error: &CometError, backtrace: Option<String>) { @@ - if error_msg.contains("not found") - && error_msg.contains("No such file or directory") - { + if is_file_not_found_message(&error_msg) { let spark_error = SparkError::FileNotFound { message: error_msg }; throw_spark_error_as_json(env, &spark_error) @@ - if error_msg.contains("not found") - && error_msg.contains("No such file or directory") - { + if is_file_not_found_message(&error_msg) { let spark_error = SparkError::FileNotFound { message: error_msg }; throw_spark_error_as_json(env, &spark_error)

Also applies to: 439-446

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@native/core/src/errors.rs` around lines 416 - 423, The current not-found detection logic in the error handling branch is too strict (it requires both "not found" and "No such file or directory") and will miss backend-specific messages; refactor by adding a small helper function (e.g. is_not_found_error) that centralizes broader matching (check for any of several substrings like "not found", "No such file or directory", "NotFound", "404", backend-specific phrases) and use it in the existing branches that construct SparkError::FileNotFound and call throw_spark_error_as_json; update both places noted (around the SparkError::FileNotFound creation and the similar block at 439-446) to call the helper instead of the dual-substring check so all object-store backends are covered.

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The detection of the error type is fragile and it could match a totally unrelated error. It looks like a best effort detection that in the worst case would lead to a wrongly formatted "full" error. It will be improved with the time when the users provide examples of errors which fail be to detected.

coderabbitai · 2026-03-15T12:50:18Z

spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala

+      case "FileNotFound" =>
+        val msg = params("message").toString
+        // Extract file path from native error message and format like Hadoop's
+        // FileNotFoundException: "File <path> does not exist"
+        val path = "Object at location (.+?) not found".r
+          .findFirstMatchIn(msg)
+          .map(_.group(1))
+          .getOrElse(msg)
+        Some(
+          QueryExecutionErrors.readCurrentFileNotFoundError(
+            new FileNotFoundException(s"File $path does not exist")))
+


⚠️ Potential issue | 🟡 Minor

Avoid double-wrapping already formatted file-not-found messages.

If Line 245 already contains File <path> does not exist, the fallback at Line 251 produces File File <path> does not exist does not exist. Normalize the message before wrapping.

💡 Proposed fix

case "FileNotFound" => val msg = params("message").toString - // Extract file path from native error message and format like Hadoop's - // FileNotFoundException: "File <path> does not exist" - val path = "Object at location (.+?) not found".r - .findFirstMatchIn(msg) - .map(_.group(1)) - .getOrElse(msg) + // Normalize to Hadoop-style: "File <path> does not exist" + val normalizedMsg = + if (msg.startsWith("File ") && msg.endsWith(" does not exist")) { + msg + } else { + val path = "Object at location (.+?) not found".r + .findFirstMatchIn(msg) + .map(_.group(1)) + .getOrElse(msg) + s"File $path does not exist" + } Some( QueryExecutionErrors.readCurrentFileNotFoundError( - new FileNotFoundException(s"File $path does not exist"))) + new FileNotFoundException(normalizedMsg)))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

case "FileNotFound" =>

val msg = params("message").toString

// Extract file path from native error message and format like Hadoop's

// FileNotFoundException: "File <path> does not exist"

val path = "Object at location (.+?) not found".r

.findFirstMatchIn(msg)

.map(_.group(1))

.getOrElse(msg)

Some(

QueryExecutionErrors.readCurrentFileNotFoundError(

new FileNotFoundException(s"File $path does not exist")))

case "FileNotFound" =>

val msg = params("message").toString

// Normalize to Hadoop-style: "File <path> does not exist"

val normalizedMsg =

if (msg.startsWith("File ") && msg.endsWith(" does not exist")) {

msg

} else {

val path = "Object at location (.+?) not found".r

.findFirstMatchIn(msg)

.map(_.group(1))

.getOrElse(msg)

s"File $path does not exist"

}

Some(

QueryExecutionErrors.readCurrentFileNotFoundError(

new FileNotFoundException(normalizedMsg)))

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimSparkErrorConverter.scala` around lines 244 - 255, The FileNotFound case in ShimSparkErrorConverter.scala currently blindly wraps the native message, causing duplicates like "File File <path> does not exist does not exist"; change the extraction logic in the case "FileNotFound" block so that you first check if msg already matches the Hadoop-style pattern "File (.+) does not exist" and, if so, extract only the path group, otherwise fall back to the existing "Object at location (.+?) not found" capture or use msg as path; then pass new FileNotFoundException(s"File $path does not exist") into QueryExecutionErrors.readCurrentFileNotFoundError to ensure you never produce "File File ..." duplicates (refer to variables msg, path and the call to QueryExecutionErrors.readCurrentFileNotFoundError).

value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The detection of the error type is fragile and it could match a totally unrelated error. It looks like a best effort detection that in the worst case would lead to a wrongly formatted "full" error. It will be improved with the time when the users provide examples of errors which fail be to detected.

claude · 2026-03-15T12:56:36Z

Code Review

Overall this is a well-structured fix. Moving file-not-found detection out of CometExecIterator and into the native error serialization path is the right architectural direction — it produces the proper SparkFileNotFoundException type that Spark tests expect, instead of a plain SparkException with a FileNotFoundException cause.

native/core/src/errors.rs — detection heuristic

The detection uses error_msg.contains("not found") && error_msg.contains("No such file or directory").

Fragility — "not found" is very broad and could match non-file errors (bucket-not-found, key-not-found, etc.). The original CometExecIterator regex was anchored to DataFusion's specific object-store pattern. Consider also matching "Object at location" to reduce false positives.

Duplicated logic — the same block appears in two match arms (DataFusion { External(_) } and the catch-all _). Extracting a helper fn is_file_not_found(msg: &str) -> bool would remove the duplication.

Coupling — the full raw message is stored in SparkError::FileNotFound { message } and re-parsed by the shim's regex. If DataFusion changes its error format, both sites need updating. A cross-reference comment would help.

native/spark-expr/src/error.rs

error_class() returns Some("_LEGACY_ERROR_TEMP_2055") but the Java exception class is "org/apache/spark/SparkException" with a comment saying the shim will re-wrap it. This is correct but non-obvious; a short inline comment on the intentionality would help.

ShimSparkErrorConverter.scala — all three shim versions

Regex compiled per call — .r creates a Regex object each invocation. Not a hot path, but a private val would be cleaner.

getOrElse(msg) fallback — if the regex does not match, the full raw error is used as the file path, producing "File <long error message> does not exist". Consider getOrElse("<unknown path>") for a clearer message.

Code duplication — the Spark 3.4 and 3.5 FileNotFound cases are identical. Could they share a base trait? (Spark 4.0 differs in final construction so may need to stay separate.)

CometExecIterator.scala

Removal of fileNotFoundPattern and unused imports is clean. Minor nit: scala.util.matching.Regex is now spelled out inline rather than imported — a small style inconsistency.

Test diff (dev/diffs/3.5.8.diff)

Removing the assume(...) skip in FileBasedDataSourceSuite and the IgnoreCometNativeDataFusion tag in SQLViewSuite is the right outcome — it confirms the fix resolves the failures rather than just suppressing them.

Summary

Main actionable suggestions:

Tighten the heuristic in errors.rs — also match "Object at location" to avoid false positives.
Extract duplicated detection into a helper function in errors.rs.
Share the FileNotFound shim case between the Spark 3.4 and 3.5 shims to remove copy-paste.
Use getOrElse("<unknown path>") in the shim for a clearer fallback error message.

martin-augment · 2026-03-16T12:15:20Z

native/core/src/errors.rs — detection heuristic

The detection uses error_msg.contains("not found") && error_msg.contains("No such file or directory").

Fragility — "not found" is very broad and could match non-file errors (bucket-not-found, key-not-found, etc.). The original CometExecIterator regex was anchored to DataFusion's specific object-store pattern. Consider also matching "Object at location" to reduce false positives.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The detection of the error type is fragile and it could match a totally unrelated error. It looks like a best effort detection that in the worst case would lead to a wrongly formatted "full" error. It will be improved with the time when the users provide examples of errors which fail be to detected.

martin-augment · 2026-03-16T12:18:48Z

ShimSparkErrorConverter.scala — all three shim versions

Regex compiled per call — .r creates a Regex object each invocation. Not a hot path, but a private val would be cleaner.

value:good-to-have; category:bug; feedback: The Claude AI reviewer is correct! The regex is compiled on each usage. It would be better to compile it once, as a class member constant, and just execute it when needed. This would be much faster

andygrove added 10 commits March 14, 2026 09:07

fix: remove test skips for file-not-found in 3.5.8 diff

90ecaa2

Remove the assume() skip for native_datafusion in FileBasedDataSourceSuite and the IgnoreCometNativeDataFusion tag in SimpleSQLViewSuite, since file-not-found errors now produce the correct SparkFileNotFoundException type.

fix: remove unused imports from 3.5.8 diff

64b63c5

Remove CometConf import from FileBasedDataSourceSuite and IgnoreCometNativeDataFusion import from SQLViewSuite that became unused after removing the test skips.

fix: format file-not-found message to match Hadoop convention

b0d2b70

Extract file path from native error message and format it as "File <path> does not exist" to match the Hadoop FileNotFoundException message format that Spark tests expect.

style: apply cargo fmt

030105f

fix: use SparkFileNotFoundException directly for Spark 4.0

81cd6d8

readCurrentFileNotFoundError was removed in Spark 4.0. Construct SparkFileNotFoundException directly instead, which is accessible from the shim package.

fix: remove unused FileNotFoundException import in Spark 4.0 shim

e630c71

Revert "fix: accept native DataFusion error message for duplicate fie…

37d10e3

…lds in 3.5.8 diff" This reverts commit 03328cd.

fix: skip duplicate fields test for native DataFusion in 3.5.8 diff

27469e0

Add IgnoreCometNativeDataFusion tag to SPARK-25207 test instead of trying to accept both error messages. A separate PR will fix the underlying issue.

gemini-code-assist bot reviewed Mar 15, 2026

View reviewed changes

augmentcode bot reviewed Mar 15, 2026

View reviewed changes

cursor bot reviewed Mar 15, 2026

View reviewed changes

coderabbitai bot reviewed Mar 15, 2026

View reviewed changes

Repository owner deleted a comment from claude bot Mar 15, 2026

Conversation

martin-augment commented Mar 15, 2026

Uh oh!

coderabbitai bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Uh oh!

gemini-code-assist bot commented Mar 15, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

augmentcode bot commented Mar 15, 2026

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

augmentcode bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 15, 2026

Choose a reason for hiding this comment

Duplicated file-not-found detection and fallback logic

Uh oh!

martin-augment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

martin-augment Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 15, 2026

Code Review

Uh oh!

martin-augment commented Mar 16, 2026

Uh oh!

martin-augment commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 15, 2026 •

edited

Loading