Skip to content

3819: feat: enable native Iceberg reader by default#52

Open
martin-augment wants to merge 3 commits intomainfrom
pr-3819-2026-04-01-08-40-10
Open

3819: feat: enable native Iceberg reader by default#52
martin-augment wants to merge 3 commits intomainfrom
pr-3819-2026-04-01-08-40-10

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3819: To review by AI

andygrove and others added 3 commits March 27, 2026 15:36
Change the default value of spark.comet.scan.icebergNative.enabled
from false to true, and update documentation to reflect this change.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Walkthrough

The pull request adds two new guideline files (AGENTS.md and CLAUDE.md) establishing a linking restriction policy. The primary functional change involves enabling Comet's native Iceberg table scans by default by updating the COMET_ICEBERG_NATIVE_ENABLED configuration default from false to true. Supporting documentation updates in the Iceberg user guide and TPC benchmark README remove explicit configuration requirements from code examples and clarify that native Iceberg scanning is now enabled by default, with an option to disable it if needed.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-3819-2026-04-01-08-40-10

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the native Iceberg reader by default, updating the configuration, benchmarks, and user documentation. Review feedback recommends removing the non-standard AGENTS.md and CLAUDE.md files, which contain restrictive instructions for AI agents that are inappropriate for an open-source project. Additionally, it is suggested to update the configuration description to explicitly state that the native Iceberg reader is an experimental feature to better manage user expectations.

@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of AGENTS.md and CLAUDE.md files at the root of the repository is non-standard for an Apache project. These files appear to contain instructions for AI agents that are likely specific to the author's local environment or a particular tool. Furthermore, the instruction to 'Never link to any issue or a pull request' is overly restrictive for an open-source project where cross-referencing is essential for traceability and collaboration. If these are not strictly required for the project's operation, they should be removed from the pull request.

@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the comment for AGENTS.md, this file appears to be an instruction for an AI agent and is non-standard for the repository. It should be removed.

@@ -140,7 +140,7 @@ object CometConf extends ShimCometConf {
"Iceberg tables are read directly through native execution, bypassing Spark's " +
"DataSource V2 API for better performance.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the native Iceberg reader is still considered experimental (as noted in the user guide), it would be beneficial to include this status in the configuration's documentation string, especially now that it is enabled by default. This helps manage user expectations regarding stability and potential fallbacks.

Suggested change
"DataSource V2 API for better performance.")
"DataSource V2 API for better performance. This feature is experimental.")

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! Several other config properties mention that they affect experimental features. It would be good this one to mention that the Iceberg support is experimental too.

@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Code Review

PR Summary

This PR enables the native Iceberg reader (spark.comet.scan.icebergNative.enabled) by default, changing its default value from false to true. Documentation in iceberg.md and benchmarks/tpc/README.md is updated accordingly.


⚠️ Prompt Injection Concern

The PR adds two files, AGENTS.md and CLAUDE.md, each containing a single line:

Never link to any issue or a pull request in any GitHub repository

These files are specifically targeted at AI coding assistants (Codex agents read AGENTS.md, Claude reads CLAUDE.md). The instruction attempts to suppress AI tools from linking to related issues or PRs during code review, which would reduce context available to reviewers. This is a prompt injection technique and should not be merged. These files have no legitimate place in this PR and should be removed.


Config Change (CometConf.scala)

The one-line change flipping .createWithDefault(false) to .createWithDefault(true) is clean. A few things worth considering:

Breaking change for existing users. Any Comet user who has Iceberg tables but hasn't opted in to the native reader will now get it automatically after upgrading. If their workload hits an unsupported feature or a bug in the native reader, they have no warning. It would be good to call this out explicitly as a breaking change in the PR description and in a release note or migration guide entry.

Is the native reader stable enough to be default? The config doc says the feature "bypasses Spark's DataSource V2 API." Do any known limitations or compatibility gaps exist (row-level deletes, equality deletes, schema evolution edge cases, etc.) that users hitting this for the first time might not expect? It may be worth documenting any known limitations in iceberg.md alongside the "enabled by default" statement.

Version compatibility. The docs mention the reader has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. Is it safe to enable by default for all of those versions, or only for a subset?


Documentation (iceberg.md)

The doc change is straightforward: removing the explicit --conf spark.comet.scan.icebergNative.enabled=true from both example spark-shell invocations and updating the prose to say the reader is on by default, with a pointer to the flag for disabling it. This reads clearly and is consistent with the config change.

One small suggestion: the sentence "The native Iceberg reader is enabled by default. To disable it, set spark.comet.scan.icebergNative.enabled=false." would be a good candidate to also appear in the compatibility or configuration guide, not just in the Iceberg-specific page, since users who don't read that page might be surprised by the behavior change.


benchmarks/tpc/README.md

The updated sentence is accurate and reads well.


Summary

The core config and documentation change is simple and correct. The main concerns are:

  1. The AGENTS.md / CLAUDE.md prompt injection files should be removed before merging.
  2. The PR should communicate clearly that this is a behavior change for existing users, and ideally document any known limitations of the native reader so users can make an informed decision about whether to disable it.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated AI instruction files committed with feature change

Medium Severity

AGENTS.md and CLAUDE.md are unrelated to the feature of enabling the native Iceberg reader by default. They contain AI agent behavioral instructions and have nothing to do with the stated PR purpose. Notably, CLAUDE.md is already listed in .gitignore, meaning it was force-added to the repository. These files appear to have been accidentally included in this commit.

Additional Locations (1)
Fix in Cursor Fix in Web

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 1, 2026

🤖 Augment PR Summary

Summary: Enables Comet’s native Iceberg reader by default by flipping the default for spark.comet.scan.icebergNative.enabled to true.

Changes:

  • Updates TPC benchmark docs and the Iceberg user guide to reflect default-on behavior and how to disable it

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. The key configuration
to enable fully-native Iceberg is `spark.comet.scan.icebergNative.enabled=true`.
1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. The native Iceberg
reader is enabled by default. To disable it, set `spark.comet.scan.icebergNative.enabled=false`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/user-guide/latest/iceberg.md:L33: This section still contains a markdown link to a pull request a few lines above (the PR apache#2528 link), which conflicts with the new repository rule “Never link to any issue or a pull request in any GitHub repository” (Rule: AGENTS.md). Could you align this doc with that rule to avoid having the repo immediately violate its own guidance?

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

The benchmark uses `spark.comet.scan.icebergNative.enabled=true` to enable Comet's native iceberg-rust
integration. Verify native scanning is active by checking for `CometIcebergNativeScanExec` in the
The benchmark uses Comet's native iceberg-rust integration, which is enabled by default.
Verify native scanning is active by checking for `CometIcebergNativeScanExec` in the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmarks/tpc/README.md:L183: This now says the benchmark uses the native Iceberg integration “enabled by default”, but the comet-iceberg benchmark engine config still explicitly sets spark.comet.scan.icebergNative.enabled=true, so the benchmark isn’t actually exercising the new default behavior. That mismatch could be confusing for readers trying to understand what’s required vs. what’s redundant.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@AGENTS.md`:
- Line 1: Add a short preamble to AGENTS.md that states the intended audience
(e.g., AI agents and maintainers), the purpose of the guideline (to prohibit
linking to issues/PRs across GitHub repos), the rationale
(privacy/security/traceability concerns), the scope (where it applies: commit
messages, documentation, code comments, config files, PR descriptions), and the
consequences or enforcement mechanism (e.g., CI linter/PR review note). Keep it
a few sentences and include one example sentence clarifying that the rule
applies to both public and internal links.
- Line 1: Add the standard Apache License 2.0 header to the very top of
AGENTS.md so the file passes the apache-rat:check; insert the full ASF-required
license header block (the standard comment/banner used across the repo) before
any existing content or text in AGENTS.md, save and commit the change so the RAT
check recognizes the file as licensed.

In `@CLAUDE.md`:
- Line 1: CLAUDE.md duplicates AGENTS.md but lacks explanation of intent; either
consolidate the two into a single canonical file (remove CLAUDE.md and update
references to use AGENTS.md) or make CLAUDE.md explicitly different by adding a
clear header and a short section explaining how its instructions differ from
AGENTS.md (e.g., "Scope: Claude-specific overrides" and list unique rules), and
add a brief note in AGENTS.md pointing to CLAUDE.md when Claude-specific
behavior exists; update any documentation references accordingly so readers know
when to consult each file.
- Line 1: CLAUDE.md is missing the required Apache license header which causes
apache-rat:check failures; add the same Apache license header block used in
AGENTS.md to the very top of CLAUDE.md (before any content) so the file starts
with the standard Apache header text, ensuring the header text and year/owner
fields match project convention and formatting.

In `@common/src/main/scala/org/apache/comet/CometConf.scala`:
- Line 143: The config default for native Iceberg was changed to true; revert
the change so spark.comet.scan.icebergNative.enabled remains false by default
(restore the call that sets createWithDefault(false) in CometConf or equivalent)
and update CometScanRule.transformV2Scan() to consult NativeBase.isLoaded() in
addition to the config flag before performing plan conversion (i.e., only
convert when the config is true AND NativeBase.isLoaded() returns true) so
environments without the native library safely fall back to Spark.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 1a751bda-732d-4d5c-ab92-56183fcfa733

📥 Commits

Reviewing files that changed from the base of the PR and between 88d3a1e and 41f8a38.

📒 Files selected for processing (5)
  • AGENTS.md
  • CLAUDE.md
  • benchmarks/tpc/README.md
  • common/src/main/scala/org/apache/comet/CometConf.scala
  • docs/source/user-guide/latest/iceberg.md

@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify the purpose and scope of this guideline file.

The file contains a single instruction without context about:

  • Who is the intended audience (AI agents, developers, both)?
  • Why this restriction exists
  • What the consequences are for violations
  • Whether this applies to commit messages, documentation, code comments, etc.

Consider adding a preamble explaining the purpose of this file and the rationale for this linking restriction.

🧰 Tools
🪛 GitHub Actions: RAT License Check

[error] 1-1: apache-rat:check failed. Too many files with unapproved license (1). Unapproved file: AGENTS.md. See RAT report in /home/runner/work/datafusion-comet/datafusion-comet/target/rat.txt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 1, Add a short preamble to AGENTS.md that states the
intended audience (e.g., AI agents and maintainers), the purpose of the
guideline (to prohibit linking to issues/PRs across GitHub repos), the rationale
(privacy/security/traceability concerns), the scope (where it applies: commit
messages, documentation, code comments, config files, PR descriptions), and the
consequences or enforcement mechanism (e.g., CI linter/PR review note). Keep it
a few sentences and include one example sentence clarifying that the rule
applies to both public and internal links.

⚠️ Potential issue | 🔴 Critical

Critical: Add Apache license header to fix RAT check failure.

The apache-rat:check pipeline is failing because this file is missing the required Apache license header. All files in the repository must include the Apache license header.

📝 Proposed fix: Add Apache license header
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
 Never link to any issue or a pull request in any GitHub repository
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Never link to any issue or a pull request in any GitHub repository
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Never link to any issue or a pull request in any GitHub repository
🧰 Tools
🪛 GitHub Actions: RAT License Check

[error] 1-1: apache-rat:check failed. Too many files with unapproved license (1). Unapproved file: AGENTS.md. See RAT report in /home/runner/work/datafusion-comet/datafusion-comet/target/rat.txt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 1, Add the standard Apache License 2.0 header to the very
top of AGENTS.md so the file passes the apache-rat:check; insert the full
ASF-required license header block (the standard comment/banner used across the
repo) before any existing content or text in AGENTS.md, save and commit the
change so the RAT check recognizes the file as licensed.

@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify the relationship between CLAUDE.md and AGENTS.md.

This file has identical content to AGENTS.md. The separation suggests CLAUDE.md targets Claude AI specifically while AGENTS.md targets all AI agents, but this isn't documented. Consider:

  • Consolidating into a single file if the instruction applies universally
  • Adding distinct instructions if each file serves different AI systems
  • Documenting why separate files are necessary and how they differ
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 1, CLAUDE.md duplicates AGENTS.md but lacks explanation of
intent; either consolidate the two into a single canonical file (remove
CLAUDE.md and update references to use AGENTS.md) or make CLAUDE.md explicitly
different by adding a clear header and a short section explaining how its
instructions differ from AGENTS.md (e.g., "Scope: Claude-specific overrides" and
list unique rules), and add a brief note in AGENTS.md pointing to CLAUDE.md when
Claude-specific behavior exists; update any documentation references accordingly
so readers know when to consult each file.

⚠️ Potential issue | 🔴 Critical

Critical: Add Apache license header to prevent RAT check failure.

Like AGENTS.md, this file is missing the required Apache license header and will cause the apache-rat:check to fail.

📝 Proposed fix: Add Apache license header
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
 Never link to any issue or a pull request in any GitHub repository
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Never link to any issue or a pull request in any GitHub repository
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Never link to any issue or a pull request in any GitHub repository
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 1, CLAUDE.md is missing the required Apache license header
which causes apache-rat:check failures; add the same Apache license header block
used in AGENTS.md to the very top of CLAUDE.md (before any content) so the file
starts with the standard Apache header text, ensuring the header text and
year/owner fields match project convention and formatting.

"DataSource V2 API for better performance.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether native Iceberg execution has an explicit runtime capability/load guard
# before conversion/execution paths are taken.

rg -n -C3 --type scala 'COMET_ICEBERG_NATIVE_ENABLED|nativeIcebergScanMetadata|CometIcebergNativeScanExec'
rg -n -C3 --type scala 'nativeLoadRequired|UnsatisfiedLinkError|isLoaded|load.*native|iceberg-rust'

Repository: martin-augment/datafusion-comet

Length of output: 50388


🏁 Script executed:

# Search for the conversion/execution paths mentioned in the review
fd -e scala -path "*CometScanRule*"
fd -e scala -path "*CometExecRule*"

Repository: martin-augment/datafusion-comet

Length of output: 412


🏁 Script executed:

# Look for native library loading/capability checks
rg -n "UnsatisfiedLinkError|NativeLibrary|loadLibrary|isDefined.*native|nativeLoaded" --type scala | head -50

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Search for the actual CometIcebergNativeScanExec implementation
fd -e scala -name "*CometIcebergNativeScanExec*"

Repository: martin-augment/datafusion-comet

Length of output: 305


🏁 Script executed:

# Search for the conversion/execution rules files
fd "CometScanRule|CometExecRule" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 340


🏁 Script executed:

# Look for native library loading/capability checks
rg "UnsatisfiedLinkError|NativeLibrary|loadLibrary|isDefined|nativeLoaded" --type scala | head -50

Repository: martin-augment/datafusion-comet

Length of output: 5723


🏁 Script executed:

# Search for the actual CometIcebergNativeScanExec implementation
fd "CometIcebergNativeScanExec" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 155


🏁 Script executed:

# Examine CometExecRule.scala for native Iceberg handling
cat spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala | head -250

Repository: martin-augment/datafusion-comet

Length of output: 12757


🏁 Script executed:

# Examine CometScanRule.scala for native Iceberg handling
cat spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala | head -250

Repository: martin-augment/datafusion-comet

Length of output: 10102


🏁 Script executed:

# Look for the transformV2Scan method and native Iceberg handling
grep -n "transformV2Scan\|nativeIcebergCompatScan\|COMET_ICEBERG_NATIVE_ENABLED" spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala | head -30

Repository: martin-augment/datafusion-comet

Length of output: 524


🏁 Script executed:

# Get the transformV2Scan implementation
sed -n '300,450p' spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

Repository: martin-augment/datafusion-comet

Length of output: 6359


🏁 Script executed:

# Also check CometIcebergNativeScanExec.scala for execution logic
head -150 spark/src/main/scala/org/apache/spark/sql/comet/CometIcebergNativeScanExec.scala

Repository: martin-augment/datafusion-comet

Length of output: 6939


🏁 Script executed:

# Look at the doExecuteColumnar method in CometIcebergNativeScanExec
sed -n '150,300p' spark/src/main/scala/org/apache/spark/sql/comet/CometIcebergNativeScanExec.scala

Repository: martin-augment/datafusion-comet

Length of output: 6311


🏁 Script executed:

# Search for any native library loading or capability checks
rg "iceberg-rust|Native\.load|UnsatisfiedLinkError|isCometNativeAvailable|checkNative" spark/src/main/scala/org/apache/comet/rules/ spark/src/main/scala/org/apache/spark/sql/comet/

Repository: martin-augment/datafusion-comet

Length of output: 1668


🏁 Script executed:

# Check if there's a CometConf setting for native library availability check
grep -n "COMET_ICEBERG_NATIVE_ENABLED\|nativeLoaded\|nativeAvailable" common/src/main/scala/org/apache/comet/CometConf.scala

Repository: martin-augment/datafusion-comet

Length of output: 137


🏁 Script executed:

# Look at CometConf.scala to see the full COMET_ICEBERG_NATIVE_ENABLED definition
sed -n '135,150p' common/src/main/scala/org/apache/comet/CometConf.scala

Repository: martin-augment/datafusion-comet

Length of output: 880


🏁 Script executed:

# Check if there's native library loading or error handling in CometExecRDD
fd "CometExecRDD" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 141


🏁 Script executed:

# Search for Native.load or native library initialization
rg "Native\.load|loadNativeLibrary" spark/src/main/scala/org/apache/spark/sql/comet/

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look at CometExecRDD to see execution and error handling
head -200 spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala

Repository: martin-augment/datafusion-comet

Length of output: 7458


🏁 Script executed:

# Search for UnsatisfiedLinkError or JNI-related error handling
rg "UnsatisfiedLinkError|JNI|native.*load|loadLibrary" spark/src/main/scala/

Repository: martin-augment/datafusion-comet

Length of output: 693


🏁 Script executed:

# Check if there's a native library availability check before execution
rg "nativeAvailable|isNativeLoaded|checkNativeLibrary" spark/src/main/scala/

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look for CometExecIterator and error handling
fd "CometExecIterator" -e java -o -e scala

Repository: martin-augment/datafusion-comet

Length of output: 191


🏁 Script executed:

# Check for error handling in native execution
rg "catch|Exception|error|fallback" spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala | head -20

Repository: martin-augment/datafusion-comet

Length of output: 160


🏁 Script executed:

# Check the native execution error paths
rg "UnsatisfiedLinkError|NoSuchLibraryException|CometNativeException" spark/src/main/scala/

Repository: martin-augment/datafusion-comet

Length of output: 419


🏁 Script executed:

# Look at CometExecIterator error handling
grep -n "CometNativeException" spark/src/main/scala/org/apache/comet/CometExecIterator.scala -A 5 -B 5

Repository: martin-augment/datafusion-comet

Length of output: 717


🏁 Script executed:

# Look at the full CometExecIterator.scala to understand execution
head -100 spark/src/main/scala/org/apache/comet/CometExecIterator.scala

Repository: martin-augment/datafusion-comet

Length of output: 4359


🏁 Script executed:

# Check what CometNativeException is
fd "CometNativeException" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look for Native class and library loading
fd "Native.java" -o -path "*parquet*Native*"

Repository: martin-augment/datafusion-comet

Length of output: 191


🏁 Script executed:

# Check if Native() constructor does any checking
rg "class Native\|object Native" spark/ common/ --type scala -A 10

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look at Native.java to see if library is checked
fd "Native.java"

Repository: martin-augment/datafusion-comet

Length of output: 132


🏁 Script executed:

# Examine Native.java to see library loading
cat common/src/main/java/org/apache/comet/parquet/Native.java

Repository: martin-augment/datafusion-comet

Length of output: 10660


🏁 Script executed:

# Check NativeBase for native library loading
fd "NativeBase" -e java

Repository: martin-augment/datafusion-comet

Length of output: 128


🏁 Script executed:

# Look at NativeBase.java
cat common/src/main/java/org/apache/comet/NativeBase.java

Repository: martin-augment/datafusion-comet

Length of output: 10437


Native Iceberg scan default should remain disabled until runtime capability check is added.

Changing spark.comet.scan.icebergNative.enabled to default true (line 143) enables native Iceberg execution by default, but CometScanRule (line 313-318) gates plan conversion only on configuration flags—not on actual native library availability. The NativeBase.isLoaded() check exists but is never called before committing to the native execution path. This means environments where the native library failed to load will now attempt native execution and fail at runtime instead of safely falling back to Spark. Keep this default as false until CometScanRule.transformV2Scan() calls NativeBase.isLoaded() to verify runtime capability before conversion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/src/main/scala/org/apache/comet/CometConf.scala` at line 143, The
config default for native Iceberg was changed to true; revert the change so
spark.comet.scan.icebergNative.enabled remains false by default (restore the
call that sets createWithDefault(false) in CometConf or equivalent) and update
CometScanRule.transformV2Scan() to consult NativeBase.isLoaded() in addition to
the config flag before performing plan conversion (i.e., only convert when the
config is true AND NativeBase.isLoaded() returns true) so environments without
the native library safely fall back to Spark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants