Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The addition of AGENTS.md and CLAUDE.md files at the root of the repository is non-standard for an Apache project. These files appear to contain instructions for AI agents that are likely specific to the author's local environment or a particular tool. Furthermore, the instruction to 'Never link to any issue or a pull request' is overly restrictive for an open-source project where cross-referencing is essential for traceability and collaboration. If these are not strictly required for the project's operation, they should be removed from the pull request.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated AI instruction files committed with feature change

Medium Severity

AGENTS.md and CLAUDE.md are unrelated to the feature of enabling the native Iceberg reader by default. They contain AI agent behavioral instructions and have nothing to do with the stated PR purpose. Notably, CLAUDE.md is already listed in .gitignore, meaning it was force-added to the repository. These files appear to have been accidentally included in this commit.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify the purpose and scope of this guideline file.

The file contains a single instruction without context about:

  • Who is the intended audience (AI agents, developers, both)?
  • Why this restriction exists
  • What the consequences are for violations
  • Whether this applies to commit messages, documentation, code comments, etc.

Consider adding a preamble explaining the purpose of this file and the rationale for this linking restriction.

🧰 Tools
🪛 GitHub Actions: RAT License Check

[error] 1-1: apache-rat:check failed. Too many files with unapproved license (1). Unapproved file: AGENTS.md. See RAT report in /home/runner/work/datafusion-comet/datafusion-comet/target/rat.txt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 1, Add a short preamble to AGENTS.md that states the
intended audience (e.g., AI agents and maintainers), the purpose of the
guideline (to prohibit linking to issues/PRs across GitHub repos), the rationale
(privacy/security/traceability concerns), the scope (where it applies: commit
messages, documentation, code comments, config files, PR descriptions), and the
consequences or enforcement mechanism (e.g., CI linter/PR review note). Keep it
a few sentences and include one example sentence clarifying that the rule
applies to both public and internal links.

⚠️ Potential issue | 🔴 Critical

Critical: Add Apache license header to fix RAT check failure.

The apache-rat:check pipeline is failing because this file is missing the required Apache license header. All files in the repository must include the Apache license header.

📝 Proposed fix: Add Apache license header
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
 Never link to any issue or a pull request in any GitHub repository
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Never link to any issue or a pull request in any GitHub repository
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Never link to any issue or a pull request in any GitHub repository
🧰 Tools
🪛 GitHub Actions: RAT License Check

[error] 1-1: apache-rat:check failed. Too many files with unapproved license (1). Unapproved file: AGENTS.md. See RAT report in /home/runner/work/datafusion-comet/datafusion-comet/target/rat.txt.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@AGENTS.md` at line 1, Add the standard Apache License 2.0 header to the very
top of AGENTS.md so the file passes the apache-rat:check; insert the full
ASF-required license header block (the standard comment/banner used across the
repo) before any existing content or text in AGENTS.md, save and commit the
change so the RAT check recognizes the file as licensed.

1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Never link to any issue or a pull request in any GitHub repository
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

As noted in the comment for AGENTS.md, this file appears to be an instruction for an AI agent and is non-standard for the repository. It should be removed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Clarify the relationship between CLAUDE.md and AGENTS.md.

This file has identical content to AGENTS.md. The separation suggests CLAUDE.md targets Claude AI specifically while AGENTS.md targets all AI agents, but this isn't documented. Consider:

  • Consolidating into a single file if the instruction applies universally
  • Adding distinct instructions if each file serves different AI systems
  • Documenting why separate files are necessary and how they differ
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 1, CLAUDE.md duplicates AGENTS.md but lacks explanation of
intent; either consolidate the two into a single canonical file (remove
CLAUDE.md and update references to use AGENTS.md) or make CLAUDE.md explicitly
different by adding a clear header and a short section explaining how its
instructions differ from AGENTS.md (e.g., "Scope: Claude-specific overrides" and
list unique rules), and add a brief note in AGENTS.md pointing to CLAUDE.md when
Claude-specific behavior exists; update any documentation references accordingly
so readers know when to consult each file.

⚠️ Potential issue | 🔴 Critical

Critical: Add Apache license header to prevent RAT check failure.

Like AGENTS.md, this file is missing the required Apache license header and will cause the apache-rat:check to fail.

📝 Proposed fix: Add Apache license header
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
 Never link to any issue or a pull request in any GitHub repository
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Never link to any issue or a pull request in any GitHub repository
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->
Never link to any issue or a pull request in any GitHub repository
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@CLAUDE.md` at line 1, CLAUDE.md is missing the required Apache license header
which causes apache-rat:check failures; add the same Apache license header block
used in AGENTS.md to the very top of CLAUDE.md (before any content) so the file
starts with the standard Apache header text, ensuring the header text and
year/owner fields match project convention and formatting.

4 changes: 2 additions & 2 deletions benchmarks/tpc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,8 @@ sudo ./drop-caches.sh
python3 run.py --engine comet-iceberg --benchmark tpch
```

The benchmark uses `spark.comet.scan.icebergNative.enabled=true` to enable Comet's native iceberg-rust
integration. Verify native scanning is active by checking for `CometIcebergNativeScanExec` in the
The benchmark uses Comet's native iceberg-rust integration, which is enabled by default.
Verify native scanning is active by checking for `CometIcebergNativeScanExec` in the
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmarks/tpc/README.md:L183: This now says the benchmark uses the native Iceberg integration “enabled by default”, but the comet-iceberg benchmark engine config still explicitly sets spark.comet.scan.icebergNative.enabled=true, so the benchmark isn’t actually exercising the new default behavior. That mismatch could be confusing for readers trying to understand what’s required vs. what’s redundant.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

physical plan output.

### create-iceberg-tables.py options
Expand Down
2 changes: 1 addition & 1 deletion common/src/main/scala/org/apache/comet/CometConf.scala
Original file line number Diff line number Diff line change
Expand Up @@ -140,7 +140,7 @@ object CometConf extends ShimCometConf {
"Iceberg tables are read directly through native execution, bypassing Spark's " +
"DataSource V2 API for better performance.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Since the native Iceberg reader is still considered experimental (as noted in the user guide), it would be beneficial to include this status in the configuration's documentation string, especially now that it is enabled by default. This helps manage user expectations regarding stability and potential fallbacks.

Suggested change
"DataSource V2 API for better performance.")
"DataSource V2 API for better performance. This feature is experimental.")

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! Several other config properties mention that they affect experimental features. It would be good this one to mention that the Iceberg support is experimental too.

.booleanConf
.createWithDefault(false)
.createWithDefault(true)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether native Iceberg execution has an explicit runtime capability/load guard
# before conversion/execution paths are taken.

rg -n -C3 --type scala 'COMET_ICEBERG_NATIVE_ENABLED|nativeIcebergScanMetadata|CometIcebergNativeScanExec'
rg -n -C3 --type scala 'nativeLoadRequired|UnsatisfiedLinkError|isLoaded|load.*native|iceberg-rust'

Repository: martin-augment/datafusion-comet

Length of output: 50388


🏁 Script executed:

# Search for the conversion/execution paths mentioned in the review
fd -e scala -path "*CometScanRule*"
fd -e scala -path "*CometExecRule*"

Repository: martin-augment/datafusion-comet

Length of output: 412


🏁 Script executed:

# Look for native library loading/capability checks
rg -n "UnsatisfiedLinkError|NativeLibrary|loadLibrary|isDefined.*native|nativeLoaded" --type scala | head -50

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Search for the actual CometIcebergNativeScanExec implementation
fd -e scala -name "*CometIcebergNativeScanExec*"

Repository: martin-augment/datafusion-comet

Length of output: 305


🏁 Script executed:

# Search for the conversion/execution rules files
fd "CometScanRule|CometExecRule" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 340


🏁 Script executed:

# Look for native library loading/capability checks
rg "UnsatisfiedLinkError|NativeLibrary|loadLibrary|isDefined|nativeLoaded" --type scala | head -50

Repository: martin-augment/datafusion-comet

Length of output: 5723


🏁 Script executed:

# Search for the actual CometIcebergNativeScanExec implementation
fd "CometIcebergNativeScanExec" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 155


🏁 Script executed:

# Examine CometExecRule.scala for native Iceberg handling
cat spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala | head -250

Repository: martin-augment/datafusion-comet

Length of output: 12757


🏁 Script executed:

# Examine CometScanRule.scala for native Iceberg handling
cat spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala | head -250

Repository: martin-augment/datafusion-comet

Length of output: 10102


🏁 Script executed:

# Look for the transformV2Scan method and native Iceberg handling
grep -n "transformV2Scan\|nativeIcebergCompatScan\|COMET_ICEBERG_NATIVE_ENABLED" spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala | head -30

Repository: martin-augment/datafusion-comet

Length of output: 524


🏁 Script executed:

# Get the transformV2Scan implementation
sed -n '300,450p' spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala

Repository: martin-augment/datafusion-comet

Length of output: 6359


🏁 Script executed:

# Also check CometIcebergNativeScanExec.scala for execution logic
head -150 spark/src/main/scala/org/apache/spark/sql/comet/CometIcebergNativeScanExec.scala

Repository: martin-augment/datafusion-comet

Length of output: 6939


🏁 Script executed:

# Look at the doExecuteColumnar method in CometIcebergNativeScanExec
sed -n '150,300p' spark/src/main/scala/org/apache/spark/sql/comet/CometIcebergNativeScanExec.scala

Repository: martin-augment/datafusion-comet

Length of output: 6311


🏁 Script executed:

# Search for any native library loading or capability checks
rg "iceberg-rust|Native\.load|UnsatisfiedLinkError|isCometNativeAvailable|checkNative" spark/src/main/scala/org/apache/comet/rules/ spark/src/main/scala/org/apache/spark/sql/comet/

Repository: martin-augment/datafusion-comet

Length of output: 1668


🏁 Script executed:

# Check if there's a CometConf setting for native library availability check
grep -n "COMET_ICEBERG_NATIVE_ENABLED\|nativeLoaded\|nativeAvailable" common/src/main/scala/org/apache/comet/CometConf.scala

Repository: martin-augment/datafusion-comet

Length of output: 137


🏁 Script executed:

# Look at CometConf.scala to see the full COMET_ICEBERG_NATIVE_ENABLED definition
sed -n '135,150p' common/src/main/scala/org/apache/comet/CometConf.scala

Repository: martin-augment/datafusion-comet

Length of output: 880


🏁 Script executed:

# Check if there's native library loading or error handling in CometExecRDD
fd "CometExecRDD" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 141


🏁 Script executed:

# Search for Native.load or native library initialization
rg "Native\.load|loadNativeLibrary" spark/src/main/scala/org/apache/spark/sql/comet/

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look at CometExecRDD to see execution and error handling
head -200 spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala

Repository: martin-augment/datafusion-comet

Length of output: 7458


🏁 Script executed:

# Search for UnsatisfiedLinkError or JNI-related error handling
rg "UnsatisfiedLinkError|JNI|native.*load|loadLibrary" spark/src/main/scala/

Repository: martin-augment/datafusion-comet

Length of output: 693


🏁 Script executed:

# Check if there's a native library availability check before execution
rg "nativeAvailable|isNativeLoaded|checkNativeLibrary" spark/src/main/scala/

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look for CometExecIterator and error handling
fd "CometExecIterator" -e java -o -e scala

Repository: martin-augment/datafusion-comet

Length of output: 191


🏁 Script executed:

# Check for error handling in native execution
rg "catch|Exception|error|fallback" spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala | head -20

Repository: martin-augment/datafusion-comet

Length of output: 160


🏁 Script executed:

# Check the native execution error paths
rg "UnsatisfiedLinkError|NoSuchLibraryException|CometNativeException" spark/src/main/scala/

Repository: martin-augment/datafusion-comet

Length of output: 419


🏁 Script executed:

# Look at CometExecIterator error handling
grep -n "CometNativeException" spark/src/main/scala/org/apache/comet/CometExecIterator.scala -A 5 -B 5

Repository: martin-augment/datafusion-comet

Length of output: 717


🏁 Script executed:

# Look at the full CometExecIterator.scala to understand execution
head -100 spark/src/main/scala/org/apache/comet/CometExecIterator.scala

Repository: martin-augment/datafusion-comet

Length of output: 4359


🏁 Script executed:

# Check what CometNativeException is
fd "CometNativeException" -e scala

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look for Native class and library loading
fd "Native.java" -o -path "*parquet*Native*"

Repository: martin-augment/datafusion-comet

Length of output: 191


🏁 Script executed:

# Check if Native() constructor does any checking
rg "class Native\|object Native" spark/ common/ --type scala -A 10

Repository: martin-augment/datafusion-comet

Length of output: 57


🏁 Script executed:

# Look at Native.java to see if library is checked
fd "Native.java"

Repository: martin-augment/datafusion-comet

Length of output: 132


🏁 Script executed:

# Examine Native.java to see library loading
cat common/src/main/java/org/apache/comet/parquet/Native.java

Repository: martin-augment/datafusion-comet

Length of output: 10660


🏁 Script executed:

# Check NativeBase for native library loading
fd "NativeBase" -e java

Repository: martin-augment/datafusion-comet

Length of output: 128


🏁 Script executed:

# Look at NativeBase.java
cat common/src/main/java/org/apache/comet/NativeBase.java

Repository: martin-augment/datafusion-comet

Length of output: 10437


Native Iceberg scan default should remain disabled until runtime capability check is added.

Changing spark.comet.scan.icebergNative.enabled to default true (line 143) enables native Iceberg execution by default, but CometScanRule (line 313-318) gates plan conversion only on configuration flags—not on actual native library availability. The NativeBase.isLoaded() check exists but is never called before committing to the native execution path. This means environments where the native library failed to load will now attempt native execution and fail at runtime instead of safely falling back to Spark. Keep this default as false until CometScanRule.transformV2Scan() calls NativeBase.isLoaded() to verify runtime capability before conversion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@common/src/main/scala/org/apache/comet/CometConf.scala` at line 143, The
config default for native Iceberg was changed to true; revert the change so
spark.comet.scan.icebergNative.enabled remains false by default (restore the
call that sets createWithDefault(false) in CometConf or equivalent) and update
CometScanRule.transformV2Scan() to consult NativeBase.isLoaded() in addition to
the config flag before performing plan conversion (i.e., only convert when the
config is true AND NativeBase.isLoaded() returns true) so environments without
the native library safely fall back to Spark.


val COMET_ICEBERG_DATA_FILE_CONCURRENCY_LIMIT: ConfigEntry[Int] =
conf("spark.comet.scan.icebergNative.dataFileConcurrencyLimit")
Expand Down
6 changes: 2 additions & 4 deletions docs/source/user-guide/latest/iceberg.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ then serialized to Comet's native execution engine (see
[PR #2528](https://github.com/apache/datafusion-comet/pull/2528)).

The example below uses Spark's package downloader to retrieve Comet 0.14.0 and Iceberg
1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. The key configuration
to enable fully-native Iceberg is `spark.comet.scan.icebergNative.enabled=true`.
1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. The native Iceberg
reader is enabled by default. To disable it, set `spark.comet.scan.icebergNative.enabled=false`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs/source/user-guide/latest/iceberg.md:L33: This section still contains a markdown link to a pull request a few lines above (the PR apache#2528 link), which conflicts with the new repository rule “Never link to any issue or a pull request in any GitHub repository” (Rule: AGENTS.md). Could you align this doc with that rule to avoid having the repo immediately violate its own guidance?

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.


```shell
$SPARK_HOME/bin/spark-shell \
Expand All @@ -43,7 +43,6 @@ $SPARK_HOME/bin/spark-shell \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.scan.icebergNative.enabled=true \
--conf spark.comet.explainFallback.enabled=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=2g
Expand Down Expand Up @@ -120,7 +119,6 @@ $SPARK_HOME/bin/spark-shell \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \
--conf spark.comet.scan.icebergNative.enabled=true \
--conf spark.comet.explainFallback.enabled=true \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=2g
Expand Down
Loading