-
Notifications
You must be signed in to change notification settings - Fork 0
3819: feat: enable native Iceberg reader by default #52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1 @@ | ||||||||||||||||||||||||||||||||||||||||||||
| Never link to any issue or a pull request in any GitHub repository | ||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unrelated AI instruction files committed with feature changeMedium Severity
Additional Locations (1)There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify the purpose and scope of this guideline file. The file contains a single instruction without context about:
Consider adding a preamble explaining the purpose of this file and the rationale for this linking restriction. 🧰 Tools🪛 GitHub Actions: RAT License Check[error] 1-1: apache-rat:check failed. Too many files with unapproved license (1). Unapproved file: AGENTS.md. See RAT report in /home/runner/work/datafusion-comet/datafusion-comet/target/rat.txt. 🤖 Prompt for AI AgentsCritical: Add Apache license header to fix RAT check failure. The apache-rat:check pipeline is failing because this file is missing the required Apache license header. All files in the repository must include the Apache license header. 📝 Proposed fix: Add Apache license header+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
Never link to any issue or a pull request in any GitHub repository📝 Committable suggestion
Suggested change
🧰 Tools🪛 GitHub Actions: RAT License Check[error] 1-1: apache-rat:check failed. Too many files with unapproved license (1). Unapproved file: AGENTS.md. See RAT report in /home/runner/work/datafusion-comet/datafusion-comet/target/rat.txt. 🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1 @@ | ||||||||||||||||||||||||||||||||||||||||||||
| Never link to any issue or a pull request in any GitHub repository | ||||||||||||||||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Clarify the relationship between CLAUDE.md and AGENTS.md. This file has identical content to
🤖 Prompt for AI AgentsCritical: Add Apache license header to prevent RAT check failure. Like 📝 Proposed fix: Add Apache license header+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
Never link to any issue or a pull request in any GitHub repository📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||||||||||||||||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -179,8 +179,8 @@ sudo ./drop-caches.sh | |
| python3 run.py --engine comet-iceberg --benchmark tpch | ||
| ``` | ||
|
|
||
| The benchmark uses `spark.comet.scan.icebergNative.enabled=true` to enable Comet's native iceberg-rust | ||
| integration. Verify native scanning is active by checking for `CometIcebergNativeScanExec` in the | ||
| The benchmark uses Comet's native iceberg-rust integration, which is enabled by default. | ||
| Verify native scanning is active by checking for `CometIcebergNativeScanExec` in the | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. benchmarks/tpc/README.md:L183: This now says the benchmark uses the native Iceberg integration “enabled by default”, but the comet-iceberg benchmark engine config still explicitly sets Severity: low 🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage. |
||
| physical plan output. | ||
|
|
||
| ### create-iceberg-tables.py options | ||
|
|
||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -140,7 +140,7 @@ object CometConf extends ShimCometConf { | |||||
| "Iceberg tables are read directly through native execution, bypassing Spark's " + | ||||||
| "DataSource V2 API for better performance.") | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since the native Iceberg reader is still considered experimental (as noted in the user guide), it would be beneficial to include this status in the configuration's documentation string, especially now that it is enabled by default. This helps manage user expectations regarding stability and potential fallbacks.
Suggested change
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:good-to-have; category:documentation; feedback: The Gemini AI reviewer is correct! Several other config properties mention that they affect experimental features. It would be good this one to mention that the Iceberg support is experimental too. |
||||||
| .booleanConf | ||||||
| .createWithDefault(false) | ||||||
| .createWithDefault(true) | ||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
# Verify whether native Iceberg execution has an explicit runtime capability/load guard
# before conversion/execution paths are taken.
rg -n -C3 --type scala 'COMET_ICEBERG_NATIVE_ENABLED|nativeIcebergScanMetadata|CometIcebergNativeScanExec'
rg -n -C3 --type scala 'nativeLoadRequired|UnsatisfiedLinkError|isLoaded|load.*native|iceberg-rust'Repository: martin-augment/datafusion-comet Length of output: 50388 🏁 Script executed: # Search for the conversion/execution paths mentioned in the review
fd -e scala -path "*CometScanRule*"
fd -e scala -path "*CometExecRule*"Repository: martin-augment/datafusion-comet Length of output: 412 🏁 Script executed: # Look for native library loading/capability checks
rg -n "UnsatisfiedLinkError|NativeLibrary|loadLibrary|isDefined.*native|nativeLoaded" --type scala | head -50Repository: martin-augment/datafusion-comet Length of output: 57 🏁 Script executed: # Search for the actual CometIcebergNativeScanExec implementation
fd -e scala -name "*CometIcebergNativeScanExec*"Repository: martin-augment/datafusion-comet Length of output: 305 🏁 Script executed: # Search for the conversion/execution rules files
fd "CometScanRule|CometExecRule" -e scalaRepository: martin-augment/datafusion-comet Length of output: 340 🏁 Script executed: # Look for native library loading/capability checks
rg "UnsatisfiedLinkError|NativeLibrary|loadLibrary|isDefined|nativeLoaded" --type scala | head -50Repository: martin-augment/datafusion-comet Length of output: 5723 🏁 Script executed: # Search for the actual CometIcebergNativeScanExec implementation
fd "CometIcebergNativeScanExec" -e scalaRepository: martin-augment/datafusion-comet Length of output: 155 🏁 Script executed: # Examine CometExecRule.scala for native Iceberg handling
cat spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala | head -250Repository: martin-augment/datafusion-comet Length of output: 12757 🏁 Script executed: # Examine CometScanRule.scala for native Iceberg handling
cat spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala | head -250Repository: martin-augment/datafusion-comet Length of output: 10102 🏁 Script executed: # Look for the transformV2Scan method and native Iceberg handling
grep -n "transformV2Scan\|nativeIcebergCompatScan\|COMET_ICEBERG_NATIVE_ENABLED" spark/src/main/scala/org/apache/comet/rules/CometScanRule.scala | head -30Repository: martin-augment/datafusion-comet Length of output: 524 🏁 Script executed: # Get the transformV2Scan implementation
sed -n '300,450p' spark/src/main/scala/org/apache/comet/rules/CometScanRule.scalaRepository: martin-augment/datafusion-comet Length of output: 6359 🏁 Script executed: # Also check CometIcebergNativeScanExec.scala for execution logic
head -150 spark/src/main/scala/org/apache/spark/sql/comet/CometIcebergNativeScanExec.scalaRepository: martin-augment/datafusion-comet Length of output: 6939 🏁 Script executed: # Look at the doExecuteColumnar method in CometIcebergNativeScanExec
sed -n '150,300p' spark/src/main/scala/org/apache/spark/sql/comet/CometIcebergNativeScanExec.scalaRepository: martin-augment/datafusion-comet Length of output: 6311 🏁 Script executed: # Search for any native library loading or capability checks
rg "iceberg-rust|Native\.load|UnsatisfiedLinkError|isCometNativeAvailable|checkNative" spark/src/main/scala/org/apache/comet/rules/ spark/src/main/scala/org/apache/spark/sql/comet/Repository: martin-augment/datafusion-comet Length of output: 1668 🏁 Script executed: # Check if there's a CometConf setting for native library availability check
grep -n "COMET_ICEBERG_NATIVE_ENABLED\|nativeLoaded\|nativeAvailable" common/src/main/scala/org/apache/comet/CometConf.scalaRepository: martin-augment/datafusion-comet Length of output: 137 🏁 Script executed: # Look at CometConf.scala to see the full COMET_ICEBERG_NATIVE_ENABLED definition
sed -n '135,150p' common/src/main/scala/org/apache/comet/CometConf.scalaRepository: martin-augment/datafusion-comet Length of output: 880 🏁 Script executed: # Check if there's native library loading or error handling in CometExecRDD
fd "CometExecRDD" -e scalaRepository: martin-augment/datafusion-comet Length of output: 141 🏁 Script executed: # Search for Native.load or native library initialization
rg "Native\.load|loadNativeLibrary" spark/src/main/scala/org/apache/spark/sql/comet/Repository: martin-augment/datafusion-comet Length of output: 57 🏁 Script executed: # Look at CometExecRDD to see execution and error handling
head -200 spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scalaRepository: martin-augment/datafusion-comet Length of output: 7458 🏁 Script executed: # Search for UnsatisfiedLinkError or JNI-related error handling
rg "UnsatisfiedLinkError|JNI|native.*load|loadLibrary" spark/src/main/scala/Repository: martin-augment/datafusion-comet Length of output: 693 🏁 Script executed: # Check if there's a native library availability check before execution
rg "nativeAvailable|isNativeLoaded|checkNativeLibrary" spark/src/main/scala/Repository: martin-augment/datafusion-comet Length of output: 57 🏁 Script executed: # Look for CometExecIterator and error handling
fd "CometExecIterator" -e java -o -e scalaRepository: martin-augment/datafusion-comet Length of output: 191 🏁 Script executed: # Check for error handling in native execution
rg "catch|Exception|error|fallback" spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala | head -20Repository: martin-augment/datafusion-comet Length of output: 160 🏁 Script executed: # Check the native execution error paths
rg "UnsatisfiedLinkError|NoSuchLibraryException|CometNativeException" spark/src/main/scala/Repository: martin-augment/datafusion-comet Length of output: 419 🏁 Script executed: # Look at CometExecIterator error handling
grep -n "CometNativeException" spark/src/main/scala/org/apache/comet/CometExecIterator.scala -A 5 -B 5Repository: martin-augment/datafusion-comet Length of output: 717 🏁 Script executed: # Look at the full CometExecIterator.scala to understand execution
head -100 spark/src/main/scala/org/apache/comet/CometExecIterator.scalaRepository: martin-augment/datafusion-comet Length of output: 4359 🏁 Script executed: # Check what CometNativeException is
fd "CometNativeException" -e scalaRepository: martin-augment/datafusion-comet Length of output: 57 🏁 Script executed: # Look for Native class and library loading
fd "Native.java" -o -path "*parquet*Native*"Repository: martin-augment/datafusion-comet Length of output: 191 🏁 Script executed: # Check if Native() constructor does any checking
rg "class Native\|object Native" spark/ common/ --type scala -A 10Repository: martin-augment/datafusion-comet Length of output: 57 🏁 Script executed: # Look at Native.java to see if library is checked
fd "Native.java"Repository: martin-augment/datafusion-comet Length of output: 132 🏁 Script executed: # Examine Native.java to see library loading
cat common/src/main/java/org/apache/comet/parquet/Native.javaRepository: martin-augment/datafusion-comet Length of output: 10660 🏁 Script executed: # Check NativeBase for native library loading
fd "NativeBase" -e javaRepository: martin-augment/datafusion-comet Length of output: 128 🏁 Script executed: # Look at NativeBase.java
cat common/src/main/java/org/apache/comet/NativeBase.javaRepository: martin-augment/datafusion-comet Length of output: 10437 Native Iceberg scan default should remain disabled until runtime capability check is added. Changing 🤖 Prompt for AI Agents |
||||||
|
|
||||||
| val COMET_ICEBERG_DATA_FILE_CONCURRENCY_LIMIT: ConfigEntry[Int] = | ||||||
| conf("spark.comet.scan.icebergNative.dataFileConcurrencyLimit") | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -29,8 +29,8 @@ then serialized to Comet's native execution engine (see | |
| [PR #2528](https://github.com/apache/datafusion-comet/pull/2528)). | ||
|
|
||
| The example below uses Spark's package downloader to retrieve Comet 0.14.0 and Iceberg | ||
| 1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. The key configuration | ||
| to enable fully-native Iceberg is `spark.comet.scan.icebergNative.enabled=true`. | ||
| 1.8.1, but Comet has been tested with Iceberg 1.5, 1.7, 1.8, 1.9, and 1.10. The native Iceberg | ||
| reader is enabled by default. To disable it, set `spark.comet.scan.icebergNative.enabled=false`. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. docs/source/user-guide/latest/iceberg.md:L33: This section still contains a markdown link to a pull request a few lines above (the PR apache#2528 link), which conflicts with the new repository rule “Never link to any issue or a pull request in any GitHub repository” (Rule: AGENTS.md). Could you align this doc with that rule to avoid having the repo immediately violate its own guidance? Severity: medium 🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage. |
||
|
|
||
| ```shell | ||
| $SPARK_HOME/bin/spark-shell \ | ||
|
|
@@ -43,7 +43,6 @@ $SPARK_HOME/bin/spark-shell \ | |
| --conf spark.plugins=org.apache.spark.CometPlugin \ | ||
| --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ | ||
| --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ | ||
| --conf spark.comet.scan.icebergNative.enabled=true \ | ||
| --conf spark.comet.explainFallback.enabled=true \ | ||
| --conf spark.memory.offHeap.enabled=true \ | ||
| --conf spark.memory.offHeap.size=2g | ||
|
|
@@ -120,7 +119,6 @@ $SPARK_HOME/bin/spark-shell \ | |
| --conf spark.plugins=org.apache.spark.CometPlugin \ | ||
| --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \ | ||
| --conf spark.sql.extensions=org.apache.comet.CometSparkSessionExtensions \ | ||
| --conf spark.comet.scan.icebergNative.enabled=true \ | ||
| --conf spark.comet.explainFallback.enabled=true \ | ||
| --conf spark.memory.offHeap.enabled=true \ | ||
| --conf spark.memory.offHeap.size=2g | ||
|
|
||


There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The addition of
AGENTS.mdandCLAUDE.mdfiles at the root of the repository is non-standard for an Apache project. These files appear to contain instructions for AI agents that are likely specific to the author's local environment or a particular tool. Furthermore, the instruction to 'Never link to any issue or a pull request' is overly restrictive for an open-source project where cross-referencing is essential for traceability and collaboration. If these are not strictly required for the project's operation, they should be removed from the pull request.