3842: chore: native_datafusion to report scan task input metrics#51
3842: chore: native_datafusion to report scan task input metrics#51martin-augment wants to merge 3 commits intomainfrom
native_datafusion to report scan task input metrics#51Conversation
WalkthroughThis change integrates native Comet execution metrics into Spark's task metrics system. It modifies ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements the reporting of native task-level input metrics, specifically bytes read and records read, from Comet to Spark. While this is a valuable addition, the current implementation only retrieves metrics from the root node of the native metrics tree. This leads to incorrect reporting in queries involving filters or projections where the scan node is not the root. Feedback suggests traversing the metrics tree to aggregate these values and adding more comprehensive test cases, such as queries with filters, to verify metric accuracy in non-trivial execution plans.
| nativeMetrics.metrics | ||
| .get("bytes_scanned") | ||
| .foreach(m => ctx.taskMetrics().inputMetrics.setBytesRead(m.value)) | ||
| nativeMetrics.metrics | ||
| .get("output_rows") | ||
| .foreach(m => ctx.taskMetrics().inputMetrics.setRecordsRead(m.value)) |
There was a problem hiding this comment.
The current implementation only retrieves metrics from the root node of the nativeMetrics tree. In most Comet execution plans (e.g., Scan -> Filter -> Project), the root node will be an operator like Project or Filter, which does not contain the bytes_scanned metric. As a result, bytesRead will be reported as 0 in the Spark UI for these queries. Furthermore, output_rows at the root node reflects the final result count, which may not match the number of records read from the source if a filter was applied.
To correctly report task-level input metrics, you should traverse the nativeMetrics tree and aggregate bytes_scanned and output_rows from all nodes that represent a scan (typically identified by the presence of the bytes_scanned metric). Consider adding a recursive helper method to CometMetricNode to perform this aggregation.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Being executed in CometExecRDD this metrics collection is executed for all kinds of nodes, not just for the Scan ones. It would be better to move this logic to CometNativeScanExec#doExecuteColumnar(). This way it will collect only the Scan related metrics.
| } | ||
| } | ||
|
|
||
| test("native_datafusion scan reports task-level input metrics matching Spark") { |
There was a problem hiding this comment.
This test case only covers a simple scan where the scan node is the root of the native plan. To ensure that input metrics are correctly reported in more realistic scenarios, consider adding a test case that includes a filter (e.g., sql("SELECT * FROM tbl WHERE _1 > 5000")). This will verify that metrics are correctly aggregated even when the scan node is not the root of the execution tree.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Using a more complex SQL query (e.g. with a Filter) will expose the problem that the collection of the metrics is done on all CometExecRDDs. It should be done only for the Scan nodes, i.e. the ones created by CometNativeScanExec#doExecuteColumnar().
Review:
|
There was a problem hiding this comment.
🧹 Nitpick comments (2)
spark/src/test/scala/org/apache/spark/sql/comet/CometTaskMetricsSuite.scala (1)
104-106: Make Comet enablement explicit in the native-scan run.The second metrics collection currently depends on ambient default for
CometConf.COMET_ENABLED. Setting it explicitly makes this test deterministic across suite/config changes.✅ Proposed tweak
- val (cometBytes, cometRecords) = collectInputMetrics( - CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION) + val (cometBytes, cometRecords) = collectInputMetrics( + CometConf.COMET_ENABLED.key -> "true", + CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark/src/test/scala/org/apache/spark/sql/comet/CometTaskMetricsSuite.scala` around lines 104 - 106, The test relies on ambient Comet enablement; make it deterministic by explicitly enabling Comet when calling collectInputMetrics: pass CometConf.COMET_ENABLED.key -> "true" (or the appropriate boolean/string form used elsewhere) alongside CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION so the call to collectInputMetrics explicitly sets COMET_ENABLED for the native-scan run.spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala (1)
143-148: Extract metric-name literals to shared constants.Using inline
"bytes_scanned"/"output_rows"here is easy to drift over time. Please centralize these keys (e.g., inCometMetricNode/CometExecRDDcompanion object) and reference constants at call sites.♻️ Proposed refactor
--- a/spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala +++ b/spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala @@ - nativeMetrics.metrics - .get("bytes_scanned") + nativeMetrics.metrics + .get(CometExecRDD.NativeBytesScannedMetric) .foreach(m => ctx.taskMetrics().inputMetrics.setBytesRead(m.value)) nativeMetrics.metrics - .get("output_rows") + .get(CometExecRDD.NativeOutputRowsMetric) .foreach(m => ctx.taskMetrics().inputMetrics.setRecordsRead(m.value)) @@ object CometExecRDD { + private val NativeBytesScannedMetric = "bytes_scanned" + private val NativeOutputRowsMetric = "output_rows"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala` around lines 143 - 148, The code in CometExecRDD uses inline metric-name literals "bytes_scanned" and "output_rows"; extract these into shared constants (e.g., add val BytesScannedMetric = "bytes_scanned" and val OutputRowsMetric = "output_rows" in the CometExecRDD companion object or CometMetricNode) and replace the literal usages in CometExecRDD.scala (the get(...) calls that update ctx.taskMetrics().inputMetrics) to reference those constants so keys are centralized and less error-prone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scala`:
- Around line 143-148: The code in CometExecRDD uses inline metric-name literals
"bytes_scanned" and "output_rows"; extract these into shared constants (e.g.,
add val BytesScannedMetric = "bytes_scanned" and val OutputRowsMetric =
"output_rows" in the CometExecRDD companion object or CometMetricNode) and
replace the literal usages in CometExecRDD.scala (the get(...) calls that update
ctx.taskMetrics().inputMetrics) to reference those constants so keys are
centralized and less error-prone.
In `@spark/src/test/scala/org/apache/spark/sql/comet/CometTaskMetricsSuite.scala`:
- Around line 104-106: The test relies on ambient Comet enablement; make it
deterministic by explicitly enabling Comet when calling collectInputMetrics:
pass CometConf.COMET_ENABLED.key -> "true" (or the appropriate boolean/string
form used elsewhere) alongside CometConf.COMET_NATIVE_SCAN_IMPL.key ->
CometConf.SCAN_NATIVE_DATAFUSION so the call to collectInputMetrics explicitly
sets COMET_ENABLED for the native-scan run.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c9c5299f-cfaf-4e89-ab95-7da490e4e763
📒 Files selected for processing (3)
spark/src/main/scala/org/apache/spark/sql/comet/CometExecRDD.scalaspark/src/main/scala/org/apache/spark/sql/comet/CometMetricNode.scalaspark/src/test/scala/org/apache/spark/sql/comet/CometTaskMetricsSuite.scala
🤖 Augment PR SummarySummary: This PR wires native DataFusion scan metrics into Spark task-level Changes:
Technical Notes: Metrics are pulled from the native 🤖 Was this summary useful? React with 👍 or 👎 |
| .foreach(m => ctx.taskMetrics().inputMetrics.setBytesRead(m.value)) | ||
| nativeMetrics.metrics | ||
| .get("output_rows") | ||
| .foreach(m => ctx.taskMetrics().inputMetrics.setRecordsRead(m.value)) |
There was a problem hiding this comment.
output_rows is part of the baseline native metrics for many non-scan operators, so setting taskMetrics.inputMetrics.recordsRead from it here can misreport or overwrite Spark’s input metrics for non-scan stages. Consider gating this update so it only applies when the native metrics actually represent scan input (e.g., tied to scan-specific metrics like bytes_scanned).
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! Being executed in CometExecRDD this metrics collection is executed for all kinds of nodes, not just for the Scan ones. It would be better to move this logic to CometNativeScanExec#doExecuteColumnar(). This way it will collect only the Scan related metrics.
value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! The test works fine because it uses the default value of |
value:useful; category:bug; feedback: The Claude AI reviewer is correct! Being executed in CometExecRDD this metrics collection is executed for all kinds of nodes, not just for the Scan ones. It would be better to move this logic to CometNativeScanExec#doExecuteColumnar(). This way it will collect only the Scan related metrics. |
3842: To review by AI