Skip to content

3708: feat: expose comet metrics through Sparks external monitoring system#48

Open
martin-augment wants to merge 1 commit intomainfrom
pr-3708-2026-03-16-14-32-34
Open

3708: feat: expose comet metrics through Sparks external monitoring system#48
martin-augment wants to merge 1 commit intomainfrom
pr-3708-2026-03-16-14-32-34

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3708: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 16, 2026

Walkthrough

This pull request introduces metrics collection capabilities for Comet integration with Spark. A new CometSource object is created to expose Dropwizard/Codahale metrics including counters for native operators, Spark operators, planned queries, and transitions, plus a gauge for acceleration ratio. Coverage statistics are computed during plan transformation in CometExecRule and recorded via CometSource. The metrics source is registered during plugin initialization. A shell script pattern is updated to allow CometSource classes in jars, and a test case validates that metrics are properly recorded.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-3708-2026-03-16-14-32-34
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can make CodeRabbit's review stricter and more nitpicky using the `assertive` profile, if that's what you prefer.

Change the reviews.profile setting to assertive to make CodeRabbit's nitpick more issues in your PRs.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances Comet's observability by integrating a dedicated metrics source (CometSource) into Spark's monitoring framework. This allows users to gain deeper insights into Comet's runtime behavior, specifically how many operations are executed natively versus falling back to Spark, the number of queries processed, and the efficiency of transitions, ultimately providing a clearer picture of Comet's acceleration benefits.

Highlights

  • New Metrics Source: Introduced CometSource to expose Comet-specific performance metrics through Spark's external monitoring system.
  • Metric Tracking: The new source tracks key metrics such as native and Spark operator counts, total planned queries, Spark-to-Comet transitions, and an acceleration ratio.
  • Integration with Spark: The CometDriverPlugin was updated to register CometSource with Spark's MetricsSystem, ensuring metrics are collected.
  • Coverage Stats Recording: The CometExecRule now records Comet coverage statistics for each plan, feeding data into the CometSource.
  • Test Coverage: A new test case was added in CometPluginsSuite to verify that CometSource metrics are correctly recorded during query execution.
Changelog
  • dev/ensure-jars-have-correct-contents.sh
    • Added CometSource to the list of allowed expressions for JAR content validation.
  • spark/src/main/scala/org/apache/comet/ExtendedExplainInfo.scala
    • Introduced a forPlan method within the CometCoverageStats object to compute coverage statistics for a given Spark plan.
  • spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala
    • Modified to import CometSource and CometCoverageStats, and to record coverage statistics for each transformed Spark plan.
  • spark/src/main/scala/org/apache/spark/CometSource.scala
    • Created a new file defining the CometSource object, which implements Spark's Source trait to expose Comet-specific metrics.
  • spark/src/main/scala/org/apache/spark/Plugins.scala
    • Updated the CometDriverPlugin to register the newly created CometSource with Spark's metrics system during initialization.
  • spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala
    • Added a new test case to verify that the CometSource correctly records and updates metrics after query execution.
Activity
  • No human activity has been recorded on this pull request yet. It is currently awaiting review.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@claude
Copy link
Copy Markdown

claude bot commented Mar 16, 2026

Code Review: feat: expose comet metrics through Spark's external monitoring system

Overall this is a well-structured feature that hooks nicely into Spark's existing metrics infrastructure. The implementation is clean and the test coverage is a good start. I have several issues to raise, ranging from a significant correctness bug to minor style points.


Critical Issue: Double-counting with AQE and dual rule application

File: spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala

CometExecRule is registered in two places in CometSparkSessionExtensions:

  1. As injectQueryStagePrepRule — called once per AQE query stage re-planning
  2. As preColumnarTransitions — called on the full plan

This means CometSource.recordStats(stats) will be called multiple times for the same logical query, causing all four counters (NATIVE_OPERATORS, SPARK_OPERATORS, TRANSITIONS, QUERIES_PLANNED) to be inflated. For a simple query under AQE with multiple stages, each stage replanning fires the rule again, so the cumulative counts will be significantly higher than reality.

The derived acceleration.ratio Gauge reads the live counter values, so it will correctly reflect the ratio at any point in time (since both numerator and denominator are inflated equally), but the absolute counts for operators.native, operators.spark, queries.planned, and transitions will be inaccurate.

A fix should track whether stats have already been recorded for a given plan root (e.g., tag the plan, or only record in one of the two injection sites), or move the recording to a different hook that fires exactly once per user-visible query.


Moderate Issue: CometCoverageStats.forPlan traverses the tree a second time when explain fallback is also enabled

File: spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala, lines 393–406

When COMET_EXPLAIN_FALLBACK_ENABLED is true, the code now walks the plan tree twice: once in CometCoverageStats.forPlan (for metrics) and once inside info.generateExtendedInfo(newPlan) (which calls generateTreeString internally). For large plans this is an unnecessary overhead. The generateExtendedInfo call already produces a CometCoverageStats instance internally — consider sharing that result rather than doing two independent traversals.


Moderate Issue: CometSource is a global singleton with unbounded, ever-increasing counters

File: spark/src/main/scala/org/apache/spark/CometSource.scala

Dropwizard Counters only go up. There is no way for a Spark application or external monitoring system to distinguish "rate of native operators per second" from the raw cumulative total without doing their own delta computation. Consider documenting this clearly in the scaladoc, or — if the intent is to expose rates — using a Meter instead of a Counter.

Also, if the CometSource singleton is never reset between unit tests in the same JVM, tests that run in sequence may see stale counts from previous tests. The test currently guards against this correctly by capturing a baseline (nativeBefore, queriesBefore) before running the query, but other future tests that depend on exact counter values could be affected.


Minor Issue: Inconsistent val annotation on public counter fields

File: spark/src/main/scala/org/apache/spark/CometSource.scala, lines 40–44

NATIVE_OPERATORS has an explicit type annotation (Counter) but SPARK_OPERATORS, QUERIES_PLANNED, and TRANSITIONS do not. For consistency and API clarity, all public fields should have explicit type annotations.

// Current (inconsistent)
val NATIVE_OPERATORS: Counter = ...
val SPARK_OPERATORS = ...
val QUERIES_PLANNED = ...
val TRANSITIONS = ...

// Suggested
val NATIVE_OPERATORS: Counter = ...
val SPARK_OPERATORS: Counter = ...
val QUERIES_PLANNED: Counter = ...
val TRANSITIONS: Counter = ...

Minor Issue: CometCoverageStats.forPlan allocates a StringBuilder only to discard it

File: spark/src/main/scala/org/apache/comet/ExtendedExplainInfo.scala, lines 200–211

generateTreeString is designed for producing explain output and unconditionally appends to a StringBuilder. forPlan passes a freshly-allocated StringBuilder and throws it away. This is a functional but slightly wasteful design — worth noting in the scaladoc, or refactoring generateTreeString to accept an Option[StringBuilder] so callers that only want stats can avoid the allocation and string-building overhead.


Minor Issue: CometSource object is in the org.apache.spark package but imports from org.apache.comet

File: spark/src/main/scala/org/apache/spark/CometSource.scala, line 27

The file imports org.apache.comet.CometCoverageStats, creating a dependency from the org.apache.spark package back into org.apache.comet. The existing Plugins.scala already uses this pattern (it imports org.apache.comet.CometSparkSessionExtensions) and the project places files in org.apache.spark specifically to access package-private Spark APIs, so this is an accepted convention here. Still worth a brief comment in the file to explain why CometSource is in the spark package rather than comet.


Test Coverage Gap: No test for SPARK_OPERATORS or TRANSITIONS counters

File: spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala, lines 82–98

The test verifies QUERIES_PLANNED and NATIVE_OPERATORS increment but does not assert on SPARK_OPERATORS or TRANSITIONS. Consider adding assertions (or at minimum a check that SPARK_OPERATORS.getCount >= 0 && TRANSITIONS.getCount >= 0) to guard against future regressions. Additionally, there is no test covering the acceleration.ratio gauge computation, including the edge case where total == 0 (which returns 0.0).


Test Coverage Gap: No test verifying CometSource is registered with the MetricsSystem

The test validates the counter values directly through the singleton object, but does not verify that CometSource was actually registered with sc.env.metricsSystem (the path that external monitoring tools use). A test using spark.sparkContext.env.metricsSystem to confirm the source is registered under the "comet" name would provide stronger end-to-end confidence.


Summary

Severity Issue
Critical Double-counting due to dual injection of CometExecRule (AQE + preColumnarTransitions)
Moderate Redundant tree traversal when COMET_EXPLAIN_FALLBACK_ENABLED=true
Moderate Unbounded counters with no documentation about cumulative vs. rate semantics
Minor Inconsistent type annotations on public Counter fields
Minor StringBuilder allocated and discarded in forPlan
Minor Missing comment explaining why CometSource lives in org.apache.spark package
Minor Test does not cover SPARK_OPERATORS, TRANSITIONS, or the acceleration.ratio gauge

The critical double-counting issue should be resolved before merging. The remaining items are improvements but not blockers.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Comet metrics through Sparks external monitoring system. It includes changes to the build script to allow the new CometSource class, adds the CometSource class itself, modifies the CometExecRule to record coverage stats, and updates the CometDriverPlugin to register the Comet metrics source. A new test case is also added to verify that CometSource metrics are recorded.

Comment on lines +394 to +395
val stats = CometCoverageStats.forPlan(newPlan)
CometSource.recordStats(stats)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

It's important to ensure that CometSource.recordStats(stats) is called only when newPlan is actually a Comet plan. Otherwise, the metrics might be misleading, as they would include Spark operators as well.

Suggested change
val stats = CometCoverageStats.forPlan(newPlan)
CometSource.recordStats(stats)
if (newPlan.isInstanceOf[CometPlan]) {
val stats = CometCoverageStats.forPlan(newPlan)
CometSource.recordStats(stats)
}

allowed_expr+="|^org/apache/spark/sql/ExtendedExplainGenerator.*$"
allowed_expr+="|^org/apache/spark/CometPlugin.class$"
allowed_expr+="|^org/apache/spark/CometDriverPlugin.*$"
allowed_expr+="|^org/apache/spark/CometSource.*$"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regex CometSource.*$ allows any character after CometSource, but it seems like only the .class extension is expected. It is better to be more specific to avoid unexpected inclusions.

Suggested change
allowed_expr+="|^org/apache/spark/CometSource.*$"
allowed_expr+="|^org/apache/spark/CometSource.class$"

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 16, 2026

🤖 Augment PR Summary

Summary: Wires Comet plan-coverage/acceleration stats into Spark’s metrics system so external monitoring can track Comet usage.

Changes:

  • Adds org.apache.spark.CometSource (counters + acceleration.ratio gauge) and registers it from CometDriverPlugin.
  • Introduces CometCoverageStats.forPlan and invokes it from CometExecRule to update metrics during planning.
  • Updates the jar allowlist and adds a plugin-suite test verifying the counters increment after running a query.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 4 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

var newPlan = transform(planWithJoinRewritten)

// Record coverage stats for metrics
val stats = CometCoverageStats.forPlan(newPlan)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since CometExecRule is injected both as a QueryStagePrepRule and as a columnar preColumnarTransitions rule, this recordStats call may run multiple times per query/stage and inflate queries.planned/operator counters. Consider guarding so stats are recorded once per final plan (or only when the rule actually changes the plan).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

0,
Seq(),
0,
new StringBuilder(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the Scaladoc, this still builds the full tree output into a StringBuilder via generateTreeString, which could allocate a large buffer on every planning pass. If the intent is stats-only, consider a traversal that updates CometCoverageStats without appending to a string.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.


/**
* Exposes following metrics (hooked from CometCoverageStats)
* - operators.native: Total operators executed natively
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metrics are derived from CometCoverageStats over the planned SparkPlan, so phrasing like “operators executed natively” / “fell back” reads like runtime execution and may be misleading. Consider clarifying the doc/metric semantics (planned operators vs executed operators).

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

}
}

test("CometSource metrics are recorded") {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only checks the in-process CometSource counters, so it can pass even if CometSource isn’t actually registered with sc.env.metricsSystem (i.e., external monitoring can’t see it). Consider asserting the metrics registry/source contains the expected names to validate the exposure path.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
spark/src/main/scala/org/apache/comet/ExtendedExplainInfo.scala (1)

200-209: Avoid building an unused explain string in stats-only path.

Line 203-209 computes coverage by calling generateTreeString, but the generated text is discarded. A stats-only traversal would remove unnecessary allocation/work.

♻️ Refactor sketch
 object CometCoverageStats {
   /**
    * Compute coverage stats for a plan without generating explain string.
    */
   def forPlan(plan: SparkPlan): CometCoverageStats = {
     val stats = new CometCoverageStats()
     val explainInfo = new ExtendedExplainInfo()
-    explainInfo.generateTreeString(
-      CometExplainInfo.getActualPlan(plan),
-      0,
-      Seq(),
-      0,
-      new StringBuilder(),
-      stats)
+    explainInfo.collectCoverageStats(
+      CometExplainInfo.getActualPlan(plan),
+      stats)
     stats
   }
 }
// Add in ExtendedExplainInfo (stats-only traversal, no string appends)
private[comet] def collectCoverageStats(node: TreeNode[_], planStats: CometCoverageStats): Unit = {
  // same node classification logic as generateTreeString(...)
  // recurse through innerChildren and children
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@spark/src/main/scala/org/apache/comet/ExtendedExplainInfo.scala` around lines
200 - 209, The forPlan method builds a full explain string by calling
ExtendedExplainInfo.generateTreeString but only needs coverage stats, causing
unnecessary allocation; add a stats-only traversal method (e.g.,
ExtendedExplainInfo.collectCoverageStats(node: TreeNode[_], planStats:
CometCoverageStats)) that replicates the node classification/recurse logic from
generateTreeString without creating or appending to a StringBuilder, and change
forPlan to call CometExplainInfo.getActualPlan(plan) then collectCoverageStats
on that node to populate CometCoverageStats instead of generateTreeString;
ensure you reference the same classification helpers currently used in
generateTreeString so behavior is unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@spark/src/main/scala/org/apache/comet/ExtendedExplainInfo.scala`:
- Around line 200-209: The forPlan method builds a full explain string by
calling ExtendedExplainInfo.generateTreeString but only needs coverage stats,
causing unnecessary allocation; add a stats-only traversal method (e.g.,
ExtendedExplainInfo.collectCoverageStats(node: TreeNode[_], planStats:
CometCoverageStats)) that replicates the node classification/recurse logic from
generateTreeString without creating or appending to a StringBuilder, and change
forPlan to call CometExplainInfo.getActualPlan(plan) then collectCoverageStats
on that node to populate CometCoverageStats instead of generateTreeString;
ensure you reference the same classification helpers currently used in
generateTreeString so behavior is unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 05daa1a7-cda2-443b-9008-38f3b23ed423

📥 Commits

Reviewing files that changed from the base of the PR and between 20a9b40 and c0eb149.

📒 Files selected for processing (6)
  • dev/ensure-jars-have-correct-contents.sh
  • spark/src/main/scala/org/apache/comet/ExtendedExplainInfo.scala
  • spark/src/main/scala/org/apache/comet/rules/CometExecRule.scala
  • spark/src/main/scala/org/apache/spark/CometSource.scala
  • spark/src/main/scala/org/apache/spark/Plugins.scala
  • spark/src/test/scala/org/apache/spark/CometPluginsSuite.scala

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.


// Record coverage stats for metrics
val stats = CometCoverageStats.forPlan(newPlan)
CometSource.recordStats(stats)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metrics inflated by multiple rule invocations with AQE

Medium Severity

CometExecRule is registered both via injectColumnar (as preColumnarTransitions) and via injectQueryStagePrepRule. With AQE enabled (the default), _apply is invoked multiple times per query — once from each injection point during re-optimization cycles. Each invocation calls CometSource.recordStats, inflating all counters including QUERIES_PLANNED, NATIVE_OPERATORS, and SPARK_OPERATORS. The acceleration.ratio gauge may also be skewed since operator counts from different plan stages have different compositions.

Fix in Cursor Fix in Web


// Record coverage stats for metrics
val stats = CometCoverageStats.forPlan(newPlan)
CometSource.recordStats(stats)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stats captured before temporary placeholder nodes are removed

Low Severity

CometCoverageStats.forPlan(newPlan) is called before CometScanWrapper and CometSinkPlaceHolder are removed from the plan. Both extend CometPlan, so they are counted as cometOperators. When removed, CometScanWrapper is replaced by its wrapped Spark originalPlan (which was never counted), and CometSinkPlaceHolder is replaced by its already-counted child. This inflates cometOperators and deflates sparkOperators relative to the final plan.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants