Skip to content

3793: feat: add audit-comet-expression Claude Code skill#54

Open
martin-augment wants to merge 3 commits intomainfrom
pr-3793-2026-04-01-13-01-48
Open

3793: feat: add audit-comet-expression Claude Code skill#54
martin-augment wants to merge 3 commits intomainfrom
pr-3793-2026-04-01-13-01-48

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

3793: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Walkthrough

This pull request introduces a new skill documentation file and updates AI agent configuration files. The new .claude/skills/audit-comet-expression/SKILL.md file defines a structured workflow for auditing the Comet $ARGUMENTS expression, covering steps from locating Spark implementations across multiple versions through creating test coverage matrices and implementation gap reviews. Three configuration files (AGENTS.md, CLAUDE.md, and .cursor/rules.md) were updated to add constraints for AI agents during code reviews, specifying that they must not link to GitHub issues or pull requests and must not review AI agent configuration files.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-3793-2026-04-01-13-01-48

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Apr 1, 2026

Review of audit-comet-expression skill

This is a useful addition. A structured audit workflow that covers multiple Spark versions, serde, Rust, and test gaps is exactly the kind of thing that takes time to do ad hoc. Having it codified as a skill should save reviewers a lot of repeated work.

A few things worth looking at:

Cloning three Spark versions is expensive. Steps 1 and 2 unconditionally try to clone v3.4.3, v3.5.8, and v4.0.1 (roughly 300-600 MB each shallow clone). On a fresh machine this could take several minutes and several GB of disk. The if [ ! -d "$dir" ] guard helps on repeated runs, but the first run of the skill will be slow. It might be worth mentioning this up front in the skill description, or suggesting the user pre-clone the repos once.

The $ARGUMENTS grep pattern may miss some expressions. The search pattern "case class $ARGUMENTS\b\|object $ARGUMENTS\b" works for the common case, but some Spark expressions are defined as abstract classes, traits, or with slightly different casing (e.g., MathExpression subtypes). If the first search returns nothing, the skill falls back to searching all of sql/, which is good, but it might be worth noting that users should also try searching for extends $ARGUMENTS or with $ARGUMENTS if the class name search turns up empty.

The DataFusion source lookup is fragile. Step 3 searches ~/.cargo/registry/src/ for the DataFusion implementation, which assumes cargo fetch or a prior build has already populated the local registry. This may not be true in all environments. A more reliable approach would be to look inside native/ directly (e.g., native/spark-expr/src/ and native/core/src/), which are already suggested earlier in the same step. The registry search could be dropped or moved to a fallback note.

The find ... | xargs grep pattern without -print0 | xargs -0 can silently fail on file paths that contain spaces. This is unlikely in practice for Spark and Comet source trees, but it is a latent issue. Using grep -r directly (as the Comet-side steps already do) avoids this entirely.

The coverage matrix table is useful but the column headers could be clearer. "Spark tests it" as a column name is a bit ambiguous. Something like "Covered in Spark tests" would read more naturally.

Minor: the SQL file template uses $ARGUMENTS inside a comment block. The CREATE TABLE test_$ARGUMENTS and INSERT INTO test_$ARGUMENTS lines look fine as templates, but when the skill substitutes a multi-word or hyphenated expression name, the SQL table name would be invalid. It might be worth noting that the user should replace $ARGUMENTS with a valid SQL identifier (e.g., snake_cased) in the generated file.

Overall the structure is solid. The 7-step workflow, gap table, and offer to auto-generate tests are all good ideas. The issues above are mostly edge cases and minor UX polish.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 1, 2026

🤖 Augment PR Summary

Summary: Adds a new Claude skill to audit Comet expression implementations for correctness and test coverage.

Changes:

  • Introduces .claude/skills/audit-comet-expression/SKILL.md with a step-by-step workflow to compare Spark behavior across 3.4.3, 3.5.8, and 4.0.1 against Comet’s Scala serde and Rust/DataFusion execution.
  • Adds instructions to locate upstream Spark tests, enumerate edge cases, and perform a structured coverage gap analysis with a coverage matrix.
  • Provides a template and guidance for adding missing Comet SQL-file tests and how to run the relevant suite locally.
  • Adds AI agent rule/config files (.cursor/rules.md, AGENTS.md, CLAUDE.md) to constrain automated review behavior (for example, no linking to issues/PRs, and avoiding review of agent config files).

Technical Notes: The skill emphasizes shallow Spark clones to /tmp and prefers SQL file tests following the repository’s SQL file test format guidance.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 3 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

dir="/tmp/spark-${tag}"
echo "=== $tag ==="
find "$dir/sql/catalyst/src/main/scala" -name "*.scala" | \
xargs grep -l "case class $ARGUMENTS\b\|object $ARGUMENTS\b" 2>/dev/null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.claude/skills/audit-comet-expression/SKILL.md:43: The grep pattern uses \b as a word-boundary, but in default grep regex syntax \b is not a word-boundary, so this search can silently miss the target expression definition and derail the audit steps.

Severity: medium

Other Locations
  • .claude/skills/audit-comet-expression/SKILL.md:54

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback: The Augment AI reviewer is not correct! GNU grep does support \b. man grep says:

The symbol \b matches the empty string at the edge of a word, and \B matches the 
empty string provided it's not at the edge of a word.

```bash
# Find the serde object
grep -r "$ARGUMENTS" spark/src/main/scala/org/apache/comet/serde/ --include="*.scala" -l
grep -r "$ARGUMENTS" spark/src/main/scala/org/apache/comet/ --include="*.scala" -l | grep -v test
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.claude/skills/audit-comet-expression/SKILL.md:115: Piping to grep -v test will exclude any path containing the substring test, which can unintentionally drop non-test source files (for example, paths with latest/testing in the name) and reduce audit coverage.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! There is no need of grep -v test! The first/main grep is looking into src/main folder, so there tests are not expected.

### Shims

```bash
find spark/src/main -name "CometExprShim.scala" | xargs grep -l "$ARGUMENTS" 2>/dev/null
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.claude/skills/audit-comet-expression/SKILL.md:128: If find returns no CometExprShim.scala files, xargs can end up invoking grep with no file arguments, which can cause the command to block waiting on stdin and make the skill appear to hang.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback: The Augment AI reviewer is not correct! DataFusion Comet provides CometExprShim.scala for every supported version of Apache Spark. Without this file the build won't work at all.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
.claude/skills/audit-comet-expression/SKILL.md (1)

26-31: Consider adding error handling for git clone failures.

The script checks if the directory exists before cloning, but if git clone fails for reasons other than the directory existing (network issues, authentication, etc.), the error will be silently ignored and subsequent steps may fail.

🛡️ Proposed enhancement for error handling
 for tag in v3.4.3 v3.5.8 v4.0.1; do
   dir="/tmp/spark-${tag}"
   if [ ! -d "$dir" ]; then
-    git clone --depth 1 --branch "$tag" https://github.com/apache/spark.git "$dir"
+    git clone --depth 1 --branch "$tag" https://github.com/apache/spark.git "$dir" || {
+      echo "Failed to clone Spark $tag"
+      exit 1
+    }
   fi
 done
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/skills/audit-comet-expression/SKILL.md around lines 26 - 31, The
clone loop over tags (for tag in v3.4.3 v3.5.8 v4.0.1) currently runs git clone
into dir="/tmp/spark-${tag}" without checking for failures; update the git clone
invocation so that you test its exit status (e.g., if ! git clone --depth 1
--branch "$tag" https://github.com/apache/spark.git "$dir"; then ...) and on
failure log a descriptive error mentioning "$tag" and the exit code, remove any
partially created "$dir" (rm -rf "$dir") to avoid corrupted state, and exit or
retry as appropriate so downstream steps don't proceed with a failed clone.
.cursor/rules.md (2)

1-2: Minor grammar inconsistency in the phrasing.

The phrase "any issue or a pull request" mixes quantifiers inconsistently. Consider using "any issue or any pull request" or "any issues or pull requests" for grammatical consistency.

📝 Suggested fix
-The AI agents should never link to any issue or a pull request
-in any GitHub repository in the code reviews!
+The AI agents should never link to any issue or pull request
+from any GitHub repository in code reviews.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor/rules.md around lines 1 - 2, Replace the inconsistent phrase "any
issue or a pull request" in .cursor/rules.md (search for the exact phrase "any
issue or a pull request") with a grammatically consistent alternative such as
"any issue or any pull request" or "any issues or pull requests" so the
quantifiers match; update the sentence in place preserving the original meaning.

1-5: Consider whether the prohibition should be absolute.

The rules prohibit linking to any GitHub issues or pull requests without exception. While this may be intentional, there could be legitimate cases where referencing related issues or PRs would be helpful (e.g., linking to the issue that introduced a bug being fixed, or referencing a related PR for context).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.cursor/rules.md around lines 1 - 5, The rule "The AI agents should never
link to any issue or a pull request in any GitHub repository in the code
reviews!" is too absolute; change the text in .cursor/rules.md to allow narrowly
scoped exceptions by replacing the absolute prohibition with a conditional rule
such as permitting links when they directly reference the specific issue/PR that
introduced the bug or provides necessary context and only when approved by a
human reviewer, and add a required justification field ("reason for linking") to
be included in the review comment; update the second rule entry "The AI agents
should not review AI agents' config files like CLAUDE.md or AGENTS.md!"
similarly if intended to be conditional (e.g., allow review if explicitly
requested by a maintainer) and add a brief example or note describing the
approval workflow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.claude/skills/audit-comet-expression/SKILL.md:
- Around line 26-31: The clone loop over tags (for tag in v3.4.3 v3.5.8 v4.0.1)
currently runs git clone into dir="/tmp/spark-${tag}" without checking for
failures; update the git clone invocation so that you test its exit status
(e.g., if ! git clone --depth 1 --branch "$tag"
https://github.com/apache/spark.git "$dir"; then ...) and on failure log a
descriptive error mentioning "$tag" and the exit code, remove any partially
created "$dir" (rm -rf "$dir") to avoid corrupted state, and exit or retry as
appropriate so downstream steps don't proceed with a failed clone.

In @.cursor/rules.md:
- Around line 1-2: Replace the inconsistent phrase "any issue or a pull request"
in .cursor/rules.md (search for the exact phrase "any issue or a pull request")
with a grammatically consistent alternative such as "any issue or any pull
request" or "any issues or pull requests" so the quantifiers match; update the
sentence in place preserving the original meaning.
- Around line 1-5: The rule "The AI agents should never link to any issue or a
pull request in any GitHub repository in the code reviews!" is too absolute;
change the text in .cursor/rules.md to allow narrowly scoped exceptions by
replacing the absolute prohibition with a conditional rule such as permitting
links when they directly reference the specific issue/PR that introduced the bug
or provides necessary context and only when approved by a human reviewer, and
add a required justification field ("reason for linking") to be included in the
review comment; update the second rule entry "The AI agents should not review AI
agents' config files like CLAUDE.md or AGENTS.md!" similarly if intended to be
conditional (e.g., allow review if explicitly requested by a maintainer) and add
a brief example or note describing the approval workflow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 099c51f9-99ac-4f16-b8d1-92603510c28b

📥 Commits

Reviewing files that changed from the base of the PR and between 93eca2d and cedc6db.

📒 Files selected for processing (4)
  • .claude/skills/audit-comet-expression/SKILL.md
  • .cursor/rules.md
  • AGENTS.md
  • CLAUDE.md

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new skill, audit-comet-expression, designed to automate the auditing of Comet expressions by comparing them against Spark implementations across multiple versions. It also adds several configuration files—.cursor/rules.md, AGENTS.md, and CLAUDE.md—to establish guidelines for AI agent behavior. The review feedback identifies opportunities to optimize source code searches by avoiding inefficient registry scans, clarifies the use of directory placeholders for test organization, and corrects the Maven command syntax for executing specific SQL file tests.

```bash
# Check if there's a DataFusion built-in function with this name
find native/ -name "Cargo.lock" -exec grep -A2 "datafusion" {} \; | grep "version" | head -5
grep -r "$ARGUMENTS" ~/.cargo/registry/src/ --include="*.rs" -l 2>/dev/null | head -10
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Searching the entire ~/.cargo/registry/src/ directory is highly inefficient and unreliable. The source code for dependencies may not be present in the registry cache, or multiple versions might exist. It is better to focus on the local native/ directory or use cargo metadata to find the exact source path of the datafusion dependency if needed.

Suggested change
grep -r "$ARGUMENTS" ~/.cargo/registry/src/ --include="*.rs" -l 2>/dev/null | head -10
grep -r "$ARGUMENTS" native/ --include="*.rs" -l | head -n 20

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There is no point in grepping in the local Cargo registry. This folder contains all dependencies ever fetched by all locally built projects. It may match totally unrelated things. It would be better to use an environment variable pointing to the local clone of DataFusion repo.

>
> - [list each missing test case]
>
> I can add them as SQL file tests in `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql`
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path contains a <category> placeholder that is not defined. The AI should be instructed to determine the appropriate category (e.g., datetime, math, string) based on the expression being audited to ensure the test file is placed in the correct directory.

Suggested change
> I can add them as SQL file tests in `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql`
> I can add them as SQL file tests in `spark/src/test/resources/sql-tests/expressions/<category>/$ARGUMENTS.sql`\n> (where `<category>` is the appropriate group such as `datetime`, `math`, or `string`).

After implementing tests, tell the user how to run them:

```bash
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Maven command for running a specific SQL file test appears to be incorrect. org.apache.comet.CometSqlFileTestSuite $ARGUMENTS is not a valid class name because of the space and the appended argument. In Comet, you typically run the suite and filter tests using specific properties or test name patterns.

Suggested change
./mvnw test -Dsuites="org.apache.comet.CometSqlFileTestSuite $ARGUMENTS" -Dtest=none
./mvnw test -Dtest=CometSqlFileTestSuite -DwildcardSuites="*$ARGUMENTS*"

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! Passing "$ARGUMENTS" as space separated argument to -Dsuites won't work. -DwildcardSuites=... is the correct way in ScalaTest.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

```bash
# Check if there's a DataFusion built-in function with this name
find native/ -name "Cargo.lock" -exec grep -A2 "datafusion" {} \; | grep "version" | head -5
grep -r "$ARGUMENTS" ~/.cargo/registry/src/ --include="*.rs" -l 2>/dev/null | head -10
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Searching home directory cargo cache is unreliable

Medium Severity

The skill searches ~/.cargo/registry/src/ to find the DataFusion implementation, but this path may not exist, may be empty, or may contain a different version of DataFusion than the project actually depends on (per Cargo.lock). This can cause the agent to analyze the wrong DataFusion code, producing a misleading audit. The correct approach is to resolve the exact DataFusion version from the project's Cargo.lock.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! There is no point in grepping in the local Cargo registry. This folder contains all dependencies ever fetched by all locally built projects. It may match totally unrelated things. It would be better to use an environment variable pointing to the local clone of DataFusion repo.

@martin-augment
Copy link
Copy Markdown
Owner Author

The DataFusion source lookup is fragile. Step 3 searches ~/.cargo/registry/src/ for the DataFusion implementation, which assumes cargo fetch or a prior build has already populated the local registry. This may not be true in all environments. A more reliable approach would be to look inside native/ directly (e.g., native/spark-expr/src/ and native/core/src/), which are already suggested earlier in the same step. The registry search could be dropped or moved to a fallback note.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! There is no point in grepping in the local Cargo registry. This folder contains all dependencies ever fetched by all locally built projects. It may match totally unrelated things. It would be better to use an environment variable pointing to the local clone of DataFusion repo.

@martin-augment
Copy link
Copy Markdown
Owner Author

26-31: Consider adding error handling for git clone failures.

The script checks if the directory exists before cloning, but if git clone fails for reasons other than the directory existing (network issues, authentication, etc.), the error will be silently ignored and subsequent steps may fail.

value:good-to-have; category:bug; feedback: The CodeRabbit AI reviewer is correct! Something should tell ClaudeCode to stop if the cloning fails. The best would be to use Bash's set -e, so any error will lead to full stop of the session.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants