3858: chore: fix native shuffle for thin batches#53
3858: chore: fix native shuffle for thin batches#53martin-augment wants to merge 5 commits intomainfrom
Conversation
WalkthroughThis pull request adds guidelines to three AI agent configuration files ( ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Review: Fix native shuffle for thin batchesThis PR fixes a crash in the native shuffle path when dealing with zero-column batches, which can arise during What the fix doesThe change in There is also a nice secondary fix here: in the original code, Implementation looks solidThe comment explaining the invariant ( One small thought: should the schema be taken from Test coverageThe test It might be worth adding a more direct test that explicitly verifies the zero-column batch case, for example a simple One other thing: after calling OverallThis is a clean, targeted fix for a real crash. The logic is correct, the comment explaining the Arrow limitation is clear, and the secondary |
There was a problem hiding this comment.
Code Review
This pull request introduces support for zero-column batches in the native shuffle partitioner, specifically addressing issues with COUNT queries where interleave_record_batch cannot be used. It also adds a corresponding test case in CometNativeShuffleSuite to verify the fix and includes new AI agent guidelines in .cursor/rules.md, AGENTS.md, and CLAUDE.md. I have no feedback to provide.
🤖 Augment PR SummarySummary: Prevents native shuffle crashes when shuffling “thin” Arrow batches that have 0 columns but a non-zero row count (e.g., column-pruned Changes:
Notes: The fix relies on explicitly setting 🤖 Was this summary useful? React with 👍 or 👎 |
| RecordBatch::try_new_with_options(schema, vec![], &options) | ||
| }; | ||
|
|
||
| self.pos = indices_end; |
There was a problem hiding this comment.
native/shuffle/src/partitioners/partitioned_batch_iterator.rs:116 — self.pos is advanced unconditionally before returning, so an Err from interleave_record_batch/try_new_with_options will still consume those indices and the iterator can continue yielding later batches. Is that behavior change from the previous implementation (which only advanced on Ok) intended?
Severity: low
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:good-but-wont-fix; category:bug; feedback: The Augment AI reviewer is correct! Yes, this is intentional. Without advancing the position an error and re-attempt will lead to the same error and it won't be able to proceed at all by skipping the problematic data.
| CometConf.COMET_EXEC_SHUFFLE_WITH_ROUND_ROBIN_PARTITIONING_ENABLED.key -> "true") { | ||
| val testDF = spark.read.parquet(dir.toString).repartition(10) | ||
| // Actual validation, no crash | ||
| val count = testDF.count() |
There was a problem hiding this comment.
spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala:493 — This test asserts the count() result and that testDF (without the count) runs with Comet, but it doesn’t assert that the count() plan used Comet native operators (it could potentially fall back and still pass). That could let the original zero-column native-shuffle crash slip through if fallback behavior changes.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
value:useful; category:bug; feedback: The Augment AI reviewer is correct! The call to checkSparkAnswerAndOperator() that tests the execution of the data frame on both Spark and Comet does not use count(). To keep the test focused it would be good to use the same dataframe there too.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
.cursor/rules.md (1)
1-5: Optional maintainability improvement: avoid policy drift across duplicated rule files.The same text is repeated in multiple files; consider defining one canonical source and syncing/generated copies to prevent divergence over time.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.cursor/rules.md around lines 1 - 5, Multiple rule files contain duplicated policy text (e.g., the blocks in .cursor/rules.md, CLAUDE.md, AGENTS.md); consolidate by creating one canonical rules source (choose or create a single file such as RULES.md) and update the other files to be generated or reference the canonical file during CI/commit using a simple sync/generation step; replace the duplicated blocks in the copies with a short pointer or include directive and add a test or CI check to fail the build if the copies drift from the canonical source to prevent future divergence.spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala (1)
492-497: Strengthen coverage by validating the aggregate plan, not only the base DataFrame.On Line 496,
checkSparkAnswerAndOperator(testDF)verifies the non-aggregate plan. The regression here isrepartition + count; consider asserting operator usage on an explicit count DataFrame.Proposed test adjustment
- val testDF = spark.read.parquet(dir.toString).repartition(10) - // Actual validation, no crash - val count = testDF.count() - assert(count == 1000) - // Ensure test df evaluated by Comet - checkSparkAnswerAndOperator(testDF) + val testDF = spark.read.parquet(dir.toString).repartition(10) + val countDF = testDF.selectExpr("count(*) as cnt") + val count = countDF.collect().head.getLong(0) + assert(count == 1000) + // Ensure repartition + count plan is evaluated by Comet + checkSparkAnswerAndOperator(countDF)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala` around lines 492 - 497, The test currently validates only the base DataFrame plan via checkSparkAnswerAndOperator(testDF); instead create an explicit aggregate DataFrame (e.g., val countDF = testDF.groupBy().count() or val countDF = testDF.agg(count("*").as("cnt"))), assert the aggregated result equals 1000, and call checkSparkAnswerAndOperator(countDF) so the operator usage for the aggregation (repartition+count) is verified; update references to testDF to use countDF for the operator check and include the aggregation import (org.apache.spark.sql.functions.count) as needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.cursor/rules.md:
- Around line 1-5: Multiple rule files contain duplicated policy text (e.g., the
blocks in .cursor/rules.md, CLAUDE.md, AGENTS.md); consolidate by creating one
canonical rules source (choose or create a single file such as RULES.md) and
update the other files to be generated or reference the canonical file during
CI/commit using a simple sync/generation step; replace the duplicated blocks in
the copies with a short pointer or include directive and add a test or CI check
to fail the build if the copies drift from the canonical source to prevent
future divergence.
In `@spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala`:
- Around line 492-497: The test currently validates only the base DataFrame plan
via checkSparkAnswerAndOperator(testDF); instead create an explicit aggregate
DataFrame (e.g., val countDF = testDF.groupBy().count() or val countDF =
testDF.agg(count("*").as("cnt"))), assert the aggregated result equals 1000, and
call checkSparkAnswerAndOperator(countDF) so the operator usage for the
aggregation (repartition+count) is verified; update references to testDF to use
countDF for the operator check and include the aggregation import
(org.apache.spark.sql.functions.count) as needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: beb23cc0-f507-4490-905e-5a18b34fdbc1
📒 Files selected for processing (5)
.cursor/rules.mdAGENTS.mdCLAUDE.mdnative/shuffle/src/partitioners/partitioned_batch_iterator.rsspark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala
value:useful; category:bug; feedback: The CodeRabbit AI reviewer is correct! The call to |
3858: To review by AI