Skip to content

21021: DataFrame API: allow aggregate functions in select() (#17874)#288

Open
martin-augment wants to merge 5 commits intomainfrom
pr-21021-2026-03-23-12-19-31
Open

21021: DataFrame API: allow aggregate functions in select() (#17874)#288
martin-augment wants to merge 5 commits intomainfrom
pr-21021-2026-03-23-12-19-31

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21021: To review by AI

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the DataFusion DataFrame API by enabling the direct use of aggregate functions within the select method. This change streamlines the process of performing aggregations, allowing users to define complex aggregate queries more intuitively and concisely, ultimately improving the usability and expressiveness of the DataFrame API for analytical workloads.

Highlights

  • Enabled Aggregate Functions in DataFrame::select: The select method now directly supports aggregate expressions, automatically transforming the logical plan to include an Aggregate node when needed.
  • Intelligent Plan Generation: The select method intelligently determines whether to generate an Aggregate plan or fall back to a Projection based on the presence and type of expressions (aggregates, non-aggregates, or mixed).
  • Alias Management for Aggregates: A mechanism was introduced to assign unique aliases to aggregate expressions within the select list, preventing naming conflicts and ensuring correct output column names.
  • Expression Rewriting: Expressions are rewritten to reference the output of the aggregate plan, ensuring that subsequent projections correctly use the aggregated results.
  • New Test Cases: Comprehensive test cases were added to validate the new functionality, covering scenarios with multiple aggregates, duplicate aliases, and combinations with wildcard selections.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 23, 2026

Walkthrough

This change enhances the DataFrame::select method to support aggregate functions in the select list. The implementation adds logic to detect aggregate expressions, create an Aggregate logical plan when necessary, and rewrite selected expressions to reference the new aggregate output columns. When no aggregates are found or the plan is already aggregated, a simple projection fallback is applied. A test is added to verify the functionality with multiple aggregate expressions, duplicate aliases, and wildcard projections.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21021-2026-03-23-12-19-31

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 23, 2026

Code Review

This PR adds useful functionality — automatically detecting and handling aggregate functions in DataFrame::select() so users do not need to call .aggregate() explicitly. The approach is reasonable, but there are several issues to address.

Bug: Intra-expression mixing of aggregates and non-aggregates

The has_non_aggregate_expr check iterates over the top-level select expressions and returns true if any expression contains no aggregate anywhere in its subtree. However, it does not catch the case where a single expression mixes an aggregate and a non-aggregate column leaf.

For example: df.select(vec![col("a") + count(col("b"))])

For this expression, find_aggregate_exprs returns [count(col("b"))] which is non-empty, so has_non_aggregate_expr is false. The code proceeds to the aggregate path, builds an aggregate plan with only count(b) as output, and then the rewritten expression col("a") + col("count(b)") references col("a") which does not exist in the aggregate plan. This produces a confusing deep planning error rather than a clear user-facing message.

Consider adding an explicit check and returning a descriptive error: "Cannot mix aggregate and non-aggregate column references in select() — use aggregate() with explicit GROUP BY instead."

Missing test cases

The test suite covers the happy path well, but several important edge cases are missing:

  1. Non-aggregate columns mixed with aggregates at top level (e.g. vec![col("c1"), count(col("c9"))]): currently falls back to projection, which likely fails at plan time with a low-level error. A test documenting the expected error message would be valuable.
  2. Intra-expression mixing (e.g. col("a") + count(col("b"))): as noted above, this hits the aggregate path and fails with a confusing error. Should have a test expecting a clear error.
  3. Window function combined with aggregates: the code processes window functions first, then aggregates on top of the window plan. There are no tests confirming this works correctly or fails with a clear error.
  4. select() after aggregate(): ensure the early-return path (matches!(plan, LogicalPlan::Aggregate(_))) correctly handles the case where self.plan is already an aggregate.

select() docstring not updated

The method docstring only shows a non-aggregate example. Since this PR adds a significant behavioral change, the docstring should include an aggregate usage example and note the restriction that mixing aggregate and non-aggregate expressions requires .aggregate() with explicit GROUP BY keys.

Minor: expressions iterator cloned multiple times

The expressions filter_map iterator is cloned 3+ times, and find_aggregate_exprs is called once per element inside the has_non_aggregate_expr check. Collecting to Vec<&Expr> once up front would be cleaner.

Minor: Intermediate column name collision

The used_names deduplication appends a counter suffix (e.g. count(c9)_1) for colliding aggregate names. If the user table has a column literally named count(c9)_1 this would silently collide. Acceptable for a first implementation but worth a comment noting the limitation.

Nit: double-aliasing every rewritten expression

In the rewrite loop every SelectExpr::Expression unconditionally gets .alias(alias). If rewritten already carries an alias this creates a redundant Alias(Alias(...)). It is harmless since the planner strips extra aliases, but worth noting.

Overall

The core logic (build an intermediate aggregate plan, assign stable aliases, rewrite the projection to reference those aliases) is sound. The main asks before merging are:

  1. Add a clear error for intra-expression mixing of aggregate and non-aggregate column references
  2. Add test cases for the missing edge cases listed above
  3. Update the select() docstring to document the new aggregate support

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the capability to use aggregate functions directly within the select() method of the DataFrame API. The implementation correctly handles the interplay between window functions, aggregate expressions, and non-aggregate expressions, ensuring that the logical plan is constructed appropriately. New test cases have been added to validate this functionality, covering various scenarios including multiple aggregates, duplicate aggregate aliases, and wildcards. The changes are well-implemented and enhance the flexibility of the DataFrame API.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (4)
datafusion/core/src/dataframe/mod.rs (2)

502-504: Consider documenting wildcard behavior with aggregates.

When a SelectExpr::Wildcard is passed alongside aggregate expressions, it passes through unchanged and will expand to the aggregate plan's schema (which contains only the computed aggregate columns). This might be surprising to users.

For example: df.select(vec![sum(col("a")), Expr::Wildcard {...}]) would have the wildcard expand to the aggregate output columns, not the original table columns.

This edge case may be intentional, but it could benefit from documentation in the method's doc comment or a check that returns an error when wildcards are mixed with aggregate expressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/core/src/dataframe/mod.rs` around lines 502 - 504, The
SelectExpr::Wildcard branch currently passes through unchanged and later expands
to the aggregate plan's schema (only computed aggregates), which is surprising
when mixed with aggregate expressions; update the doc comment for the function
handling select expression rewriting to explicitly document this behavior
(mention SelectExpr::Wildcard and that it expands to aggregate output when
aggregates are present) and/or add a validation in the same function to detect
mixing wildcard with aggregate expressions and return a clear error instead;
reference the SelectExpr::Wildcard match arm and the rewritten_exprs collection
when implementing the documentation or the new check.

495-505: Redundant aliasing may occur for already-aliased expressions.

When the original expression already has an alias (e.g., sum(col("a")).alias("total")), the rewriting preserves the inner alias through transform, and then line 501 wraps it in another alias. This creates nested Alias nodes.

While DataFusion's planner may flatten these, consider avoiding the redundant alias:

♻️ Suggested improvement
 SelectExpr::Expression(expr) => {
-    let rewritten = rewrite_expr(expr.clone(), &aggr_map)?;
     let alias = expr.name_for_alias()?;
-    rewritten_exprs.push(SelectExpr::Expression(rewritten.alias(alias)));
+    let rewritten = rewrite_expr(expr, &aggr_map)?;
+    // Only add alias if the rewritten expression doesn't already have one
+    let final_expr = match &rewritten {
+        Expr::Alias(_) => rewritten,
+        _ => rewritten.alias(alias),
+    };
+    rewritten_exprs.push(SelectExpr::Expression(final_expr));
 }

This also removes the unnecessary expr.clone() since alias can be computed first.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/core/src/dataframe/mod.rs` around lines 495 - 505, In the
SelectExpr::Expression branch avoid producing nested Alias nodes by computing
alias = expr.name_for_alias() before rewriting, then call rewrite_expr with the
original expr (no clone) and only wrap with rewritten.alias(alias) if the
rewritten expression is not already an Alias (i.e., detect existing Alias on the
rewritten expression or if the original expression already carried an alias) so
rewritten_exprs.push either the rewritten as-is or rewritten.alias(alias); refer
to SelectExpr::Expression, rewrite_expr, expr.name_for_alias(), and
rewritten.alias().
datafusion/core/tests/dataframe/mod.rs (2)

6913-6916: Make wildcard assertions batch-agnostic.

Asserting on batches[0] assumes a single output batch. Safer checks should validate total rows and consistent column counts across all returned batches.

Batch-agnostic assertion pattern
-    let batches = res.collect().await?;
-    assert_eq!(batches[0].num_rows(), 100);
-    assert_eq!(batches[0].num_columns(), 14);
+    let batches = res.collect().await?;
+    assert!(!batches.is_empty());
+    assert_eq!(batches.iter().map(|b| b.num_rows()).sum::<usize>(), 100);
+    assert!(batches.iter().all(|b| b.num_columns() == 14));

@@
-    let batches = res.collect().await?;
-    assert_eq!(batches[0].num_rows(), 100);
-    assert_eq!(batches[0].num_columns(), 14);
+    let batches = res.collect().await?;
+    assert!(!batches.is_empty());
+    assert_eq!(batches.iter().map(|b| b.num_rows()).sum::<usize>(), 100);
+    assert!(batches.iter().all(|b| b.num_columns() == 14));

Also applies to: 6925-6927

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/core/tests/dataframe/mod.rs` around lines 6913 - 6916, The
assertions currently index into batches[0], which assumes a single output batch;
change the checks after let batches = res.collect().await? to be batch-agnostic
by summing rows across all batches (e.g., total_rows == 100) and verifying every
batch has the expected number of columns (batch.num_columns() == 14) rather than
only checking batches[0]; apply the same replacement for the second occurrence
around the lines referenced (6925-6927). Use the local variable name batches to
iterate and perform these aggregated/consistent checks.

6887-6891: Align the “duplicate aliases” case with actual behavior.

Line 6889 and Line 6890 use different aliases (count_c9 and count_c9_2), so this does not test alias-collision behavior despite the comment.

Suggested adjustment
-    // Test duplicate aggregate aliases
+    // Test repeated aggregate expressions with distinct aliases

If the intent is to validate duplicate alias handling, use the same alias on both expressions and assert the expected output naming behavior.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/core/tests/dataframe/mod.rs` around lines 6887 - 6891, The test
comment says it should check duplicate aggregate aliases but the select uses
different aliases; update the test so both aggregate expressions use the same
alias (e.g., replace count(col("c9")).alias("count_c9_2") with
count(col("c9")).alias("count_c9")) and then assert the expected naming/behavior
for duplicate aliases (adjust assertions on the resulting DataFrame / variable
res accordingly) so the test actually validates alias-collision handling in the
df.select/count/alias flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@datafusion/core/src/dataframe/mod.rs`:
- Around line 502-504: The SelectExpr::Wildcard branch currently passes through
unchanged and later expands to the aggregate plan's schema (only computed
aggregates), which is surprising when mixed with aggregate expressions; update
the doc comment for the function handling select expression rewriting to
explicitly document this behavior (mention SelectExpr::Wildcard and that it
expands to aggregate output when aggregates are present) and/or add a validation
in the same function to detect mixing wildcard with aggregate expressions and
return a clear error instead; reference the SelectExpr::Wildcard match arm and
the rewritten_exprs collection when implementing the documentation or the new
check.
- Around line 495-505: In the SelectExpr::Expression branch avoid producing
nested Alias nodes by computing alias = expr.name_for_alias() before rewriting,
then call rewrite_expr with the original expr (no clone) and only wrap with
rewritten.alias(alias) if the rewritten expression is not already an Alias
(i.e., detect existing Alias on the rewritten expression or if the original
expression already carried an alias) so rewritten_exprs.push either the
rewritten as-is or rewritten.alias(alias); refer to SelectExpr::Expression,
rewrite_expr, expr.name_for_alias(), and rewritten.alias().

In `@datafusion/core/tests/dataframe/mod.rs`:
- Around line 6913-6916: The assertions currently index into batches[0], which
assumes a single output batch; change the checks after let batches =
res.collect().await? to be batch-agnostic by summing rows across all batches
(e.g., total_rows == 100) and verifying every batch has the expected number of
columns (batch.num_columns() == 14) rather than only checking batches[0]; apply
the same replacement for the second occurrence around the lines referenced
(6925-6927). Use the local variable name batches to iterate and perform these
aggregated/consistent checks.
- Around line 6887-6891: The test comment says it should check duplicate
aggregate aliases but the select uses different aliases; update the test so both
aggregate expressions use the same alias (e.g., replace
count(col("c9")).alias("count_c9_2") with count(col("c9")).alias("count_c9"))
and then assert the expected naming/behavior for duplicate aliases (adjust
assertions on the resulting DataFrame / variable res accordingly) so the test
actually validates alias-collision handling in the df.select/count/alias flow.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 54a647e9-24d9-45d5-bb85-1a94cb1d3975

📥 Commits

Reviewing files that changed from the base of the PR and between ea6750b and e527e12.

📒 Files selected for processing (2)
  • datafusion/core/src/dataframe/mod.rs
  • datafusion/core/tests/dataframe/mod.rs

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Mar 23, 2026

🤖 Augment PR Summary

Summary: This PR updates the DataFrame API so aggregate functions can be used directly in DataFrame::select().

Changes:

  • Extends DataFrame::select to detect window functions, apply a window plan when needed, and then (when appropriate) build a global Aggregate plan for aggregate expressions.
  • Collects aggregate sub-expressions, assigns stable/unique internal aliases, and rewrites the original select expressions to reference the aggregate outputs.
  • Adds tests covering multiple aggregates, repeated aggregates with different output aliases, and continued support for wildcard / qualified-wildcard selection.

Technical Notes: The implementation relies on find_aggregate_exprs for aggregate discovery and uses expression tree rewriting (TreeNode::transform) to replace aggregate nodes with column references from the generated aggregate schema.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

expr_list.into_iter().map(|e| e.into()).collect::<Vec<_>>();
expr_list.into_iter().map(|e| e.into()).collect();

// Extract expressions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expressions only includes SelectExpr::Expression, so SelectExpr::Wildcard / QualifiedWildcard aren’t considered when deciding to build the implicit global Aggregate. This can misclassify a select() that mixes a wildcard with an aggregate (e.g. [* , count(..)]) and change semantics by aggregating anyway.

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

let aggr_exprs = find_aggregate_exprs(expressions.clone());

// Check for non-aggregate expressions
let has_non_aggregate_expr = expressions
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The has_non_aggregate_expr check treats any expression with no aggregate sub-expr (including pure scalars like lit(1)) as “non-aggregate”, which prevents the new aggregate-in-select() path from running. This likely makes queries like select([count(..), lit(1)]) fall back to a projection containing an aggregate and then error later.

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

// - no aggregates
if matches!(plan, LogicalPlan::Aggregate(_))
|| has_non_aggregate_expr
|| aggr_exprs.is_empty()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wildcard with aggregates incorrectly enters aggregate path

Medium Severity

The has_non_aggregate_expr check only inspects SelectExpr::Expression items (via the expressions iterator), completely ignoring SelectExpr::Wildcard and SelectExpr::QualifiedWildcard. When a wildcard is combined with aggregate expressions (e.g., vec![wildcard(), count(col("a")).into()]), the check finds no non-aggregate expressions, aggr_exprs is non-empty, and the code incorrectly enters the aggregate path. The wildcard is then passed through unmodified at other => rewritten_exprs.push(other) and resolves against the aggregate output schema (which only contains aggregate columns) instead of the original table schema, producing incorrect results.

Additional Locations (1)
Fix in Cursor Fix in Web

SelectExpr::Expression(expr) => {
let rewritten = rewrite_expr(expr.clone(), &aggr_map)?;
let alias = expr.name_for_alias()?;
rewritten_exprs.push(SelectExpr::Expression(rewritten.alias(alias)));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unconditional alias creates redundant double-wrapping on aliased expressions

Low Severity

When the original expression already has an alias (e.g., count(col("c9")).alias("count_c9")), rewrite_expr preserves the inner Alias node, and then .alias(alias) unconditionally wraps it again, producing Alias(Alias(Column(...), "count_c9"), "count_c9"). The existing alias_if_changed method on Expr already handles this case by only adding an alias when the name differs, which would avoid the unnecessary double wrapping.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants