Skip to content

Integrate HAVING with query optimization engine (predicate pushdown) #136

@paveg

Description

@paveg

Description

Integrate HAVING clause operations with the existing query optimization engine to enable predicate pushdown and other optimization strategies for improved performance.

Current State

HAVING operations are evaluated after GROUP BY aggregation is complete. The query optimizer doesn't currently analyze HAVING predicates for optimization opportunities.

Proposed Enhancement

1. Predicate Pushdown for HAVING

Identify predicates that can be pushed down to pre-aggregation filtering:

// Original query
df.GroupBy("department").
   Having(expr.And(
       expr.Sum(expr.Col("salary")).Gt(expr.Lit(100000)),
       expr.Col("department").Ne(expr.Lit("TEMP")), // Can be pushed down
   ))

// Optimized to:
df.Filter(expr.Col("department").Ne(expr.Lit("TEMP"))). // Pushed down
   GroupBy("department").
   Having(expr.Sum(expr.Col("salary")).Gt(expr.Lit(100000)))

2. Early Group Elimination

Skip aggregation for groups that can't possibly satisfy HAVING conditions:

// If we know min(salary) = 30000 across dataset
Having(expr.Sum(expr.Col("salary")).Gt(expr.Lit(1000000)))
// Skip groups with < 34 members (1000000/30000)

3. Expression Optimization

Optimize HAVING expressions using existing expression optimization:

// Constant folding
Having(expr.Col("count").Gt(expr.Lit(5).Add(expr.Lit(10))))
// Optimized to:
Having(expr.Col("count").Gt(expr.Lit(15)))

// Common subexpression elimination
Having(expr.And(
    expr.Sum(expr.Col("amount")).Gt(expr.Lit(1000)),
    expr.Sum(expr.Col("amount")).Lt(expr.Lit(5000)),
))
// Cache Sum(amount) computation

4. Join with HAVING Optimization

Optimize joins that include HAVING clauses:

// Push HAVING predicates to appropriate side of join
leftDf.Join(rightDf, ...).
    GroupBy("category").
    Having(expr.Count(expr.Col("left.id")).Gt(expr.Lit(10)))

Implementation Tasks

  • Extend QueryOptimizer to analyze HAVING predicates
  • Implement predicate classification (pushable vs non-pushable)
  • Add HAVING-aware optimization rules to the engine
  • Integrate with existing predicate pushdown infrastructure
  • Add cost-based optimization for HAVING operations
  • Create benchmarks demonstrating optimization benefits
  • Update execution plan visualization for HAVING optimizations

Performance Impact

Expected improvements:

  • 20-50% reduction in processing time for queries with pushable predicates
  • Reduced memory usage by filtering data earlier
  • Better cache utilization with optimized expressions

Priority

Low - Performance optimization that builds on existing infrastructure.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: core🏗️ Core DataFrame functionalitypriority: low🟢 Low priority / nice-to-have featurestype: performance⚡ Performance improvements and optimizations

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions