Combine child subtrees being counted with different filters using SUM(IFF(...))

Consider a PyDough query such as this:
```
nations.CALCULATE(
  name,
  c1 = COUNT(customers.WHERE(market_segment == "BUILDING")),
  c2 = COUNT(customers.WHERE((market_segment == "BUILDING") & MONOTONIC(500, account_balance, 600))),
)
```

Currently, this will be implemented by taking `nations` and joining it onto two filtered/aggregated copies of `customers`, one filtered with just the market_segment and one on the market_segment & account_balance. However, there is a more performant way to write this where only 1 copy of `customers` is scanned/filtered/aggregated:
```
child = customers.WHERE(market_segment == "BUILDING").CALCULATE(expr=MONOTONIC(500, account_balance, 600))
nations.CALCULATE(
  name,
  c1 = COUNT(child),
  c2 = COUNT(SUM(IFF(child.expr, 1, 0))),
)
```

The goal of this optimization is to add a new pass to the hybrid tree that automatically transforms the datastructures to make it so if the PyDough query is written in the first way, it is transformed as if it were written the second way and thus improve performance. In concrete terms, the algorithm is:
- Recursively traverse the entire hybrid tree and invoke the rest of the procedure on each level with multiple children
- Identify any children that are only used in a `COUNT` call that is not an `ONLY_MATCH` child; these children are "mergeable"
- For each child, identify the set of all filters (e.g. the conjunction) in the bottom-most layer of the child (only looking at the ones after any limits or window functions)
- Identify which of the children subtrees are **identical** except for these filters
- Find each pair of childA and childB where childA is "mergeable", the child subtree is identical to child B, and the set of filters from childA is a **superset** of the filters in childB; if multiple such childB exist for childA, only focus on the childB with the smallest set of filters
- Merge childA into childB by creating a new expression `SUM(IIF(cond, 1, 0))` in childB where `cond` are all of the excess filters in childA that are not in childB; replace all references to the count of childA with this new SUM call in childB

This optimization must be done after hybrid decorrelation to remove a lot of complicated unnecessary edge cases. If it is done before the dead child removal pass, then any redundant childA subtrees will automatically be cleaned up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combine child subtrees being counted with different filters using SUM(IFF(...)) #477

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Combine child subtrees being counted with different filters using SUM(IFF(...)) #477

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions