-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Consider a PyDough query such as this:
nations.CALCULATE(
name,
c1 = COUNT(customers.WHERE(market_segment == "BUILDING")),
c2 = COUNT(customers.WHERE((market_segment == "BUILDING") & MONOTONIC(500, account_balance, 600))),
)
Currently, this will be implemented by taking nations and joining it onto two filtered/aggregated copies of customers, one filtered with just the market_segment and one on the market_segment & account_balance. However, there is a more performant way to write this where only 1 copy of customers is scanned/filtered/aggregated:
child = customers.WHERE(market_segment == "BUILDING").CALCULATE(expr=MONOTONIC(500, account_balance, 600))
nations.CALCULATE(
name,
c1 = COUNT(child),
c2 = COUNT(SUM(IFF(child.expr, 1, 0))),
)
The goal of this optimization is to add a new pass to the hybrid tree that automatically transforms the datastructures to make it so if the PyDough query is written in the first way, it is transformed as if it were written the second way and thus improve performance. In concrete terms, the algorithm is:
- Recursively traverse the entire hybrid tree and invoke the rest of the procedure on each level with multiple children
- Identify any children that are only used in a
COUNTcall that is not anONLY_MATCHchild; these children are "mergeable" - For each child, identify the set of all filters (e.g. the conjunction) in the bottom-most layer of the child (only looking at the ones after any limits or window functions)
- Identify which of the children subtrees are identical except for these filters
- Find each pair of childA and childB where childA is "mergeable", the child subtree is identical to child B, and the set of filters from childA is a superset of the filters in childB; if multiple such childB exist for childA, only focus on the childB with the smallest set of filters
- Merge childA into childB by creating a new expression
SUM(IIF(cond, 1, 0))in childB wherecondare all of the excess filters in childA that are not in childB; replace all references to the count of childA with this new SUM call in childB
This optimization must be done after hybrid decorrelation to remove a lot of complicated unnecessary edge cases. If it is done before the dead child removal pass, then any redundant childA subtrees will automatically be cleaned up.