Skip to content

Combine child subtrees being counted with different filters using SUM(IFF(...)) #477

@knassre-bodo

Description

@knassre-bodo

Consider a PyDough query such as this:

nations.CALCULATE(
  name,
  c1 = COUNT(customers.WHERE(market_segment == "BUILDING")),
  c2 = COUNT(customers.WHERE((market_segment == "BUILDING") & MONOTONIC(500, account_balance, 600))),
)

Currently, this will be implemented by taking nations and joining it onto two filtered/aggregated copies of customers, one filtered with just the market_segment and one on the market_segment & account_balance. However, there is a more performant way to write this where only 1 copy of customers is scanned/filtered/aggregated:

child = customers.WHERE(market_segment == "BUILDING").CALCULATE(expr=MONOTONIC(500, account_balance, 600))
nations.CALCULATE(
  name,
  c1 = COUNT(child),
  c2 = COUNT(SUM(IFF(child.expr, 1, 0))),
)

The goal of this optimization is to add a new pass to the hybrid tree that automatically transforms the datastructures to make it so if the PyDough query is written in the first way, it is transformed as if it were written the second way and thus improve performance. In concrete terms, the algorithm is:

  • Recursively traverse the entire hybrid tree and invoke the rest of the procedure on each level with multiple children
  • Identify any children that are only used in a COUNT call that is not an ONLY_MATCH child; these children are "mergeable"
  • For each child, identify the set of all filters (e.g. the conjunction) in the bottom-most layer of the child (only looking at the ones after any limits or window functions)
  • Identify which of the children subtrees are identical except for these filters
  • Find each pair of childA and childB where childA is "mergeable", the child subtree is identical to child B, and the set of filters from childA is a superset of the filters in childB; if multiple such childB exist for childA, only focus on the childB with the smallest set of filters
  • Merge childA into childB by creating a new expression SUM(IIF(cond, 1, 0)) in childB where cond are all of the excess filters in childA that are not in childB; replace all references to the count of childA with this new SUM call in childB

This optimization must be done after hybrid decorrelation to remove a lot of complicated unnecessary edge cases. If it is done before the dead child removal pass, then any redundant childA subtrees will automatically be cleaned up.

Metadata

Metadata

Assignees

Labels

effort - mediummid-sized issue with average implementation time/difficultyenhancementNew feature or requestoptimizationImproving the speed/quality of PyDough's outputs

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions