Optimize SEMI joins to INNER joins when HAS is used on singular relationships#480
Optimize SEMI joins to INNER joins when HAS is used on singular relationships#480
Conversation
knassre-bodo
left a comment
There was a problem hiding this comment.
Hi Hadia, mostly LGTM just a few testing thoughts
tests/test_pipeline_tpch_custom.py
Outdated
| ), | ||
| pytest.param( | ||
| PyDoughPandasTest( | ||
| "result = TPCH.CALCULATE(n=COUNT(customers.WHERE(HAS(nation.WHERE(region.name == 'ASIA')))))", |
There was a problem hiding this comment.
NIT: if need be for visual clarity, you can split this & the others up into multiple lines (put everything inside the HAS inside of a variable)
tests/test_pipeline_tpch_custom.py
Outdated
| # HAS on singular relationship with additional filter | ||
| pytest.param( | ||
| PyDoughPandasTest( | ||
| "result = TPCH.CALCULATE(n=COUNT(suppliers.WHERE(HAS(nation.WHERE(region.name == 'EUROPE')))))", |
There was a problem hiding this comment.
How is this different from the basic redundant_has?
result = TPCH.CALCULATE(n=COUNT(customers.WHERE(HAS(nation.WHERE(region.name == 'ASIA')))))
tests/test_pipeline_tpch_custom.py
Outdated
| ), | ||
| id="redundant_has_on_plural_lineitems", | ||
| ), | ||
| # HASNOT on singular relationship - should optimize to ANTI join or similar |
There was a problem hiding this comment.
No optimization here, it just stays as ANTI the entire time
tests/test_pipeline_tpch_custom.py
Outdated
| } | ||
| ), | ||
| "redundant_has_not_on_singular", | ||
| skip_relational=True, |
There was a problem hiding this comment.
I think all of these we should keep either the relational or sql just so we can see what is going on.
tests/test_pipeline_tpch_custom.py
Outdated
| ), | ||
| id="redundant_has_not_on_singular", | ||
| ), | ||
| # HAS without WHERE filter on singular - should optimize to INNER |
There was a problem hiding this comment.
This isn't really an optimization either; HAS outside of the conjunction of a WHERE always just becomes COUNT(x) != 0, which other rewrites interpret as an INNER join. The optimizations are that other passes will delete the aggregation, and should realize that since every customer has a nation + we aren't using anything from the nation, the nation join can get deleted so we just do SELECT COUNT(*) FROM CUSTOMERS)
tests/test_pipeline_tpch_custom.py
Outdated
| # HAS on singular within plural context - orders whose customer is from ASIA | ||
| pytest.param( | ||
| PyDoughPandasTest( | ||
| "result = TPCH.CALCULATE(n=COUNT(orders.WHERE(HAS(customer.WHERE(nation.region.name == 'ASIA')))))", |
There was a problem hiding this comment.
This is the same as filtering customers based on a filter involving region/nation, so it is not a distinct case from the original test.
| skip_sql=True, | ||
| ), | ||
| id="redundant_has_singular_in_plural_context", | ||
| ), |
There was a problem hiding this comment.
Some weird cases to consider with HAS:
- Contents of the
HASis aCROSSthat has a correlated filter back to the context outside theHAS - Same as (1), but ensure that the child data is singular with regards to the parent based on the filter then do
.SINGULAR()on the child (theoretically can get rewritten but I don't think it will be with the current version, that can be a later followup to deal with more advanced cases of determining when the child subtree is singular with regards to the parent)
INNER JOINinstead ofSEMI JOINwhenHASis used on a singular sub-collection, the existence check is redundant since singular relationships always have exactly one match.closes #476