Skip to content

Optimize SEMI joins to INNER joins when HAS is used on singular relationships#480

Merged
hadia206 merged 8 commits intomainfrom
Hadia/remove_has_singular
Feb 6, 2026
Merged

Optimize SEMI joins to INNER joins when HAS is used on singular relationships#480
hadia206 merged 8 commits intomainfrom
Hadia/remove_has_singular

Conversation

@hadia206
Copy link
Contributor

  • Optimizes using INNER JOIN instead of SEMI JOIN when HAS is used on a singular sub-collection, the existence check is redundant since singular relationships always have exactly one match.
  • Qualification Bug Fix for issue where queries failed when the graph name matched a collection name

closes #476

@hadia206 hadia206 marked this pull request as ready for review January 28, 2026 21:03
@hadia206 hadia206 requested review from a team, john-sanchez31, juankx-bodo and knassre-bodo and removed request for a team January 28, 2026 21:04
Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Hadia, mostly LGTM just a few testing thoughts

),
pytest.param(
PyDoughPandasTest(
"result = TPCH.CALCULATE(n=COUNT(customers.WHERE(HAS(nation.WHERE(region.name == 'ASIA')))))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: if need be for visual clarity, you can split this & the others up into multiple lines (put everything inside the HAS inside of a variable)

# HAS on singular relationship with additional filter
pytest.param(
PyDoughPandasTest(
"result = TPCH.CALCULATE(n=COUNT(suppliers.WHERE(HAS(nation.WHERE(region.name == 'EUROPE')))))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the basic redundant_has?

result = TPCH.CALCULATE(n=COUNT(customers.WHERE(HAS(nation.WHERE(region.name == 'ASIA')))))

),
id="redundant_has_on_plural_lineitems",
),
# HASNOT on singular relationship - should optimize to ANTI join or similar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No optimization here, it just stays as ANTI the entire time

}
),
"redundant_has_not_on_singular",
skip_relational=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all of these we should keep either the relational or sql just so we can see what is going on.

),
id="redundant_has_not_on_singular",
),
# HAS without WHERE filter on singular - should optimize to INNER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really an optimization either; HAS outside of the conjunction of a WHERE always just becomes COUNT(x) != 0, which other rewrites interpret as an INNER join. The optimizations are that other passes will delete the aggregation, and should realize that since every customer has a nation + we aren't using anything from the nation, the nation join can get deleted so we just do SELECT COUNT(*) FROM CUSTOMERS)

# HAS on singular within plural context - orders whose customer is from ASIA
pytest.param(
PyDoughPandasTest(
"result = TPCH.CALCULATE(n=COUNT(orders.WHERE(HAS(customer.WHERE(nation.region.name == 'ASIA')))))",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as filtering customers based on a filter involving region/nation, so it is not a distinct case from the original test.

skip_sql=True,
),
id="redundant_has_singular_in_plural_context",
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some weird cases to consider with HAS:

  1. Contents of the HAS is a CROSS that has a correlated filter back to the context outside the HAS
  2. Same as (1), but ensure that the child data is singular with regards to the parent based on the filter then do .SINGULAR() on the child (theoretically can get rewritten but I don't think it will be with the current version, that can be a later followup to deal with more advanced cases of determining when the child subtree is singular with regards to the parent)

Copy link
Contributor

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@hadia206 hadia206 merged commit ba5294c into main Feb 6, 2026
12 checks passed
@hadia206 hadia206 deleted the Hadia/remove_has_singular branch February 6, 2026 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure that SEMI joins on singular data are promoted to INNER joins

3 participants