Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79
Draft
conradbzura wants to merge 16 commits intomainfrom
Draft
Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79conradbzura wants to merge 16 commits intomainfrom
conradbzura wants to merge 16 commits intomainfrom
Conversation
Column-to-column INTERSECTS joins (e.g., a.interval INTERSECTS b.interval) are now rewritten into binned equi-joins using CTEs with UNNEST(range(...)) bin assignments. This gives the query planner an equi-join key to work with instead of forcing a nested-loop or cross join. The bin size defaults to 10,000 and is configurable via the new bin_size parameter on transpile(). Literal-range INTERSECTS filters remain unchanged.
Needed for end-to-end correctness tests that validate the binned equi-join SQL against DataFusion's query engine.
The transformer now detects column-to-column INTERSECTS in WHERE clauses (FROM a, b WHERE a.interval INTERSECTS b.interval), not just in explicit JOIN ON conditions. Both patterns are rewritten to binned equi-joins for the same performance benefit.
Covers both explicit JOIN ON and implicit cross-join patterns, custom bin sizes, custom column mappings, self-joins, literal range passthrough, and end-to-end correctness against DataFusion including multi-bin deduplication and equivalence with naive joins.
Move the overlap predicate (start < end AND end > start) from WHERE into the JOIN ON clause so that LEFT/RIGHT/FULL JOIN semantics are preserved — a WHERE filter on the right-side columns silently converts outer joins into inner joins. Also refactor the transformer to rewrite all INTERSECTS joins in a query, not just the first. A new _ensure_table_binned helper tracks which aliases already have binned CTEs so that multi-join queries reuse CTEs instead of duplicating them. Add bin_size validation (must be positive) and remove dead code from _rewrite_where.
Cover three-way joins with CTE reuse, invalid bin_size rejection, and update assertions for the overlap-in-ON change. Remove unused pytest import from module level.
The binned CTE approach leaks __giql_bin into SELECT * results because CTEs expose all their columns. Revert implicit cross-join rewriting (FROM a, b WHERE INTERSECTS) so those queries fall through to the generator's naive overlap predicate, which produces clean column output. Explicit JOIN ON INTERSECTS continues to use the binned equi-join. Also add pytest.importorskip for datafusion so the DataFusion correctness tests are skipped when the module is not installed.
The CI workflow uses pixi, not uv, so the datafusion package must be listed under [tool.pixi.dependencies] for the DataFusion correctness tests to run. Remove the pytest.importorskip guard since the dependency is now always available.
The previous approach replaced FROM/JOIN table references with full
CTEs (SELECT *), causing __giql_bin to appear in SELECT a.* output.
The new approach keeps original table references and routes the equi-
join through key-only bridge CTEs (SELECT chrom, start, end, bin),
eliminating the leak entirely.
This also restores implicit cross-join rewriting (FROM a, b WHERE
INTERSECTS) which was reverted in the prior commit due to the leak.
CTEs are now named __giql_{table}_bins and deduplicated per underlying
table name rather than per alias, so self-joins share one CTE.
Queries with explicit column lists (SELECT a.chrom, b.start, ...) cannot expose __giql_bin in their output regardless of which CTE the table alias points to. Detecting this at transform time lets us skip the 3-join bridge pattern entirely for those queries and use the simpler, faster 1-join full-CTE approach. Queries with wildcards (SELECT a.*, SELECT *) still take the bridge path so __giql_bin never leaks into the output column set.
Drop section divider lines (`# --...--`) from `IntersectsBinnedJoinTransformer` to reduce visual clutter. Descriptive inline comments explaining code behavior are preserved.
Cover outer join semantics (LEFT/RIGHT/FULL preserved through both full-CTE and bridge paths), additional ON conditions surviving the rewrite alongside INTERSECTS, and unconditional DISTINCT collapsing legitimate duplicate rows. The DISTINCT tests are marked xfail since the correct behavior (preserving duplicates) is a known limitation. 7 tests fail against the current implementation, confirming the bugs. 2 tests are strict xfail documenting the DISTINCT limitation.
Two interrelated fixes for the binned equi-join rewrite: The bridge path was silently converting LEFT/RIGHT/FULL joins to INNER because sqlglot stores the join type as "side" not "kind", and only join3 received it. Propagate the side attribute to both join2 and join3. FULL OUTER with wildcards falls back to the full-CTE path because the three-join chain's bin fan-out creates spurious unmatched rows that DISTINCT cannot resolve. Both rewrite paths were replacing the entire ON clause with the binned equi-join and overlap predicate, silently dropping any user-supplied conditions alongside INTERSECTS. Extract non- INTERSECTS conditions from the original ON and AND them back into the rewritten clause.
DISTINCT is added unconditionally to column-to-column INTERSECTS joins to eliminate duplicates from the bin fan-out. This section explains the mechanism, the edge case where it can collapse genuinely identical source rows, and the mitigation of including any distinguishing column in the SELECT list.
Move DEFAULT_BIN_SIZE to constants module and export from __init__. Extract shared _build_bin_range helper to eliminate duplicate bin-computation logic between the two CTE builders. Replace the mutable-list connector counter with itertools.count. Add isinstance check for bin_size so floats are rejected early. Rewrite _remove_intersects_from_where to use _extract_non_intersects so deeply-nested AND trees are handled cleanly. Expand docstrings on the class, __init__, _find_column_intersects_in, and _build_join_back_joins to document assumptions and limitations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add
IntersectsBinnedJoinTransformerto rewrite column-to-columnINTERSECTSjoins into binned equi-joins usingUNNEST(range(...))CTEs. The generated SQL is portable across DuckDB, DataFusion, and PostgreSQL — no runtime extensions required. Two rewrite strategies are selected automatically based on the SELECT list: a simpler full-CTE path when no wildcards are present, and a key-only bridge CTE pattern when wildcards (a.*) would otherwise leak the internal__giql_bincolumn.SELECT DISTINCTis added unconditionally to deduplicate rows from multi-bin matches. Outer join semantics (LEFT, RIGHT, FULL) are preserved by placing both the equi-join and overlap predicates in the ON clause, with FULL OUTER routed to the full-CTE path to avoid spurious unmatched rows from the bridge chain. Extra ON conditions alongside INTERSECTS are extracted and re-attached to the rewritten join.Closes #78
Proposed changes
Binned equi-join transformer
Add
IntersectsBinnedJoinTransformerinsrc/giql/transformer.py. Each interval is assigned to bins viaUNNEST(range(CAST(start/B AS BIGINT), CAST((end-1)/B+1 AS BIGINT))). The transformer handles explicitJOIN ON, implicit cross-join (FROM a, b WHERE ...), self-joins, multi-table joins, and custom column mappings. Bin size defaults toDEFAULT_BIN_SIZE(10,000) and is configurable via thebin_sizeparameter ontranspile().Two rewrite strategies
SELECT *, __giql_binCTEs and rewrite the JOIN ON. Used when the SELECT list has no wildcards, or when a FULL OUTER JOIN is present.SELECT chrom, start, end, __giql_binCTEs and a three-join chain that keeps original table references intact, preventing__giql_binfrom appearing ina.*expansion.Outer join and extra ON condition handling
Propagate the join side (
LEFT,RIGHT) to the bridge path's join2 and join3. RouteFULL OUTER JOINqueries with wildcards to the full-CTE path, where the single-join structure avoids spurious unmatched rows from bin fan-out. Extract non-INTERSECTS siblings from AND trees in ON clauses via_extract_non_intersects()and re-attach them to the rewritten join.Code quality improvements
Move
DEFAULT_BIN_SIZEtoconstants.pyand export fromgiql.__init__. Extract shared_build_bin_range()helper to eliminate duplicate bin-computation logic. Replace mutable-list connector counter withitertools.count. Addisinstancecheck forbin_sizeto reject floats early. Rewrite_remove_intersects_from_whereto handle deeply-nested AND trees cleanly.Documentation
Document the DISTINCT deduplication behavior in
docs/dialect/spatial-operators.rstunder a new "Deduplication Behavior" subsection of INTERSECTS, explaining the mechanism, the edge case where genuinely identical rows are collapsed, and the mitigation of including a distinguishing column.Test cases
TestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinbin_size=NoneTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinbin_size=0or negativeTestTranspileBinnedJoinTestTranspileBinnedJoinTestTranspileBinnedJoinTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusionTestBinnedJoinDataFusion__giql_binin outputTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinOuterJoinSemanticsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinAdditionalOnConditionsTestBinnedJoinDistinctSemanticsTestBinnedJoinDistinctSemanticsTestBinnedJoinDistinctSemanticsTestBinnedJoinDistinctSemantics