Skip to content

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79

Draft
conradbzura wants to merge 16 commits intomainfrom
78-binned-equijoin-intersects
Draft

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins — Closes #78#79
conradbzura wants to merge 16 commits intomainfrom
78-binned-equijoin-intersects

Conversation

@conradbzura
Copy link
Copy Markdown
Collaborator

@conradbzura conradbzura commented Mar 31, 2026

Summary

Add IntersectsBinnedJoinTransformer to rewrite column-to-column INTERSECTS joins into binned equi-joins using UNNEST(range(...)) CTEs. The generated SQL is portable across DuckDB, DataFusion, and PostgreSQL — no runtime extensions required. Two rewrite strategies are selected automatically based on the SELECT list: a simpler full-CTE path when no wildcards are present, and a key-only bridge CTE pattern when wildcards (a.*) would otherwise leak the internal __giql_bin column. SELECT DISTINCT is added unconditionally to deduplicate rows from multi-bin matches. Outer join semantics (LEFT, RIGHT, FULL) are preserved by placing both the equi-join and overlap predicates in the ON clause, with FULL OUTER routed to the full-CTE path to avoid spurious unmatched rows from the bridge chain. Extra ON conditions alongside INTERSECTS are extracted and re-attached to the rewritten join.

Closes #78

Proposed changes

Binned equi-join transformer

Add IntersectsBinnedJoinTransformer in src/giql/transformer.py. Each interval is assigned to bins via UNNEST(range(CAST(start/B AS BIGINT), CAST((end-1)/B+1 AS BIGINT))). The transformer handles explicit JOIN ON, implicit cross-join (FROM a, b WHERE ...), self-joins, multi-table joins, and custom column mappings. Bin size defaults to DEFAULT_BIN_SIZE (10,000) and is configurable via the bin_size parameter on transpile().

Two rewrite strategies

  • Full-CTE path — replace table references with SELECT *, __giql_bin CTEs and rewrite the JOIN ON. Used when the SELECT list has no wildcards, or when a FULL OUTER JOIN is present.
  • Bridge path — create key-only SELECT chrom, start, end, __giql_bin CTEs and a three-join chain that keeps original table references intact, preventing __giql_bin from appearing in a.* expansion.

Outer join and extra ON condition handling

Propagate the join side (LEFT, RIGHT) to the bridge path's join2 and join3. Route FULL OUTER JOIN queries with wildcards to the full-CTE path, where the single-join structure avoids spurious unmatched rows from bin fan-out. Extract non-INTERSECTS siblings from AND trees in ON clauses via _extract_non_intersects() and re-attach them to the rewritten join.

Code quality improvements

Move DEFAULT_BIN_SIZE to constants.py and export from giql.__init__. Extract shared _build_bin_range() helper to eliminate duplicate bin-computation logic. Replace mutable-list connector counter with itertools.count. Add isinstance check for bin_size to reject floats early. Rewrite _remove_intersects_from_where to handle deeply-nested AND trees cleanly.

Documentation

Document the DISTINCT deduplication behavior in docs/dialect/spatial-operators.rst under a new "Deduplication Behavior" subsection of INTERSECTS, explaining the mechanism, the edge case where genuinely identical rows are collapsed, and the mitigation of including a distinguishing column.

Test cases

# Test Suite Test ID Given When Then Coverage Target
1 TestTranspileBinnedJoin BJ-001 A GIQL query joining two tables with INTERSECTS Transpiling with default settings SQL contains CTEs with UNNEST/range, equi-join ON, and DISTINCT Basic rewrite structure
2 TestTranspileBinnedJoin BJ-002 A GIQL query with custom bin_size=5000 Transpiling Bin size 5000 appears in the generated SQL Custom bin size
3 TestTranspileBinnedJoin BJ-003 Tables with custom column mappings Transpiling Custom column names appear in the generated SQL Column mapping
4 TestTranspileBinnedJoin BJ-004 An INTERSECTS with a literal range string Transpiling No binned CTEs are generated Literal passthrough
5 TestTranspileBinnedJoin BJ-005 A query with no JOIN Transpiling Query passes through unchanged No-join passthrough
6 TestTranspileBinnedJoin BJ-006 A query with WHERE filter alongside INTERSECTS Transpiling Original WHERE conditions are preserved WHERE preservation
7 TestTranspileBinnedJoin BJ-007 bin_size=None Transpiling Default bin size 10000 is used Default bin size
8 TestTranspileBinnedJoin BJ-008 Implicit cross-join with WHERE INTERSECTS Transpiling Binned optimization is applied Implicit cross-join
9 TestTranspileBinnedJoin BJ-009 Self-join on the same table Transpiling Only one shared bin CTE is created Self-join dedup
10 TestTranspileBinnedJoin BJ-010 bin_size=0 or negative Transpiling ValueError is raised Validation
11 TestTranspileBinnedJoin BJ-011 Three tables with two INTERSECTS joins Transpiling Both joins are rewritten with separate CTEs Multi-join
12 TestTranspileBinnedJoin BJ-012 SELECT with explicit column names Transpiling Full-CTE path is used (no bridge) Strategy selection
13 TestTranspileBinnedJoin BJ-013 SELECT with wildcards Transpiling Bridge path is used (key-only CTEs) Strategy selection
14 TestBinnedJoinDataFusion DF-001 Overlapping intervals across two tables Executing binned join SQL Correct rows returned with no duplicates Correctness
15 TestBinnedJoinDataFusion DF-002 Non-overlapping intervals Executing binned join SQL Zero rows returned Non-overlap
16 TestBinnedJoinDataFusion DF-003 Adjacent intervals (half-open coordinates) Executing binned join SQL Zero rows returned Half-open semantics
17 TestBinnedJoinDataFusion DF-004 Intervals on different chromosomes Executing binned join SQL Only same-chromosome matches returned Chromosome filter
18 TestBinnedJoinDataFusion DF-005 Intervals spanning multiple bins Executing binned join SQL Correct results with no duplicate rows Multi-bin dedup
19 TestBinnedJoinDataFusion DF-006 Binned join vs naive cross-join Executing both Results are identical Equivalence
20 TestBinnedJoinDataFusion DF-007 Implicit cross-join syntax Executing binned join SQL Correct rows, no __giql_bin in output Column leak prevention
21 TestBinnedJoinOuterJoinSemantics OJ-001 LEFT JOIN with unmatched left rows (full-CTE) Executing Unmatched left rows appear with NULL right columns LEFT JOIN full-CTE
22 TestBinnedJoinOuterJoinSemantics OJ-002 LEFT JOIN with unmatched left rows (bridge) Executing Unmatched left rows appear with NULL right columns LEFT JOIN bridge
23 TestBinnedJoinOuterJoinSemantics OJ-003 RIGHT JOIN with unmatched right rows (full-CTE) Executing Unmatched right rows appear with NULL left columns RIGHT JOIN full-CTE
24 TestBinnedJoinOuterJoinSemantics OJ-004 RIGHT JOIN with unmatched right rows (bridge) Executing Unmatched right rows appear with NULL left columns RIGHT JOIN bridge
25 TestBinnedJoinOuterJoinSemantics OJ-005 FULL OUTER JOIN (full-CTE) Executing Both unmatched sides appear with NULLs FULL OUTER full-CTE
26 TestBinnedJoinOuterJoinSemantics OJ-006 FULL OUTER JOIN (bridge fallback) Executing Both unmatched sides appear with NULLs FULL OUTER bridge fallback
27 TestBinnedJoinOuterJoinSemantics OJ-007 LEFT JOIN where no rows match Executing All left rows returned with NULL right columns All-unmatched LEFT
28 TestBinnedJoinAdditionalOnConditions AC-001 Extra equality in ON alongside INTERSECTS (full-CTE) Executing Extra condition filters results correctly Extra ON full-CTE
29 TestBinnedJoinAdditionalOnConditions AC-002 Extra equality in ON alongside INTERSECTS (bridge) Executing Extra condition filters results correctly Extra ON bridge
30 TestBinnedJoinAdditionalOnConditions AC-003 Extra ON condition with LEFT JOIN Executing Unmatched rows preserved, extra filter applied Extra ON + LEFT
31 TestBinnedJoinAdditionalOnConditions AC-004 Multiple extra conditions in ON Executing All extra conditions preserved and applied Multiple ON conditions
32 TestBinnedJoinAdditionalOnConditions AC-005 Extra WHERE condition with implicit cross-join Executing WHERE condition applied after INTERSECTS rewrite Extra WHERE cross-join
33 TestBinnedJoinDistinctSemantics DS-001 Duplicate source rows with no unique column (full-CTE) Executing Duplicates collapsed (xfail, known limitation) DISTINCT limitation
34 TestBinnedJoinDistinctSemantics DS-002 Duplicate source rows with no unique column (bridge) Executing Duplicates collapsed (xfail, known limitation) DISTINCT limitation
35 TestBinnedJoinDistinctSemantics DS-003 Rows with distinguishing column Executing All distinct rows preserved DISTINCT with unique col
36 TestBinnedJoinDistinctSemantics DS-004 User-specified DISTINCT already in query Executing Still works correctly Idempotent DISTINCT

Column-to-column INTERSECTS joins (e.g., a.interval INTERSECTS
b.interval) are now rewritten into binned equi-joins using CTEs with
UNNEST(range(...)) bin assignments. This gives the query planner an
equi-join key to work with instead of forcing a nested-loop or cross
join. The bin size defaults to 10,000 and is configurable via the
new bin_size parameter on transpile(). Literal-range INTERSECTS
filters remain unchanged.
Needed for end-to-end correctness tests that validate the binned
equi-join SQL against DataFusion's query engine.
The transformer now detects column-to-column INTERSECTS in WHERE
clauses (FROM a, b WHERE a.interval INTERSECTS b.interval), not
just in explicit JOIN ON conditions. Both patterns are rewritten
to binned equi-joins for the same performance benefit.
Covers both explicit JOIN ON and implicit cross-join patterns,
custom bin sizes, custom column mappings, self-joins, literal
range passthrough, and end-to-end correctness against DataFusion
including multi-bin deduplication and equivalence with naive joins.
Move the overlap predicate (start < end AND end > start) from WHERE
into the JOIN ON clause so that LEFT/RIGHT/FULL JOIN semantics are
preserved — a WHERE filter on the right-side columns silently converts
outer joins into inner joins.

Also refactor the transformer to rewrite all INTERSECTS joins in a
query, not just the first. A new _ensure_table_binned helper tracks
which aliases already have binned CTEs so that multi-join queries
reuse CTEs instead of duplicating them.

Add bin_size validation (must be positive) and remove dead code from
_rewrite_where.
Cover three-way joins with CTE reuse, invalid bin_size rejection,
and update assertions for the overlap-in-ON change. Remove unused
pytest import from module level.
The binned CTE approach leaks __giql_bin into SELECT * results because
CTEs expose all their columns. Revert implicit cross-join rewriting
(FROM a, b WHERE INTERSECTS) so those queries fall through to the
generator's naive overlap predicate, which produces clean column output.
Explicit JOIN ON INTERSECTS continues to use the binned equi-join.

Also add pytest.importorskip for datafusion so the DataFusion
correctness tests are skipped when the module is not installed.
The CI workflow uses pixi, not uv, so the datafusion package must
be listed under [tool.pixi.dependencies] for the DataFusion
correctness tests to run. Remove the pytest.importorskip guard
since the dependency is now always available.
The previous approach replaced FROM/JOIN table references with full
CTEs (SELECT *), causing __giql_bin to appear in SELECT a.* output.
The new approach keeps original table references and routes the equi-
join through key-only bridge CTEs (SELECT chrom, start, end, bin),
eliminating the leak entirely.

This also restores implicit cross-join rewriting (FROM a, b WHERE
INTERSECTS) which was reverted in the prior commit due to the leak.
CTEs are now named __giql_{table}_bins and deduplicated per underlying
table name rather than per alias, so self-joins share one CTE.
Queries with explicit column lists (SELECT a.chrom, b.start, ...)
cannot expose __giql_bin in their output regardless of which CTE
the table alias points to. Detecting this at transform time lets
us skip the 3-join bridge pattern entirely for those queries and
use the simpler, faster 1-join full-CTE approach.

Queries with wildcards (SELECT a.*, SELECT *) still take the bridge
path so __giql_bin never leaks into the output column set.
Drop section divider lines (`# --...--`) from `IntersectsBinnedJoinTransformer`
to reduce visual clutter. Descriptive inline comments explaining code behavior
are preserved.
Cover outer join semantics (LEFT/RIGHT/FULL preserved through both
full-CTE and bridge paths), additional ON conditions surviving the
rewrite alongside INTERSECTS, and unconditional DISTINCT collapsing
legitimate duplicate rows. The DISTINCT tests are marked xfail since
the correct behavior (preserving duplicates) is a known limitation.

7 tests fail against the current implementation, confirming the bugs.
2 tests are strict xfail documenting the DISTINCT limitation.
Two interrelated fixes for the binned equi-join rewrite:

The bridge path was silently converting LEFT/RIGHT/FULL joins to
INNER because sqlglot stores the join type as "side" not "kind",
and only join3 received it. Propagate the side attribute to both
join2 and join3. FULL OUTER with wildcards falls back to the
full-CTE path because the three-join chain's bin fan-out creates
spurious unmatched rows that DISTINCT cannot resolve.

Both rewrite paths were replacing the entire ON clause with the
binned equi-join and overlap predicate, silently dropping any
user-supplied conditions alongside INTERSECTS. Extract non-
INTERSECTS conditions from the original ON and AND them back into
the rewritten clause.
DISTINCT is added unconditionally to column-to-column INTERSECTS joins
to eliminate duplicates from the bin fan-out. This section explains the
mechanism, the edge case where it can collapse genuinely identical source
rows, and the mitigation of including any distinguishing column in the
SELECT list.
Move DEFAULT_BIN_SIZE to constants module and export from __init__.
Extract shared _build_bin_range helper to eliminate duplicate
bin-computation logic between the two CTE builders.  Replace the
mutable-list connector counter with itertools.count.  Add isinstance
check for bin_size so floats are rejected early.  Rewrite
_remove_intersects_from_where to use _extract_non_intersects so
deeply-nested AND trees are handled cleanly.  Expand docstrings on
the class, __init__, _find_column_intersects_in, and
_build_join_back_joins to document assumptions and limitations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins

1 participant