-
Notifications
You must be signed in to change notification settings - Fork 2
Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins #78
Description
Description
When INTERSECTS is used in a full-table join (a.interval INTERSECTS b.interval), transpile it to a binned equi-join using pure SQL — UNNEST, range bins, and a canonical-bin dedup filter — rather than relying on a custom DataFusion optimizer or external execution backend.
The binned SQL pattern is well-understood and works on any SQL engine that supports UNNEST and range():
WITH a_binned AS (
SELECT *, UNNEST(range(CAST(start / B AS BIGINT), CAST((end - 1) / B + 1 AS BIGINT))) AS __bin
FROM a
),
b_binned AS (
SELECT *, UNNEST(range(CAST(start / B AS BIGINT), CAST((end - 1) / B + 1 AS BIGINT))) AS __bin
FROM b
)
SELECT DISTINCT a_binned.*, b_binned.*
FROM a_binned JOIN b_binned
ON a_binned.chrom = b_binned.chrom AND a_binned.__bin = b_binned.__bin
WHERE a_binned.start < b_binned.end AND a_binned.end > b_binned.startThe bin size B can be a fixed default (e.g., 10,000) or configurable via a transpiler option. This approach requires no Rust code, no custom DataFusion extensions, and no external dependencies — it runs on any DuckDB, DataFusion, or PostgreSQL instance.
Motivation
The datafusion-bio dialect (#77) provides optimized execution via COI trees but requires a specific runtime (DataFusion + datafusion-bio-functions). The datafusion dialect (#67) requires a custom Rust crate. For users who want portable SQL — or who run on DuckDB or PostgreSQL — a pure-SQL binned join is the simplest path to efficient INTERSECTS joins with no runtime dependencies.
This also serves as the default dialect's optimization for full-table joins, replacing the current naive overlap predicate (a.start < b.end AND a.end > b.start) which forces a nested-loop or cross join on most engines.
Expected outcome
- The
"default"dialect (or a new"binned"dialect) transpiles full-table INTERSECTS joins to the binned equi-join SQL pattern - Literal range queries (
interval INTERSECTS 'chr1:1000-2000') continue to emit simple filter predicates (no binning needed for single-row lookups) - Bin size is configurable via a transpiler parameter, defaulting to 10,000
- The canonical-bin dedup filter (
__bin = CASE WHEN ...) is used instead of DISTINCT where the engine supports CASE expressions, falling back to DISTINCT otherwise - The generated SQL is portable across DuckDB, DataFusion, and PostgreSQL (all support UNNEST + range or generate_series)