Skip to content

Transpile INTERSECTS to binned equi-join in pure SQL for full-table joins #78

@conradbzura

Description

@conradbzura

Description

When INTERSECTS is used in a full-table join (a.interval INTERSECTS b.interval), transpile it to a binned equi-join using pure SQL — UNNEST, range bins, and a canonical-bin dedup filter — rather than relying on a custom DataFusion optimizer or external execution backend.

The binned SQL pattern is well-understood and works on any SQL engine that supports UNNEST and range():

WITH a_binned AS (
  SELECT *, UNNEST(range(CAST(start / B AS BIGINT), CAST((end - 1) / B + 1 AS BIGINT))) AS __bin
  FROM a
),
b_binned AS (
  SELECT *, UNNEST(range(CAST(start / B AS BIGINT), CAST((end - 1) / B + 1 AS BIGINT))) AS __bin
  FROM b
)
SELECT DISTINCT a_binned.*, b_binned.*
FROM a_binned JOIN b_binned
  ON a_binned.chrom = b_binned.chrom AND a_binned.__bin = b_binned.__bin
WHERE a_binned.start < b_binned.end AND a_binned.end > b_binned.start

The bin size B can be a fixed default (e.g., 10,000) or configurable via a transpiler option. This approach requires no Rust code, no custom DataFusion extensions, and no external dependencies — it runs on any DuckDB, DataFusion, or PostgreSQL instance.

Motivation

The datafusion-bio dialect (#77) provides optimized execution via COI trees but requires a specific runtime (DataFusion + datafusion-bio-functions). The datafusion dialect (#67) requires a custom Rust crate. For users who want portable SQL — or who run on DuckDB or PostgreSQL — a pure-SQL binned join is the simplest path to efficient INTERSECTS joins with no runtime dependencies.

This also serves as the default dialect's optimization for full-table joins, replacing the current naive overlap predicate (a.start < b.end AND a.end > b.start) which forces a nested-loop or cross join on most engines.

Expected outcome

  • The "default" dialect (or a new "binned" dialect) transpiles full-table INTERSECTS joins to the binned equi-join SQL pattern
  • Literal range queries (interval INTERSECTS 'chr1:1000-2000') continue to emit simple filter predicates (no binning needed for single-row lookups)
  • Bin size is configurable via a transpiler parameter, defaulting to 10,000
  • The canonical-bin dedup filter (__bin = CASE WHEN ...) is used instead of DISTINCT where the engine supports CASE expressions, falling back to DISTINCT otherwise
  • The generated SQL is portable across DuckDB, DataFusion, and PostgreSQL (all support UNNEST + range or generate_series)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions