-
Notifications
You must be signed in to change notification settings - Fork 2
Add datafusion-bio target dialect for GIQL transpilation #77
Description
Description
Add a "datafusion-bio" target dialect to the GIQL transpiler that emits SQL compatible with datafusion-bio-functions — a DataFusion extension crate providing optimized genomic interval operations via cache-oblivious interval trees (coitrees), with pluggable algorithm backends.
The datafusion-bio-functions crate provides two integration paths:
- Physical optimizer rule — intercepts standard SQL joins with range overlap predicates (
a.start <= b.end AND a.end >= b.start) and replaces them with a build/probeIntervalJoinExec - Table functions (UDTFs) —
overlap(),nearest(),merge(),cluster(),count_overlaps(),coverage(),complement(),subtract()
Benchmarking against our custom giql-datafusion COI tree implementation (PR #76) shows datafusion-bio-functions is 2-3x faster across all tested distributions, due to hash-partitioned chromosome lookup (AHashMap) and tighter physical-level integration.
Operator mapping
All existing GIQL operators have a natural mapping to datafusion-bio primitives:
| GIQL Operator | datafusion-bio Target | Strategy |
|---|---|---|
INTERSECTS (join) |
Standard SQL join with overlap predicates | Physical optimizer intercepts and replaces with COI tree join |
INTERSECTS (literal) |
Standard SQL filter predicates | No change from default dialect |
CONTAINS |
Standard SQL join with overlap predicates + post-filter | Overlap join finds candidates, post-filter a.start <= b.start AND a.end >= b.end narrows to containment |
WITHIN |
Standard SQL join with overlap predicates + post-filter | Overlap join finds candidates, post-filter a.start >= b.start AND a.end <= b.end narrows to enclosure |
NEAREST |
nearest('a', 'b') UDTF |
Direct mapping to table function |
MERGE |
merge('a') UDTF |
Direct mapping to table function |
CLUSTER |
cluster('a') UDTF |
Direct mapping to table function |
DISTANCE |
Standard SQL arithmetic | No equivalent UDTF; transpile as before |
CONTAINS/WITHIN post-filter trade-off
datafusion-bio's FilterOp only supports Weak (standard overlap) and Strict (boundary-exclusive) — no native containment or enclosure mode. CONTAINS and WITHIN must therefore be expressed as an overlap join plus a post-join filter.
This is correct but suboptimal when overlap is dense and containment is rare: the COI tree probe returns all overlapping pairs, output rows are materialized, and the containment filter discards the non-matching ones. The wasted work is in batch construction for discarded pairs, not in the tree probe itself (which is the same cost regardless).
The ideal fix is to contribute a CONTAINS/WITHIN FilterOp variant upstream to datafusion-bio-functions, so the containment check happens inside the probe loop before output materialization. This would be a follow-up contribution.
UDTF parameter coverage
GIQL operators accept parameters that the UDTFs may not fully support (e.g., NEAREST's k, max_distance, stranded; CLUSTER's distance, stranded). The generator should use the UDTF when parameter sets align and fall back to the current SQL rewrite (window functions, LATERAL subqueries) when they don't. This decision is made per-invocation by checking which arguments are present.
Additionally, datafusion-bio provides operations that GIQL could adopt as new operators in the future:
| datafusion-bio UDTF | Potential GIQL Operator | Description |
|---|---|---|
count_overlaps('a', 'b') |
COUNT_OVERLAPS |
Count overlapping intervals per row |
coverage('a', 'b') |
COVERAGE |
Per-base coverage depth |
complement('a') |
COMPLEMENT |
Gaps between intervals |
subtract('a', 'b') |
SUBTRACT |
Remove overlapping regions |
Motivation
The giql-datafusion crate (PR #76) implemented a full interval join optimizer from scratch — COI tree execution plan, logical optimizer rule, Parquet sampling, adaptive bin sizing — totaling ~1,500 lines of Rust. Benchmarking against datafusion-bio-functions (which also uses coitrees) revealed that their implementation is consistently 2-3x faster due to physical-level integration advantages that are difficult to replicate through DataFusion's Extension node mechanism.
By adopting datafusion-bio-functions as the execution backend, GIQL can:
- Drop ~1,500 lines of custom Rust execution code
- Gain access to multiple interval join algorithms (coitrees, lapper, superintervals) via SQL configuration (
SET bio.interval_join_algorithm = ...) - Leverage optimized UDTFs for NEAREST, MERGE, and CLUSTER instead of transpiling to complex SQL window functions
- Benefit from upstream performance improvements and bug fixes
- Focus development effort on the transpiler and query language features
Expected outcome
transpile()acceptsdialect="datafusion-bio"which emits SQL thatdatafusion-bio-functionsrecognizes and optimizes- Spatial joins (INTERSECTS, CONTAINS, WITHIN) emit standard SQL overlap predicates that the physical optimizer intercepts — no placeholder UDFs or custom logical nodes needed. CONTAINS/WITHIN add a post-join filter for the containment check.
- NEAREST, MERGE, CLUSTER emit
datafusion-bioUDTF calls when parameters are compatible, falling back to standard SQL rewrites when unsupported parameters are present - Users create a
SessionContextviacreate_bio_session()and execute the transpiled SQL directly - The existing
giql-datafusioncrate anddialect="datafusion"path are deprecated in favor of this approach - Document the CONTAINS/WITHIN post-filter limitation and plan upstream contribution of native containment FilterOp