Skip to content

Add datafusion-bio target dialect for GIQL transpilation #77

@conradbzura

Description

@conradbzura

Description

Add a "datafusion-bio" target dialect to the GIQL transpiler that emits SQL compatible with datafusion-bio-functions — a DataFusion extension crate providing optimized genomic interval operations via cache-oblivious interval trees (coitrees), with pluggable algorithm backends.

The datafusion-bio-functions crate provides two integration paths:

  1. Physical optimizer rule — intercepts standard SQL joins with range overlap predicates (a.start <= b.end AND a.end >= b.start) and replaces them with a build/probe IntervalJoinExec
  2. Table functions (UDTFs)overlap(), nearest(), merge(), cluster(), count_overlaps(), coverage(), complement(), subtract()

Benchmarking against our custom giql-datafusion COI tree implementation (PR #76) shows datafusion-bio-functions is 2-3x faster across all tested distributions, due to hash-partitioned chromosome lookup (AHashMap) and tighter physical-level integration.

Operator mapping

All existing GIQL operators have a natural mapping to datafusion-bio primitives:

GIQL Operator datafusion-bio Target Strategy
INTERSECTS (join) Standard SQL join with overlap predicates Physical optimizer intercepts and replaces with COI tree join
INTERSECTS (literal) Standard SQL filter predicates No change from default dialect
CONTAINS Standard SQL join with overlap predicates + post-filter Overlap join finds candidates, post-filter a.start <= b.start AND a.end >= b.end narrows to containment
WITHIN Standard SQL join with overlap predicates + post-filter Overlap join finds candidates, post-filter a.start >= b.start AND a.end <= b.end narrows to enclosure
NEAREST nearest('a', 'b') UDTF Direct mapping to table function
MERGE merge('a') UDTF Direct mapping to table function
CLUSTER cluster('a') UDTF Direct mapping to table function
DISTANCE Standard SQL arithmetic No equivalent UDTF; transpile as before

CONTAINS/WITHIN post-filter trade-off

datafusion-bio's FilterOp only supports Weak (standard overlap) and Strict (boundary-exclusive) — no native containment or enclosure mode. CONTAINS and WITHIN must therefore be expressed as an overlap join plus a post-join filter.

This is correct but suboptimal when overlap is dense and containment is rare: the COI tree probe returns all overlapping pairs, output rows are materialized, and the containment filter discards the non-matching ones. The wasted work is in batch construction for discarded pairs, not in the tree probe itself (which is the same cost regardless).

The ideal fix is to contribute a CONTAINS/WITHIN FilterOp variant upstream to datafusion-bio-functions, so the containment check happens inside the probe loop before output materialization. This would be a follow-up contribution.

UDTF parameter coverage

GIQL operators accept parameters that the UDTFs may not fully support (e.g., NEAREST's k, max_distance, stranded; CLUSTER's distance, stranded). The generator should use the UDTF when parameter sets align and fall back to the current SQL rewrite (window functions, LATERAL subqueries) when they don't. This decision is made per-invocation by checking which arguments are present.

Additionally, datafusion-bio provides operations that GIQL could adopt as new operators in the future:

datafusion-bio UDTF Potential GIQL Operator Description
count_overlaps('a', 'b') COUNT_OVERLAPS Count overlapping intervals per row
coverage('a', 'b') COVERAGE Per-base coverage depth
complement('a') COMPLEMENT Gaps between intervals
subtract('a', 'b') SUBTRACT Remove overlapping regions

Motivation

The giql-datafusion crate (PR #76) implemented a full interval join optimizer from scratch — COI tree execution plan, logical optimizer rule, Parquet sampling, adaptive bin sizing — totaling ~1,500 lines of Rust. Benchmarking against datafusion-bio-functions (which also uses coitrees) revealed that their implementation is consistently 2-3x faster due to physical-level integration advantages that are difficult to replicate through DataFusion's Extension node mechanism.

By adopting datafusion-bio-functions as the execution backend, GIQL can:

  • Drop ~1,500 lines of custom Rust execution code
  • Gain access to multiple interval join algorithms (coitrees, lapper, superintervals) via SQL configuration (SET bio.interval_join_algorithm = ...)
  • Leverage optimized UDTFs for NEAREST, MERGE, and CLUSTER instead of transpiling to complex SQL window functions
  • Benefit from upstream performance improvements and bug fixes
  • Focus development effort on the transpiler and query language features

Expected outcome

  • transpile() accepts dialect="datafusion-bio" which emits SQL that datafusion-bio-functions recognizes and optimizes
  • Spatial joins (INTERSECTS, CONTAINS, WITHIN) emit standard SQL overlap predicates that the physical optimizer intercepts — no placeholder UDFs or custom logical nodes needed. CONTAINS/WITHIN add a post-join filter for the containment check.
  • NEAREST, MERGE, CLUSTER emit datafusion-bio UDTF calls when parameters are compatible, falling back to standard SQL rewrites when unsupported parameters are present
  • Users create a SessionContext via create_bio_session() and execute the transpiled SQL directly
  • The existing giql-datafusion crate and dialect="datafusion" path are deprecated in favor of this approach
  • Document the CONTAINS/WITHIN post-filter limitation and plan upstream contribution of native containment FilterOp

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions