Skip to content

Implement cost-based optimizer for INTERSECTS join algorithm selection #67

@conradbzura

Description

@conradbzura

Summary

Add a DataFusion optimizer rule that uses Parquet metadata and lightweight sampling to choose between sweep line and binning algorithms for INTERSECTS joins, making the decision independently per partition. Sweep line is output-sensitive and distribution-agnostic — O(n log n + k) regardless of interval width variance. Binning is faster when widths are uniform and known, but suffers replication blowup on wide intervals and false-positive overhead on narrow ones. The optimizer should default to sweep line and select binning only when the partition's data profile confirms it will be cheaper.

Decision inputs

The optimizer gathers statistics per partition at two tiers, both before reading the full dataset:

  1. Parquet metadata (near-free, from file footer)

    • Row group column stats on start/end: min, max, null count, row count
    • Derived width upper bounds: max(end) - min(start) per row group
    • Domain span: global max(start) - min(start)
    • sorting_columns: whether a sort step can be skipped for sweep line
    • Page index (if present): finer-grained range estimates, crude width histogram
  2. Lightweight sampling (milliseconds)

    • Read only the start and end columns from 1–3 representative row groups (first, middle, last)
    • Compute width distribution statistics: median, p95, p99, coefficient of variation (CV)

Per-partition decision function

When DataFusion repartitions by coordinate range, different partitions may have different width profiles (e.g., one partition covers a region of uniform short reads while another spans a region with large structural variants). The algorithm choice is made independently per partition using two key signals — the p99/median width ratio and the coefficient of variation:

  • p99/median > 10 → heavy-tailed distribution, binning will replicate wide intervals badly → sweep line
  • CV > 1.5 → high width variance, no single bin width works → sweep line
  • Otherwise, estimate binning cost (n * avg_replication * hash_cost) vs sweep line cost (n * log(n) * compare_cost) and pick the cheaper one, sizing bins around p95

The initial p99/median and CV thresholds (10.0 and 1.5) are starting heuristics. As part of this work, benchmark against synthetic and real-world interval datasets with varying width distributions to empirically tune these cutoffs. The goal is to identify tighter thresholds that short-circuit to sweep line earlier when the cost function would always agree, and conversely, short-circuit to binning when the distribution is tight enough that the cost comparison is a foregone conclusion. This reduces the optimizer's overhead for the common cases while reserving the full cost model for genuinely ambiguous profiles.

DataFusion integration

  • Row group pruning — Before algorithm selection, use row group min/max stats to prune row groups that cannot overlap the other side of the join. Extend DataFusion's PruningPredicate infrastructure.
  • Sort order exploitation — If either input is sorted by start (from sorting_columns metadata), sweep line skips the sort step. This should strongly bias the decision toward sweep line.

Motivation

The current INTERSECTS join implementation does not adapt to the statistical profile of the input data. Sweep line is the robust default but leaves performance on the table when widths are uniform (e.g., fixed-length sequencing reads, uniform time windows). Binning can be significantly faster in those cases due to cache-friendliness and natural parallelism, but degrades catastrophically on heavy-tailed or multimodal width distributions. A cost-based optimizer eliminates the need for users to know which algorithm suits their data — the system reads cheap metadata and makes the right call automatically.

Expected outcome

  • INTERSECTS joins automatically select the optimal algorithm per partition based on the data's interval width distribution
  • The optimizer reads Parquet metadata and samples at most a few row groups — negligible cost relative to the join itself
  • Uniform-width workloads (e.g., fixed-length reads) benefit from binning's cache-friendliness and parallelism
  • Heavy-tailed or mixed-width workloads fall back to sweep line without replication blowup
  • p99/median and CV thresholds are empirically validated against benchmark datasets

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions