-
Notifications
You must be signed in to change notification settings - Fork 2
Implement cost-based optimizer for INTERSECTS join algorithm selection #67
Description
Summary
Add a DataFusion optimizer rule that uses Parquet metadata and lightweight sampling to choose between sweep line and binning algorithms for INTERSECTS joins, making the decision independently per partition. Sweep line is output-sensitive and distribution-agnostic — O(n log n + k) regardless of interval width variance. Binning is faster when widths are uniform and known, but suffers replication blowup on wide intervals and false-positive overhead on narrow ones. The optimizer should default to sweep line and select binning only when the partition's data profile confirms it will be cheaper.
Decision inputs
The optimizer gathers statistics per partition at two tiers, both before reading the full dataset:
-
Parquet metadata (near-free, from file footer)
- Row group column stats on
start/end: min, max, null count, row count - Derived width upper bounds:
max(end) - min(start)per row group - Domain span: global
max(start) - min(start) sorting_columns: whether a sort step can be skipped for sweep line- Page index (if present): finer-grained range estimates, crude width histogram
- Row group column stats on
-
Lightweight sampling (milliseconds)
- Read only the
startandendcolumns from 1–3 representative row groups (first, middle, last) - Compute width distribution statistics: median, p95, p99, coefficient of variation (CV)
- Read only the
Per-partition decision function
When DataFusion repartitions by coordinate range, different partitions may have different width profiles (e.g., one partition covers a region of uniform short reads while another spans a region with large structural variants). The algorithm choice is made independently per partition using two key signals — the p99/median width ratio and the coefficient of variation:
- p99/median > 10 → heavy-tailed distribution, binning will replicate wide intervals badly → sweep line
- CV > 1.5 → high width variance, no single bin width works → sweep line
- Otherwise, estimate binning cost (
n * avg_replication * hash_cost) vs sweep line cost (n * log(n) * compare_cost) and pick the cheaper one, sizing bins around p95
The initial p99/median and CV thresholds (10.0 and 1.5) are starting heuristics. As part of this work, benchmark against synthetic and real-world interval datasets with varying width distributions to empirically tune these cutoffs. The goal is to identify tighter thresholds that short-circuit to sweep line earlier when the cost function would always agree, and conversely, short-circuit to binning when the distribution is tight enough that the cost comparison is a foregone conclusion. This reduces the optimizer's overhead for the common cases while reserving the full cost model for genuinely ambiguous profiles.
DataFusion integration
- Row group pruning — Before algorithm selection, use row group min/max stats to prune row groups that cannot overlap the other side of the join. Extend DataFusion's
PruningPredicateinfrastructure. - Sort order exploitation — If either input is sorted by
start(fromsorting_columnsmetadata), sweep line skips the sort step. This should strongly bias the decision toward sweep line.
Motivation
The current INTERSECTS join implementation does not adapt to the statistical profile of the input data. Sweep line is the robust default but leaves performance on the table when widths are uniform (e.g., fixed-length sequencing reads, uniform time windows). Binning can be significantly faster in those cases due to cache-friendliness and natural parallelism, but degrades catastrophically on heavy-tailed or multimodal width distributions. A cost-based optimizer eliminates the need for users to know which algorithm suits their data — the system reads cheap metadata and makes the right call automatically.
Expected outcome
- INTERSECTS joins automatically select the optimal algorithm per partition based on the data's interval width distribution
- The optimizer reads Parquet metadata and samples at most a few row groups — negligible cost relative to the join itself
- Uniform-width workloads (e.g., fixed-length reads) benefit from binning's cache-friendliness and parallelism
- Heavy-tailed or mixed-width workloads fall back to sweep line without replication blowup
- p99/median and CV thresholds are empirically validated against benchmark datasets