Implement cost-based optimizer for INTERSECTS join algorithm selection

## Summary

Add a DataFusion optimizer rule that uses Parquet metadata and lightweight sampling to choose between sweep line and binning algorithms for INTERSECTS joins, making the decision independently per partition. Sweep line is output-sensitive and distribution-agnostic — O(n log n + k) regardless of interval width variance. Binning is faster when widths are uniform and known, but suffers replication blowup on wide intervals and false-positive overhead on narrow ones. The optimizer should default to sweep line and select binning only when the partition's data profile confirms it will be cheaper.

### Decision inputs

The optimizer gathers statistics per partition at two tiers, both before reading the full dataset:

1. **Parquet metadata (near-free, from file footer)**
   - Row group column stats on `start`/`end`: min, max, null count, row count
   - Derived width upper bounds: `max(end) - min(start)` per row group
   - Domain span: global `max(start) - min(start)`
   - `sorting_columns`: whether a sort step can be skipped for sweep line
   - Page index (if present): finer-grained range estimates, crude width histogram

2. **Lightweight sampling (milliseconds)**
   - Read only the `start` and `end` columns from 1–3 representative row groups (first, middle, last)
   - Compute width distribution statistics: median, p95, p99, coefficient of variation (CV)

### Per-partition decision function

When DataFusion repartitions by coordinate range, different partitions may have different width profiles (e.g., one partition covers a region of uniform short reads while another spans a region with large structural variants). The algorithm choice is made independently per partition using two key signals — the **p99/median width ratio** and the **coefficient of variation**:

- **p99/median > 10** → heavy-tailed distribution, binning will replicate wide intervals badly → sweep line
- **CV > 1.5** → high width variance, no single bin width works → sweep line
- Otherwise, estimate binning cost (`n * avg_replication * hash_cost`) vs sweep line cost (`n * log(n) * compare_cost`) and pick the cheaper one, sizing bins around p95

The initial p99/median and CV thresholds (10.0 and 1.5) are starting heuristics. As part of this work, benchmark against synthetic and real-world interval datasets with varying width distributions to empirically tune these cutoffs. The goal is to identify tighter thresholds that short-circuit to sweep line earlier when the cost function would always agree, and conversely, short-circuit to binning when the distribution is tight enough that the cost comparison is a foregone conclusion. This reduces the optimizer's overhead for the common cases while reserving the full cost model for genuinely ambiguous profiles.

### DataFusion integration

- **Row group pruning** — Before algorithm selection, use row group min/max stats to prune row groups that cannot overlap the other side of the join. Extend DataFusion's `PruningPredicate` infrastructure.
- **Sort order exploitation** — If either input is sorted by `start` (from `sorting_columns` metadata), sweep line skips the sort step. This should strongly bias the decision toward sweep line.

## Motivation

The current INTERSECTS join implementation does not adapt to the statistical profile of the input data. Sweep line is the robust default but leaves performance on the table when widths are uniform (e.g., fixed-length sequencing reads, uniform time windows). Binning can be significantly faster in those cases due to cache-friendliness and natural parallelism, but degrades catastrophically on heavy-tailed or multimodal width distributions. A cost-based optimizer eliminates the need for users to know which algorithm suits their data — the system reads cheap metadata and makes the right call automatically.

## Expected outcome

- INTERSECTS joins automatically select the optimal algorithm per partition based on the data's interval width distribution
- The optimizer reads Parquet metadata and samples at most a few row groups — negligible cost relative to the join itself
- Uniform-width workloads (e.g., fixed-length reads) benefit from binning's cache-friendliness and parallelism
- Heavy-tailed or mixed-width workloads fall back to sweep line without replication blowup
- p99/median and CV thresholds are empirically validated against benchmark datasets

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cost-based optimizer for INTERSECTS join algorithm selection #67

Summary

Decision inputs

Per-partition decision function

DataFusion integration

Motivation

Expected outcome

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement cost-based optimizer for INTERSECTS join algorithm selection #67

Description

Summary

Decision inputs

Per-partition decision function

DataFusion integration

Motivation

Expected outcome

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions