-
Notifications
You must be signed in to change notification settings - Fork 2
Support optional full table schema in Table to enable explicit column projection for binned joins #80
Description
Background
When transpile() rewrites a column-to-column INTERSECTS join as a binned equi-join, it currently uses a bridge CTE pattern to avoid leaking the internal __giql_bin column into SELECT * output:
WITH __giql_peaks_bins AS (
SELECT "chrom", "start", "end", UNNEST(range(...)) AS __giql_bin FROM peaks
),
__giql_genes_bins AS (...)
SELECT DISTINCT a.*, b.*
FROM peaks a
JOIN __giql_peaks_bins __giql_c0 ON a."chrom" = __giql_c0."chrom" AND ...
JOIN __giql_genes_bins __giql_c1 ON __giql_c0."chrom" = __giql_c1."chrom" AND __giql_c0.__giql_bin = __giql_c1.__giql_bin
JOIN genes b ON b."chrom" = __giql_c1."chrom" AND ... AND overlapThis adds two extra equi-join stages per INTERSECTS pair compared to the ideal single-join approach. The overhead is bounded and still far faster than a naive cross-join, but it is avoidable.
Root Cause
Standard SQL has no portable SELECT * EXCEPT (col) syntax, so wildcard expansion in SELECT a.* unconditionally includes every column from the relation a aliases — including __giql_bin when a is backed by a full-select CTE.
The clean fix is explicit column projection: if the full column list is known at transpile time, the transformer can expand a.* → a.col1, a.col2, ..., use the original 1-join full-CTE approach, and emit no __giql_bin in the output at all.
Proposed Change
Add an optional columns field to the Table dataclass:
@dataclass
class Table:
name: str
genomic_col: str = "interval"
chrom_col: str = "chrom"
start_col: str = "start"
end_col: str = "end"
strand_col: str | None = "strand"
columns: list[str] | None = None # full ordered column list
...When columns is provided for all tables involved in a binned INTERSECTS join, the transformer:
- Reverts to the original 1-join full-CTE pattern (
SELECT *, UNNEST(...) AS __giql_bin FROM table) - Expands any
alias.*expression in the SELECT list to explicitalias.col1, alias.col2, ...— omitting__giql_bin
When columns is absent (the current default), the bridge pattern is used as a safe fallback.
Behaviour Matrix
columns provided |
SELECT has wildcards | Strategy | Extra joins |
|---|---|---|---|
| Yes | Yes | Full-CTE + explicit projection | 0 |
| Yes | No | Full-CTE (no wildcards to expand) | 0 |
| No | No | Full-CTE (no wildcards to expand) | 0 |
| No | Yes | Bridge CTEs (current fallback) | +2 per INTERSECTS |
Acceptance Criteria
Tableaccepts an optionalcolumns: list[str]parameter- When
columnsis provided for all tables in anINTERSECTSjoin, the 1-join full-CTE approach is used and no__giql_binappears in the output - When
columnsis absent, behaviour is identical to the current bridge approach - Existing tests continue to pass without modification
- New tests cover: explicit projection with schema, correct column order, schema absent falls back to bridge