Skip to content

Support optional full table schema in Table to enable explicit column projection for binned joins #80

@conradbzura

Description

@conradbzura

Background

When transpile() rewrites a column-to-column INTERSECTS join as a binned equi-join, it currently uses a bridge CTE pattern to avoid leaking the internal __giql_bin column into SELECT * output:

WITH __giql_peaks_bins AS (
  SELECT "chrom", "start", "end", UNNEST(range(...)) AS __giql_bin FROM peaks
),
__giql_genes_bins AS (...)
SELECT DISTINCT a.*, b.*
FROM peaks a
JOIN __giql_peaks_bins __giql_c0 ON a."chrom" = __giql_c0."chrom" AND ...
JOIN __giql_genes_bins __giql_c1 ON __giql_c0."chrom" = __giql_c1."chrom" AND __giql_c0.__giql_bin = __giql_c1.__giql_bin
JOIN genes b ON b."chrom" = __giql_c1."chrom" AND ... AND overlap

This adds two extra equi-join stages per INTERSECTS pair compared to the ideal single-join approach. The overhead is bounded and still far faster than a naive cross-join, but it is avoidable.

Root Cause

Standard SQL has no portable SELECT * EXCEPT (col) syntax, so wildcard expansion in SELECT a.* unconditionally includes every column from the relation a aliases — including __giql_bin when a is backed by a full-select CTE.

The clean fix is explicit column projection: if the full column list is known at transpile time, the transformer can expand a.*a.col1, a.col2, ..., use the original 1-join full-CTE approach, and emit no __giql_bin in the output at all.

Proposed Change

Add an optional columns field to the Table dataclass:

@dataclass
class Table:
    name: str
    genomic_col: str = "interval"
    chrom_col: str = "chrom"
    start_col: str = "start"
    end_col: str = "end"
    strand_col: str | None = "strand"
    columns: list[str] | None = None  # full ordered column list
    ...

When columns is provided for all tables involved in a binned INTERSECTS join, the transformer:

  1. Reverts to the original 1-join full-CTE pattern (SELECT *, UNNEST(...) AS __giql_bin FROM table)
  2. Expands any alias.* expression in the SELECT list to explicit alias.col1, alias.col2, ... — omitting __giql_bin

When columns is absent (the current default), the bridge pattern is used as a safe fallback.

Behaviour Matrix

columns provided SELECT has wildcards Strategy Extra joins
Yes Yes Full-CTE + explicit projection 0
Yes No Full-CTE (no wildcards to expand) 0
No No Full-CTE (no wildcards to expand) 0
No Yes Bridge CTEs (current fallback) +2 per INTERSECTS

Acceptance Criteria

  • Table accepts an optional columns: list[str] parameter
  • When columns is provided for all tables in an INTERSECTS join, the 1-join full-CTE approach is used and no __giql_bin appears in the output
  • When columns is absent, behaviour is identical to the current bridge approach
  • Existing tests continue to pass without modification
  • New tests cover: explicit projection with schema, correct column order, schema absent falls back to bridge

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions