Skip to content

Add binned summary statistic aggregation for genomic intervals — Closes #61#62

Open
conradbzura wants to merge 17 commits intomainfrom
61-binned-coverage-operator
Open

Add binned summary statistic aggregation for genomic intervals — Closes #61#62
conradbzura wants to merge 17 commits intomainfrom
61-binned-coverage-operator

Conversation

@conradbzura
Copy link
Copy Markdown
Collaborator

@conradbzura conradbzura commented Mar 10, 2026

Summary

Add the COVERAGE operator for computing binned genome coverage from interval data. The operator tiles the genome into fixed-width bins using generate_series, LEFT JOINs against the source table, and aggregates overlapping intervals per bin. It supports configurable statistics (count, mean, sum, min, max), an optional target column for aggregating a named field instead of interval length, and both := and => named parameter syntax. The generated SQL is compatible with both DuckDB and PostgreSQL.

Closes #61

Proposed changes

Expression node (GIQLCoverage)

Add a new GIQLCoverage expression node registered as a SQLGlot Func. The from_arg_list classmethod handles positional and named parameters via PropertyEQ (:=) and Kwarg (=>). Parameters: this (interval column), resolution (bin width), stat (aggregation function), and target (column to aggregate).

Parser registration

Register COVERAGE as a function in GIQLDialect so parse_one(..., dialect=GIQLDialect) produces a GIQLCoverage AST node. The = operator is explicitly excluded from named parameter syntax to avoid ambiguity with SQL equality.

Transformer (CoverageTransformer)

Rewrite queries containing COVERAGE(...) into a CTE-based plan:

  1. A __giql_chroms subquery computes per-chromosome MAX(end) from the source table
  2. A __giql_bins CTE uses CROSS JOIN LATERAL generate_series(0, max_end, resolution) to tile each chromosome
  3. A LEFT JOIN matches source intervals to bins, with any original WHERE conditions moved into the ON clause to preserve zero-coverage bins
  4. GROUP BY and the selected aggregate (COUNT, AVG, SUM, MIN, MAX) produce one row per bin
  5. When no explicit AS alias is provided, the aggregate defaults to value

The LATERAL syntax omits the Subquery wrapper to emit CROSS JOIN LATERAL generate_series(...) directly, which is compatible with both DuckDB and PostgreSQL.

Documentation

Add COVERAGE to the aggregation operators reference (docs/dialect/aggregation-operators.rst) with full syntax, parameter descriptions, return value documentation, and examples. Add a recipes page (docs/recipes/coverage.rst) with patterns for basic coverage, statistics, filtered coverage, strand-specific coverage, target column aggregation, and RPM normalisation. Include a sample output table illustrating the returned data structure and clarify that score is a source table column in the WHERE filter example.

Test cases

# Test Suite Given When Then Coverage Target
1 TestGIQLCoverage A COVERAGE expression with positional interval and resolution Parsed with GIQLDialect It should produce a GIQLCoverage node with resolution set and stat/target both None from_arg_list positional mapping
2 TestGIQLCoverage A COVERAGE expression with := named stat parameter Parsed with GIQLDialect It should produce a GIQLCoverage node with stat set to the given value from_arg_list named param via PropertyEQ
3 TestGIQLCoverage A COVERAGE expression with => named stat parameter Parsed with GIQLDialect It should produce a GIQLCoverage node with stat set to the given value from_arg_list named param via Kwarg
4 TestGIQLCoverage A COVERAGE expression with named resolution parameter Parsed with GIQLDialect It should produce a GIQLCoverage node with resolution set via named param from_arg_list named resolution
5 TestGIQLCoverage A COVERAGE expression with := named target parameter Parsed with GIQLDialect It should produce a GIQLCoverage node with target set from_arg_list target via PropertyEQ
6 TestGIQLCoverage A COVERAGE expression with => named target parameter Parsed with GIQLDialect It should produce a GIQLCoverage node with target set from_arg_list target via Kwarg
7 TestGIQLCoverage A COVERAGE expression with stat, target, and resolution all named Parsed with GIQLDialect It should produce a GIQLCoverage node with all three params set from_arg_list multiple named params
8 TestGIQLCoverage Any valid resolution (1-10M), stat (sampled from valid values), and syntax (:= or =>) Parsed with GIQLDialect It should produce a GIQLCoverage node with correct resolution and stat from_arg_list property: stat + resolution combinations
9 TestGIQLCoverage Any valid resolution (1-10M) with no stat or target Parsed with GIQLDialect It should produce a GIQLCoverage node with resolution set and stat/target None from_arg_list property: positional-only
10 TestGIQLCoverage Either := or => syntax for target parameter Parsed with GIQLDialect It should produce a GIQLCoverage node with target set from_arg_list property: target syntax
11 TestCoverageTransformer A Tables container with registered tables CoverageTransformer is instantiated It should store the tables reference __init__
12 TestCoverageTransformer A basic COVERAGE query with count (default stat) Transpiled It should produce SQL with __giql_bins CTE, GENERATE_SERIES, LEFT JOIN, GROUP BY, COUNT, and ORDER BY transform basic path
13 TestCoverageTransformer A COVERAGE query with no COVERAGE expression Transpiled It should return the query unchanged transform early return
14 TestCoverageTransformer A COVERAGE query with stat := 'mean' Transpiled It should use AVG aggregate, not COUNT transform stat=mean
15 TestCoverageTransformer A COVERAGE query with stat := 'sum' Transpiled It should use SUM aggregate transform stat=sum
16 TestCoverageTransformer A COVERAGE query with stat := 'min' Transpiled It should use MIN aggregate transform stat=min
17 TestCoverageTransformer A COVERAGE query with stat := 'max' Transpiled It should use MAX aggregate transform stat=max
18 TestCoverageTransformer A COVERAGE query with stat := 'mean' and target := 'score' Transpiled It should use AVG on the score column transform target with non-count stat
19 TestCoverageTransformer A COVERAGE query with target := 'score' (default count) Transpiled It should use COUNT on the score column, not COUNT(*) transform target with count stat
20 TestCoverageTransformer A COVERAGE query without an explicit AS alias Transpiled It should alias the aggregate as "value" transform default alias
21 TestCoverageTransformer A COVERAGE query with explicit AS alias Transpiled It should use the explicit alias, not "value" transform explicit alias preserved
22 TestCoverageTransformer A COVERAGE query with a WHERE clause Transpiled It should move the WHERE condition into the LEFT JOIN ON clause, not the outer WHERE transform WHERE to ON migration
23 TestCoverageTransformer A COVERAGE query with a WHERE clause Transpiled It should qualify unqualified column references in the JOIN ON with the source table transform WHERE column qualification (ON)
24 TestCoverageTransformer A COVERAGE query with a WHERE clause Transpiled It should also apply the WHERE to the chroms subquery with table-qualified columns transform WHERE in chroms subquery
25 TestCoverageTransformer A COVERAGE query with custom column mappings (chromosome, start_pos, end_pos) Transpiled It should use the mapped column names throughout transform custom columns
26 TestCoverageTransformer A COVERAGE query with additional columns alongside COVERAGE Transpiled It should include the extra columns in the output transform pass-through columns
27 TestCoverageTransformer A COVERAGE query with a table alias (FROM features f) Transpiled It should use the alias as the source reference in JOIN transform table alias
28 TestCoverageTransformer A COVERAGE query with resolution=500 Transpiled It should use 500 as the step in generate_series and bin width transform resolution propagation
29 TestCoverageTransformer A COVERAGE expression inside a WITH clause Transpiled It should correctly transform the CTE containing COVERAGE transform CTE recursion
30 TestCoverageTransformer A COVERAGE query with an invalid stat value Transpiled It should raise ValueError matching "Unknown COVERAGE stat" transform invalid stat
31 TestCoverageTransformer A query with two COVERAGE expressions Transpiled It should raise ValueError matching "Multiple COVERAGE" transform multiple COVERAGE
32 TestCoverageTransformer A DuckDB table with two intervals in the same 1000bp bin COVERAGE count is transpiled and executed It should return count=2 for that bin End-to-end count correctness
33 TestCoverageTransformer A DuckDB table with intervals covering only some bins COVERAGE count is transpiled and executed Bins beyond intervals should appear with count=0 Zero-coverage bins via LEFT JOIN
34 TestCoverageTransformer A DuckDB table with intervals and a WHERE filter excluding some COVERAGE count with WHERE is transpiled and executed Bins should not be dropped by WHERE WHERE to ON preserves zero bins
35 TestCoverageTransformer A DuckDB table with a score column and two intervals in one bin COVERAGE with stat='mean' and target='score' is transpiled and executed It should return the average of the score values Target + mean correctness
36 TestCoverageTransformer A DuckDB table with intervals of different lengths in one bin COVERAGE with stat='min' is transpiled and executed It should return the minimum interval length Min stat correctness
37 TestCoverageTransformer Any valid stat (count/mean/sum/min/max) and resolution (1-10M) Transpiled via transpile() The output SQL should contain the corresponding SQL aggregate function name and the resolution value transform property: stat maps to SQL aggregate
38 TestCoverageTransformer Any valid stat (count/mean/sum/min/max) and resolution (1-10M) Transpiled via transpile() The output SQL should always contain __GIQL_BINS, GENERATE_SERIES, LEFT JOIN, GROUP BY, and ORDER BY transform property: structural invariants

Implementation plan

    • Add GIQLCoverage expression node with from_arg_list and register in parser
    • Add CoverageTransformer with CTE-based generate_series + LEFT JOIN + GROUP BY rewrite
    • Add COVERAGE operator reference documentation and recipes
    • Add initial parsing and transpilation tests
    • Support => (standard SQL Kwarg) named parameter syntax alongside :=
    • Stop treating = as named parameter syntax to avoid ambiguity
    • Remove dead code and fix LATERAL syntax for DuckDB compatibility
    • Add target parameter for aggregating a named column and default value alias
    • Move WHERE clause into LEFT JOIN ON condition to preserve zero-coverage bins
    • Rewrite tests to spec with full API coverage (36 tests: 10 parsing, 26 transpilation/functional)
    • Clarify score column reference in docs and add sample output table to recipes
    • Add property-based tests for COVERAGE transpilation (38 tests total: 10 parsing + 3 parsing PBT, 21 transpilation + 2 transpilation PBT, 5 functional)

@conradbzura
Copy link
Copy Markdown
Collaborator Author

test_coverage.py has a number of tests that can be consolidated into a few property-based tests. We also need functional tests confirming query behavior — can use DuckDB for this.

@conradbzura conradbzura self-assigned this Mar 12, 2026
- ``'min'`` — minimum interval length of overlapping intervals
- ``'max'`` — maximum interval length of overlapping intervals

When ``target`` is specified, the stat is applied to that column instead of interval length.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvictus is interval length a sane default for target metric?

Define a new GIQLCoverage(exp.Func) AST node with this, resolution,
and stat arg_types. The from_arg_list classmethod handles both
positional and named parameters (EQ and PropertyEQ for := syntax).
Register COVERAGE in GIQLDialect.Parser.FUNCTIONS so the parser
recognises it.
CoverageTransformer rewrites SELECT COVERAGE(interval, N) queries
into a CTE-based plan: a __giql_bins CTE built from generate_series
via LATERAL, LEFT JOINed to the source table on overlap, with
GROUP BY and the appropriate aggregate (COUNT, AVG, SUM, MIN, MAX).
Wire the transformer into the transpile() pipeline before MERGE and
CLUSTER.
TestCoverageParsing (3 tests) verifies positional args, named stat
via :=, and named resolution. TestCoverageTranspile (11 tests) covers
basic transpilation, stat variants (mean/sum/max), custom column
mappings, WHERE preservation, additional SELECT columns, table alias
handling, resolution in generate_series, overlap join conditions, and
ORDER BY output.
Add a COVERAGE section to aggregation-operators.rst with description,
syntax, parameters, return value, examples, and related operators.
Create docs/recipes/coverage.rst with strand-specific coverage,
coverage statistics, filtered coverage, 5-prime end counting, and
RPM normalisation recipes. Add coverage to the recipe index.
Add exp.Kwarg handling alongside exp.PropertyEQ in from_arg_list so
that COVERAGE(interval, 1000, stat => 'mean') works identically to
the := form. Update the reference docs to show both syntaxes and add
a parsing test for the => form.
The = operator inside a function call is an equality comparison in
standard SQL, not parameter assignment. Only := (PropertyEQ) and
=> (Kwarg) are valid named parameter syntaxes. This makes COVERAGE
consistent with SQL semantics and allows = to be used as a boolean
expression argument.
Remove unused generate_series_sql variable and unwrap the redundant
exp.Subquery wrapper inside exp.Lateral. The old form emitted
CROSS JOIN LATERAL (GENERATE_SERIES(...)) which DuckDB rejects due
to the extra parentheses. The new form emits
CROSS JOIN LATERAL GENERATE_SERIES(...) which works on both DuckDB
and PostgreSQL.
Add optional target parameter to GIQLCoverage that specifies which
column to aggregate instead of defaulting to interval length
(end - start). When target is set, COUNT uses COUNT(target_col)
instead of COUNT(*), and other stats (mean, sum, min, max) aggregate
the named column.

Bare COVERAGE expressions without an explicit AS alias now default
to AS value.
The original query's WHERE was applied to the outer query, which
filtered out zero-coverage bins because source columns are NULL
for non-matching LEFT JOIN rows (NULL > threshold evaluates to
FALSE). Moving the WHERE into the JOIN's ON clause preserves all
bins while still filtering which source rows participate.

Also qualify unqualified column references with the source table
in both the JOIN ON condition and the chroms subquery WHERE to
avoid ambiguous column errors.
Replace the ad-hoc test classes with two spec-aligned classes:

- TestGIQLCoverage (10 tests): example-based parsing for positional
  args, :=/=> named params, target parameter, and all-named-params;
  property-based tests for stat+resolution combos, positional-only,
  and target syntax variants.

- TestCoverageTransformer (26 tests): instantiation, basic
  transpilation, all five stats, target with count/non-count,
  default and explicit aliases, WHERE-to-ON migration with column
  qualification, custom column mapping, table alias, resolution
  propagation, CTE nesting, error paths (invalid stat, multiple
  COVERAGE), and five DuckDB end-to-end functional tests.

Update docs to document the target parameter, default value alias,
and add a recipe for aggregating a specific column.
Cover bedtools_wrapper, comparison, data_models, and duckdb_loader
utility modules used by the integration test suite.
Cover dialect parser, expression nodes, BaseGIQLGenerator, table
metadata, ClusterTransformer, MergeTransformer, CoverageTransformer,
and the public transpile() API.
Compare GIQL INTERSECTS, MERGE, and NEAREST output against bedtools
results across edge cases, strand handling, scale, and multi-step
workflow pipelines.
The WHERE example in the COVERAGE reference now notes that score is a
column on the source table.  The coverage recipes page gains a sample
output table after the first example so readers can see the returned
data structure at a glance.
Two new Hypothesis PBTs verify that transpiled SQL contains the correct
aggregate function for every stat and that all structural elements
(__giql_bins, generate_series, LEFT JOIN, GROUP BY, ORDER BY) are
present across the full stat x resolution input space.
…ervation

Update all unit tests to use := syntax for named parameters instead of
= which is no longer treated as named parameter syntax after the fix
merged from main.

Fix MergeTransformer to preserve existing CTEs from the original query
so that WITH...SELECT MERGE(interval) FROM cte_name works correctly.

Relax bedtools closest distance assertions to tolerate version
differences in gap distance reporting (0-based vs 1-based).
MERGE outputs BED3 (chrom, start, end) while the bedtools intersect
wrapper pads to BED6. Trim bedtools results to coordinates before
comparing so the column count matches.
@conradbzura conradbzura force-pushed the 61-binned-coverage-operator branch from f7ad665 to 039baae Compare March 26, 2026 13:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add binned summary statistic aggregation for genomic intervals

1 participant