Add binned summary statistic aggregation for genomic intervals — Closes #61#62
Open
conradbzura wants to merge 17 commits intomainfrom
Open
Add binned summary statistic aggregation for genomic intervals — Closes #61#62conradbzura wants to merge 17 commits intomainfrom
conradbzura wants to merge 17 commits intomainfrom
Conversation
4b9e1a1 to
10b287a
Compare
conradbzura
commented
Mar 11, 2026
conradbzura
commented
Mar 11, 2026
conradbzura
commented
Mar 11, 2026
conradbzura
commented
Mar 11, 2026
conradbzura
commented
Mar 11, 2026
conradbzura
commented
Mar 11, 2026
conradbzura
commented
Mar 11, 2026
Collaborator
Author
|
test_coverage.py has a number of tests that can be consolidated into a few property-based tests. We also need functional tests confirming query behavior — can use DuckDB for this. |
conradbzura
commented
Mar 12, 2026
| - ``'min'`` — minimum interval length of overlapping intervals | ||
| - ``'max'`` — maximum interval length of overlapping intervals | ||
|
|
||
| When ``target`` is specified, the stat is applied to that column instead of interval length. |
Collaborator
Author
There was a problem hiding this comment.
@nvictus is interval length a sane default for target metric?
conradbzura
commented
Mar 12, 2026
conradbzura
commented
Mar 12, 2026
Define a new GIQLCoverage(exp.Func) AST node with this, resolution, and stat arg_types. The from_arg_list classmethod handles both positional and named parameters (EQ and PropertyEQ for := syntax). Register COVERAGE in GIQLDialect.Parser.FUNCTIONS so the parser recognises it.
CoverageTransformer rewrites SELECT COVERAGE(interval, N) queries into a CTE-based plan: a __giql_bins CTE built from generate_series via LATERAL, LEFT JOINed to the source table on overlap, with GROUP BY and the appropriate aggregate (COUNT, AVG, SUM, MIN, MAX). Wire the transformer into the transpile() pipeline before MERGE and CLUSTER.
TestCoverageParsing (3 tests) verifies positional args, named stat via :=, and named resolution. TestCoverageTranspile (11 tests) covers basic transpilation, stat variants (mean/sum/max), custom column mappings, WHERE preservation, additional SELECT columns, table alias handling, resolution in generate_series, overlap join conditions, and ORDER BY output.
Add a COVERAGE section to aggregation-operators.rst with description, syntax, parameters, return value, examples, and related operators. Create docs/recipes/coverage.rst with strand-specific coverage, coverage statistics, filtered coverage, 5-prime end counting, and RPM normalisation recipes. Add coverage to the recipe index.
Add exp.Kwarg handling alongside exp.PropertyEQ in from_arg_list so that COVERAGE(interval, 1000, stat => 'mean') works identically to the := form. Update the reference docs to show both syntaxes and add a parsing test for the => form.
The = operator inside a function call is an equality comparison in standard SQL, not parameter assignment. Only := (PropertyEQ) and => (Kwarg) are valid named parameter syntaxes. This makes COVERAGE consistent with SQL semantics and allows = to be used as a boolean expression argument.
Remove unused generate_series_sql variable and unwrap the redundant exp.Subquery wrapper inside exp.Lateral. The old form emitted CROSS JOIN LATERAL (GENERATE_SERIES(...)) which DuckDB rejects due to the extra parentheses. The new form emits CROSS JOIN LATERAL GENERATE_SERIES(...) which works on both DuckDB and PostgreSQL.
Add optional target parameter to GIQLCoverage that specifies which column to aggregate instead of defaulting to interval length (end - start). When target is set, COUNT uses COUNT(target_col) instead of COUNT(*), and other stats (mean, sum, min, max) aggregate the named column. Bare COVERAGE expressions without an explicit AS alias now default to AS value.
The original query's WHERE was applied to the outer query, which filtered out zero-coverage bins because source columns are NULL for non-matching LEFT JOIN rows (NULL > threshold evaluates to FALSE). Moving the WHERE into the JOIN's ON clause preserves all bins while still filtering which source rows participate. Also qualify unqualified column references with the source table in both the JOIN ON condition and the chroms subquery WHERE to avoid ambiguous column errors.
Replace the ad-hoc test classes with two spec-aligned classes: - TestGIQLCoverage (10 tests): example-based parsing for positional args, :=/=> named params, target parameter, and all-named-params; property-based tests for stat+resolution combos, positional-only, and target syntax variants. - TestCoverageTransformer (26 tests): instantiation, basic transpilation, all five stats, target with count/non-count, default and explicit aliases, WHERE-to-ON migration with column qualification, custom column mapping, table alias, resolution propagation, CTE nesting, error paths (invalid stat, multiple COVERAGE), and five DuckDB end-to-end functional tests. Update docs to document the target parameter, default value alias, and add a recipe for aggregating a specific column.
Cover bedtools_wrapper, comparison, data_models, and duckdb_loader utility modules used by the integration test suite.
Cover dialect parser, expression nodes, BaseGIQLGenerator, table metadata, ClusterTransformer, MergeTransformer, CoverageTransformer, and the public transpile() API.
Compare GIQL INTERSECTS, MERGE, and NEAREST output against bedtools results across edge cases, strand handling, scale, and multi-step workflow pipelines.
The WHERE example in the COVERAGE reference now notes that score is a column on the source table. The coverage recipes page gains a sample output table after the first example so readers can see the returned data structure at a glance.
Two new Hypothesis PBTs verify that transpiled SQL contains the correct aggregate function for every stat and that all structural elements (__giql_bins, generate_series, LEFT JOIN, GROUP BY, ORDER BY) are present across the full stat x resolution input space.
…ervation Update all unit tests to use := syntax for named parameters instead of = which is no longer treated as named parameter syntax after the fix merged from main. Fix MergeTransformer to preserve existing CTEs from the original query so that WITH...SELECT MERGE(interval) FROM cte_name works correctly. Relax bedtools closest distance assertions to tolerate version differences in gap distance reporting (0-based vs 1-based).
MERGE outputs BED3 (chrom, start, end) while the bedtools intersect wrapper pads to BED6. Trim bedtools results to coordinates before comparing so the column count matches.
f7ad665 to
039baae
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Add the
COVERAGEoperator for computing binned genome coverage from interval data. The operator tiles the genome into fixed-width bins usinggenerate_series, LEFT JOINs against the source table, and aggregates overlapping intervals per bin. It supports configurable statistics (count,mean,sum,min,max), an optionaltargetcolumn for aggregating a named field instead of interval length, and both:=and=>named parameter syntax. The generated SQL is compatible with both DuckDB and PostgreSQL.Closes #61
Proposed changes
Expression node (
GIQLCoverage)Add a new
GIQLCoverageexpression node registered as a SQLGlotFunc. Thefrom_arg_listclassmethod handles positional and named parameters viaPropertyEQ(:=) andKwarg(=>). Parameters:this(interval column),resolution(bin width),stat(aggregation function), andtarget(column to aggregate).Parser registration
Register
COVERAGEas a function inGIQLDialectsoparse_one(..., dialect=GIQLDialect)produces aGIQLCoverageAST node. The=operator is explicitly excluded from named parameter syntax to avoid ambiguity with SQL equality.Transformer (
CoverageTransformer)Rewrite queries containing
COVERAGE(...)into a CTE-based plan:__giql_chromssubquery computes per-chromosomeMAX(end)from the source table__giql_binsCTE usesCROSS JOIN LATERAL generate_series(0, max_end, resolution)to tile each chromosomeLEFT JOINmatches source intervals to bins, with any originalWHEREconditions moved into theONclause to preserve zero-coverage binsGROUP BYand the selected aggregate (COUNT,AVG,SUM,MIN,MAX) produce one row per binASalias is provided, the aggregate defaults tovalueThe
LATERALsyntax omits theSubquerywrapper to emitCROSS JOIN LATERAL generate_series(...)directly, which is compatible with both DuckDB and PostgreSQL.Documentation
Add
COVERAGEto the aggregation operators reference (docs/dialect/aggregation-operators.rst) with full syntax, parameter descriptions, return value documentation, and examples. Add a recipes page (docs/recipes/coverage.rst) with patterns for basic coverage, statistics, filtered coverage, strand-specific coverage, target column aggregation, and RPM normalisation. Include a sample output table illustrating the returned data structure and clarify thatscoreis a source table column in the WHERE filter example.Test cases
TestGIQLCoveragefrom_arg_listpositional mappingTestGIQLCoverage:=named stat parameterfrom_arg_listnamed param via PropertyEQTestGIQLCoverage=>named stat parameterfrom_arg_listnamed param via KwargTestGIQLCoveragefrom_arg_listnamed resolutionTestGIQLCoverage:=named target parameterfrom_arg_listtarget via PropertyEQTestGIQLCoverage=>named target parameterfrom_arg_listtarget via KwargTestGIQLCoveragefrom_arg_listmultiple named paramsTestGIQLCoverage:=or=>)from_arg_listproperty: stat + resolution combinationsTestGIQLCoveragefrom_arg_listproperty: positional-onlyTestGIQLCoverage:=or=>syntax for target parameterfrom_arg_listproperty: target syntaxTestCoverageTransformer__init__TestCoverageTransformertransformbasic pathTestCoverageTransformertransformearly returnTestCoverageTransformertransformstat=meanTestCoverageTransformertransformstat=sumTestCoverageTransformertransformstat=minTestCoverageTransformertransformstat=maxTestCoverageTransformertransformtarget with non-count statTestCoverageTransformertransformtarget with count statTestCoverageTransformertransformdefault aliasTestCoverageTransformertransformexplicit alias preservedTestCoverageTransformertransformWHERE to ON migrationTestCoverageTransformertransformWHERE column qualification (ON)TestCoverageTransformertransformWHERE in chroms subqueryTestCoverageTransformertransformcustom columnsTestCoverageTransformertransformpass-through columnsTestCoverageTransformertransformtable aliasTestCoverageTransformertransformresolution propagationTestCoverageTransformertransformCTE recursionTestCoverageTransformertransforminvalid statTestCoverageTransformertransformmultiple COVERAGETestCoverageTransformerTestCoverageTransformerTestCoverageTransformerTestCoverageTransformerTestCoverageTransformerTestCoverageTransformertransformproperty: stat maps to SQL aggregateTestCoverageTransformertransformproperty: structural invariantsImplementation plan
GIQLCoverageexpression node withfrom_arg_listand register in parserCoverageTransformerwith CTE-based generate_series + LEFT JOIN + GROUP BY rewrite=>(standard SQL Kwarg) named parameter syntax alongside:==as named parameter syntax to avoid ambiguitytargetparameter for aggregating a named column and defaultvaluealias