-
Notifications
You must be signed in to change notification settings - Fork 2
Add binned summary statistic aggregation for genomic intervals #61
Description
Summary
Add a new GIQL operator that computes summary statistics over fixed-size genomic bins. Given a table of intervals (e.g., alignments from a BAM file), produce a coverage-like signal track at an arbitrary resolution — analogous to how bigWig files store pre-aggregated signal.
The core capabilities to support are:
- Fixed-bin coverage — tile the genome into bins of a given width and count (or sum/mean/min/max) overlapping intervals per bin
- Strand-specific aggregation — produce separate plus/minus strand tracks
- 5' end vs fragment mapping — count only the 5' endpoint of reads, or both 5' and 3' ends
- Read shifting — offset positive/negative strand reads by N bases before aggregation
- Normalization — scale factor and read-depth normalization (e.g., divide by total reads, multiply by 1,000,000 for CPM)
Proposed syntax:
SELECT *
FROM COVERAGE(reads, resolution := 1000, stat := 'mean')This would return one row per bin (chrom, start, end, value) with the aggregated statistic computed over all intervals overlapping each bin.
Strand-specific, shifting, and normalization features may be expressible as standard SQL around the core operator (WHERE clauses, CASE expressions, scalar subqueries) rather than requiring dedicated parameters — the recipes and documentation should demonstrate these patterns.
Motivation
Computing read depth or signal coverage over fixed-width windows is a fundamental genomics operation. GIQL currently supports CLUSTER and MERGE for combining overlapping intervals, but lacks the ability to tile the genome into fixed-width bins and aggregate across them. A first-class operator for this would enable coverage computation that composes with GIQL's filtering, joins, and spatial operators within a single query pipeline — replacing external tool invocations with declarative SQL.
Affected code
src/giql/expressions.py— new AST node for the coverage/binning operatorsrc/giql/transformer.py— transformer to rewrite into windowed SQL (generate_series + GROUP BY)src/giql/dialect.py— parser extensions for the new syntaxsrc/giql/generators/base.py— SQL generation for the binned aggregation patterndocs/dialect/aggregation-operators.rst— operator documentationdocs/recipes/— recipes for strand-specific coverage, normalization, shiftingtests/— new test module for the operator