Skip to content

Add binned summary statistic aggregation for genomic intervals #61

@conradbzura

Description

@conradbzura

Summary

Add a new GIQL operator that computes summary statistics over fixed-size genomic bins. Given a table of intervals (e.g., alignments from a BAM file), produce a coverage-like signal track at an arbitrary resolution — analogous to how bigWig files store pre-aggregated signal.

The core capabilities to support are:

  • Fixed-bin coverage — tile the genome into bins of a given width and count (or sum/mean/min/max) overlapping intervals per bin
  • Strand-specific aggregation — produce separate plus/minus strand tracks
  • 5' end vs fragment mapping — count only the 5' endpoint of reads, or both 5' and 3' ends
  • Read shifting — offset positive/negative strand reads by N bases before aggregation
  • Normalization — scale factor and read-depth normalization (e.g., divide by total reads, multiply by 1,000,000 for CPM)

Proposed syntax:

SELECT *
FROM COVERAGE(reads, resolution := 1000, stat := 'mean')

This would return one row per bin (chrom, start, end, value) with the aggregated statistic computed over all intervals overlapping each bin.

Strand-specific, shifting, and normalization features may be expressible as standard SQL around the core operator (WHERE clauses, CASE expressions, scalar subqueries) rather than requiring dedicated parameters — the recipes and documentation should demonstrate these patterns.

Motivation

Computing read depth or signal coverage over fixed-width windows is a fundamental genomics operation. GIQL currently supports CLUSTER and MERGE for combining overlapping intervals, but lacks the ability to tile the genome into fixed-width bins and aggregate across them. A first-class operator for this would enable coverage computation that composes with GIQL's filtering, joins, and spatial operators within a single query pipeline — replacing external tool invocations with declarative SQL.

Affected code

  • src/giql/expressions.py — new AST node for the coverage/binning operator
  • src/giql/transformer.py — transformer to rewrite into windowed SQL (generate_series + GROUP BY)
  • src/giql/dialect.py — parser extensions for the new syntax
  • src/giql/generators/base.py — SQL generation for the binned aggregation pattern
  • docs/dialect/aggregation-operators.rst — operator documentation
  • docs/recipes/ — recipes for strand-specific coverage, normalization, shifting
  • tests/ — new test module for the operator

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions