Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
77df05a
feat: Add GIQLCoverage expression node and parser registration
conradbzura Mar 11, 2026
3ea9d81
feat: Add CoverageTransformer for binned genome coverage
conradbzura Mar 11, 2026
9403bc7
test: Add parsing and transpilation tests for COVERAGE operator
conradbzura Mar 11, 2026
77d7f28
docs: Add COVERAGE operator reference and recipes
conradbzura Mar 11, 2026
f4838ee
feat: Support => (standard SQL) named parameter syntax in COVERAGE
conradbzura Mar 11, 2026
0a79eb1
fix: Stop treating = as named parameter syntax in COVERAGE
conradbzura Mar 11, 2026
45710d8
refactor: Remove dead code and fix LATERAL syntax for DuckDB compat
conradbzura Mar 12, 2026
763885e
feat: Add target parameter and default alias to COVERAGE operator
conradbzura Mar 12, 2026
684ebc1
fix: Move COVERAGE WHERE clause into LEFT JOIN ON condition
conradbzura Mar 12, 2026
c5f1e9c
test: Rewrite COVERAGE tests to spec with full API coverage
conradbzura Mar 12, 2026
c7b1131
test: Add unit tests for bedtools test utilities
conradbzura Mar 25, 2026
b800625
test: Add unit tests for GIQL parsing, generation, and transpilation
conradbzura Mar 25, 2026
4898025
test: Add bedtools integration tests for operator correctness
conradbzura Mar 25, 2026
04311f0
docs: Clarify score column reference and add sample output table
conradbzura Mar 25, 2026
7577ef7
test: Add property-based tests for COVERAGE transpilation
conradbzura Mar 25, 2026
1db963e
fix: Align unit tests with := named parameter syntax and fix CTE pres…
conradbzura Mar 26, 2026
039baae
fix: Compare only coordinates in merge-then-intersect workflow test
conradbzura Mar 26, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/dialect/aggregation-operators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -328,4 +328,129 @@ Related Operators
~~~~~~~~~~~~~~~~~

- :ref:`CLUSTER <cluster-operator>` - Assign cluster IDs without merging
- :ref:`COVERAGE <coverage-operator>` - Compute binned genome coverage
- :ref:`INTERSECTS <intersects-operator>` - Test for overlap between specific pairs

----

.. _coverage-operator:

COVERAGE
--------

Compute binned genome coverage by tiling the genome into fixed-width bins.

Description
~~~~~~~~~~~

The ``COVERAGE`` operator tiles the genome into fixed-width bins and aggregates overlapping intervals per bin. It generates a bin grid using ``generate_series`` and joins it against the source table to count (or otherwise aggregate) overlapping features in each bin.

This is useful for:

- Computing read depth or signal coverage across the genome
- Creating fixed-resolution coverage tracks from interval data
- Summarising feature density at a user-defined resolution

The operator works as an aggregate function, returning one row per bin with the bin coordinates and the computed statistic.

Syntax
~~~~~~

.. code-block:: sql

-- Basic coverage (count overlapping intervals per bin)
SELECT COVERAGE(interval, resolution) FROM features

-- With a named statistic (either := or => syntax)
SELECT COVERAGE(interval, 1000, stat := 'mean') FROM features
SELECT COVERAGE(interval, 1000, stat => 'mean') FROM features

-- Aggregate a specific column instead of interval length
SELECT COVERAGE(interval, 1000, stat := 'mean', target := 'score') FROM features

-- Named resolution parameter
SELECT COVERAGE(interval, resolution := 500) FROM features

Parameters
~~~~~~~~~~

**interval**
A genomic column.

**resolution**
Bin width in base pairs. Can be given as a positional or named parameter.

**stat** *(optional)*
Aggregation function applied to overlapping intervals per bin. One of:

- ``'count'`` — number of overlapping intervals (default)
- ``'mean'`` — average interval length of overlapping intervals
- ``'sum'`` — total interval length of overlapping intervals
- ``'min'`` — minimum interval length of overlapping intervals
- ``'max'`` — maximum interval length of overlapping intervals

When ``target`` is specified, the stat is applied to that column instead of interval length.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvictus is interval length a sane default for target metric?


**target** *(optional)*
Column name to aggregate. When omitted, non-count stats aggregate interval length (``end - start``). When specified, the stat is applied to the named column. For ``'count'``, specifying a target counts non-NULL values of that column instead of ``COUNT(*)``.

Return Value
~~~~~~~~~~~~

Returns one row per genomic bin:

- ``chrom`` — Chromosome of the bin
- ``start`` — Start position of the bin
- ``end`` — End position of the bin
- ``value`` — The computed aggregate (default alias; use ``AS`` to rename)

Examples
~~~~~~~~

**Basic Coverage:**

Count the number of features overlapping each 1 kb bin:

.. code-block:: sql

SELECT COVERAGE(interval, 1000)
FROM features

**Mean Coverage:**

Compute the average interval length per 500 bp bin:

.. code-block:: sql

SELECT COVERAGE(interval, 500, stat := 'mean')
FROM features

**Named Alias:**

.. code-block:: sql

SELECT COVERAGE(interval, 1000) AS depth
FROM reads

**With WHERE Filter:**

Assuming the source table includes a ``score`` column, compute coverage of high-scoring features only:

.. code-block:: sql

SELECT COVERAGE(interval, 1000) AS depth
FROM features
WHERE score > 10

Performance Notes
~~~~~~~~~~~~~~~~~

- The operator creates one bin per chromosome per step, so smaller resolutions produce more rows
- A ``LEFT JOIN`` ensures bins with zero coverage are included in the output
- For very large genomes, consider restricting the query with a ``WHERE`` clause on chromosome

Related Operators
~~~~~~~~~~~~~~~~~

- :ref:`MERGE <merge-operator>` - Combine overlapping intervals into single regions
- :ref:`CLUSTER <cluster-operator>` - Assign cluster IDs to overlapping intervals
173 changes: 173 additions & 0 deletions docs/recipes/coverage.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
Coverage
========

This section covers patterns for computing genome-wide coverage and signal
summaries using GIQL's ``COVERAGE`` operator.

Basic Coverage
--------------

Count Overlapping Features
~~~~~~~~~~~~~~~~~~~~~~~~~~

Count the number of features overlapping each 1 kb bin across the genome:

.. code-block:: sql

SELECT COVERAGE(interval, 1000) AS depth
FROM features

**Sample output:**

.. code-block:: text

┌────────┬────────┬────────┬───────┐
│ chrom │ start │ end │ depth │
├────────┼────────┼────────┼───────┤
│ chr1 │ 0 │ 1000 │ 3 │
│ chr1 │ 1000 │ 2000 │ 1 │
│ chr1 │ 2000 │ 3000 │ 0 │
│ ... │ ... │ ... │ ... │
└────────┴────────┴────────┴───────┘

Each row represents one genomic bin. Bins with no overlapping features appear with a count of zero.

**Use case:** Compute read depth or feature density at a fixed resolution.

Custom Bin Size
~~~~~~~~~~~~~~~

Use a finer resolution of 100 bp:

.. code-block:: sql

SELECT COVERAGE(interval, 100) AS depth
FROM reads

**Use case:** High-resolution coverage tracks for visualisation.

Coverage Statistics
-------------------

Mean Interval Length per Bin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compute the average length of intervals overlapping each bin:

.. code-block:: sql

SELECT COVERAGE(interval, 1000, stat := 'mean') AS avg_len
FROM features

Sum of Interval Lengths per Bin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compute the total interval length in each bin:

.. code-block:: sql

SELECT COVERAGE(interval, 1000, stat := 'sum') AS total_len
FROM features

Maximum Interval Length per Bin
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Find the longest interval overlapping each bin:

.. code-block:: sql

SELECT COVERAGE(interval, 1000, stat := 'max') AS max_len
FROM features

Aggregating a Specific Column
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compute the mean score of overlapping features per bin instead of summarising interval length:

.. code-block:: sql

SELECT COVERAGE(interval, 1000, stat := 'mean', target := 'score') AS avg_score
FROM features

**Use case:** Signal tracks from a numeric column (e.g. ChIP-seq score, p-value).

Filtered Coverage
-----------------

Strand-Specific Coverage
~~~~~~~~~~~~~~~~~~~~~~~~

Compute coverage for each strand separately by filtering:

.. code-block:: sql

-- Plus strand
SELECT COVERAGE(interval, 1000) AS depth
FROM features
WHERE strand = '+'

.. code-block:: sql

-- Minus strand
SELECT COVERAGE(interval, 1000) AS depth
FROM features
WHERE strand = '-'

**Use case:** Strand-specific signal tracks for RNA-seq or stranded assays.

Coverage of High-Scoring Features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Restrict coverage to features above a quality threshold:

.. code-block:: sql

SELECT COVERAGE(interval, 1000) AS depth
FROM features
WHERE score > 10

5' End Counting
~~~~~~~~~~~~~~~

To count only the 5' ends of features (e.g. TSS or read starts), first
create a view or CTE that trims each interval to its 5' end, then apply
``COVERAGE``:

.. code-block:: sql

WITH five_prime AS (
SELECT chrom, start, start + 1 AS end
FROM features
WHERE strand = '+'
UNION ALL
SELECT chrom, end - 1 AS start, end
FROM features
WHERE strand = '-'
)
SELECT COVERAGE(interval, 1000) AS tss_count
FROM five_prime

Normalised Coverage
-------------------

RPM Normalisation
~~~~~~~~~~~~~~~~~

Normalise bin counts to reads per million (RPM) by dividing by the total
number of reads:

.. code-block:: sql

WITH bins AS (
SELECT COVERAGE(interval, 1000) AS depth
FROM reads
),
total AS (
SELECT COUNT(*) AS n FROM reads
)
SELECT
bins.chrom,
bins.start,
bins.end,
bins.depth * 1000000.0 / total.n AS rpm
FROM bins, total
4 changes: 4 additions & 0 deletions docs/recipes/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ Recipe Categories
Clustering overlapping intervals, distance-based clustering,
merging intervals, and aggregating cluster statistics.

:doc:`coverage`
Binned genome coverage, coverage statistics, strand-specific coverage,
normalisation, and 5' end counting.

:doc:`advanced`
Multi-range matching, complex filtering with joins, aggregate statistics,
window expansions, and multi-table queries.
Expand Down
2 changes: 2 additions & 0 deletions src/giql/dialect.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
from giql.expressions import Contains
from giql.expressions import GIQLCluster
from giql.expressions import GIQLDistance
from giql.expressions import GIQLCoverage
from giql.expressions import GIQLMerge
from giql.expressions import GIQLNearest
from giql.expressions import Intersects
Expand Down Expand Up @@ -54,6 +55,7 @@ class Parser(Parser):
FUNCTIONS = {
**Parser.FUNCTIONS,
"CLUSTER": GIQLCluster.from_arg_list,
"COVERAGE": GIQLCoverage.from_arg_list,
"MERGE": GIQLMerge.from_arg_list,
"DISTANCE": GIQLDistance.from_arg_list,
"NEAREST": GIQLNearest.from_arg_list,
Expand Down
49 changes: 49 additions & 0 deletions src/giql/expressions.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,55 @@ def from_arg_list(cls, args):
return cls(**kwargs)


class GIQLCoverage(exp.Func):
"""COVERAGE aggregate function for binned genome coverage.

Tiles the genome into fixed-width bins and aggregates overlapping
intervals per bin using generate_series and JOIN + GROUP BY.

Examples:
COVERAGE(interval, 1000)
COVERAGE(interval, 500, stat := 'mean')
COVERAGE(interval, resolution := 1000)
COVERAGE(interval, 1000, stat := 'mean', target := 'score')
"""

arg_types = {
"this": True, # genomic column
"resolution": True, # bin width (positional or named)
"stat": False, # aggregation: 'count', 'mean', 'sum', 'min', 'max'
"target": False, # column to aggregate (default: interval length)
}

@classmethod
def from_arg_list(cls, args):
"""Parse argument list, handling named parameters.

:param args: List of arguments from parser
:return: GIQLCoverage instance with properly mapped arguments
"""
kwargs = {}
positional_args = []

# Separate named (PropertyEQ for :=, Kwarg for =>) and positional arguments
for arg in args:
if isinstance(arg, (exp.PropertyEQ, exp.Kwarg)):
param_name = (
arg.this.name if hasattr(arg.this, "name") else str(arg.this)
)
kwargs[param_name.lower()] = arg.expression
else:
positional_args.append(arg)

# Map positional arguments
if len(positional_args) > 0:
kwargs["this"] = positional_args[0]
if len(positional_args) > 1:
kwargs["resolution"] = positional_args[1]

return cls(**kwargs)


class GIQLDistance(exp.Func):
"""DISTANCE function for calculating genomic distances between intervals.

Expand Down
Loading
Loading