-
Notifications
You must be signed in to change notification settings - Fork 2
Add binned summary statistic aggregation for genomic intervals — Closes #61 #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
conradbzura
wants to merge
17
commits into
main
Choose a base branch
from
61-binned-coverage-operator
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
77df05a
feat: Add GIQLCoverage expression node and parser registration
conradbzura 3ea9d81
feat: Add CoverageTransformer for binned genome coverage
conradbzura 9403bc7
test: Add parsing and transpilation tests for COVERAGE operator
conradbzura 77d7f28
docs: Add COVERAGE operator reference and recipes
conradbzura f4838ee
feat: Support => (standard SQL) named parameter syntax in COVERAGE
conradbzura 0a79eb1
fix: Stop treating = as named parameter syntax in COVERAGE
conradbzura 45710d8
refactor: Remove dead code and fix LATERAL syntax for DuckDB compat
conradbzura 763885e
feat: Add target parameter and default alias to COVERAGE operator
conradbzura 684ebc1
fix: Move COVERAGE WHERE clause into LEFT JOIN ON condition
conradbzura c5f1e9c
test: Rewrite COVERAGE tests to spec with full API coverage
conradbzura c7b1131
test: Add unit tests for bedtools test utilities
conradbzura b800625
test: Add unit tests for GIQL parsing, generation, and transpilation
conradbzura 4898025
test: Add bedtools integration tests for operator correctness
conradbzura 04311f0
docs: Clarify score column reference and add sample output table
conradbzura 7577ef7
test: Add property-based tests for COVERAGE transpilation
conradbzura 1db963e
fix: Align unit tests with := named parameter syntax and fix CTE pres…
conradbzura 039baae
fix: Compare only coordinates in merge-then-intersect workflow test
conradbzura File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -328,4 +328,129 @@ Related Operators | |
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| - :ref:`CLUSTER <cluster-operator>` - Assign cluster IDs without merging | ||
| - :ref:`COVERAGE <coverage-operator>` - Compute binned genome coverage | ||
| - :ref:`INTERSECTS <intersects-operator>` - Test for overlap between specific pairs | ||
|
|
||
| ---- | ||
|
|
||
| .. _coverage-operator: | ||
|
|
||
| COVERAGE | ||
| -------- | ||
|
|
||
| Compute binned genome coverage by tiling the genome into fixed-width bins. | ||
|
|
||
| Description | ||
| ~~~~~~~~~~~ | ||
|
|
||
| The ``COVERAGE`` operator tiles the genome into fixed-width bins and aggregates overlapping intervals per bin. It generates a bin grid using ``generate_series`` and joins it against the source table to count (or otherwise aggregate) overlapping features in each bin. | ||
|
|
||
| This is useful for: | ||
|
|
||
| - Computing read depth or signal coverage across the genome | ||
| - Creating fixed-resolution coverage tracks from interval data | ||
| - Summarising feature density at a user-defined resolution | ||
|
|
||
| The operator works as an aggregate function, returning one row per bin with the bin coordinates and the computed statistic. | ||
|
|
||
| Syntax | ||
| ~~~~~~ | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| -- Basic coverage (count overlapping intervals per bin) | ||
| SELECT COVERAGE(interval, resolution) FROM features | ||
|
|
||
| -- With a named statistic (either := or => syntax) | ||
| SELECT COVERAGE(interval, 1000, stat := 'mean') FROM features | ||
| SELECT COVERAGE(interval, 1000, stat => 'mean') FROM features | ||
|
|
||
| -- Aggregate a specific column instead of interval length | ||
| SELECT COVERAGE(interval, 1000, stat := 'mean', target := 'score') FROM features | ||
|
|
||
| -- Named resolution parameter | ||
| SELECT COVERAGE(interval, resolution := 500) FROM features | ||
|
|
||
| Parameters | ||
| ~~~~~~~~~~ | ||
|
|
||
| **interval** | ||
| A genomic column. | ||
|
|
||
| **resolution** | ||
| Bin width in base pairs. Can be given as a positional or named parameter. | ||
|
|
||
| **stat** *(optional)* | ||
| Aggregation function applied to overlapping intervals per bin. One of: | ||
|
|
||
| - ``'count'`` — number of overlapping intervals (default) | ||
| - ``'mean'`` — average interval length of overlapping intervals | ||
| - ``'sum'`` — total interval length of overlapping intervals | ||
| - ``'min'`` — minimum interval length of overlapping intervals | ||
| - ``'max'`` — maximum interval length of overlapping intervals | ||
|
|
||
| When ``target`` is specified, the stat is applied to that column instead of interval length. | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @nvictus is interval length a sane default for target metric? |
||
|
|
||
| **target** *(optional)* | ||
| Column name to aggregate. When omitted, non-count stats aggregate interval length (``end - start``). When specified, the stat is applied to the named column. For ``'count'``, specifying a target counts non-NULL values of that column instead of ``COUNT(*)``. | ||
|
|
||
| Return Value | ||
| ~~~~~~~~~~~~ | ||
|
|
||
| Returns one row per genomic bin: | ||
|
|
||
| - ``chrom`` — Chromosome of the bin | ||
| - ``start`` — Start position of the bin | ||
| - ``end`` — End position of the bin | ||
conradbzura marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - ``value`` — The computed aggregate (default alias; use ``AS`` to rename) | ||
|
|
||
| Examples | ||
| ~~~~~~~~ | ||
|
|
||
| **Basic Coverage:** | ||
|
|
||
| Count the number of features overlapping each 1 kb bin: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000) | ||
| FROM features | ||
|
|
||
| **Mean Coverage:** | ||
|
|
||
| Compute the average interval length per 500 bp bin: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 500, stat := 'mean') | ||
| FROM features | ||
|
|
||
| **Named Alias:** | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM reads | ||
|
|
||
| **With WHERE Filter:** | ||
|
|
||
| Assuming the source table includes a ``score`` column, compute coverage of high-scoring features only: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE score > 10 | ||
conradbzura marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Performance Notes | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| - The operator creates one bin per chromosome per step, so smaller resolutions produce more rows | ||
| - A ``LEFT JOIN`` ensures bins with zero coverage are included in the output | ||
| - For very large genomes, consider restricting the query with a ``WHERE`` clause on chromosome | ||
|
|
||
| Related Operators | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| - :ref:`MERGE <merge-operator>` - Combine overlapping intervals into single regions | ||
| - :ref:`CLUSTER <cluster-operator>` - Assign cluster IDs to overlapping intervals | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,173 @@ | ||
| Coverage | ||
| ======== | ||
|
|
||
| This section covers patterns for computing genome-wide coverage and signal | ||
| summaries using GIQL's ``COVERAGE`` operator. | ||
|
|
||
| Basic Coverage | ||
| -------------- | ||
|
|
||
| Count Overlapping Features | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Count the number of features overlapping each 1 kb bin across the genome: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM features | ||
|
|
||
| **Sample output:** | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| ┌────────┬────────┬────────┬───────┐ | ||
| │ chrom │ start │ end │ depth │ | ||
| ├────────┼────────┼────────┼───────┤ | ||
| │ chr1 │ 0 │ 1000 │ 3 │ | ||
| │ chr1 │ 1000 │ 2000 │ 1 │ | ||
| │ chr1 │ 2000 │ 3000 │ 0 │ | ||
| │ ... │ ... │ ... │ ... │ | ||
| └────────┴────────┴────────┴───────┘ | ||
|
|
||
| Each row represents one genomic bin. Bins with no overlapping features appear with a count of zero. | ||
|
|
||
| **Use case:** Compute read depth or feature density at a fixed resolution. | ||
|
|
||
| Custom Bin Size | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| Use a finer resolution of 100 bp: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 100) AS depth | ||
| FROM reads | ||
|
|
||
| **Use case:** High-resolution coverage tracks for visualisation. | ||
|
|
||
| Coverage Statistics | ||
| ------------------- | ||
|
|
||
| Mean Interval Length per Bin | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Compute the average length of intervals overlapping each bin: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000, stat := 'mean') AS avg_len | ||
| FROM features | ||
conradbzura marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| Sum of Interval Lengths per Bin | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Compute the total interval length in each bin: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000, stat := 'sum') AS total_len | ||
| FROM features | ||
|
|
||
| Maximum Interval Length per Bin | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Find the longest interval overlapping each bin: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000, stat := 'max') AS max_len | ||
| FROM features | ||
|
|
||
| Aggregating a Specific Column | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Compute the mean score of overlapping features per bin instead of summarising interval length: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000, stat := 'mean', target := 'score') AS avg_score | ||
| FROM features | ||
|
|
||
| **Use case:** Signal tracks from a numeric column (e.g. ChIP-seq score, p-value). | ||
|
|
||
| Filtered Coverage | ||
| ----------------- | ||
|
|
||
| Strand-Specific Coverage | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Compute coverage for each strand separately by filtering: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| -- Plus strand | ||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE strand = '+' | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| -- Minus strand | ||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE strand = '-' | ||
|
|
||
| **Use case:** Strand-specific signal tracks for RNA-seq or stranded assays. | ||
|
|
||
| Coverage of High-Scoring Features | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Restrict coverage to features above a quality threshold: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM features | ||
| WHERE score > 10 | ||
|
|
||
| 5' End Counting | ||
| ~~~~~~~~~~~~~~~ | ||
|
|
||
| To count only the 5' ends of features (e.g. TSS or read starts), first | ||
| create a view or CTE that trims each interval to its 5' end, then apply | ||
| ``COVERAGE``: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| WITH five_prime AS ( | ||
| SELECT chrom, start, start + 1 AS end | ||
| FROM features | ||
| WHERE strand = '+' | ||
| UNION ALL | ||
| SELECT chrom, end - 1 AS start, end | ||
| FROM features | ||
| WHERE strand = '-' | ||
| ) | ||
| SELECT COVERAGE(interval, 1000) AS tss_count | ||
| FROM five_prime | ||
|
|
||
| Normalised Coverage | ||
| ------------------- | ||
|
|
||
| RPM Normalisation | ||
| ~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Normalise bin counts to reads per million (RPM) by dividing by the total | ||
| number of reads: | ||
|
|
||
| .. code-block:: sql | ||
|
|
||
| WITH bins AS ( | ||
| SELECT COVERAGE(interval, 1000) AS depth | ||
| FROM reads | ||
| ), | ||
| total AS ( | ||
| SELECT COUNT(*) AS n FROM reads | ||
| ) | ||
| SELECT | ||
| bins.chrom, | ||
| bins.start, | ||
| bins.end, | ||
| bins.depth * 1000000.0 / total.n AS rpm | ||
| FROM bins, total | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.