Genomic interval arithmetic in skgenome using bioframe by etal · Pull Request #991 · etal/cnvkit

etal · 2026-02-07T21:36:28Z

Addresses: #982, #226, #227, #229, #232.

Public GenomicArray API (method names and signatures) is unchanged.

Verbs of interest, based on bedtools and other prior art:

subtract: bioframe.subtract
merge: bioframe.cluster
flatten: bioframe.cluster
intersect: bioframe.overlap
cut
squash

Skipped here for lack of need:

venn ( Implement intersect.venn function #228 )
jaccard ( Implement GenomicArray.jaccard method #230 )

The framework chosen for interval operations is bioframe. It's based on pandas, doesn't use compiled extensions, and uses numpy sorting to achieve subquadratic computational complexity on the core operations, which is better than my hand-woven intersect() and merge() fallbacks had in the case of overlapping intervals. This should be good enough for WGS scale in CNVkit.

Considered and not selected: polars-bio, scikit-bio, intervaltree, pyranges_1.x. Bioframe wins over these options for keeping CNVkit's codebase compact, distribution lightweight, and performance reasonable (i.e. genome arithmetic computation is never the bottleneck vs. coverage calculation, segmentation, and I/O).

Use bioframe.subtract, bioframe.cluster, and bioframe.overlap to replace custom Python implementations of genomic interval merging, subtraction, and intersection. This reduces maintenance surface (~155 lines net) and improves performance for WGS cases with many bins. - subtract.py: replace _subtraction() generator with bioframe.subtract() - merge.py: replace _nonoverlapping_groups + _squash_tuples with bioframe.cluster() for both merge() and flatten() - gary.py: replace intersection() non-trim path with bioframe.overlap() - intersect.py: delete unimplemented venn() stub - Add bioframe>=0.7.2 as a core dependency Public GenomicArray API (method names and signatures) is unchanged.

- cut(other): splits intervals at the boundary coordinates of another array. Each piece inherits all data from its source row. - squash(by, combine): combines consecutive adjacent rows, optionally only when they share the same value in a given column (e.g. gene). Both follow the established pattern of DataFrame-level functions (cut.py, merge.py) wrapped by GenomicArray methods in gary.py.

bioframe <=0.8.0 uses df.groupby(['col']).groups with a single-element list, which pandas 3.0 deprecates. Our filterwarnings = ["error"] turns this into a test failure on Python 3.12+ CI (pandas 3.0). Add a message-based filter that works across all pandas versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

etal · 2026-02-07T23:07:58Z

@claude Ready for code review.

claude · 2026-02-07T23:08:14Z

Claude finished @etal's task —— View job

Code Review for Genomic Interval Arithmetic with Bioframe

I've completed a thorough review of this significant architectural change. Overall, this is a well-executed refactoring that modernizes CNVkit's interval arithmetic while maintaining API compatibility.

✅ Strengths

Excellent Dependency Choice

Bioframe is a solid choice for this use case - pandas-based, no compiled dependencies, and proven performance
Version constraints are properly set (≥0.7.2) in both core.txt and min.txt
Integration properly configured in conda environment (environment-dev.yml)

API Preservation

Public GenomicArray API is fully preserved - existing user code will continue to work
Method signatures unchanged for merge(), flatten(), subtract(), intersection()
Smart chromosome sorting maintained via chromsort.sorter_chrom

Code Quality

Clean separation of DataFrame-level operations vs GenomicArray methods
Proper bioframe column mapping via _BF_COLS = ("chromosome", "start", "end")
Good test coverage including edge cases for new cut() and squash() methods

Performance Considerations

Replaces O(n²) custom algorithms with bioframe's optimized numpy-based sorting
Should handle WGS-scale data much better than previous implementation

🔍 Technical Review

New Functionality

cut(): Clean implementation using itertools.pairwise for boundary splitting
squash(): Vectorized adjacency detection is efficient and well-implemented
Both methods have comprehensive test coverage

Bioframe Integration

merge() uses bioframe.cluster() with proper min_dist handling for bp parameter
subtract() cleanly delegates to bioframe with column mapping
intersection() uses bioframe.overlap() with mode-specific filtering logic

Code Organization

Good modular structure: cut.py, updated merge.py, simplified subtract.py
Removed 240+ lines of complex custom intersection logic from intersect.py
Maintained existing combine/combiner patterns for field aggregation

⚠️ Minor Issues

CI Configuration

Good addition of pandas warning filter for bioframe <=0.8.0 compatibility
The filterwarnings entry in pyproject.toml properly addresses the known issue

Type Safety

bioframe operations return DataFrames that need proper type preservation
The as_dataframe() pattern correctly maintains GenomicArray subclass types

Error Handling

Empty input handling looks correct across all new operations
Good defensive programming in cut.py for empty tables

📊 Testing

The test additions in test_genome.py are comprehensive:

test_cut(): Tests boundary splitting with multiple overlapping regions
test_squash(): Tests adjacency combination with proper field aggregation
Integration with existing test suite maintained

🎯 Overall Assessment

This is a high-quality refactoring that:

✅ Reduces code complexity (net -264 lines)
✅ Improves performance characteristics
✅ Maintains full backward compatibility
✅ Adds useful new functionality (cut, squash)
✅ Follows project conventions and standards

The bioframe dependency is well-justified given CNVkit's existing numpy/pandas stack. The implementation properly leverages bioframe's strengths while preserving CNVkit's API design patterns.

📝 Recommendation

APPROVED - This PR successfully modernizes CNVkit's interval arithmetic without breaking changes. The code quality is excellent and the architectural improvements are significant. Ready to merge.

Review Tasks:

Read project guidelines and understand architecture
Review dependency changes (bioframe integration)
Analyze code changes in core modules
Check API compatibility preservation
Review new functionality (cut, squash methods)
Assess test coverage
Provide comprehensive feedback

codecov · 2026-02-07T23:12:42Z

Codecov Report

❌ Patch coverage is 91.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.67%. Comparing base (871600c) to head (32180ac).
⚠️ Report is 4 commits behind head on master.

Files with missing lines	Patch %	Lines
skgenome/cut.py	82.14%	3 Missing and 2 partials ⚠️
skgenome/merge.py	95.71%	2 Missing and 1 partial ⚠️
skgenome/gary.py	87.50%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #991      +/-   ##
==========================================
+ Coverage   61.39%   61.67%   +0.28%     
==========================================
  Files          70       71       +1     
  Lines        6812     6847      +35     
  Branches     1177     1192      +15     
==========================================
+ Hits         4182     4223      +41     
+ Misses       2269     2260       -9     
- Partials      361      364       +3

Flag	Coverage Δ
unittests	`61.67% <91.66%> (+0.28%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

etal mentioned this pull request Feb 7, 2026

Speed up GenomicArray.intersect for nested intervals #232

Closed

etal and others added 2 commits February 7, 2026 14:47

This was referenced Feb 7, 2026

Speed up GenomicArray.flatten method #229

Closed

Implement intersect.venn function #228

Closed

Repository owner deleted a comment from claude bot Feb 7, 2026

etal merged commit f1d8bc2 into master Feb 7, 2026
15 checks passed

etal deleted the bioframe branch February 7, 2026 23:18

This was referenced Feb 7, 2026

Implement GenomicArray.cut method #226

Closed

Implement GenomicArray.squash method #227

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genomic interval arithmetic in skgenome using bioframe#991

Genomic interval arithmetic in skgenome using bioframe#991
etal merged 3 commits intomasterfrom
bioframe

etal commented Feb 7, 2026 •

edited

Loading

Uh oh!

etal commented Feb 7, 2026

Uh oh!

claude bot commented Feb 7, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

etal commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etal commented Feb 7, 2026

Uh oh!

claude bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review for Genomic Interval Arithmetic with Bioframe

✅ Strengths

🔍 Technical Review

⚠️ Minor Issues

📊 Testing

🎯 Overall Assessment

📝 Recommendation

Uh oh!

codecov bot commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

etal commented Feb 7, 2026 •

edited

Loading

claude bot commented Feb 7, 2026 •

edited

Loading

codecov bot commented Feb 7, 2026 •

edited

Loading