Skip to content

Genomic interval arithmetic in skgenome using bioframe#991

Merged
etal merged 3 commits intomasterfrom
bioframe
Feb 7, 2026
Merged

Genomic interval arithmetic in skgenome using bioframe#991
etal merged 3 commits intomasterfrom
bioframe

Conversation

@etal
Copy link
Owner

@etal etal commented Feb 7, 2026

Addresses: #982, #226, #227, #229, #232.

Public GenomicArray API (method names and signatures) is unchanged.

Verbs of interest, based on bedtools and other prior art:

  • subtract: bioframe.subtract
  • merge: bioframe.cluster
  • flatten: bioframe.cluster
  • intersect: bioframe.overlap
  • cut
  • squash

Skipped here for lack of need:

The framework chosen for interval operations is bioframe. It's based on pandas, doesn't use compiled extensions, and uses numpy sorting to achieve subquadratic computational complexity on the core operations, which is better than my hand-woven intersect() and merge() fallbacks had in the case of overlapping intervals. This should be good enough for WGS scale in CNVkit.

Considered and not selected: polars-bio, scikit-bio, intervaltree, pyranges_1.x. Bioframe wins over these options for keeping CNVkit's codebase compact, distribution lightweight, and performance reasonable (i.e. genome arithmetic computation is never the bottleneck vs. coverage calculation, segmentation, and I/O).

Use bioframe.subtract, bioframe.cluster, and bioframe.overlap to replace
custom Python implementations of genomic interval merging, subtraction,
and intersection. This reduces maintenance surface (~155 lines net) and
improves performance for WGS cases with many bins.

- subtract.py: replace _subtraction() generator with bioframe.subtract()
- merge.py: replace _nonoverlapping_groups + _squash_tuples with
  bioframe.cluster() for both merge() and flatten()
- gary.py: replace intersection() non-trim path with bioframe.overlap()
- intersect.py: delete unimplemented venn() stub
- Add bioframe>=0.7.2 as a core dependency

Public GenomicArray API (method names and signatures) is unchanged.
etal and others added 2 commits February 7, 2026 14:47
- cut(other): splits intervals at the boundary coordinates of another
  array. Each piece inherits all data from its source row.
- squash(by, combine): combines consecutive adjacent rows, optionally
  only when they share the same value in a given column (e.g. gene).

Both follow the established pattern of DataFrame-level functions
(cut.py, merge.py) wrapped by GenomicArray methods in gary.py.
bioframe <=0.8.0 uses df.groupby(['col']).groups with a single-element
list, which pandas 3.0 deprecates. Our filterwarnings = ["error"] turns
this into a test failure on Python 3.12+ CI (pandas 3.0). Add a
message-based filter that works across all pandas versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Repository owner deleted a comment from claude bot Feb 7, 2026
@etal
Copy link
Owner Author

etal commented Feb 7, 2026

@claude Ready for code review.

@claude
Copy link

claude bot commented Feb 7, 2026

Claude finished @etal's task —— View job


Code Review for Genomic Interval Arithmetic with Bioframe

I've completed a thorough review of this significant architectural change. Overall, this is a well-executed refactoring that modernizes CNVkit's interval arithmetic while maintaining API compatibility.

Strengths

Excellent Dependency Choice

  • Bioframe is a solid choice for this use case - pandas-based, no compiled dependencies, and proven performance
  • Version constraints are properly set (≥0.7.2) in both core.txt and min.txt
  • Integration properly configured in conda environment (environment-dev.yml)

API Preservation

  • Public GenomicArray API is fully preserved - existing user code will continue to work
  • Method signatures unchanged for merge(), flatten(), subtract(), intersection()
  • Smart chromosome sorting maintained via chromsort.sorter_chrom

Code Quality

  • Clean separation of DataFrame-level operations vs GenomicArray methods
  • Proper bioframe column mapping via _BF_COLS = ("chromosome", "start", "end")
  • Good test coverage including edge cases for new cut() and squash() methods

Performance Considerations

  • Replaces O(n²) custom algorithms with bioframe's optimized numpy-based sorting
  • Should handle WGS-scale data much better than previous implementation

🔍 Technical Review

New Functionality

  • cut(): Clean implementation using itertools.pairwise for boundary splitting
  • squash(): Vectorized adjacency detection is efficient and well-implemented
  • Both methods have comprehensive test coverage

Bioframe Integration

  • merge() uses bioframe.cluster() with proper min_dist handling for bp parameter
  • subtract() cleanly delegates to bioframe with column mapping
  • intersection() uses bioframe.overlap() with mode-specific filtering logic

Code Organization

  • Good modular structure: cut.py, updated merge.py, simplified subtract.py
  • Removed 240+ lines of complex custom intersection logic from intersect.py
  • Maintained existing combine/combiner patterns for field aggregation

⚠️ Minor Issues

CI Configuration

  • Good addition of pandas warning filter for bioframe <=0.8.0 compatibility
  • The filterwarnings entry in pyproject.toml properly addresses the known issue

Type Safety

  • bioframe operations return DataFrames that need proper type preservation
  • The as_dataframe() pattern correctly maintains GenomicArray subclass types

Error Handling

  • Empty input handling looks correct across all new operations
  • Good defensive programming in cut.py for empty tables

📊 Testing

The test additions in test_genome.py are comprehensive:

  • test_cut(): Tests boundary splitting with multiple overlapping regions
  • test_squash(): Tests adjacency combination with proper field aggregation
  • Integration with existing test suite maintained

🎯 Overall Assessment

This is a high-quality refactoring that:

  • ✅ Reduces code complexity (net -264 lines)
  • ✅ Improves performance characteristics
  • ✅ Maintains full backward compatibility
  • ✅ Adds useful new functionality (cut, squash)
  • ✅ Follows project conventions and standards

The bioframe dependency is well-justified given CNVkit's existing numpy/pandas stack. The implementation properly leverages bioframe's strengths while preserving CNVkit's API design patterns.

📝 Recommendation

APPROVED - This PR successfully modernizes CNVkit's interval arithmetic without breaking changes. The code quality is excellent and the architectural improvements are significant. Ready to merge.


Review Tasks:

  • Read project guidelines and understand architecture
  • Review dependency changes (bioframe integration)
  • Analyze code changes in core modules
  • Check API compatibility preservation
  • Review new functionality (cut, squash methods)
  • Assess test coverage
  • Provide comprehensive feedback

@codecov
Copy link

codecov bot commented Feb 7, 2026

Codecov Report

❌ Patch coverage is 91.66667% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.67%. Comparing base (871600c) to head (32180ac).
⚠️ Report is 4 commits behind head on master.

Files with missing lines Patch % Lines
skgenome/cut.py 82.14% 3 Missing and 2 partials ⚠️
skgenome/merge.py 95.71% 2 Missing and 1 partial ⚠️
skgenome/gary.py 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #991      +/-   ##
==========================================
+ Coverage   61.39%   61.67%   +0.28%     
==========================================
  Files          70       71       +1     
  Lines        6812     6847      +35     
  Branches     1177     1192      +15     
==========================================
+ Hits         4182     4223      +41     
+ Misses       2269     2260       -9     
- Partials      361      364       +3     
Flag Coverage Δ
unittests 61.67% <91.66%> (+0.28%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@etal etal merged commit f1d8bc2 into master Feb 7, 2026
15 checks passed
@etal etal deleted the bioframe branch February 7, 2026 23:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant