Conversation
Use bioframe.subtract, bioframe.cluster, and bioframe.overlap to replace custom Python implementations of genomic interval merging, subtraction, and intersection. This reduces maintenance surface (~155 lines net) and improves performance for WGS cases with many bins. - subtract.py: replace _subtraction() generator with bioframe.subtract() - merge.py: replace _nonoverlapping_groups + _squash_tuples with bioframe.cluster() for both merge() and flatten() - gary.py: replace intersection() non-trim path with bioframe.overlap() - intersect.py: delete unimplemented venn() stub - Add bioframe>=0.7.2 as a core dependency Public GenomicArray API (method names and signatures) is unchanged.
- cut(other): splits intervals at the boundary coordinates of another array. Each piece inherits all data from its source row. - squash(by, combine): combines consecutive adjacent rows, optionally only when they share the same value in a given column (e.g. gene). Both follow the established pattern of DataFrame-level functions (cut.py, merge.py) wrapped by GenomicArray methods in gary.py.
bioframe <=0.8.0 uses df.groupby(['col']).groups with a single-element list, which pandas 3.0 deprecates. Our filterwarnings = ["error"] turns this into a test failure on Python 3.12+ CI (pandas 3.0). Add a message-based filter that works across all pandas versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@claude Ready for code review. |
|
Claude finished @etal's task —— View job Code Review for Genomic Interval Arithmetic with BioframeI've completed a thorough review of this significant architectural change. Overall, this is a well-executed refactoring that modernizes CNVkit's interval arithmetic while maintaining API compatibility. ✅ StrengthsExcellent Dependency Choice
API Preservation
Code Quality
Performance Considerations
🔍 Technical ReviewNew Functionality
Bioframe Integration
Code Organization
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #991 +/- ##
==========================================
+ Coverage 61.39% 61.67% +0.28%
==========================================
Files 70 71 +1
Lines 6812 6847 +35
Branches 1177 1192 +15
==========================================
+ Hits 4182 4223 +41
+ Misses 2269 2260 -9
- Partials 361 364 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Addresses: #982, #226, #227, #229, #232.
Public GenomicArray API (method names and signatures) is unchanged.
Verbs of interest, based on bedtools and other prior art:
Skipped here for lack of need:
The framework chosen for interval operations is bioframe. It's based on pandas, doesn't use compiled extensions, and uses numpy sorting to achieve subquadratic computational complexity on the core operations, which is better than my hand-woven intersect() and merge() fallbacks had in the case of overlapping intervals. This should be good enough for WGS scale in CNVkit.
Considered and not selected: polars-bio, scikit-bio, intervaltree, pyranges_1.x. Bioframe wins over these options for keeping CNVkit's codebase compact, distribution lightweight, and performance reasonable (i.e. genome arithmetic computation is never the bottleneck vs. coverage calculation, segmentation, and I/O).