Releases: abdenlab/oxbow
v0.7.0
What's changed
Declarative data model types for all format families
Each format family now has a standalone "model" type that encapsulates all schema-defining parameters independently of any file handle or header. The model type implements the data model for how we map a file format to Arrow and enables declaration of the initial projection/schema of a table derived from a file.
AlignmentModel(SAM/BAM/CRAM) — fields + tag definitionsGxfModel(GFF/GTF) — fields + attribute definitionsSequenceModel(FASTA/FASTQ) — fieldsVariantModel(VCF/BCF) — fields, info fields, genotype fields, samples, layoutBedModel(BED/BigBed/BigWig) — wraps BedSchema with field projectionBBIZoomModel— zoom summary schema
Each model produces an Arrow schema without a file, supports column projection, exposes a model() accessor on its scanner, and round-trips through Display/FromStr.
Scanners gain with_model(header, model) constructors alongside the existing new() constructors.
Selection semantics
A Select<T> enum replaces Option<Vec<T>> for all types of field selection. All scanner constructors and model types now use a ternary Select enum instead of Some or None to express field selection intent unambiguously:
Select::All— include all fields (from defaults or file header)Select::Some(vec)— include only the named fieldsSelect::Omit— omit the column group entirely
This affects the new() constructors of all scanners across all format families (alignment, GXF, variant, sequence, BED, BBI).
Customizable BED schemas for BED and BBI files
New constructor BedSchema::from_defs() for fully custom BED schemas (field name and type definitions) from a Vec<FieldDef>, enabling programmatic schema construction for formats like narrowPeak without going through a string specifier. The autoSql-based type system (FieldDef) is now shared and harmonized across the BED and BBI models for extended BED fields, but standard BED fields are interpreted using format-native (BigBed) or spec-compliant (BED) types.
Nested samples table in VariantModel
VariantModel gains a samples_nested boolean parameter. When true, all sample genotype data is emitted as a single "samples" struct column rather than N top-level per-sample or per-field columns. This makes it straightforward to treat genotype data as an atomic projection unit (e.g. project(["samples"])). The default is false, preserving existing behavior. Resolves #167
Full Changelog: v0.6.0...v0.7.0
py-oxbow@v0.7.0
New features
New selection semantics (None, list, "*") in #172
All DataSource constructors now accept the value "*" for all field declaration parameters (referring to all standard fields, all info/format fields in a header, all samples in a header, etc.) in addition to a list or None (which now means: "omit entirely"). Previously, None was used as the "all fields" sentinel, which was ambiguous. Parameter defaults have been updated to reflect these new semantics, keeping the same defaults, except for those listed below.
Customizable BED schemas for BED and BBI files in #169
Support for fully custom BED schemas (field name and type definitions) from a tuple of (str, dict[str, str]), where the first item is a bed3-12 string specifier for the initial standard fields, and the second item is a dictionary of field names to type names for the remaining fields, parsed using an AutoSql-inspired type system with additional Rust numeric type aliases. This enables programmatic schema construction for formats like narrowPeak. The autoSql-based type system is now shared and harmonized across the BED and BBI models for extended BED fields, but standard BED fields are interpreted using format-native (BigBed) or spec-compliant (BED) types.
Nested samples table in VCF/BCF DataSources in #170
VcfFile/from_vcf and BcfFile/from_bcf gain a samples_nested boolean parameter. When true, all sample genotype data is emitted as a single "samples" struct column rather than N top-level per-sample or per-field columns. This makes it straightforward to treat genotype data as an atomic projection unit. The default is false, preserving existing behavior. Resolves #167
API changes
Tag and attribute discovery is no longer automatic (breaking)
Previously, alignment and annotation file constructors would scan an initial number of records to discover tag/attribute definitions and include them in the schema by default. This auto-discovery has been removed. Tag and attribute definition and discovery is now opt-in.
tag_scan_rowsparameter removed fromSamFile/from_sam,BamFile/from_bam,CramFile/from_cram.attribute_scan_rowsparameter removed fromGtfFile/from_gtf,GffFile/from_gff.tag_defsandattribute_defsnow default toNone, which omits the "tags" / "attributes" column entirely- Use the new
with_tags()andwith_attributes()builder methods (below) to opt in. (Recommended)
Sample genotype data is no longer projected by default (breaking)
from_vcfandfrom_bcfpreviously defaulted to projecting all samples defined in the header, including all sample genotype columns. The default is nowsamples=None, omitting genotype data entirely.- Use the new
with_samples()builder method (below) to opt in. (Recommended)
New builder methods for tags, attributes and samples
with_tags() — opt-in tag discovery for alignment files
df = ox.from_bam("sample.bam").with_tags().pl()Call with_tags() on any SamFile, BamFile, or CramFile to discover tag definitions by scanning an initial number of records. Pass explicit definitions to skip discovery:
ds = ox.from_bam("sample.bam").with_tags([("NM", "i"), ("MD", "Z")])The scan_rows keyword argument controls how many records are scanned (default: 1024; pass -1 to scan the whole file).
with_attributes() — opt-in attribute discovery for annotation files
df = ox.from_gff("sample.gff").with_attributes().pl()Same pattern as with_tags(), for GtfFile and GffFile. The scan_rows keyword argument is also supported.
with_samples() — nested sample genotype data for variant files
Calling with_samples() on a VcfFile or BcfFile includes all sample genotype data nested under a single "samples" struct column. Accepts optional samples, genotype_fields, and group_by arguments:
df = ox.from_vcf("sample.vcf.gz").with_samples().pl()
df.unnest("samples")ds = (
ox.from_vcf("sample.vcf.gz")
.with_samples(["NA12891", "NA12892"], genotype_fields=["GT", "DP"], group_by="field")
)Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.6.0...py-oxbow@v0.7.0
py-oxbow@v0.6.0
New Features
Rust backtrace in Python exceptions: When RUST_BACKTRACE=1 is set, parsing and validation errors raised from the Rust core now include a Rust backtrace in the exception message, making it easier to diagnose issues. Errors also map to more appropriate Python exception types: KeyError for missing resources, IOError for I/O failures, ValueError for everything else. (#166)
External reference support in CRAM high-level API: from_cram() now accepts reference and reference_index keyword arguments for decoding CRAM files that store bases as diffs against an external reference. Also fixes a bug where tag discovery on reference-dependent CRAM files would fail. (#161)
Core version retrieval: oxbow.__core_version__ exposes the version of the core oxbow Rust library. (#162)
Maintenance
Simplified DataSource internals: Schema-defining parameters (fields, tag_defs, attr_defs, etc.) are now passed to the Rust scanner at construction time rather than at each scan call. With this change, the Python DataSource classes are significantly simplified. Column projection is now handled entirely by the Rust scanner. User-facing API (from_*() constructors, .to_table(), .to_batches(), .to_reader(), .batches(), etc.) is unchanged. (#161)
Dependency Upgrades: PyO3 0.28, pyo3-arrow 0.17, Arrow 58, noodles 0.107. (#165)
Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.5.2...py-oxbow@v0.6.0
v0.6.0
New Features
Zero-column projection support — record batch builders now accept zero-column projections and preserve row counts when no columns are projected. (#160)
Multithreaded BGZF reader support — Widened trait bounds allow callers to pass bgzf::io::MultithreadedReader to scan methods. (#164)
Backtrace capture in errors — Six crate-level error variants (InvalidInput, InvalidData, NotFound, Io, Arrow, External) with backtrace captured at creation time (displayed when RUST_BACKTRACE=1). On the Python side, variants map to PyValueError, PyKeyError, or PyIOError. (#166)
API changes
Schema definition separate from projection: scanner schemas are now declared at construction time and scan methods project onto the declared schema. (#161)
- Schema-defining parameters (
fields,tag_defs,attr_defs,info_fields,genotype_fields,samples, etc.) move fromscan()arguments toScanner::new(). Scanners validate and cache their Arrow schema at construction. Scan methods now accept onlycolumns(projection),batch_size, andlimit. - Discovery methods like
tag_defs()andattribute_defs()are now static/standalone rather than instance methods.
New traits: shared RecordBatchBuilder and Push<T> traits for record batch builders (#160)
schema()now returns the cached schema (not computed on the fly).
Widened BGZF trait bounds on scan methods (#164)
scan_query,scan_unmapped, andscan_virtual_rangesaccept genericR: bgzf::io::BufRead + bgzf::io::Seekinstead of concretebgzf::io::Reader<R>
Custom crate error type: OxbowError replaces io::Error (#166)
- All public scanner methods and the
Push<T>trait now returncrate::Result<T>(alias for Result<T, OxbowError>). TryFrom/FromStrimpls usetype Error = OxbowError.
Maintenance
- Module paths renamed: /format/ directories renamed to /scanner/, with batch_iterator nested underneath. All public re-exports preserved at the family level, but direct path imports will break. (#163)
- Unified BED schema model: Shared
BedSchema,FieldDef, andFieldTypetype system (37 AutoSql variants) extracted intobed/model/field_def.rs, used by both BED and BBI formats. (#161) - Export version variable for inspection in py-oxbow. (#162)
- Dependency Upgrades: noodles 0.107, Arrow 58, PyO3 0.28, pyo3-arrow 0.17, Rust toolchain 1.94. (#165)
Full Changelog: v0.5.2...v0.6.0
v0.5.2
Bug fixes
VCF: Recover INFO fields around malformed tokens.
Real-world VCFs (e.g., Ensembl variation files) can contain double semicolons (;;) in the INFO column. Previously, info.get() was called per field, which scanned from the beginning each time and aborted at the first tokenization error, silently nullifying all fields past the error. Now uses a single info.iter() pass that advances past malformed tokens, recovering fields on both sides of ;;. This also improves performance by scanning the INFO string once instead of N times per record, but comes at the cost of doing both tokenization and parsing of all fields even when projecting a subset. (#156)
New Contributors
Full Changelog: v0.5.1...v0.5.2
py-oxbow@v0.5.2
Bug fixes
- VCF: Recover INFO fields around malformed tokens (oxbow-rs). (#156)
- Streaming-compatible Polars lazy frames.:
scan_pyarrow_dataset()still does not support the Polars streaming engine. Replaced withregister_io_source(), which yields DataFrames batch-by-batch and integrates natively with streaming execution. This enablessink_parquet()and other streaming operations on oxbowDataSourceobjects. (#158, fixes #157)
New Contributors
Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.5.1...py-oxbow@v0.5.2
v0.5.1
py-oxbow@v0.5.1
Bug fixes and improvements
- 🎉 All Rust panics during batch scanning now propagate to Python as exceptions and are no longer fatal #146
- Projections onto all types of BED schemas now work correctly #148
- Updated noodles, arrow, pyo3, and pyo3-arrow dependencies to latest versions #149
Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.5.0...py-oxbow@v0.5.1
v0.5.0
New Features
CRAM support: CramScanner for reading CRAM alignment files (#142)
- Support for external reference FASTA files for reference-compressed CRAM
- Genomic range queries using CRAI index files
- Tag definition discovery and projection
Byte Range and Virtual Position Range scan methods for SAM, BAM, GTF, GFF, VCF, and BED formats (#143)
scan_byte_ranges(): Read specific byte ranges from uncompressed filesscan_virtual_ranges(): Read specific virtual position ranges from BGZF-compressed files
Bug fixes and maintenance
- Fixed numerous docstring inconsistencies
- Fixed repository URL in Cargo.toml
- Updated Rust toolchain
Full Changelog: v0.4.1...v0.5.0
py-oxbow@v0.5.0
New Features
CRAM support: Low-level PyCramScanner and high-level data source from_cram() for reading CRAM alignment files (#142)
- Low-level scanner supports external reference FASTA files for reference-compressed CRAM, but not high-level data source yet
- Genomic range queries using CRAI index files
- Tag definition discovery and projection
Low-level Byte Range and Virtual Position Range scan methods for SAM, BAM, GTF, GFF, VCF, and BED formats (#143)
scan_byte_ranges(): Read specific byte ranges from uncompressed filesscan_virtual_ranges(): Read specific virtual position ranges from BGZF-compressed files
Bug fixes and maintenance
- Made batch reader fragments pickleable (#124) and deterministically hashable (#122) for Dask.
- Fixed numerous docstring inconsistencies
- Updated Rust toolchain
Full Changelog: https://github.com/abdenlab/oxbow/compare/py-oxbow@v0.4.2...py-oxbow@v0.5.0