feat: Resolve coordinate system semantics for input and output positions by nvictus · Pull Request #173 · abdenlab/oxbow

nvictus · 2026-03-20T20:12:32Z

This PR adds the ability to honor genomic coordinate system semantics in the way input query ranges are interpreted, and the way output positions (namely, start positions) are returned. Resolves #114.

Coordinate systems

There are two "coordinate system" conventions in genomics:

Zero-based, half-open (i.e. left-inclusive, right-exclusive) -- aka "zero-based start, one-based end". [1000, 2000)
One-based, closed (i.e. left-inclusive, right-inclusive) -- aka "one-based start, one-based end". [1001, 2000]

We introduce shorthands for these in Python: "01" and "11". And every scanner is now coordinate-system aware in its output, defaulting to a format family-"native" system.

Range notation

We further support 3 range string notations:

Ambiguous UCSC notation, interpreted using the scanner's registered coordinate system. chr1:10,000-20,000
Explicit "bracket notation" for 01-semantics: chr1:[10_000,20_000)
Explicit "bracket notation" for 11-semantics: chr1:[10_001,20_001]

Thousands separators , and _ are supported (only the latter in bracket notation).

Example

# default for SAM/BAM/CRAM is 11
df = ox.from_bam("example.bam").regions("chr3:10,001-20,000").pl()

# return 01 output and interpret query regions as 01
df = ox.from_bam("example.bam", coords="01").regions("chr3:10,000-20,000").pl()

# return 01 output, pass in 11 input
df = ox.from_bam("example.bam", coords="01").regions("chr3:[10001,20000]").pl()

Malformatted files

This feature alone does not solve the problem malformatted files, most often observed with text formats. BED is natively 01, but many BED-like files in the wild contain intervals with 1-based starts. Likewise, GTF is supposed to be 11, but many may use 0-based starts. If one knows the start positions in a file don't match what the format natively expects, one has to choose a coordinate system interpretation and increment or decrement the start positions accordingly to correct them.

Changes

Add a CoordSystem enum (OneClosed / ZeroHalfOpen) that controls how start positions are represented in output Arrow batches and how user-supplied query regions are interpreted.
Add an oxbow::Region type with coordinate-system-aware parsing, supporting ambiguous UCSC notation (interpreted using a CoordSystem) and an explicit bracket notation.
Each format defaults to its native coordinate convention (1-based for SAM/BAM/CRAM, VCF/BCF, GFF/GTF, FASTA; 0-based for BED, BigBed, BigWig) but callers can request either system explicitly via a coords parameter.
All scanner scan_query methods now accept oxbow::Region instead of noodles::core::Region, converting internally.
Python API exposes coords as literals ("01", "11") on all DataSource classes and from_* factory functions, as well as pyo3 scanner classes.
pyo3 read_* functions and R bindings updated to use format-native coordinate systems.

Introduce a `CoordSystem` enum (`OneClosed` / `ZeroHalfOpen`) that controls how start positions are represented in output Arrow batches. Each format defaults to its native coordinate convention: 1-based for SAM/BAM/CRAM, VCF/BCF, and GFF/GTF; 0-based for BED, BigBed, and BigWig — but callers can request either system explicitly. Rust changes: * `CoordSystem` enum in `oxbow::lib` with `Display/FromStr` and `start_offset_from()` for computing the adjustment between systems. * Every Model now carries a `coord_system` field (alignment, variant, gxf, bed, bbi base, bbi zoom, sequence). * Offset applied at the FieldBuilder level (alignment, variant, gxf, bed) or BatchBuilder level (bbi base, bbi zoom) during `push()`. * All scanner constructors accept an explicit `CoordSystem` parameter. Also edded `Default` trait impl on alignment and gxf Models. Python changes: * All pyo3 scanner classes accept a `coords` keyword argument ("01" or "11") and serialize it through `__getnewargs_ex__`. * All DataSource classes and from_* factory functions expose coords: `Literal["01", "11"]` with format-appropriate defaults. * BBI zoom scanners inherit `coord_system` from their base scanner. * All read_* functions pass the format-native CoordSystem. R changes: * All read_*_impl functions pass the format-native CoordSystem.

Introduce `oxbow::Region`: a coordinate-system-aware genomic region type that normalizes all coordinates to 0-based half-open internally. Supports two parsing styles: * Ambiguous UCSC notation (chr1:10,000-20,000) interpreted using a provided CoordSystem. Accepts , and _ as thousands separators. * Explicit bracket notation (chr1:[10_000,20_000) or chr1:[10_001,20_000]) that is self-describing and overrides any provided coordinate system. Only _ is accepted as a thousands separator (since , delimits start and end). `Region::to_noodles()` converts to a noodles `Region` for index-based seeking. All `scan_query` methods now accept `oxbow::Region` instead of `noodles::core::Region`, performing the conversion internally. `CoordSystem` and `Region` are extracted into a new `oxbow::coords` module and re-exported from the crate root. py-oxbow scanner classes parse user region strings using the scanner's `model().coord_system()` when using ambiguous notation. Standalone `read_*` functions use the format-native default. r-oxbow follows the same convention.

conradbzura · 2026-03-23T12:42:36Z

oxbow/src/coords.rs

+        write!(f, "{}", self.name)?;
+        match (self.start, self.end) {
+            (0, None) => {}
+            (s, None) => write!(f, ":[{s},)")?,


Region { start: 5000, end: None } displays as chr1:[5000,), but try_parse_bracket can’t parse an empty end — it splits on , and fails "".parse::<u64>(). Because try_parse_bracket returns Some(Err(…)) instead of None, the UCSC fallback path is never reached, so FromStr roundtrips break for any open-ended region with an explicit start.

Either extend bracket parsing to accept [start,) as open-ended, or have Display emit UCSC notation for the unbounded case (e.g., a 0-based chr1:5001- under the default OneClosed assumption).

conradbzura · 2026-03-23T12:42:36Z

oxbow/src/coords.rs

+        let (start, end) = match coord_system {
+            CoordSystem::OneClosed => {
+                // 1-based closed → 0-based half-open: start -= 1, end unchanged
+                (start.map(|s| s.saturating_sub(1)), end)


saturating_sub(1) correctly avoids wrapping to u64::MAX, but it also silently accepts start = 0, which isn’t a valid 1-based position. In UCSC mode with OneClosed, chr1:0-100 quietly normalizes to start = 0 (0-based) — the user gets a plausible-looking result from an invalid input. The same applies to bracket notation [0,100] at line 236.

A guard like if start == 0 { return Err(invalid_input("1-based position must be >= 1")) } before the subtraction would catch the mistake early.

conradbzura · 2026-03-23T12:48:47Z

@nvictus sorry I'm a little late, but noted a couple of things.

nvictus added 2 commits March 20, 2026 12:56

nvictus requested a review from conradbzura March 20, 2026 20:29

Update docstrings

a7ced21

nvictus merged commit 56d894c into abdenlab:main Mar 21, 2026
8 checks passed

conradbzura reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Resolve coordinate system semantics for input and output positions#173

feat: Resolve coordinate system semantics for input and output positions#173
nvictus merged 3 commits intoabdenlab:mainfrom
nvictus:feat-coordsys

nvictus commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

conradbzura Mar 23, 2026

Uh oh!

conradbzura Mar 23, 2026

Uh oh!

conradbzura commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nvictus commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coordinate systems

Range notation

Example

Malformatted files

Changes

Uh oh!

Uh oh!

conradbzura Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

conradbzura Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

conradbzura commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nvictus commented Mar 20, 2026 •

edited

Loading