Skip to content

feat: Resolve coordinate system semantics for input and output positions#173

Merged
nvictus merged 3 commits intoabdenlab:mainfrom
nvictus:feat-coordsys
Mar 21, 2026
Merged

feat: Resolve coordinate system semantics for input and output positions#173
nvictus merged 3 commits intoabdenlab:mainfrom
nvictus:feat-coordsys

Conversation

@nvictus
Copy link
Copy Markdown
Member

@nvictus nvictus commented Mar 20, 2026

This PR adds the ability to honor genomic coordinate system semantics in the way input query ranges are interpreted, and the way output positions (namely, start positions) are returned. Resolves #114.

Coordinate systems

There are two "coordinate system" conventions in genomics:

  1. Zero-based, half-open (i.e. left-inclusive, right-exclusive) -- aka "zero-based start, one-based end". [1000, 2000)
  2. One-based, closed (i.e. left-inclusive, right-inclusive) -- aka "one-based start, one-based end". [1001, 2000]

We introduce shorthands for these in Python: "01" and "11". And every scanner is now coordinate-system aware in its output, defaulting to a format family-"native" system.

Range notation

We further support 3 range string notations:

  1. Ambiguous UCSC notation, interpreted using the scanner's registered coordinate system. chr1:10,000-20,000
  2. Explicit "bracket notation" for 01-semantics: chr1:[10_000,20_000)
  3. Explicit "bracket notation" for 11-semantics: chr1:[10_001,20_001]

Thousands separators , and _ are supported (only the latter in bracket notation).

Example

# default for SAM/BAM/CRAM is 11
df = ox.from_bam("example.bam").regions("chr3:10,001-20,000").pl()

# return 01 output and interpret query regions as 01
df = ox.from_bam("example.bam", coords="01").regions("chr3:10,000-20,000").pl()

# return 01 output, pass in 11 input
df = ox.from_bam("example.bam", coords="01").regions("chr3:[10001,20000]").pl()

Malformatted files

This feature alone does not solve the problem malformatted files, most often observed with text formats. BED is natively 01, but many BED-like files in the wild contain intervals with 1-based starts. Likewise, GTF is supposed to be 11, but many may use 0-based starts. If one knows the start positions in a file don't match what the format natively expects, one has to choose a coordinate system interpretation and increment or decrement the start positions accordingly to correct them.

Changes

  • Add a CoordSystem enum (OneClosed / ZeroHalfOpen) that controls how start positions are represented in output Arrow batches and how user-supplied query regions are interpreted.

  • Add an oxbow::Region type with coordinate-system-aware parsing, supporting ambiguous UCSC notation (interpreted using a CoordSystem) and an explicit bracket notation.

  • Each format defaults to its native coordinate convention (1-based for SAM/BAM/CRAM, VCF/BCF, GFF/GTF, FASTA; 0-based for BED, BigBed, BigWig) but callers can request either system explicitly via a coords parameter.

  • All scanner scan_query methods now accept oxbow::Region instead of noodles::core::Region, converting internally.

  • Python API exposes coords as literals ("01", "11") on all DataSource classes and from_* factory functions, as well as pyo3 scanner classes.

  • pyo3 read_* functions and R bindings updated to use format-native coordinate systems.

nvictus added 2 commits March 20, 2026 12:56
Introduce a `CoordSystem` enum (`OneClosed` / `ZeroHalfOpen`) that
controls how start positions are represented in output Arrow batches.
Each format defaults to its native coordinate convention: 1-based
for SAM/BAM/CRAM, VCF/BCF, and GFF/GTF; 0-based for BED, BigBed,
and BigWig — but callers can request either system explicitly.

Rust changes:

* `CoordSystem` enum in `oxbow::lib` with `Display/FromStr` and `start_offset_from()` for computing the adjustment between systems.
* Every Model now carries a `coord_system` field (alignment, variant, gxf, bed, bbi base, bbi zoom, sequence).
* Offset applied at the FieldBuilder level (alignment, variant, gxf, bed) or BatchBuilder level (bbi base, bbi zoom) during `push()`.
* All scanner constructors accept an explicit `CoordSystem` parameter.

Also edded `Default` trait impl on alignment and gxf Models.

Python changes:
* All pyo3 scanner classes accept a `coords` keyword argument ("01" or "11") and serialize it through `__getnewargs_ex__`.
* All DataSource classes and from_* factory functions expose coords: `Literal["01", "11"]` with format-appropriate defaults.
* BBI zoom scanners inherit `coord_system` from their base scanner.
* All read_* functions pass the format-native CoordSystem.

R changes:
* All read_*_impl functions pass the format-native CoordSystem.
Introduce `oxbow::Region`: a coordinate-system-aware genomic region
type that normalizes all coordinates to 0-based half-open internally.

Supports two parsing styles:

* Ambiguous UCSC notation (chr1:10,000-20,000) interpreted using a provided
CoordSystem. Accepts , and _ as thousands separators.
* Explicit bracket notation (chr1:[10_000,20_000) or chr1:[10_001,20_000])
that is self-describing and overrides any provided coordinate system.
Only _ is accepted as a thousands separator (since , delimits start and end).

`Region::to_noodles()` converts to a noodles `Region` for index-based
seeking. All `scan_query` methods now accept `oxbow::Region` instead
of `noodles::core::Region`, performing the conversion internally.

`CoordSystem` and `Region` are extracted into a new `oxbow::coords`
module and re-exported from the crate root.

py-oxbow scanner classes parse user region strings using the scanner's
`model().coord_system()` when using ambiguous notation. Standalone
`read_*` functions use the format-native default. r-oxbow follows the
same convention.
@nvictus nvictus requested a review from conradbzura March 20, 2026 20:29
@nvictus nvictus merged commit 56d894c into abdenlab:main Mar 21, 2026
8 checks passed
write!(f, "{}", self.name)?;
match (self.start, self.end) {
(0, None) => {}
(s, None) => write!(f, ":[{s},)")?,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Region { start: 5000, end: None } displays as chr1:[5000,), but try_parse_bracket can’t parse an empty end — it splits on , and fails "".parse::<u64>(). Because try_parse_bracket returns Some(Err(…)) instead of None, the UCSC fallback path is never reached, so FromStr roundtrips break for any open-ended region with an explicit start.

Either extend bracket parsing to accept [start,) as open-ended, or have Display emit UCSC notation for the unbounded case (e.g., a 0-based chr1:5001- under the default OneClosed assumption).

let (start, end) = match coord_system {
CoordSystem::OneClosed => {
// 1-based closed → 0-based half-open: start -= 1, end unchanged
(start.map(|s| s.saturating_sub(1)), end)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

saturating_sub(1) correctly avoids wrapping to u64::MAX, but it also silently accepts start = 0, which isn’t a valid 1-based position. In UCSC mode with OneClosed, chr1:0-100 quietly normalizes to start = 0 (0-based) — the user gets a plausible-looking result from an invalid input. The same applies to bracket notation [0,100] at line 236.

A guard like if start == 0 { return Err(invalid_input("1-based position must be >= 1")) } before the subtraction would catch the mistake early.

@conradbzura
Copy link
Copy Markdown
Contributor

@nvictus sorry I'm a little late, but noted a couple of things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handle 0 vs 1-based input/output ranges in a sane manner

2 participants