Skip to content

refactor: Replace Option<Vec<String>> with Select<String> for field selection#172

Merged
nvictus merged 16 commits intoabdenlab:mainfrom
nvictus:feat-select-fields
Mar 18, 2026
Merged

refactor: Replace Option<Vec<String>> with Select<String> for field selection#172
nvictus merged 16 commits intoabdenlab:mainfrom
nvictus:feat-select-fields

Conversation

@nvictus
Copy link
Copy Markdown
Member

@nvictus nvictus commented Mar 16, 2026

Introduces a consistent ternary Select<String> enum across all format modules (alignment, GXF, sequence, variant, BED, BBI base, BBI zoom) to replace the ambiguous Option<Vec<String>> pattern.

oxbow crate

  • Select::All → include all fields (derive from defaults or header)
  • Select::Some → include only the named fields
  • Select::Omit → omit the column entirely (for composite columns)

pyo3

PyO3 bindings use resolve_fields() to map Python conventions:

  • "*"Select::All
  • listSelect::Some
  • NoneSelect::Omit

Breaking API changes

PyO3 scanners now require explicit opt-in for * semantics, since the default values of the input parameters are still None.

Python

Python DataSource APIs now default all fields params to * (previously None, which now means "omit").

This keeps default behavior mostly unchanged.

Breaking API changes

Tag/attribute discovery and variant sample inclusion are now opt-in.

  • tag_scan_rows and attribute_scan_rows have been removed from data source constructors for alignment and annotation files, respectively. Tag/attribute discovery is now done through a with_tags() builder method.
  • In the variant data source constructors, we set the default for samples=None, meaning "omit". A new with_samples() builder method can be used to project samples (all by default).

R

R functions now default all fields params to * (previously NULL, which now means "omit"), except vcf/bcf samples.

This keeps default behavior mostly unchanged.

Breaking API changes

Variant sample inclusion is now opt-in (default samples=NULL).

…election

Introduces a consistent ternary Select<String> enum across all format
modules (alignment, GXF, sequence, variant, BED, BBI base, BBI zoom)
to replace the ambiguous Option<Vec<String>> pattern.

Select::All  → include all fields (derive from defaults or header)
Select::Omit → omit the column entirely
Select::Some → include only the named fields

PyO3 bindings use resolve_fields() to map Python conventions:
  None → Select::Omit
  "*"  → Select::All
  list → Select::Some

Python DataSource APIs default all fields params to "*" (previously
None, which now means omit).
@nvictus nvictus force-pushed the feat-select-fields branch from 0b1d946 to 7103a89 Compare March 16, 2026 16:42
nvictus added 15 commits March 16, 2026 13:27
Replace the `unnest_samples` parameter (true = top-level, false = nested)
with `samples_nested` (false = top-level, false = nested) across Rust,
PyO3, and Python. The new name is a positive predicate that reads more
naturally at call sites (e.g. `samples_nested=True`). Default behavior
is unchanged: sample columns are top-level by default.

Files changed: variant model, batch builder, vcf/bcf scanners,
PyO3 bindings, and Python DataSource classes.

Also re-ordered the parameters lists on the Rust side so that genotype_by
follows genotype_fields and samples_nested follows samples.
Add `resolve_r_fields` helper to bridge R's `Option<Vec<String>>` to
`Select<String>`: `NULL` → `Omit`, `"*"` → `All`, vector → `Some`.
Update all scanner call sites in lib.rs accordingly.
- None (Omit) → column absent; [] (Some([])) → column present as empty struct
- Remove empty-list-collapses-to-None from from_header; propagate Select faithfully
- Add has_info / has_genotype flags to BatchBuilder to distinguish None from Some([])
- Use StructArray::new_empty_fields(row_count) for empty-field struct columns
- Add row_count tracking to SampleStructBuilder and SeriesStructBuilder to
  avoid StructArray::from([]) panic when no builders are registered
- Remove unused info_defs field from BatchBuilder
- Add 14 PyScanner tests covering all null/empty Select edge cases for VCF and BCF
Replace panic!/unwrap with proper OxbowError::not_found in by-field sample
lookup (both VCF and BCF push paths)
arro3-core 0.4.6 depends on arrow 54, which does not handle
empty struct arrays correctly.
…methods

- Remove automatic tag/attribute discovery from AlignmentFile and GxfFile
  __init__; tags and attributes are now omitted by default
- Add with_tags() to AlignmentFile and with_attributes() to GxfFile for
  explicit opt-in discovery or specification, mirroring Select semantics
- Change VcfFile/BcfFile default for samples from "*" (all) to None (omit);
  add with_samples() method that always produces a nested "samples" struct
…upport

Fix with_tags() to use **self._tag_discovery_kwargs(), so CramFile's override
correctly forwards `reference` and `reference_index` to the discovery scanner.

Also fix a stale docstring in from_bam() ("SAM file" → "BAM file") and
remove TestSamFile.test_malformed (SAM is a lenient text format that does
not error on malformed input).
- Update quickstart examples to use with_tags(), with_attributes(), and
  with_samples() opt-in builder methods replacing the old auto-discovery
  defaults
- Add important admonitions noting the opt-in behavior change for tags,
  attributes, and sample genotype data as of v0.7
- Rewrite VCF/BCF samples section around with_samples(), including nested
  group_by="field" example
- Add Custom BED schemas section with BED and BigBed tuple-schema examples
- Update BigBed AutoSql section intro to clarify default vs autosql parsing
- Expose with_tags, with_attributes, with_samples, and model methods in
  API reference files
@nvictus nvictus merged commit 9e077ba into abdenlab:main Mar 18, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant