refactor: Replace Option<Vec<String>> with Select<String> for field selection#172
Merged
nvictus merged 16 commits intoabdenlab:mainfrom Mar 18, 2026
Merged
refactor: Replace Option<Vec<String>> with Select<String> for field selection#172nvictus merged 16 commits intoabdenlab:mainfrom
nvictus merged 16 commits intoabdenlab:mainfrom
Conversation
…election Introduces a consistent ternary Select<String> enum across all format modules (alignment, GXF, sequence, variant, BED, BBI base, BBI zoom) to replace the ambiguous Option<Vec<String>> pattern. Select::All → include all fields (derive from defaults or header) Select::Omit → omit the column entirely Select::Some → include only the named fields PyO3 bindings use resolve_fields() to map Python conventions: None → Select::Omit "*" → Select::All list → Select::Some Python DataSource APIs default all fields params to "*" (previously None, which now means omit).
0b1d946 to
7103a89
Compare
Replace the `unnest_samples` parameter (true = top-level, false = nested) with `samples_nested` (false = top-level, false = nested) across Rust, PyO3, and Python. The new name is a positive predicate that reads more naturally at call sites (e.g. `samples_nested=True`). Default behavior is unchanged: sample columns are top-level by default. Files changed: variant model, batch builder, vcf/bcf scanners, PyO3 bindings, and Python DataSource classes. Also re-ordered the parameters lists on the Rust side so that genotype_by follows genotype_fields and samples_nested follows samples.
Add `resolve_r_fields` helper to bridge R's `Option<Vec<String>>` to `Select<String>`: `NULL` → `Omit`, `"*"` → `All`, vector → `Some`. Update all scanner call sites in lib.rs accordingly.
- None (Omit) → column absent; [] (Some([])) → column present as empty struct - Remove empty-list-collapses-to-None from from_header; propagate Select faithfully - Add has_info / has_genotype flags to BatchBuilder to distinguish None from Some([]) - Use StructArray::new_empty_fields(row_count) for empty-field struct columns - Add row_count tracking to SampleStructBuilder and SeriesStructBuilder to avoid StructArray::from([]) panic when no builders are registered - Remove unused info_defs field from BatchBuilder - Add 14 PyScanner tests covering all null/empty Select edge cases for VCF and BCF
Replace panic!/unwrap with proper OxbowError::not_found in by-field sample lookup (both VCF and BCF push paths)
arro3-core 0.4.6 depends on arrow 54, which does not handle empty struct arrays correctly.
…methods - Remove automatic tag/attribute discovery from AlignmentFile and GxfFile __init__; tags and attributes are now omitted by default - Add with_tags() to AlignmentFile and with_attributes() to GxfFile for explicit opt-in discovery or specification, mirroring Select semantics - Change VcfFile/BcfFile default for samples from "*" (all) to None (omit); add with_samples() method that always produces a nested "samples" struct
…upport
Fix with_tags() to use **self._tag_discovery_kwargs(), so CramFile's override
correctly forwards `reference` and `reference_index` to the discovery scanner.
Also fix a stale docstring in from_bam() ("SAM file" → "BAM file") and
remove TestSamFile.test_malformed (SAM is a lenient text format that does
not error on malformed input).
- Update quickstart examples to use with_tags(), with_attributes(), and with_samples() opt-in builder methods replacing the old auto-discovery defaults - Add important admonitions noting the opt-in behavior change for tags, attributes, and sample genotype data as of v0.7 - Rewrite VCF/BCF samples section around with_samples(), including nested group_by="field" example - Add Custom BED schemas section with BED and BigBed tuple-schema examples - Update BigBed AutoSql section intro to clarify default vs autosql parsing - Expose with_tags, with_attributes, with_samples, and model methods in API reference files
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces a consistent ternary
Select<String>enum across all format modules (alignment, GXF, sequence, variant, BED, BBI base, BBI zoom) to replace the ambiguousOption<Vec<String>>pattern.oxbow crate
Select::All→ include all fields (derive from defaults or header)Select::Some→ include only the named fieldsSelect::Omit→ omit the column entirely (for composite columns)pyo3
PyO3 bindings use
resolve_fields()to map Python conventions:"*"→Select::Alllist→Select::SomeNone→Select::OmitBreaking API changes
PyO3 scanners now require explicit opt-in for
*semantics, since the default values of the input parameters are stillNone.Python
Python DataSource APIs now default all fields params to
*(previouslyNone, which now means "omit").This keeps default behavior mostly unchanged.
Breaking API changes
Tag/attribute discovery and variant sample inclusion are now opt-in.
tag_scan_rowsandattribute_scan_rowshave been removed from data source constructors for alignment and annotation files, respectively. Tag/attribute discovery is now done through awith_tags()builder method.samples=None, meaning "omit". A newwith_samples()builder method can be used to project samples (all by default).R
R functions now default all fields params to
*(previouslyNULL, which now means "omit"), except vcf/bcfsamples.This keeps default behavior mostly unchanged.
Breaking API changes
Variant sample inclusion is now opt-in (default
samples=NULL).