Skip to content

Accept U as a base type and convert to T for BAM. #306

@jkbonfield

Description

@jkbonfield

I see the BAM encoder disallows U.

We recently had a samtools/htslib issue where the input U characters in SAM were being converted to N due to the in-memory encoding (essentially BAM) treating any unrecognised character as unknown. Their input data came from ONT, which we know can produce U in FASTQ. In BAM the Dorado software converts those to T (so they've already chosen that as their solution), but I don't know where the SAM came from. Regardless, it exists and is genuine user data.

IUPAC does acknowledge U as a base type (the original text referred to V as not-T or not-U), but it's obvious they don't disambiguate between them as we don't have different codes for A/T and A/U. I think it was probably an error made very early on in SAM/BAM and samtools to not add U into the lookup table as it's unambiguous in meaning, and the SAM spec even mentions DNA/RNA in the text so it was clearly intended to work with both.

I've already fixed this in htslib (not yet in a release) as evidently users need at least one efficient solution that can convert Us to Ts in sequences and isn't a perl, python etc slow one-liner. :). However I'd advocate for it being supported by all major implementations.

See samtools/hts-specs#800 and samtools/hts-specs#801 for more discussion on this, which discusses the SAM regexp description and also how we should track whether this conversion was done.

PS. I've no idea what to do with CRAM! For us in htslib it's a moot point as htslib converts to nibble encoding anyway, so we cannot present data to the CRAM encoder that contains U, but technically CRAM could store it. It's not ideal though and would cause lots of wasted space with reference-based encoding with all T vs U being an edit. My gut feeling is it should also just store U as T and use meta-data somewhere to flag it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions