Skip to content

Conversation

@douglasgscofield
Copy link

@douglasgscofield douglasgscofield commented Jun 11, 2018

accuMUlate uses a homegrown solution to find RG tags attached to reads: using std::string::find() to search for the literal string "\000RGZ". This only works if the previous tag fortuitously ended with \000.

BAM tags containing, e.g., integer values are encoded using a type-sensitive fixed width. "XS:i:20" is encoded as "XSC\024", with the C indicating a single-byte unsigned int. In the problem case this was immediately followed by "RGZ..." encoding the tag for the read group. This is perfectly valid BAM tag encoding, however, the accuMUlate search for "\000RGZ" never found the RG tag for this read or similar reads.

The solution is to use BamTools' own API for finding tags and their values. It knows about these encodings, and searches for them robustly. A single call to BamTools::BamAlignment::GetTag() will return a tag's value and indicate an error. The interface to ReadDataVisitor::GetSampleIndex() is changed slightly and the function is simplified. The constants ReadDataVisitor::RG_TAG and ReadDataVisitor::ZERO_CHAR are removed as they are no longer required.

@dwinter
Copy link
Owner

dwinter commented Jun 11, 2018

Hi Douglas,

Many thanks for identifying this problem and proving a PR to address it. I'll make some time to check this on a few test datasets I have here, then merge this in and make a new release.

@douglasgscofield
Copy link
Author

Hi, I'm hoping this could get merged into the master at some point and be part of the new 0.2.2 that's been started?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants