Skip to content

FASTQ read name/description when computing hash #5

@guillermo-carrasco

Description

@guillermo-carrasco

Hi,

I have a question regarding how you use the FASTQ description field to calculate the hash. The question arises because I've been doing some tests and I only manage to get the same md5 using the -R option in both the BAM file and the original FASTQ files.

More details following:

I've trimmed a FASTQ sample for testing purposes, the reads look like this:

~> zcat ../data/NA12878_trimmed_1.fastq.gz | head -n 4
@ERR194147.1 HSQ1004:134:C0D8DACXX:1:1104:3874:86238/1
GGTTCCTACTTCAGGGTCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACG
+
CC@FFFFFHHHHHJJJFHIIJJJJJJIHJIIJJJJJJJJIIGIJJIJJJIJJJIJIJJJJJJJJJJIJHHHHFFFDEEEEEEEEDDDCDDEEDDDDDDDDD

When, after the analysis, I run bamhash_checksum_fastq in the FASTQ files and bamhash_checksum_bam in the resulting BAM file I get different md5's:

~> bamhash_checksum_fastq ../data/NA12878_trimmed_*
a05de49644a0fb5d        10000

~> bamhash_checksum_bam final/NA12878_trimmed/NA12878_trimmed-ready.bam
d4d5ece0f619d83d        20000

If I convert the BAM file back to FASTQ I realised that the FASTQ read description disappears, i.e:

~> samtools fastq final/NA12878_trimmed/NA12878_trimmed-ready.bam | head -n 4
@ERR194147.6389/2
CATCGGATTTTTGTTTTTTTTGTTTTGGGTGGGGGGGGTTGGTGGGGTTGTGTGTGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGTGGGGGGGGTGGTTGG
+
11++40+2)2,+)2:3AA8)))00)))((('''''&&&))&&(((&&&&&&(((++(&05;BB7@>B@BDBBDDB>BDD@@3>9<&5-&&&&&&)&&)+(&

Is that description after the readname used to calculate the hash? I'm pretty confident that this is the problem, since if I run BamHash with -R it does return the expected result:

~> bamhash_checksum_fastq -R ../data/NA12878_trimmed_1.fastq.gz ../data/NA12878_trimmed_2.fastq.gz
f4524c00c70e9b83        10000

~> bamhash_checksum_bam -R final/NA12878_trimmed/NA12878_trimmed-ready.bam
f4524c00c70e9b83        20000

Thanks for your help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions