A **duplicate** read pair is defined as a pair that has the same *signature* for each mapped read as a previous read pair in the input SAM file. The *signature* is comprised of the combination of the sequence name, strand, and the reference offset where the 5' end of the read would fall if the read were fully aligned (not clipped) at its 5' end. The 5' aligned reference position is calculated using a combination of the POS field, the strand, and the CIGAR string. This definition of *signature* matches that used by *Picard MarkDuplicates*.
0 commit comments