Skip to content

Commit 3fe403f

Browse files
committed
Merge branch 'dev'
2 parents 76ad6ec + 7f07ecf commit 3fe403f

File tree

5 files changed

+191
-64
lines changed

5 files changed

+191
-64
lines changed

Makefile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Determine the samblaster build number
2-
BUILDNUM = 21
2+
BUILDNUM = 22
33
# INTERNAL = TRUE
44

55
OBJS = samblaster.o sbhash.o

README.md

Lines changed: 32 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -11,9 +11,9 @@ Click the preceeding link or download the file from this repository.
1111

1212
---
1313

14-
**Current version:** 0.1.21
14+
**Current version:** 0.1.22
1515

16-
Support for Linux and OSX.
16+
Support for Linux and OSX (Version 10.7 or higher).
1717

1818
##Summary
1919
*samblaster* is a fast and flexible program for marking duplicates in __read-id grouped<sup>1</sup>__ paired-end SAM files.
@@ -34,10 +34,10 @@ cp samblaster /usr/local/bin/.
3434
##Usage
3535
See the [SAM File Format Specification](http://samtools.sourceforge.net/SAMv1.pdf) for details about the SAM alignment format.
3636

37-
By default, *samblaster* reads SAM input from **stdin** and writes SAM to **stdout**. Input SAM file usually contain paired end data (see [Duplicate Identification](#DupIdentification) below), must contain a sequence header, and must be __read-id grouped<sup>1<sup>__.
37+
By default, *samblaster* reads SAM input from **stdin** and writes SAM to **stdout**. Input SAM files usually contain paired end data (see [Duplicate Identification](#DupIdentification) below), must contain a sequence header, and must be __read-id grouped<sup>1<sup>__.
3838
By default, the output SAM file will contain all the alignments in the same order as the input, with duplicates marked with SAM FLAG 0x400. The **--removeDups** option will instead remove duplicate alignments from the output file.
3939

40-
__<sup>1</sup>A read-id grouped__ SAM file is one in which all alignments for a read-id are grouped together in adjacent lines.
40+
__<sup>1</sup>A read-id grouped__ SAM file is one in which all alignments for a read-id (QNAME) are grouped together in adjacent lines.
4141
Aligners naturally produce such files.
4242
They can also be created by sorting a SAM file by read-id.
4343
But as shown below, sorting the input to *samblaster* by read-id is not required if the alignments are already grouped.
@@ -46,12 +46,17 @@ But as shown below, sorting the input to *samblaster* by read-id is not required
4646

4747
To take input alignments directly from _bwa mem_ and output to _samtools view_ to compress SAM to BAM:
4848
```
49-
bwa mem index samp.r1.fq samp.r2.fq | samblaster | samtools view -Sb - > samp.out.bam
49+
bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster | samtools view -Sb - > samp.out.bam
50+
```
51+
52+
When using the *bwa mem* **-M** option, also use the *samblaster* **-M** option:
53+
```
54+
bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M | samtools view -Sb - > samp.out.bam
5055
```
5156

5257
To additionally output discordant read pairs and split read alignments:
5358
```
54-
bwa mem index samp.r1.fq samp.r2.fq | samblaster -e -d samp.disc.sam -s samp.split.sam | samtools view -Sb - > samp.out.bam
59+
bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster -e -d samp.disc.sam -s samp.split.sam | samtools view -Sb - > samp.out.bam
5560
```
5661

5762
To pull split reads and discordants read pairs from a pre-existing BAM file with duplicates already marked:
@@ -62,7 +67,7 @@ samtools view -h samp.bam | samblaster -a -e -d samp.disc.sam -s samp.split.sam
6267
---
6368
**OPTIONS:**
6469
Default values enclosed in square brackets []
65-
```
70+
<pre>
6671
Input/Output Options:
6772
-i --input FILE Input sam file [stdin].
6873
-o --output FILE Output sam file for all input alignments [stdout].
@@ -76,17 +81,30 @@ Other Options:
7681
-e --excludeDups Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
7782
-r --removeDups Remove duplicates reads from all output files. (Implies --excludeDups).
7883
--addMateTags Add MC and MQ tags to all output paired-end SAM lines.
84+
--ignoreUnmated Suppress abort on unmated alignments. Use only when sure input is read-id grouped and alignments have been filtered.
85+
<b>--ignoreUnmated is not recommended for general use. It disables checks that detect incorrectly sorted input.</b>
86+
-M Compatibility mode (details below); both FLAG 0x100 and 0x800 denote supplemental (chimeric). Similar to <i>bwa mem</i> <b>-M</b> option.
7987
--maxSplitCount INT Maximum number of split alignments for a read to be included in splitter file. [2]
8088
--maxUnmappedBases INT Maximum number of un-aligned bases between two alignments to be included in splitter file. [50]
8189
--minIndelSize INT Minimum structural variant feature size for split alignments to be included in splitter file. [50]
8290
--minNonOverlap INT Minimum non-overlaping base pairs between two alignments for a read to be included in splitter file. [20]
8391
--minClipSize INT Minumum number of bases a mapped read must be clipped to be included in unmapped file. [20]
8492

85-
8693
-h --help Print samblaster help to stderr.
8794
-q --quiet Output fewer statistics.
8895
--version Print samblaster version number to stderr.
89-
```
96+
</pre>
97+
98+
---
99+
**ALIGNMENT TYPE DEFINITIONS:<a name="Definitions"></a>**
100+
Below, we will use the following definitions for alignment types.
101+
Starting with *samblaster* release 0.1.22, these definitions are affected by the use of the **-M** option.
102+
By default, *samblaster* will use the current definitions of alignment types as specified in the [SAM Specification](http://samtools.sourceforge.net/SAMv1.pdf).
103+
Namely, alignments marked with FLAG 0x100 are considered *secondary*, while those marked with FLAG 0x800 are considered *supplemental*.
104+
If the **-M** option is specified, alignments marked with either FLAG 0x100 or 0x800 are considered *supplemental*, and no alignments are considered *secondary*.
105+
A *primary* alignment is always one that is neither *secondary* nor *supplemental*.
106+
Only *primary* and *supplemental* alignments are used to find chimeric (split-read) mappings.
107+
The **-M** flag is used for backward compatibility with older SAM/BAM files in which "chimeric" alignments were marked with FLAG 0x100, and should also be used with output from more recent runs of *bwa mem* using its **-M** option.
90108

91109
---
92110
**DUPLICATE IDENTIFICATION:<a name="DupIdentification"></a>**
@@ -95,22 +113,22 @@ A **duplicate** read pair is defined as a pair that has the same *signature* for
95113
1. For pairs in which both reads are mapped, both signatures must match.
96114
2. For pairs in which only one side is mapped (an "orphan"), the signature of the mapped read must match a previously seen orphan. In an orphan pair, the unmapped read need not appear in the input file. In addition, mapped non-paired single read alignments will be treated the same as an orphan pair with a missing unmapped read.
97115
3. No doubly unmapped pair will be marked as a duplicate.
98-
4. Any *secondary* alignment (FLAG 0x100 or 0x800) associated with a duplicate primary alignment will also be marked as a duplicate.
116+
4. Any *secondary* or *supplemental* alignment associated with a duplicate *primary* alignment will also be marked as a duplicate.
99117

100118
---
101119
**DISCORDANT READ PAIR IDENTIFICATION:**
102120
A **discordant** read pair is one which meets all of the following criteria:
103121

104122
1. Both side of the read pair are mapped (neither FLAG 0x4 or 0x8 is set).
105123
2. The *properly paired* FLAG (0x2) is not set.
106-
3. Secondary alignments (FLAG 0x100 or 0x800) are never output as discordant, although a discordant read pair can have secondary alignments associated with them.
124+
3. *Secondary* or *supplemental* alignments are never output as discordant, although a discordant read pair can have such alignments associated with them.
107125
4. Duplicate read pairs that meet the above criteria will be output as discordant unless the **-e** option is used.
108126

109127
---
110128
**SPLIT READ IDENTIFICATION:**
111129
**Split Read** alignments are derived from a single read when one portion of the read aligns to a different region of the reference genome than another portion of the read. Such pairs of alignments often define a structural variant (SV) breakpoint, and are therefore useful input to SV detection algorithms such as [LUMPY](https://github.com/arq5x/lumpy-sv/). *samblaster* uses the following strategy to identify split reads alignments.
112130

113-
1. Identify reads that have between two and **--maxSplitCount** alignments.
131+
1. Identify reads that have between two and **--maxSplitCount** *primary* and *supplemental* alignments.
114132
2. Sort these alignments by their strand-normalized position along the read.
115133
3. Two alignments are output as splitters if they are adjacent on the read, and meet these criteria:
116134
- each covers at least **--minNonOverlap** base pairs of the read that the other does not.
@@ -120,10 +138,10 @@ A **discordant** read pair is one which meets all of the following criteria:
120138

121139
---
122140
**UNMAPPED/CLIPPED READ IDENTIFICATION:**
123-
An **unmapped** or **clipped** read is one that is unaligned over all or part of its length respectively. The lack of a full alignment may be caused by a SV breakpoint that falls within the read. Therefore, *samblaster* will optionally output such reads to a FASTQ file for re-alignment by a tool, such as [YAHA](http://faculty.virginia.edu/irahall/yaha/), geared toward finding split-read mappings. *samblaster* applies the following strategy to identify and output unmapped/clipped reads:
141+
An **unmapped** or **clipped** read is a *primary* alignment that is unaligned over all or part of its length respectively. The lack of a full alignment may be caused by a SV breakpoint that falls within the read. Therefore, *samblaster* will optionally output such reads to a FASTQ file for re-alignment by a tool, such as [YAHA](https://github.com/GregoryFaust/yaha/), geared toward finding split-read mappings. *samblaster* applies the following strategy to identify and output unmapped/clipped reads:
124142

125143
1. An **unmapped** read has the *unmapped read* FLAG set (0x4).
126-
2. A **clipped** read is a mapped read with a CIGAR string that begins or ends with at least **--minClipSize** unaligned bases (CIGAR code S or H), and is not from a read that has one or more *secondary* alignments (FLAG 0x100).
144+
2. A **clipped** read is a mapped read with a CIGAR string that begins or ends with at least **--minClipSize** unaligned bases (CIGAR code S and/or H), and is not from a read that has one or more *supplemental* alignments.
127145
3. In order for *samblaster* to output the entire sequence for clipped reads, the input SAM file must have soft clipped primary alignments.
128146
4. *samblaster* will output unmapped/clipped reads into a FASTQ file if QUAL information is available in the input file, and a FASTA file if not.
129147
5. Unmapped/clipped reads that are part of a duplicate read pair will be output unless the **-e** option is used.

0 commit comments

Comments
 (0)