Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 27 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,26 +55,27 @@ Then, copy or move "scythe" to a directory in your $PATH.

Scythe can be run minimally with:

scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq
scythe -a adap.fa -o trimmed_sequences.fasta sequences.fastq

By default, the prior contamination rate is 0.05. This can be changed
By default, the prior contamination rate is 0.3. This can be changed
(and one is encouraged to do so!) with:

scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq
scythe -a adap.fa -p 0.1 -o trimmed_sequences.fastq sequences.fastq

If you'd like to use standard out, it is recommended you use the
--quiet option:

scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq
scythe -a adap.fa --quiet sequences.fastq > trimmed_sequences.fastq

Also, more detailed output about matches can be obtained with:

scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq
scythe -a adap.fa -o trimmed_sequences.fasta -m matches.txt sequences.fastq

By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger
or Solexa (pipeline < 1.3) qualities can be specified with -q:
By default, the Sanger fastq quality encoding (phred+33; pipeline >= 1.8) is used.
Illumina (phred+64; pipelines 1.3 - 1.7) or Solexa ("Solexa"+64; pipelines < 1.3)
qualities can be specified with -q:

scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq
scythe -a adap.fa -q solexa -o trimmed_sequences.fasta sequences.fastq

Lastly, a minimum match length argument can be specified with -n <integer>:

Expand All @@ -86,27 +87,21 @@ liberal trimming, i.e. of only a few bases.

## Notes

Note that the two provided adapter sequence files contain non-FASTA
characters to denote the locations of barcode sequences, which always
appear in TruSeq adapters, and may or may not appear in forward and/or
reverse reads using the original Solexa/Illumina adapter sequences,
depending on library preparation. You'll need to modify the adapter
sequence files in order to use them.

In the case of the original Solexa/Illumina adapter sequences, we've seen
barcodes "upstream" of forward reads (in which case the reverse complement
of the barcode will appear before the adapter sequence at the 3'-end of
reverse reads - replacing the [NNNNNN]). We've also seen barcodes upstream
of reverse reads (in which case the reverse complement of the barcode will
appear before the adapter sequence at the 3'-end of forward reads -
replacing the [MMMMMM]). Your definition of the barcode may be someone
else's reverse-complemented barcode, and the barcode may or may not be 6
bases.

In the case of TruSeq adapter sequences, there will always be a 6 bp
barcode in place of the [NNNNNN] in sequence contaminating forward reads
(if the fragment is short enough, of course). This barcode sequence should
match the barcode included in the reads' FASTQ headers.
Note that the provided adapter sequence files (*_adapters.fa) contain
non-FASTA characters to denote the locations of barcode sequences,
which always appear in TruSeq adapters, and may or may not appear in
forward and/or reverse reads using the original ("Solexa") Illumina
adapter sequences, depending on library preparation. You'll need to
modify the adapter sequence files in order to use them.

An example adapters file (adap.fa) is included for ease of use. It
omits barcodes, so it can be used on all samples of an indexed pool or
set of files (the first ~30 bp are sufficient to identify adapter
contamination). However, Scythe will check reads against all adapter
sequences in the file, so a file with 6 adapter sequences will cause
roughly 6x runtime. Since your samples will never include adapters
from several types of kits, you're encouraged to omit everything but
the adapter(s) that will be found in your sequences.

Scythe only checks for 3'-end contaminants, up to the adapter's length
into the 3'-end. For reads with contamination in *any* position, the
Expand Down Expand Up @@ -138,9 +133,9 @@ Scythe adapter files that contain all possible barcodes concatenated
with possible adapters, so that both can be recognized and
removed. This has worked well and is recommended for cases when 3'-end
quality deteriorates and prevents barcode removal. Newer Illumina
chemistry has the barcode separated from the fragment, so that it
appears as an entirely separate read and is used to demultiplex sample
reads by Illumina's CASAVA pipeline.
chemistry (TruSeq) has the barcode separated from the fragment, so
that it appears as an entirely separate read that is used to
demultiplex sample reads by the Illumina pipeline.

### Does Scythe work on 5'-end or other contaminants?

Expand Down
16 changes: 16 additions & 0 deletions adap.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
>TruSeq_forward_contam
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
>TruSeq_reverse_contam
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>Nextera_forward_contam
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>Nextera_reverse_contam
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>TruSeq_SmallRNA_forward_contam
TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC
>TruSeq_SmallRNA_reverse_contam
GATCGTCGGACTGTAGAACTCTGAACCTGTCG
>Solexa_forward_contam
AGATCGGAAGAGCGGTTCAGCAGGAATGCCGA
>Solexa_reverse_contam
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTG
4 changes: 0 additions & 4 deletions illumina_adapters.fa

This file was deleted.

4 changes: 4 additions & 0 deletions nextera_adapters.fa
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
>Nextera_forward_contam
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC[8bp index]ATCTCGTATGCCGTCTTCTGCTTG
>Nextera_reverse_contam
CTGTCTCTTATACACATCTGACGCTGCCGACGA[8bp index]GTGTAGATCTCGGTGGTCGCCGTATCATT
Loading