From ff298fa09ea9b1f75127f765fed018aaf7f5a0fc Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Thu, 7 Jun 2012 09:46:03 -0700 Subject: [PATCH 1/9] some edits to paper .tex --- paper/scythe-paper.tex | 198 ++++++++++++++++++++++------------------- 1 file changed, 105 insertions(+), 93 deletions(-) diff --git a/paper/scythe-paper.tex b/paper/scythe-paper.tex index a2e1e31..130eae2 100644 --- a/paper/scythe-paper.tex +++ b/paper/scythe-paper.tex @@ -18,40 +18,54 @@ \begin{abstract} \section{Motivation:} -Modern sequencing technologies can leave artifactual contaminant -sequences at the 3'-end of reads. 3'-end regions also have the lowest -quality bases that are likely to be called incorrectly, which makes -identifying and removing 3'-end contaminants difficult. +Illumina's short read sequencing technology uses known adapter +sequences that can contaminate the 3'-ends of reads when individual +sample fragments are shorter than the read length. Unfortunately, the +3'-ends of these reads are enriched with low quality bases that are +more likely to be called incorrectly, which makes identifying and +removing 3'-end contaminants difficult. Removal of contaminating +sequence is a critical step in most bioinformatics pipelines. + \section{Results:} -Scythe is a program designed specifically to remove 3'-end -contaminants. It searches for 3'-end contaminants and uses a Bayesian -model that considers individual base qualities to decide whether a -given match is a contaminant or background sequence. With a variety of -contamination rates, Scythe outperforms other adapter removal software -tools. +We have written Scythe - a novel 3'-end adapter trimmer based on +Bayesian classification that incorporates both sequence and base +qualities. Scythe outperforms other adapter removal tools across a +wide range of contamination rates, and is more robust to parameter +settings than other tools. + \section{Availability:} Scythe is freely available under the MIT license at \href{http://github.com/vsbuffalo/scythe}{http://github.com/vsbuffalo/scythe}. + \section{Contact:} \href{mailto:vsbuffalo@gmail.com}{vsbuffalo@gmail.com} \end{abstract} + \section{Introduction} -Scythe is focused on trimming 3'-end contaminants, specifically those -due to adapters or barcodes. It embraces the Unix Philosophy of -``programs that do one thing well'' (\citealp{raymond2003}). Many -second-generation sequencing technologies such as Illumina's Genome -Analyzer II and HiSeq have lower-quality 3'-end bases. These -low-quality bases are more likely to have nucleotides called -incorrectly, making contaminant identification more -difficult. Futhermore, 3'-end quality deterioration is not uniform -across all reads (see Fig. 1 in Supplementary Materials). - -By using a Bayesian classification procedure, Scythe accurately trims -3'-end contaminants off reads. The underlying model differentially -weights mismatching bases according to their FASTQ quality. +Scythe embraces the Unix Philosophy of ``programs that do one thing +well'' (\citealp{raymond2003}), and focuses on recognition and +trimming of Illumina adapter contamination. Illumina's +second-generation sequencing technology (MiSeq, GA series, HiSeq) +often results in reads that have a known, constant adapter sequence +that starts at variable, 3'-biased positions in reads. Bases at the +3'-end are also more difficult to call, and accordingly have low base +qualities and high substitution error rates. Importantly, the 3'-end +quality deterioration is not uniform across all reads (see Fig. 1 in +Supplementary Materials). + +Low quality 3'-end bases are commonly trimmed as a first step, however +this discards some evidence, albeit noisy, for adapter +sequences. Scythe is meant to be run before quality trimming is +performed, and uses a Bayesian classification scheme that +differentially weights mismatches between the adapter sequence and +bases in a read according to their base quality. + +We tested Scythe using a combination of real and simulated data, along +with two other 3'-end adapter trimming tools: Btrim (ref) and Cutadapt +(ref). % A common step in read quality improvement procedures is to remove % these low-quality 3'-end sequences from reads. This is thought to @@ -74,26 +88,28 @@ \section{Introduction} \begin{methods} \section{String Matching in Scythe} -Scythe employs a simple string matching heuristic to find a best -match. Scythe scores each alignment of the contaminant sequence -against the read sequence, starting from the entire contaminant in the -3'-end to incrementally fewer bases. A minimum match length can be -specified via a parameter (by default, 5). Each of these alignments is -scored using a $1$ for match, $-1$ for mismatch approach. The top -scoring alignment is then passed to the probabilistic classification -procedure, which decides whether the match is background sequence or a -contaminant. The time complexity of Scythe's matching algorithm for a -single adapter of length $l_a$ is $O(l_a^2 R)$ for a FASTQ file with -$R$ entries. +Scythe employs a simple, string matching heuristic to find a best +match for a given adapter sequence within a given read. Scythe scores +ungapped alignments of the contaminant sequence against the read +sequence, in every position in which a base in the provided adapter +sequence overlaps with the 3'-most bases in the read - a minimum match +length can be specified via a parameter (by default, 5). Each of these +alignments is scored using a scheme of $1$ for a match, $-1$ for a +mismatch. The top scoring alignment is then passed to the +probabilistic classification procedure, described below. The time +complexity of Scythe's matching algorithm for a single adapter of +length $l_a$ is $O(l_a^2 R)$ for a FASTQ file with $R$ entries. \section{Bayesian Classification of Top-Scoring Matches} -There are two mutually exclusive and exhaustive events: a top scoring -match is a contaminant or it is random sequence (that happens to be -similar to the adapter contaminant). A likelihood for each of these -events, given probabilities of each base being called correctly and -which bases mismatch the contaminant sequence, can be calculated. +Considering the highest-scoring alignment of an adapter sequence to a +read, there are two mutually exclusive and exhaustive events: (1) +there is a contaminant at the specified position, or (2) there is no +contaminant at that position. A likelihood for each of these events, +given probabilities of each base being called correctly (derived from +the base qualities) and which bases mismatch the contaminant sequence, +can be calculated. These likelihood functions assume models of error for each event. If the top-scoring match is contaminant sequence (event $C$), the @@ -103,12 +119,10 @@ \section{Bayesian Classification of Top-Scoring Matches} Where $m_i \in \{0, 1\}$ indicating whether the $i$ base is a mismatch or match (respectively) and $q_i$ is the probability the $i$ base is -called correctly (from the ASCII-encoded quality). $l_t$ is the length -of the top-scoring match. - -If the top-scoring sequence is not a contaminant sequence (event -$C'$), it is assumed that the matches are just by chance. Thus, the -likelihood function is: +called correctly (from its base quality). $l_t$ is the length of the +top-scoring match. On the other hand, if the top-scoring sequence is +not a contaminant sequence (event $C'$), it is assumed that the +matches occur by chance, giving the likelihood function: $$ P(S | C') = \prod_{i=1}^{l_t} \left(\frac{1}{4}\right)^{m_i} \cdot \left(\frac{3}{4}\right)^{1 - m_i} $$ @@ -119,14 +133,16 @@ \section{Bayesian Classification of Top-Scoring Matches} Since these are mutually exclusive and exhaustive events, the \emph{maximum a posteriori} rule can be used to classify a top-scoring -sequence as either contaminated or non-contaminated (e.g. if $P(C'|S) -> P(C|S)$, the sequence is not contaminated and visa-versa). +alignment as due to contamination or chance (e.g. if $P(C'|S) > +P(C|S)$, the sequence is not contaminated at that position, and +visa-versa). Required in this Bayesian formulation is the prior of contamination, -$P(C)$. Specifying the prior may seem like a nuisance, but it is a -useful parameter that can be adjusted for a variety of different -contamination scenarios. More information on specifying a parameter is -in Supplementary Materials. +$P(C)$. This can be estimated by inspection of the sequences +(i.e. counting perfect matches to a substring of the adapter +sequence), though in practice Scythe's performance does not depend on +an accurate estimate. More information on specifying the prior is in +Supplementary Materials. \section{Results} \begin{centering} @@ -139,64 +155,60 @@ \section{Results} \end{centering} Scythe was tested against two similar program: Btrim -(\citealp{pmid21651976}) and Cutadapt (\citealp{EJ200}). Both -Btrim and Cutadapt can handle insertions/deletions, but in Illumina -data we've analzed, no indels in adapter contaminants were -witnessed. This problem is more associated with 454 reads, which can -have homopolymer repeats. Both Btrim and Cutadapt also include quality -trimmers, which Scythe intentionally does not. - -To test Scythe, Btrim, and Cutadapt at 3'-end adapter contaminant -removal, random reads were generated and contaminated at fixed -contamination rates of 40\% and 70\%. FASTQ qualities from an Illumina -HiSeq run were added to these read sequences. Real base qualities were -used because properly modeling and simulating 3'-end quality -degradation is an arduous task. For the two fixed contamination rates, -ten replicate FASTQ files were generated randomly to ensure +(\citealp{pmid21651976}) and Cutadapt (\citealp{EJ200}). Both Btrim +and Cutadapt can handle insertions/deletions, but these are rare in +Illumina sequences. Whereas Btrim and Cutadapt also perform quality +trimming, Scythe intentionally limits its scope to adapter trimming. + +To test these tools, random reads were generated and contaminated at +fixed contamination rates of 40\% and 70\%, and paired to base quality +profiles from an Illumina HiSeq run, followed by mutation of each base +at the rate designated by their base quality (i.e. a base with a +quality score of 30 had a 1 in 1000 chance of being changed to a +different, random base). For each contamination rate, ten replicate +FASTQ files of 10,000 sequences each were generated, to ensure that contaminated and non-contaminated reads were well dispersed across -different read qualities. Each simulated FASTQ reads file contains -10,000 entries. +different base quality profiles. Each program was run with varying parameters to see how true positive and false positive rates change. Btrim uses a fixed number of mismatches, so integer values from 0 to 10 were used. Cutadapt uses an error rate for a matched sequence, which was varied from 0 to 0.95 in -0.05 increments. Likewise, Scythe uses the same values for its prior -parameter. Each piece of software was run with all other options off -to ensure a fair comparison of 3'-end contaminant removal. +0.05 increments. Scythe uses the same values as Cutadapt, for the +prior. Each program was run with all other options off to ensure a +fair comparison of 3'-end contaminant removal. While Cutadapt and Scythe only trim reads, Btrim occasionally removes a read entirely from the sample. For comparison purposes, we count this as trimming the entire length of a read. We compared the length of the original simulated read to the trimmed read for all programs, -parameters, replicates, and contamination rates. This, combined with -information about whether a read was contaminated allows the -calculation of true positive and false positive rates. Also, we -investigate how many times Btrim, Cutadapt, and Scythe incorrectly -trim a read, e.g. whether the trimmed length is not equal to the -contaminate length for reads that were contaminated. - -In these tests, we find that Scythe outperforms Btrim and Cutadapt in -terms of true positive rates for a variety of false positive rates -across a variety of parameters (Fig. \ref{fig:02}). Furthermore, -Scythe also has fewer incorrectly trimmed reads across a variety of -parameters compared to Btrim and Cutadapt. Differences in -contamination rates (40\% and 70\%) do not adversely affect Scythe's -performance. Note that Cutadapt performs well too, but is very -sensitive to its error parameter. Changing Cutadapt's error parameter -from 35\% to 40\% slightly increases the true positive rate, but -drastically increases the false positive rate. By contrast, Scythe's -probabilistic approach is much less sensitive to different prior -parameters, illustrated by the overplotting of the 35\% to 95\% prior -parameters in the ROC curve (Fig. \ref{fig:02}). Furthermore, we argue -that Scythe's parameter is easier to intuitively choose than a -sequence error rate. +parameters, replicates, and contamination rates, in order to calculate +true positive and false positive rates. We also counted how many times +Btrim, Cutadapt, and Scythe incorrectly trimmed a read, e.g. whether +the trimmed length was not equal to the contaminated length for reads +that were contaminated. + +All three tools performed almost almost identically at the 40\% and +70\% contamination rates (Fig. \ref{fig:02}a). Scythe outperformed +Btrim and Cutadapt in terms of true positive rates for a variety of +false positive rates across a wide range of parameters +(Fig. \ref{fig:02}). In addition, Scythe's performance (TP/FP rates) +was relatively stable across a wide range of priors - most notably +between ~0.45 and 0.95, whereas small parameter changes had large +effects on the TP/FP rates for Cutadapt and Btrim. Scythe also has +fewer incorrectly trimmed reads across a variety of parameters +compared to Btrim and Cutadapt (Fig. \ref{fig:02}b). \end{methods} + \section{Conclusion} -When compared to Btrim and Cutadapt, Scythe is a very accurate 3'-end +When compared to Btrim and Cutadapt, Scythe outperforms both in terms +of True Positive (TP) rate across a large parameter range, and will not + + +is a very accurate 3'-end contaminant trimmer appropriate for a variety of pre-analysis quality pipelines. It is a modular and orthogonal program, allowing use with other read processing tools and diagnostics between each step. We From 460c1e3eb91b7a1556f4d25bc0434af8a52105e0 Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Wed, 13 Jun 2012 13:56:35 -0700 Subject: [PATCH 2/9] finished a round of paper edits - JNF --- paper/scythe-paper.tex | 85 +++++++++++++++++++++--------------------- 1 file changed, 42 insertions(+), 43 deletions(-) diff --git a/paper/scythe-paper.tex b/paper/scythe-paper.tex index 130eae2..2593812 100644 --- a/paper/scythe-paper.tex +++ b/paper/scythe-paper.tex @@ -23,16 +23,17 @@ \section{Motivation:} sample fragments are shorter than the read length. Unfortunately, the 3'-ends of these reads are enriched with low quality bases that are more likely to be called incorrectly, which makes identifying and -removing 3'-end contaminants difficult. Removal of contaminating -sequence is a critical step in most bioinformatics pipelines. +removing 3'-end contaminants difficult. However, removal of +contaminating sequence is a critical step in most bioinformatics +pipelines. \section{Results:} We have written Scythe - a novel 3'-end adapter trimmer based on -Bayesian classification that incorporates both sequence and base -qualities. Scythe outperforms other adapter removal tools across a -wide range of contamination rates, and is more robust to parameter -settings than other tools. +Bayesian classification that considers both sequence and base +qualities. In our tests, Scythe outperforms other adapter removal +tools across a wide range of contamination rates, and is more robust +to parameter settings than other tools. \section{Availability:} @@ -49,19 +50,19 @@ \section{Introduction} well'' (\citealp{raymond2003}), and focuses on recognition and trimming of Illumina adapter contamination. Illumina's second-generation sequencing technology (MiSeq, GA series, HiSeq) -often results in reads that have a known, constant adapter sequence -that starts at variable, 3'-biased positions in reads. Bases at the -3'-end are also more difficult to call, and accordingly have low base -qualities and high substitution error rates. Importantly, the 3'-end -quality deterioration is not uniform across all reads (see Fig. 1 in -Supplementary Materials). +often results in reads that have a known, largely unvarying adapter +sequence that starts at variable (but largely 3'-biased) positions in +reads. Bases at the 3'-end are also more difficult to call, and +accordingly have low base qualities and high substitution error +rates. Importantly, though, the 3'-end quality deterioration is not +uniform across all reads (see Fig. 1 in Supplementary Materials). Low quality 3'-end bases are commonly trimmed as a first step, however this discards some evidence, albeit noisy, for adapter sequences. Scythe is meant to be run before quality trimming is performed, and uses a Bayesian classification scheme that differentially weights mismatches between the adapter sequence and -bases in a read according to their base quality. +bases in a read according to their particular base quality. We tested Scythe using a combination of real and simulated data, along with two other 3'-end adapter trimming tools: Btrim (ref) and Cutadapt @@ -89,22 +90,21 @@ \section{Introduction} \section{String Matching in Scythe} Scythe employs a simple, string matching heuristic to find a best -match for a given adapter sequence within a given read. Scythe scores -ungapped alignments of the contaminant sequence against the read -sequence, in every position in which a base in the provided adapter -sequence overlaps with the 3'-most bases in the read - a minimum match -length can be specified via a parameter (by default, 5). Each of these -alignments is scored using a scheme of $1$ for a match, $-1$ for a -mismatch. The top scoring alignment is then passed to the -probabilistic classification procedure, described below. The time -complexity of Scythe's matching algorithm for a single adapter of +match for a given contaminant sequence within a given read. Scythe +scores ungapped alignments of the contaminant sequence against the +read sequence, from a minimum overlap at the 3'-end (by default, 5 bp) +to full overlap with right justification of the contaminant within the +read. Each of these alignments is scored using a scheme of $1$ for a +match, $-1$ for a mismatch. The top scoring alignment is then passed +to the probabilistic classification procedure, described below. The +time complexity of Scythe's matching algorithm for a single adapter of length $l_a$ is $O(l_a^2 R)$ for a FASTQ file with $R$ entries. \section{Bayesian Classification of Top-Scoring Matches} -Considering the highest-scoring alignment of an adapter sequence to a -read, there are two mutually exclusive and exhaustive events: (1) +Considering the highest-scoring alignment of a contaminant sequence to +a read, there are two mutually exclusive and exhaustive events: (1) there is a contaminant at the specified position, or (2) there is no contaminant at that position. A likelihood for each of these events, given probabilities of each base being called correctly (derived from @@ -112,7 +112,7 @@ \section{Bayesian Classification of Top-Scoring Matches} can be calculated. These likelihood functions assume models of error for each event. If -the top-scoring match is contaminant sequence (event $C$), the +the top-scoring match is contaminant-derived (event $C$), the likelihood $P(S | C)$ (where $S$ is the sequence match data) is: $$ P(S | C) = \prod_{i=1}^{l_t} q_i^{m_i} \cdot (1-q_i)^{1 - m_i} $$ @@ -120,9 +120,9 @@ \section{Bayesian Classification of Top-Scoring Matches} Where $m_i \in \{0, 1\}$ indicating whether the $i$ base is a mismatch or match (respectively) and $q_i$ is the probability the $i$ base is called correctly (from its base quality). $l_t$ is the length of the -top-scoring match. On the other hand, if the top-scoring sequence is -not a contaminant sequence (event $C'$), it is assumed that the -matches occur by chance, giving the likelihood function: +top-scoring match. On the other hand, if the top-scoring match is not +contaminant-derived (event $C'$), it is assumed that the match +occurred by chance, giving the likelihood function: $$ P(S | C') = \prod_{i=1}^{l_t} \left(\frac{1}{4}\right)^{m_i} \cdot \left(\frac{3}{4}\right)^{1 - m_i} $$ @@ -135,7 +135,7 @@ \section{Bayesian Classification of Top-Scoring Matches} \emph{maximum a posteriori} rule can be used to classify a top-scoring alignment as due to contamination or chance (e.g. if $P(C'|S) > P(C|S)$, the sequence is not contaminated at that position, and -visa-versa). +vice versa). Required in this Bayesian formulation is the prior of contamination, $P(C)$. This can be estimated by inspection of the sequences @@ -205,20 +205,19 @@ \section{Results} \section{Conclusion} When compared to Btrim and Cutadapt, Scythe outperforms both in terms -of True Positive (TP) rate across a large parameter range, and will not - - -is a very accurate 3'-end -contaminant trimmer appropriate for a variety of pre-analysis quality -pipelines. It is a modular and orthogonal program, allowing use with -other read processing tools and diagnostics between each step. We -suggest that Scythe be used for all Illumina read processing -pipelines. - -We suspect Scythe's superior performance is due to its (1) -consideration of base-level qualities and (2) Scythe's use of Bayesian -model rather than a simple fixed-number of mismatches or fixed-error -approach. +of the True Positive (TP) rate across a large range of input +parameters, and Scythe would not generate results with a False +Positive (FP) rate above ~0.4 in the data tested. Scythe's 'prior' +parameter can be set through straightforward inspection of the reads +(for perfect matches), and is more weakly correlated with the FP rate, +meaning inaccurate settings will not suddenly cause poor performance. + +In Scythe, we have implemented an approach to 3'-end contaminant +trimming that is novel in two respects: (1) consideration of +base-level qualities and (2) the use of a Bayesian model rather than a +simple fixed-number of mismatches or fixed-error approach. We believe +that this approach can improve contaminant removal in many pipelines +and applications. \section*{Acknowledgement} From 42c96a48ac995d8c4c966b4a4c6168a5687f89ff Mon Sep 17 00:00:00 2001 From: Joe Fass Date: Fri, 29 Jun 2012 16:15:39 -0700 Subject: [PATCH 3/9] fixed naming of original solexa adapters --- illumina_adapters.fa => solexa_adapters.fa | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) rename illumina_adapters.fa => solexa_adapters.fa (100%) diff --git a/illumina_adapters.fa b/solexa_adapters.fa similarity index 100% rename from illumina_adapters.fa rename to solexa_adapters.fa index c1838bf..e7e9c68 100644 --- a/illumina_adapters.fa +++ b/solexa_adapters.fa @@ -1,4 +1,4 @@ >Solexa_forward_contam -[MMMMMM]AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA ->Solexa_reverse_contam [NNNNNN]AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAA +>Solexa_reverse_contam +[MMMMMM]AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA From cbf89b50707840b1649235a75c5e7e8d36a24515 Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Wed, 11 Jul 2012 11:32:36 -0700 Subject: [PATCH 4/9] minor formula fix - reversing the contaminant / non-contaminant cases of maximum a priori rule application --- paper/scythe-paper.tex | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/paper/scythe-paper.tex b/paper/scythe-paper.tex index 2593812..e54ffdd 100644 --- a/paper/scythe-paper.tex +++ b/paper/scythe-paper.tex @@ -133,9 +133,8 @@ \section{Bayesian Classification of Top-Scoring Matches} Since these are mutually exclusive and exhaustive events, the \emph{maximum a posteriori} rule can be used to classify a top-scoring -alignment as due to contamination or chance (e.g. if $P(C'|S) > -P(C|S)$, the sequence is not contaminated at that position, and -vice versa). +alignment as due to contamination (if $P(C|S) > P(C'|S)$) or chance (if +$P(C'|S) > P(C|S)$). Required in this Bayesian formulation is the prior of contamination, $P(C)$. This can be estimated by inspection of the sequences From b294cecc367098e2c1c2f29f8765aa34c42580dc Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Mon, 20 Aug 2012 21:19:32 -0700 Subject: [PATCH 5/9] added helper script to mine reads for (TruSeq) sample IDs --- README.md | 13 +++++++--- profileTruSeqIDs.pl | 58 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 68 insertions(+), 3 deletions(-) create mode 100755 profileTruSeqIDs.pl diff --git a/README.md b/README.md index 781a357..15639f0 100644 --- a/README.md +++ b/README.md @@ -138,9 +138,16 @@ Scythe adapter files that contain all possible barcodes concatenated with possible adapters, so that both can be recognized and removed. This has worked well and is recommended for cases when 3'-end quality deteriorates and prevents barcode removal. Newer Illumina -chemistry has the barcode separated from the fragment, so that it -appears as an entirely separate read and is used to demultiplex sample -reads by Illumina's CASAVA pipeline. +chemistry (TruSeq) has the barcode separated from the fragment, so +that it appears as an entirely separate read that is used to +demultiplex sample reads by Illumina's CASAVA pipeline. In case the 6 +bp sample IDs are not readily available, we have included a helper +script to profile the IDs found in the forward reads of TruSeq data +set (reverse reads are subject to contamination by a different adapter +that doesn't contain variable sequence). Use this script as follows +(e.g.): + +cat reads.fq | head -4000000 | perl profileTruSeqIDs.pl > IDsFound.txt ### Does Scythe work on 5'-end or other contaminants? diff --git a/profileTruSeqIDs.pl b/profileTruSeqIDs.pl new file mode 100755 index 0000000..e570e3e --- /dev/null +++ b/profileTruSeqIDs.pl @@ -0,0 +1,58 @@ +#!/usr/bin/perl -w + +# AUTHOR: Joseph Fass +# LAST REVISED: August 2012 +# The Bioinformatics Core at UC Davis Genome Center +# http://bioinformatics.ucdavis.edu + +# profileTruSeqIDs.pl is the proprietary property of The Regents of +# the University of California (“The Regents.”) Copyright 2007-12 The +# Regents of the University of California, Davis campus. All Rights +# Reserved. Redistribution and use in source and binary forms, with +# or without modification, are permitted by nonprofit, research +# institutions for research use only, provided that the following +# conditions are met: Redistributions of source code must retain the +# above copyright notice, this list of conditions and the following +# disclaimer. Redistributions in binary form must reproduce the above +# copyright notice, this list of conditions and the following +# disclaimer in the documentation and/or other materials provided with +# the distribution. The name of The Regents may not be used to +# endorse or promote products derived from this software without +# specific prior written permission. The end-user understands that +# the program was developed for research purposes and is advised not +# to rely exclusively on the program for any reason. THE SOFTWARE +# PROVIDED IS ON AN "AS IS" BASIS, AND THE REGENTS HAVE NO OBLIGATION +# TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR +# MODIFICATIONS. THE REGENTS SPECIFICALLY DISCLAIM ANY EXPRESS OR +# IMPLIED WARRANTIES, INCLUDING BUT NOT LIMITED TO, THE IMPLIED +# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND +# NONINFRINGEMENT. IN NO EVENT SHALL THE REGENTS BE LIABLE TO ANY +# PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY OR +# CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO PROCUREMENT OF +# SUBSTITUTE GOODS OR SERVICES, LOSS OF USE, DATA OR PROFITS, OR +# BUSINESS INTERRUPTION, HOWEVER CAUSED AND UNDER ANY THEORY OF +# LIABILITY WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING +# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS +# SOFTWARE AND ITS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY +# OF SUCH DAMAGE. If you do not agree to these terms, do not download +# or use the software. This license may be modified only in a writing +# signed by authorized signatory of both parties. + +# searches for TruSeq adapters in fastq, extracts and counts the 6 bp +# "barcodes," or multiplex identifiers + +while (<>) { + $seq = <>; + <>; + <>; + if ($seq =~ /AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC([ACGT]{6})/) { + $count{$1}++; + } +} +if (%count) { + foreach $ID (reverse sort {$count{$a}<=>$count{$b}} keys %count) { + print "$ID\t$count{$ID}\n"; + } +} else { + print "\nNo ID's found! Reads too short? Try visual inspection, with a shorter search string, like \"AGATCGGAAGAG\".\n\n"; +} From 4ec3f7dc5ae46020c548b44d4744e6ffe7f09ede Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Wed, 26 Sep 2012 14:05:59 -0700 Subject: [PATCH 6/9] added bib to tex, small edits --- paper/scythe-paper.tex | 65 +++++++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 30 deletions(-) diff --git a/paper/scythe-paper.tex b/paper/scythe-paper.tex index e54ffdd..27a75dd 100644 --- a/paper/scythe-paper.tex +++ b/paper/scythe-paper.tex @@ -20,12 +20,11 @@ \section{Motivation:} Illumina's short read sequencing technology uses known adapter sequences that can contaminate the 3'-ends of reads when individual -sample fragments are shorter than the read length. Unfortunately, the -3'-ends of these reads are enriched with low quality bases that are -more likely to be called incorrectly, which makes identifying and -removing 3'-end contaminants difficult. However, removal of -contaminating sequence is a critical step in most bioinformatics -pipelines. +sample DNA fragments are shorter than the read length. Unfortunately, +the 3'-ends of these reads contain more substitution errors, which +makes identifying and removing 3'-end contaminants difficult. However, +removal of contaminating sequence is a critical step in most +bioinformatics pipelines. \section{Results:} @@ -47,22 +46,22 @@ \section{Contact:} \href{mailto:vsbuffalo@gmail.com}{vsbuffalo@gmail.com} \section{Introduction} Scythe embraces the Unix Philosophy of ``programs that do one thing -well'' (\citealp{raymond2003}), and focuses on recognition and -trimming of Illumina adapter contamination. Illumina's +well'' (\citealp{gancarz1995}), and focuses on recognition and removal +of Illumina adapter sequence from read 3'-ends. Illumina's second-generation sequencing technology (MiSeq, GA series, HiSeq) -often results in reads that have a known, largely unvarying adapter -sequence that starts at variable (but largely 3'-biased) positions in -reads. Bases at the 3'-end are also more difficult to call, and -accordingly have low base qualities and high substitution error -rates. Importantly, though, the 3'-end quality deterioration is not -uniform across all reads (see Fig. 1 in Supplementary Materials). +often results in reads that contain a known adapter sequence starting +at variable, but commonly 3'-biased positions in reads. Bases at the +3'-end are also more difficult to call, and consequently have low base +qualities and high substitution error rates. Importantly, though, the +3'-end quality deterioration is not uniform across all reads (see +Fig. 1 in Supplementary Materials). Low quality 3'-end bases are commonly trimmed as a first step, however this discards some evidence, albeit noisy, for adapter sequences. Scythe is meant to be run before quality trimming is performed, and uses a Bayesian classification scheme that differentially weights mismatches between the adapter sequence and -bases in a read according to their particular base quality. +bases in a read according to their base qualities. We tested Scythe using a combination of real and simulated data, along with two other 3'-end adapter trimming tools: Btrim (ref) and Cutadapt @@ -108,21 +107,22 @@ \section{Bayesian Classification of Top-Scoring Matches} there is a contaminant at the specified position, or (2) there is no contaminant at that position. A likelihood for each of these events, given probabilities of each base being called correctly (derived from -the base qualities) and which bases mismatch the contaminant sequence, -can be calculated. +their phred-scaled base qualities (\citealp{ewing1998})) and which +bases mismatch the contaminant sequence, can be calculated. -These likelihood functions assume models of error for each event. If -the top-scoring match is contaminant-derived (event $C$), the -likelihood $P(S | C)$ (where $S$ is the sequence match data) is: +The following likelihood functions assume independent models of error +for each event. If the top-scoring match is contaminant-derived (event +$C$), the likelihood $P(S | C)$ (where $S$ is the sequence match data) +is: $$ P(S | C) = \prod_{i=1}^{l_t} q_i^{m_i} \cdot (1-q_i)^{1 - m_i} $$ Where $m_i \in \{0, 1\}$ indicating whether the $i$ base is a mismatch or match (respectively) and $q_i$ is the probability the $i$ base is -called correctly (from its base quality). $l_t$ is the length of the -top-scoring match. On the other hand, if the top-scoring match is not -contaminant-derived (event $C'$), it is assumed that the match -occurred by chance, giving the likelihood function: +called correctly. $l_t$ is the length of the top-scoring match. On the +other hand, if the top-scoring match is not contaminant-derived (event +$C'$), it is assumed that the match occurred by chance, giving the +likelihood function: $$ P(S | C') = \prod_{i=1}^{l_t} \left(\frac{1}{4}\right)^{m_i} \cdot \left(\frac{3}{4}\right)^{1 - m_i} $$ @@ -227,17 +227,22 @@ \section*{Acknowledgement} %\paragraph{Funding\textcolon} None. -\bibliographystyle{natbib} +% \bibliographystyle{natbib} % \bibliographystyle{achemnat} % \bibliographystyle{plainnat} % \bibliographystyle{abbrv} % \bibliographystyle{bioinformatics} -\nocite{qrqc} -\nocite{ShortRead} -\nocite{Biostrings} -\nocite{ggplot2} +% \nocite{qrqc} +% \nocite{ShortRead} +% \nocite{Biostrings} +% \nocite{ggplot2} -\bibliography{references} +\begin{thebibliography}{} +\bibitem[Ewing \textit{et~al}., 1998]{ewing1998} +Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. \textit{Genome Res.} \textbf{8}(3):175--185 +\bibitem[Gancarz, 1995]{gancarz1995} +Gancarz M (1995) \textit{The UNIX Philosophy}. Digital Press, Boston. +\end{thebibliography} \end{document} From 5383b0d1ed605aced6f93462dba339b3fe369a57 Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Wed, 23 Sep 2015 15:22:19 -0700 Subject: [PATCH 7/9] added and edited correct adapter sequences and a better sample adapter file with no non-nt characters --- README.md | 53 +++++++++++++++++-------------------- adap.fa | 16 +++++++++++ nextera_adapters.fa | 4 +++ solexa_adapters.fa | 4 +-- truseq_adapters.fa | 4 +++ truseq_adapters.fasta | 4 --- truseq_smallRNA_adapters.fa | 4 +++ 7 files changed, 54 insertions(+), 35 deletions(-) create mode 100644 adap.fa create mode 100644 nextera_adapters.fa create mode 100644 truseq_adapters.fa delete mode 100644 truseq_adapters.fasta create mode 100644 truseq_smallRNA_adapters.fa diff --git a/README.md b/README.md index 15639f0..db4e5a6 100644 --- a/README.md +++ b/README.md @@ -55,26 +55,27 @@ Then, copy or move "scythe" to a directory in your $PATH. Scythe can be run minimally with: - scythe -a adapter_file.fasta -o trimmed_sequences.fasta sequences.fastq + scythe -a adap.fa -o trimmed_sequences.fasta sequences.fastq -By default, the prior contamination rate is 0.05. This can be changed +By default, the prior contamination rate is 0.3. This can be changed (and one is encouraged to do so!) with: - scythe -a adapter_file.fasta -p 0.1 -o trimmed_sequences.fastq sequences.fastq + scythe -a adap.fa -p 0.1 -o trimmed_sequences.fastq sequences.fastq If you'd like to use standard out, it is recommended you use the --quiet option: - scythe -a adapter_file.fasta --quiet sequences.fastq > trimmed_sequences.fastq + scythe -a adap.fa --quiet sequences.fastq > trimmed_sequences.fastq Also, more detailed output about matches can be obtained with: - scythe -a adapter_file.fasta -o trimmed_sequences.fasta -m matches.txt sequences.fastq + scythe -a adap.fa -o trimmed_sequences.fasta -m matches.txt sequences.fastq -By default, Illumina's quality scheme (pipeline > 1.3) is used. Sanger -or Solexa (pipeline < 1.3) qualities can be specified with -q: +By default, the Sanger fastq quality encoding (phred+33; pipeline >= 1.8) is used. +Illumina (phred+64; pipelines 1.3 - 1.7) or Solexa ("Solexa"+64; pipelines < 1.3) +qualities can be specified with -q: - scythe -a adapter_file.fasta -q solexa -o trimmed_sequences.fasta sequences.fastq + scythe -a adap.fa -q solexa -o trimmed_sequences.fasta sequences.fastq Lastly, a minimum match length argument can be specified with -n : @@ -86,27 +87,21 @@ liberal trimming, i.e. of only a few bases. ## Notes -Note that the two provided adapter sequence files contain non-FASTA -characters to denote the locations of barcode sequences, which always -appear in TruSeq adapters, and may or may not appear in forward and/or -reverse reads using the original Solexa/Illumina adapter sequences, -depending on library preparation. You'll need to modify the adapter -sequence files in order to use them. - -In the case of the original Solexa/Illumina adapter sequences, we've seen -barcodes "upstream" of forward reads (in which case the reverse complement -of the barcode will appear before the adapter sequence at the 3'-end of -reverse reads - replacing the [NNNNNN]). We've also seen barcodes upstream -of reverse reads (in which case the reverse complement of the barcode will -appear before the adapter sequence at the 3'-end of forward reads - -replacing the [MMMMMM]). Your definition of the barcode may be someone -else's reverse-complemented barcode, and the barcode may or may not be 6 -bases. - -In the case of TruSeq adapter sequences, there will always be a 6 bp -barcode in place of the [NNNNNN] in sequence contaminating forward reads -(if the fragment is short enough, of course). This barcode sequence should -match the barcode included in the reads' FASTQ headers. +Note that the provided adapter sequence files (*_adapters.fa) contain +non-FASTA characters to denote the locations of barcode sequences, +which always appear in TruSeq adapters, and may or may not appear in +forward and/or reverse reads using the original ("Solexa") Illumina +adapter sequences, depending on library preparation. You'll need to +modify the adapter sequence files in order to use them. + +An example adapters file (adap.fa) is included for ease of use. It +omits barcodes, so it can be used on all samples of an indexed pool or +set of files (the first ~30 bp are sufficient to identify adapter +contamination). However, Scythe will check reads against all adapter +sequences in the file, so a file with 6 adapter sequences will cause +roughly 6x runtime. Since your samples will never include adapters +from several types of kits, you're encouraged to omit everything but +the adapter(s) that will be found in your sequences. Scythe only checks for 3'-end contaminants, up to the adapter's length into the 3'-end. For reads with contamination in *any* position, the diff --git a/adap.fa b/adap.fa new file mode 100644 index 0000000..b298d2d --- /dev/null +++ b/adap.fa @@ -0,0 +1,16 @@ +>TruSeq_forward_contam +AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC +>TruSeq_reverse_contam +AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT +>Nextera_forward_contam +CTGTCTCTTATACACATCTCCGAGCCCACGAGAC +>Nextera_reverse_contam +CTGTCTCTTATACACATCTGACGCTGCCGACGA +>TruSeq_SmallRNA_forward_contam +TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC +>TruSeq_SmallRNA_reverse_contam +GATCGTCGGACTGTAGAACTCTGAACCTGTCG +>Solexa_forward_contam +AGATCGGAAGAGCGGTTCAGCAGGAATGCCGA +>Solexa_reverse_contam +AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTG diff --git a/nextera_adapters.fa b/nextera_adapters.fa new file mode 100644 index 0000000..a27728a --- /dev/null +++ b/nextera_adapters.fa @@ -0,0 +1,4 @@ +>Nextera_forward_contam +CTGTCTCTTATACACATCTCCGAGCCCACGAGAC[8bp index]ATCTCGTATGCCGTCTTCTGCTTG +>Nextera_reverse_contam +CTGTCTCTTATACACATCTGACGCTGCCGACGA[8bp index]GTGTAGATCTCGGTGGTCGCCGTATCATT diff --git a/solexa_adapters.fa b/solexa_adapters.fa index e7e9c68..d68e2cd 100644 --- a/solexa_adapters.fa +++ b/solexa_adapters.fa @@ -1,4 +1,4 @@ >Solexa_forward_contam -[NNNNNN]AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAA +[home-grown index?]AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTGAAAAA >Solexa_reverse_contam -[MMMMMM]AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA +[home-grown index?]AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA diff --git a/truseq_adapters.fa b/truseq_adapters.fa new file mode 100644 index 0000000..369866b --- /dev/null +++ b/truseq_adapters.fa @@ -0,0 +1,4 @@ +>TruSeq_forward_contam +AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[8bp index]ATCTCGTATGCCGTCTTCTGCTTGAAAAA +>TruSeq_reverse_contam +AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT[8bp index]GTGGTCGCCGTATCATTAAAAA diff --git a/truseq_adapters.fasta b/truseq_adapters.fasta deleted file mode 100644 index 70bd06c..0000000 --- a/truseq_adapters.fasta +++ /dev/null @@ -1,4 +0,0 @@ ->TruSeq_forward_contam -AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC[NNNNNN]ATCTCGTATGCCGTCTTCTGCTTGAAAAA ->TruSeq_reverse_contam -AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAA diff --git a/truseq_smallRNA_adapters.fa b/truseq_smallRNA_adapters.fa new file mode 100644 index 0000000..0cb5aa9 --- /dev/null +++ b/truseq_smallRNA_adapters.fa @@ -0,0 +1,4 @@ +>TruSeq_SmallRNA_forward_contam +TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC[6bp adapter]ATCTCGTATGCCGTCTTCTGCTTG +>TruSeq_SmallRNA_reverse_contam +GATCGTCGGACTGTAGAACTCTGAACCTGTCG From 8dc0c8e1aeead88ef64dd0b3a0a6679e22f4247a Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Wed, 23 Sep 2015 15:25:37 -0700 Subject: [PATCH 8/9] removed adapter profiler perl script .. not useful --- profileTruSeqIDs.pl | 58 --------------------------------------------- 1 file changed, 58 deletions(-) delete mode 100755 profileTruSeqIDs.pl diff --git a/profileTruSeqIDs.pl b/profileTruSeqIDs.pl deleted file mode 100755 index e570e3e..0000000 --- a/profileTruSeqIDs.pl +++ /dev/null @@ -1,58 +0,0 @@ -#!/usr/bin/perl -w - -# AUTHOR: Joseph Fass -# LAST REVISED: August 2012 -# The Bioinformatics Core at UC Davis Genome Center -# http://bioinformatics.ucdavis.edu - -# profileTruSeqIDs.pl is the proprietary property of The Regents of -# the University of California (“The Regents.”) Copyright 2007-12 The -# Regents of the University of California, Davis campus. All Rights -# Reserved. Redistribution and use in source and binary forms, with -# or without modification, are permitted by nonprofit, research -# institutions for research use only, provided that the following -# conditions are met: Redistributions of source code must retain the -# above copyright notice, this list of conditions and the following -# disclaimer. Redistributions in binary form must reproduce the above -# copyright notice, this list of conditions and the following -# disclaimer in the documentation and/or other materials provided with -# the distribution. The name of The Regents may not be used to -# endorse or promote products derived from this software without -# specific prior written permission. The end-user understands that -# the program was developed for research purposes and is advised not -# to rely exclusively on the program for any reason. THE SOFTWARE -# PROVIDED IS ON AN "AS IS" BASIS, AND THE REGENTS HAVE NO OBLIGATION -# TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR -# MODIFICATIONS. THE REGENTS SPECIFICALLY DISCLAIM ANY EXPRESS OR -# IMPLIED WARRANTIES, INCLUDING BUT NOT LIMITED TO, THE IMPLIED -# WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND -# NONINFRINGEMENT. IN NO EVENT SHALL THE REGENTS BE LIABLE TO ANY -# PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, EXEMPLARY OR -# CONSEQUENTIAL DAMAGES, INCLUDING BUT NOT LIMITED TO PROCUREMENT OF -# SUBSTITUTE GOODS OR SERVICES, LOSS OF USE, DATA OR PROFITS, OR -# BUSINESS INTERRUPTION, HOWEVER CAUSED AND UNDER ANY THEORY OF -# LIABILITY WHETHER IN CONTRACT, STRICT LIABILITY OR TORT (INCLUDING -# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS -# SOFTWARE AND ITS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY -# OF SUCH DAMAGE. If you do not agree to these terms, do not download -# or use the software. This license may be modified only in a writing -# signed by authorized signatory of both parties. - -# searches for TruSeq adapters in fastq, extracts and counts the 6 bp -# "barcodes," or multiplex identifiers - -while (<>) { - $seq = <>; - <>; - <>; - if ($seq =~ /AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC([ACGT]{6})/) { - $count{$1}++; - } -} -if (%count) { - foreach $ID (reverse sort {$count{$a}<=>$count{$b}} keys %count) { - print "$ID\t$count{$ID}\n"; - } -} else { - print "\nNo ID's found! Reads too short? Try visual inspection, with a shorter search string, like \"AGATCGGAAGAG\".\n\n"; -} From f39276dce8b529f1bf4fa016c43bbad2da9709cb Mon Sep 17 00:00:00 2001 From: Joseph Fass Date: Wed, 23 Sep 2015 15:28:26 -0700 Subject: [PATCH 9/9] removed profiler mentions in readme --- README.md | 9 +-------- 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/README.md b/README.md index db4e5a6..ed44e78 100644 --- a/README.md +++ b/README.md @@ -135,14 +135,7 @@ removed. This has worked well and is recommended for cases when 3'-end quality deteriorates and prevents barcode removal. Newer Illumina chemistry (TruSeq) has the barcode separated from the fragment, so that it appears as an entirely separate read that is used to -demultiplex sample reads by Illumina's CASAVA pipeline. In case the 6 -bp sample IDs are not readily available, we have included a helper -script to profile the IDs found in the forward reads of TruSeq data -set (reverse reads are subject to contamination by a different adapter -that doesn't contain variable sequence). Use this script as follows -(e.g.): - -cat reads.fq | head -4000000 | perl profileTruSeqIDs.pl > IDsFound.txt +demultiplex sample reads by the Illumina pipeline. ### Does Scythe work on 5'-end or other contaminants?