Contig Assessment with BLAST+
-
Return to the ASC documentation system https://hpcdocs.asc.edu/ and explore the README info for the command line version of BLAST+
a. Go to https://hpcdocs.asc.edu/ and log in with your ASC credentials.
b. Click on "Sequence analysis", then "Blast+". -
From the README.txt above, you will need a script to run BLAST+. An example script named
BLAST+_example.shis located at:/home/shared/biobootcamp/data/example_ASC_queue_scripts -
You will need to make a copy of the above script to your home directory. Let’s do that now:
cd(to take you to the top level of your home directory)mkdir blast+_working_example(create a directory to work in)cd blast+_working_example(to change into the directory)cp /home/shared/biobootcamp/data/example_ASC_queue_scripts/BLAST+_example.sh .(to copy the script to your working directory)
-
Now use your choice of text editor to open the script. For example to use nano:
nano BLAST+_example.sh -
Note that comments in the script indicate where you will be tweaking your command once you decide exactly what you want to do. Close nano with
Ctrl-x. -
Now make a copy of the contig file resulting from your Ray genomic assembly performed previously to the
blast+_working_exampledirectory. The file should be namedContigs.fastaand be located in the output directory you specified in theRay_example.shscript within yourray_working_exampledirectory (Ex. 2). If for some reason you need a copy of a Ray genomic assembly, a previously owned version can be obtained following the directions below (to be run from within yourblast+_working_exampledirectory).
cp /home/shared/biobootcamp/data/Lamellibrachia_luymesi_finished_genomic_transcriptomic_assemblies/Lamellibrachia_luymesi_all_genomic_RAY_05_2015.fasta .
-
Leaving your file named
Contigs.fastais not very informative, bad practice and will lead to confusion when you happen upon it 6 months from now. So rename the file to something more descriptive using the mv command:mv Contigs.fasta Lamellibrachia_luymesi_all_genomic_RAY_05_2018.fasta
-
Generate some basic and advanced statistics for your assembly using the script
get_fasta_stats.pl:get_fasta_stats.pl –hwill provide you with options that can be supplied to the script in order to tailor the output as desired. It is suggested that the–gand–Toptions be used at a minimum. Also recall from the homework that the output from the script can be redirected to a file using>
b. Draft your get_fasta_stats.pl command by hand on scratch paper with 1) included options and 2) a redirect to an appropriately named file.
-
Once you have drafted your
get_fasta_stats.plcommands, execute it in your linux shell within theblast+_working_exampledirectory where yourLamellibrachia_luymesi_all_genomic_RAY_05_2018.fastafile is located. Now do a long listing of theblast+_working_example directorycontents to ensure that 1) the new output file exists and 2) is has a file size > 0 (otherwise they are empty). Remember that a “long listing” isls –alfrom within the directory. -
View the contents of your output file using
catorlessand assess the assembly
Now that we have assessed some of the properties for our assemblies, let’s identify whether large mitochondrial contigs or potentially a complete mt genome is present in the assembly. QUESTION: why hypothesize that large pieces of mt genome would be present in the assembly in the first place?
-
The sequences in your
Lamellibrachia_luymesi_all_genomic_RAY_05_2018.fastafile will serve as subject (i.e., they will be the basis for the database that searches are conducted against). That means that we will need an appropriate query or queries to locate and “fish out” any potentially complete mitochondrial genome. While you can easily acquire such data from NCBI via a web browser, where is the fun in that? You can also do it from the command line in a number of ways. Here, we will be using a set of relatively new, powerful and unix-friendly utilities that are part of NCBI Edirect (https://www.ncbi.nlm.nih.gov/books/NBK179288).- Type
esearch -helpandefetch -helpto get an idea of the general usage – we can essentially perform an Entrez query from the command line. e.g.esearch -db pubmed -query "Mcclintock B [AUTH]" - Try the command
esearch -db nuccore -query "Lamellibrachia luymesi [organism]"Again the output is not particularly informative but note the ‘count’ field in the xml output. Does this value change if you leave out the query qualifier “organism”? What explains the difference? - The command
efetchis the key to retrieving and formatting your esearch results. Conveniently,esearchoutput can be "piped" into efetch:esearch -db nuccore -query "Lamellibrachia luymesi [organism]" | efetch -format accLook back at the efetch help to see what this format means. - That seems useful but what we want here is to repeat the command above with the format set to fasta. Redirect the output (hint: >) of the combined commands to a new file named something like
L_luymesi_genbank.fasta
- Type
-
Your fasta output is relatively easy to scroll through, but imagine if it was 100x the size. Recall the required format of a fasta file and use grep to isolate the sequence headers:
- Type
grep \> L_luymesi_genbank.fasta(or whatever filename you chose). It looks like there are numerous potential mitochondrial sequences in our retrieval including multiple COI sequences. - Repeat your grep search using “COI” as a search term note: you won’t need the escape character ‘\’, why not? Also try
less L_luymesi_sequences.fastaand search within for COI using ‘/’ and typing “COI”. - There are many elegant ways to pull out an individual COI sequence, let’s make it easy for now and simply copy one of the COI seqeunces returned in the output of the
lesscommand and paste it into a separate, new, clearly named fasta file usingnano.
- Type
-
We can now build our subject database from the
Lamellibrachia_luymesi_all_genomic_RAY_05_2018.fastafile usingmakeblastdb.- Type the command
module load blast+/2.3.0to load the BLAST+ suite into our current workspace in interactive mode. makeblastdb -help(provides you with options for constructing your BLAST+ DB) NOTE: you will see the there are both “required” and “optional” arguments to the makeblastdb program. As you might guess, the ones in the former category are mandatory and must be supplied while those in the latter are supplementary. Read through the list of “optional” arguments and consider using some in your command.- Draft your
makeblastdbcommand with 1) all “required” and 2) any “optional” arguments out ahead of time. Consult with members of your group on the arguments that you utilized and why you are using them.
- Type the command
-
Next, since we will be conducting a nucleotide query to nucleotide database search,
blastnwill be utilized.blastn -help(provides you with options for conducting your blastn search) NOTE: unlikemakeblastdb, all arguments toblastnare “optional”. Read through the list of “optional” arguments carefully and start to identify ones you might consider essential. Some bare minimum arguments you should consider are-query,-out, and-db.- Draft your
blastncommand with any and all “optional” arguments on paper or in a text editor on your own computer. Consult with members of the group on the arguments that you are utilized and again why you are using them. Make sure to name your output file with a.blastoutfile extension.
-
Once you have drafted your
makeblastdbandblastncommands, add them to the appropriate section of theBLAST+_example.shscript usingnano. Save the script and exit fromnano. -
Now submit your script to the ASC queue system using the directions at the bottom of the script.
-
Monitor this job using
squeue.
Once we have a report (i.e., the *.blastout file) from blastn, we will want to parse it into a format that is MUCH more human readable.
-
Once
squeuehas cleared (i.e., indicating that themakeblastdbcommand andblastnsearch have successfully completed), check the contents of theblast+_working_example directorywith a “long listing” (ls -l). You should see a few new files:*.oXXXXXX(the job summary; check this in case your job appears to have run into an error and died)*.blastout(the blastn report itself containing the actual results of the search)- files ending in
*.nhr,*.ninand*.nsq(these three (3) files represent the actual BLAST+ database itself. All three are required for our BLAST+ DB to be valid and functional. Keep this in mind if you construct BLAST+ DBs in one location and then move them to another).
-
Time to parse the
*.blastoutfile into an easily accessible table using a script calledblast2table2.pl:a.
blast2table2.pl -help(provides the options for parsing yourblastnreport withblast2table2.pl) b. Draft theblast2table2.plcommand with your selected options ahead of time. Discuss with members of your group the arguments that you are using and why you are using them. HINT: consider minimally–format 9,-verboseand a > redirect to send the results to an output file with a “.blastout.parsed” file extension. -
Once you have drafted your
blast2table2.plcommand, execute it in the CLI within theblast+_working_exampledirectory. Output from the parsing script should scroll by on the screen. A “long listing” (ls -l) in theblast+_working_exampledirectory afterward should reveal a*.blastout.parsedfile whose size is > 0. -
Carefully examine the first 10 lines of the
*.blastout.parsedfile using theheadcommand (remember this from the pre-Bootcamp assignment?). The header information for each of the columns can be found in the description of–format 9inblast2table2.pl -help. -
Of the 15 fields (columns) of the
-format 9output ofblast2table2.pl, #11 seems promising if we want to:- quantify how many contigs from our assembly have homology to the chosen queries (i.e., distinct mitochondrial COI gene copies from Lamellibrachia luymesi) and
- the specific identities of those contigs in the NCBI+ DB (and by extension the original *.fasta file that we used to create the DB).
-
Let’s extract field #11 of the report and identify how many unique entries occur (i.e., # of contigs from the assembly); be sure to provide your actual
*.blastout.parsedname to the following command when executed in bash:cat <FILENAME>.blastout.parsed | awk -F "\t" '{print $11}' | sort | uniq
--- What just happened? Determine as a group exactly what the above pipeline accomplished. Remember thatman cmd_nameprovides detailed information for any command as well as its options (another one from the pre-Bootcamp assignment). Don’t worry, you’ll get more exposure to commands and pipelines like the above in your very near future …
-
Repeat the above command, redirecting output to a file named
lure.txtand examine the results withcat. How many unique entries (i.e., contig names) are in it? Is this the same across other members of the group? Let’s use the contents of this file to fish out entire FASTA nucleotide entries from theLamellibrachia_luymesi_all_genomic_RAY_05_2018.fastafile for each and every descriptor in thelure.txtfile.select_contigs.pl -h(provides options for parsing a FASTA report with this particular script). For the exercise, consider ‘lure.txt’ as analogous to the ‘select_file’ mentioned in the help info ofselect_contigs.pl. Therefore the–nflag should be specified.- Draft the
select_contigs.plcommand with all selected options out in your own editor. Name your final output file with a<FILENAME>.potential_mito_genome.fastaextension.
-
Once you have crafted your
select_contigs.plcommand with appropriate options, execute it on ASC. Check the final "*.potential_mito_genome.fasta" file theget_fasta_stats.plscript in #8 above for its statistics. Only a single contig of <~15K bp should be included in the final output file.
KEEP THIS FILE SINCE IT WILL BE USED LATER AS PART OF AN EXERCISE IN MITOCHONDRIAL GENOME ANNOTATION.
Tired of scrolling back through pages of help screens or program output? Pipe your command to the page-viewer less
e.g. get_fasta_stats.pl -h | less – as usual, type q to exit this type of pager.
https://github.com/NCBI-Hackathons/EDirectCookbook is a good resource for NCBI edirect utilities. Also see https://dataguide.nlm.nih.gov/eutilities/utilities.html#esearch and