-
Removed hard limit on maximum sequence length. The hard limit always causes segmentation fault when performing ultra long sequence alignments.
-
Merge
mecat2pwandmecat2refinto one single mapping toolmecat2map.mecat2mapuses much less memory compared tomecat2pw. -
The candidate partition process now supports multiple CPU threads.
-
Support multiple input forms (see Input Format).
MECAT2 is an improved version of MECAT. It is an ultra-fast and accurate Mapping, Error Correction and de novo Assembly Tools for single molecula sequencing (SMRT) reads.
MECAT2 consists of three modules:
-
mecat2map, a fast and accurate alignment tool for SMRT reads -
mecat2cns, correct noisy reads based on their pairwise overlaps -
fsa, a string graph based assembly tool.
MECAT2 is written in C, C++, and perl. It is open source and distributed under the GPLv3 license.
Please note that MECAT2 no longer supports Nanopore raw reads. We have developed a Mapping, Error Correction and de novo Assembly Pipeline specifically for Nanopore Raw Reads NECAT. follow this link to NECAT.
We have tested MECAT2 on CentOS release 7.3 and on Ubuntu 18.04.
- Step 1: Figure out where to install
MECAT2. We will installMECAT2and two other auxiliary toolsHDF5anddextract. We first identify the directory in which we want to install them. As an example, I will install them in the directory/home/chenying/smrt_asm. So I first create this directory using themkdircommand and go to that directory: (The dollar sign$that preceeds the input is the promt printed by the shell.)
$ mkdir -p /home/chenying/smrt_asm
$ cd /home/chenying/smrt_asm
$ pwd
/home/chenying/smrt_asmFor easy reference, we asign /home/chenying/smrt_asm to an environment variable MECAT_PATH:
$ export MECAT_PATH=/home/chenying/smrt_asm
$ echo ${MECAT_PATH}
/home/chenying/smrt_asm- Step 2: Install
MECAT2:
$ git clone https://github.com/xiaochuanle/MECAT2.git
$ cd MECAT2
$ make
$ cd ..After installation, all the executables are found in ${MECAT_PATH}/MECAT/Linux-amd64/bin. The folder name Linux-amd64 will vary in operating systems.
- Step 3: Add relative pathes
$ export PATH=${MECAT_PATH}/MECAT/Linux-amd64/bin:$PATHBefore running MECAT2, don't forget to add binary paths to PATH (Step 3 of Installation).
Here we take assemblying the genome of Ecoli as an example, to go through each step in order. Details of each step are given in the next section.
- Step 1: Download dataset.
We download the raw reads ecoli_filtered.fastq.gz into directory
${MECAT_PATH}/ecoli
$ mkdir -p ${MECAT_PATH}/ecoli
$ cd ${MECAT_PATH}/ecoli
$ wget http://gembox.cbcb.umd.edu/mhap/raw/ecoli_filtered.fastq.gzAfter that, we get raw read file ${MECAT_PATH}/ecoli/ecoli_filtered.fastq.gz:
$ ls
ecoli_filtered.fastq.gz- Step 2: Prepare config file We create a config file template using the following command:
$ mecat.pl config ecoli_config_file.txtThis command creates a config file ecoli_config_file.txt, which looks like
PROJECT=
RAWREADS=
GENOME_SIZE=
THREADS=4
MIN_READ_LENGTH=2000
CNS_OVLP_OPTIONS="-kmer_size 13"
CNS_PCAN_OPTIONS="-p 100000 -k 100"
CNS_OPTIONS=""
CNS_OUTPUT_COVERAGE=30
TRIM_OVLP_OPTIONS="-skip_overhang"
TRIM_PM4_OPTIONS="-p 100000 -k 100"
TRIM_LCR_OPTIONS=""
TRIM_SR_OPTIONS=""
ASM_OVLP_OPTIONS=""
FSA_OL_FILTER_OPTIONS="--max_overhang=-1 --min_identity=-1"
FSA_ASSEMBLE_OPTIONS=""
CLEANUP=0After filling the relative information, we have
PROJECT=ecoli
RAWREADS=/home/chenying/smrt_asm/ecoli/ecoli_filtered.fastq
GENOME_SIZE=4800000
THREADS=4
MIN_READ_LENGTH=2000
CNS_OVLP_OPTIONS="-kmer_size 13"
CNS_PCAN_OPTIONS="-p 100000 -k 100"
CNS_OPTIONS=""
CNS_OUTPUT_COVERAGE=30
TRIM_OVLP_OPTIONS="-skip_overhang"
TRIM_PM4_OPTIONS="-p 100000 -k 100"
TRIM_LCR_OPTIONS=""
TRIM_SR_OPTIONS=""
ASM_OVLP_OPTIONS=""
FSA_OL_FILTER_OPTIONS="--max_overhang=-1 --min_identity=-1"
FSA_ASSEMBLE_OPTIONS=""
CLEANUP=0- Step 3: Correct Raw Reads. Correct the raw noisy reads using the following command:
$ mecat.pl correct ecoli_config_file.txt- Step 4: Trim Out Low Quality Subsequences in Corrected Reads.
$ mecat.pl trim ecoli_config_file.txt- Step 5: Assemble Contigs Using the Trimeed Reads
$ mecat.pl assemble ecoli_config_file.txt-
Step 6: Where to Find Results
- The file
${MECAT_PATH}/eocli/ecoli/1-consensus/cns_reads_list.txtcontains the full path of all corrected reads files.
$ cat ${MECAT_PATH}/eocli/ecoli/1-consensus//cns_reads_list.txt /home/chenying/smrt_asm/ecoli/ecoli/1-consensus/cns_cns_dir/p00000000.cns.fasta- The extracted longest 30x (The number 30 is indidated by the
CNS_OUTPUT_COVERAGEoption in the config file) corrected reads used for trimming is${MECAT_PATH}/ecoli/ecoli/1-consensus/cns_final.fasta. - The trimmed reads is
${MECAT_PATH}/ecoli/ecoli/2-trim_bases/trimReads.fasta - The assembled contigs is
${MECAT_PATH}/ecoli/ecoli/4-fsa/contigs.fasta
- The file
The input to MECAT2 is indicated by the RAWREADS option in the config file. It must be a full path. MECAT2 supports several different input formats:
H5format.H5file format must first be transfered toFASTAformat with${MECAT_PATH}/DEXTRACT/dextract. For example:
$ find pathto/raw_reads -name "*.bax.h5" -exec readlink -f {} \; > reads.fofn
$ while read line; do dextract -v $line >> reads.fasta ; done < reads.fofnAfter transformation, proceed to one of the following input case.
FASTAformat.
RAWREADS=/Users/sysu/Desktop/files/programs/ecoli/pacbio/ecoli/raw_reads.fastaOr FASTA format compressed in GNU Zip (gzip) format
RAWREADS=/Users/sysu/Desktop/files/programs/ecoli/pacbio/ecoli/raw_reads.fasta.gzFASTQformat
RAWREADS=/Users/sysu/Desktop/files/programs/ecoli/pacbio/ecoli/raw_reads.fastqOr FASTQ format compressed in GNU Zip (gzip) format
RAWREADS=/Users/sysu/Desktop/files/programs/ecoli/pacbio/ecoli/raw_reads.fastq.gz- List format A file indicates the full paths of all raw reads files.
RAWREADS=/Users/sysu/Desktop/files/programs/tomato/read_list.txt$ cat /Users/sysu/Desktop/files/programs/tomato/read_list.txt
/share/home/chuanlex/xiaochuanle/data/testdata/tomato/20161027_Spenn_001_001_all.fastq
/share/home/chuanlex/xiaochuanle/data/testdata/tomato/20161101_Spenn_002_002_all.fastq
/share/home/chuanlex/xiaochuanle/data/testdata/tomato/20161103_Spenn_003_003_all.fastq
/share/home/chuanlex/xiaochuanle/data/testdata/tomato/20161108_Spenn_004_004_all.fastq
/share/home/chuanlex/xiaochuanle/data/testdata/tomato/20161108_Spenn_004_005_all.fastqPlease note that files in read_list.txt need not be the same format. Each file can independently be either FASTA or FASTQ, and can further be compressed in GNU Zip (gzip) format.
We describe in detail each module of MECAT, including their options and output formats.
MECAT2 reads all the information, including project name, raw reads, and various running parameters, from config file. To create a config file template, just run
$ mecat.pl config config_file_nameThe above command creates a config file named config_file_name. We have met an sample of config file in the previous section
PROJECT=ecoli
RAWREADS=/home/chenying/smrt_asm/ecoli/ecoli_filtered.fastq
GENOME_SIZE=4800000
THREADS=4
MIN_READ_LENGTH=2000
CNS_OVLP_OPTIONS="-kmer_size 13"
CNS_PCAN_OPTIONS="-p 100000 -k 100"
CNS_OPTIONS=""
CNS_OUTPUT_COVERAGE=30
TRIM_OVLP_OPTIONS="-skip_overhang"
TRIM_PM4_OPTIONS="-p 100000 -k 100"
TRIM_LCR_OPTIONS=""
TRIM_SR_OPTIONS=""
ASM_OVLP_OPTIONS=""
FSA_OL_FILTER_OPTIONS="--max_overhang=-1 --min_identity=-1"
FSA_ASSEMBLE_OPTIONS=""
CLEANUP=0The meaning of each option is given below
PROJECT=ecoli, the name of the project. In this example, a directoryecoliwill be created in the current directory, and then everything will take place in the directoryecoli.RAWREADS=, the raw reads (with full path) to be processed byMECAT2. See Input Format.GENOME_SIZE=, the size (in bp) of the underlying genome.THREADS=, number of CPU threads used byMECAT2.MIN_READ_LENGTH=, minimal length of corrected reads and trimmed reads.CNS_OVLP_OPTIONS="", options for detecting overlap candidates in the correction stage. Runmecat2map -helpfor details. Note that the output format isseqidx(-outfmt seqidx), which is set internally bymecat.pl.CNS_OPTIONS="", options for correcting raw reads. Runmecat2cns -helpfor details.TRIM_OVLP_OPTIONS="", options for detecting overlaps in the trimming stage. Runmecat2mapfor details. Note that output format ism4x(-outfmt m4x), which is set internally bymecat.pl.ASM_OVLP_OPTIONS="", options for detecting overlaps in the assemble stage. Runmecat2map -helpfor details. The output format ism4(-outfmt m4), which is set internally bymecat.pl.FSA_OL_FILTER_OPTIONS="", options for filtering overlaps. See below for details.FSA_ASSEMBLE_OPTIONS="", options for assembling trimmed reads. See below for details.USE_GRID=false, using multiple computing nodes (true) or not (false).CLEANUP=0, delete intermediate date genrated byMECAT2(1) or not (0). Please note the in assemblying large genomes, the intermediate data can be very large.CNS_OUTPUT_COVERAGE=30, number of coverage of the longest corrected reads are extracted to be trimed and then assembled. In this example, 30x (specifically, 30 * 4800000 = 144 MB) of the longest corrected reads will be extracted.
For easy use, we have integrated all the procedures into one perl script file mecat.pl, which works in the following steps:
meat.pl config, as mentioned above, this command creates a config file.mecat.pl correct, correct raw reads, which consits of three steps:detecting overlap candidates using
mecat2map. partition overlap candidates into several parts usingmecat2pcan. Each parts contains overlap candidates needed for correcting 100000 raw reads. correct raw reads based on overlap candidates usingmecat2cns.
mecat.pl assemble, assemble corrected reads in three steps:extract 30x longest corrected reads with
mecat2extseqstrim out low quality subsequences in two stpes:detecting overlaps of extracted reads using
mecat2maptrim out low quality subsequence based on their overlaps usingmecat2lcr,mecat2splitreadsandmecat2trimbases.
assemble trimmed reads into contigs in three steps:
detecting overlaps of trimmed reads using
mecat2mapfilter out low quality overlaps usingfsa_ol_filterassemble trimmed reads into contigs based on high quality overlaps usingfsa_assemble
The command for running mecat2pw is
mecat2map [OPTIONS] reads reference > results.m4fsa_ol_filter is used for filtering out low-quality overlaps. The usage of fsa_ol_filter1 is
fsa_ol_filter [optioins] overlaps filtered_overlapsThe options are
--min_length=INT, minimum length of reads (default: 2500)--max_length=INT, maximum length of reads (defualt: INT_MAX).--min_identity=DOUBLE, minimum identity of overlaps (defualt: 90).--min_aligned_length=INT, minimum aligned length of overlaps (default: 2500).--max_overhang=INT, maximum overhang of overlaps (default: 10), negative number = determined by the program.--min_coverage=INT, minimum base coverage (default: -1), negative number = determined by the program.--max_coverage=INT, maximum base coverage (default: -1), negative number = determined by the program.--max_diff_coverage=INT, maximum difference of base coverage (default: -1), negative number = determined by the program.--coverage_discard=DOUBLE, discard ratio of base coverage (default: 0.01). If--max_coverageor--max_diff_coverageis negative, it will be reset to (100-coverage_discard)th percentile.--overlap_file_type="|m4|paf|ovl", overlap file format (default: "").""= filename extension,"m4"=M4format,"paf"=PAFformat generated by minimap2,"ovl"=OVLformat generated by FALCON.--bestn=INT, output best n overlaps on 5' or 3' end for each read (default: 10).--genome_size=INT, genome size. It determines the maximum length of reads with--coveragetogether.--coverage=INT, coverage. It determines the maximum length of reads with--genome_sizetogether.--output_directory=STRING, directory for output files (default: ".").--thread_size=INT, number of threads (default: 4).
fsa_assemble is a tool for constructing contigs from filtered overlaps and corrected reads. The algorithm is similar to FALCON. The usage of fsa_assemble is
fsa_assenble [optioins] filtered_overlapsThe options are
--min_length=INT, minimum length of reads (default: 0).--min_identity=DOUBLE, minimum identity of overlaps (defualt: 0).--min_aligned_length=INT, minimum aligned length of overlaps (default: 0).--min_contig_length=INT, minimum length of contigs (default: 500).--read_file=STRING, reads file name in FASTA or FASTQ format.--overlap_file_type="|m4|paf|ovl", overlap file format (default: "").""= filename extension,"m4"=M4format,"paf"=PAFformat generated by minimap2,"ovl"=OVLformat generated by FALCON.--output_directory=STRING, directory for output files (default: ".").--select_branch="no|best", selecting method when encountering branches in the graph,"no"= do not select any branch,"best"= select the most probable branch.--thread_size=INT, number of threads (default: 4)
Chuan-Le Xiao, Ying Chen, Shang-Qian Xie, Kai-Ning Chen, Yan Wang, Yue Han, Feng Luo, Zhi Xie. MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nature Methods, 2017, 14: 1072-1074
-
Chuan-Le Xiao, xiaochuanle@126.com
-
Ying Chen, chenying2016@gmail.com
-
Fan Nie, niefan@csu.edu.cn
-
Feng Luo, luofeng@clemson.edu
Updates in MECAT2 (20193.14):
-
Add some improvements in FSA
-
Optimize Install Method
Updates in MECAT2 (2019.2):
-
Fix many bugs in MECAT
-
Replace the asseble module mecat2canu by fasa.
Updates in MECAT V1.3 (2017.12.18):
-
Correct text error in HDF5 Installation.
-
Update the makefile in dextract .
-
Update citation.
Updates in MECAT V1.2 (2017.5.22):
-
Add
trimming moduleinmecat2canuto improve the integrality of the assembly. -
Add supports for Nanopore data.
-
Improve the sensitivity of
mecat2ref.
MECAT v1.1 replaced the old MECAT,some debug were resolved and some new fuctions were added:
-
- we added the extracted tools for the raw
H5format files.
- we added the extracted tools for the raw
-
- some debugs from running mecat2canu were solved