DupLess is a duplication removal tool for heterozygous genomes based on read coverage and pairwise alignments.
Most of the currently available assemblers are designed to work on highly inbred, homozygous species and are treating differing haplotypes as separate contigs. However, for outbred plant genomes where inbreeding is not possible, attempts to assemble a highly heterozygous species often results in a heavily duplicated assembly and over-estimation of the genome length. For such cases, we created "DupLess", a tool capable of detecting and removing the duplicated regions resulting from heterozygosity in diploid genomes.
DupLess is supported on Linux and Mac OS.
You will need to have the following dependencies:
Note: The following python packages are already built-in from python2.7 and do not need to be installed:
getopt
subprocess
multiprocessing
sys
os
- Python v2 or v3
- The following python packages: numpy, pandas, biopython, matplotlib.pyplot, getopt, subprocess, multiprocessing, sys, os.
- samtools v1.9 or higher (DupLess will not work with version prior to 1.9, as it needs the "-o" parameter) To install samtools 1.9:
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar -vxjf samtools-1.9.tar.bz2
cd samtools-1.9
make
To install samtools you may also have to install HTSlib (cf: https://www.biostars.org/p/328831/)
- bedtools v2.27.1 or higher (lower versions should also work now, but only v2.26 has been tested)
- blastn v2.6.0+ or higher
- awk and sed (Available by default on Linux Systems)
DupLess needs to have the dependencies (samtools ,bedtools, blastn, awk and sed) available in the $PATH to work. You can run the following command to add a tool to the PATH variable:
export PATH=/path/to/tool_folder/:$PATH
DupLess itself is a collection of python scripts, no installation is needed. You just have to clone the repository (or directly download the python files) and run "python DupLess.py" to use it.
git clone https://github.com/MCorentin/DupLess
cd DupLess
python DupLess.py --help
The test_data folder contains two files:
- AT_duplicated.fa: this is a subset of chr3 of Arabidopsis thaliana with artificially induced duplications (15 sequences, see below for explanations)
- AT_duplicated_simReads.sorted.coverage.gz: the coverage file for these 3 contigs, based on simulated reads you will need to unzip this file to run DupLess.
To test if your DupLess installation works you can run the following command (~30 min):
cd test_data/
gunzip AT_duplicated_simReads.sorted.coverage.gz
python DupLess.py -t 20 -o dupless_AT -b AT_duplicated_simReads.sorted.coverage -a AT_duplicated.fa -w 250 -c 50 -i 95 -l 500
The 15 duplicated sequences should be removed or heavily truncated, you can filter the fasta by length to remove remaining artifacts.
After this pipeline AT_duplicated.fa contained 15 duplicated regions, with 25x coverage. The non-duplicated regions had ~50x coverage. DupLess was run on this dataset and managed to remove 95% of the induced duplicated sequences.
The list of duplicated contigs is available in: "test_data/additional_data/mutated_list.txt".
The test dataset was created with the following pipeline (to simulate duplication due to heterozygosity):
- Extraction of chr3 of Arabidopsis thaliana assembly (ftp://ftp.ensemblgenomes.org/pub/plants/release-42/fasta/arabidopsis_thaliana/dna/).
- Creation of contig assembly by splitting chr3 everytime 2 or more consecutive "N" appeared.
- Creation of mutated version of contig assembly with BBmap's "mutate.sh" (mutation rate: 3%).
- Extraction of 15 regions from the mutated genome (~33 kbp in total) and concatenation of these 15 sequences with the contig assembly (to induce duplication).
- Simulation of reads with 25x coverage on the whole assembly.
- Simulation of reads with 25x coverage only on non-duplicated regions.
Required
-
The assembly in fasta format.
-
A bed file with the coverage value for each base of the assembly. (See below for instructions on how to create this file: "Running DupLess on your own assembly:").
Optional
- A bed file containing the gaps coordinates in the assembly. If provided, they will be represented as grey bars on the coverage graphs.
python DupLess.py -t [nb_threads] -b [coverage.bed] -a [assembly.fasta] -w [window_size] -c [expected_coverage] -i [min_blast_identity] -l [min_blast_length] -o [output_folder]
Required:
-a/--assembly The assembly corresponding to the bed coverage in fasta format.
-b/--bed_cov The bed file containing the coverage at each base (can be generated with 'bedtools genomecov').
/!\ If using paired end reads: make sure that you set the -w or -l option higher than the insert size,
to avoid false positives due to coverage drop at the ends of contigs (because of unaligned mates).
Optional:
-t/--nThreads The number of threads (default 10)
-o/--out_folder The output folder (default './DupLess_out/')
-c/--expected_cov The expected read coverage for the homozygous regions. The homozygosity / heterozygosity will be determined based on this value.
You can determine the value to use by plotting the coverage distribution. It should correspond to the homozygous peak
If no value is given, it will be based on the mode of the coverage distribution (not reliable if high heterozygosity).
-w/--window_size The size of the windows in basepairs (default: 1000)
The value of the coverage for each window will be the median of the coverage at each base.
All the windows classified as 'heterozygous' will be considered for the detection of duplication.
-g/--bed_gaps A bed file containing the gaps along the genome. If given, the graphs will contain a grey background where the gaps are.
-i/--blast_identity The minimum percentage of identity between the het regions to consider them duplicates (default: 90, range 0 to 100).
-l/--blast_length The blast alignments with a length lower than this threshold will be filtered (default=300).
-n/--no_plot Skip the creation of all the plots
Skipping part of pipeline:
-s/--skip_het_detection Skip the detection of the heterozygous regions. If so, you must provide a bed identifying the heterozygous regions:
python DupLess.py -s [het_regions_bed] -t [nb_threads] -a [assembly.fasta] -i [min_blast_identity] -l [min_blast_length] -o [new_output_folder]
-f/--filter_blast_only Skip the detection of the heterozygous regions AND the pairwise alignments. If so, you must provide a blast ouput with -oufmt 6:
python DupLess.py -f [blast_output] -t [nb_threads] -a [assembly.fasta] -i [min_blast_identity] -l [min_blast_length] -o [new_output_folder]
Other:
-h/--help Print the usage and help and exit.
-v/--version Print the version and exit.
The output folder will contains the following subfolders:
- individuals_beds/ contains the bed files describing the het. regions for each sequence.
- invidual_blasts/ contains the blast results for each het. region.
- graphs/ contains the coverage graphs for each sequence (see below).
- temp/ contains temp file for blast.
- deduplicated/ contains the results of DupLess: deduplicated.fasta and discarded.fasta
The output folder will also contain the following files:
- Heterozygous_regions_ALL.bed: A bed file with the identified heterozygous regions, useful if one wants to explore the regions in more details.
- Histogram_coverage.png: A histogram of the coverage distribution, to help the user decide the expected coverage value (see below).
- All_Blasts_scaffolds_coord.tab: The raw results of the blast between the heterozygous regions.
We recommand filtering the resulting deduplicated assembly by length as DupLess does not remove entire contigs, so some very small contigs may be present in the output
Coverage plots like the one below are produced for each sequence in the assembly (under the graphs/ folder), heterozygous regions are identified in red:

You need to generate a file with the coverage value at each position (format: "sequence_name position coverage"). You can use any pipeline you want to generate this file. We used the following pipeline to generate the coverage files durnig our testing (/!\ do not forget the "-d" option for "genomecov" as DupLess needs the coverage on every bases):
bwa index genome.fa
bwa mem -t 20 genome.fa read_1.fq read_2.fq | samtools view -b -@ 20 -o genome_reads.bam
samtools sort -@ 20 genome_reads.bam > genome_reads.sorted.bam
bedtools genomecov -ibam genome_reads.sorted.bam -d > genome_reads.coverage
You can then use "genome_reads.coverage" with the "-b/--bed_cov" parameter. We recommand using short reads with a high coverage. For paired end reads we recommand a short insert size (cf: How to choose the right value for the window length ("-w/--window_size" option): below).
The expected coverage should be the coverage corresponding to the homozygous regions. To choose it you can plot the coverage distribution from the coverage file.
You can use R:
cov <- read.table("illumina.coverage")
hist(cov$V3)
If your genome is heterozygous, you should obtain two peaks (see graph below):

The second peak corresponds to the coverage on the homozygous regions, and the value on the x-axis for the maximum of this peak corresponds to the homozygous coverage. One of the first step of DupLess is plotting the coverage histogram with a red bar representing the expected coverage to help you deciding if you chose the right value for this parameter (see figure above).
DupLess works with a window-based approach: each sequence is splitted into windows of a certain size. Each window will be catergorized as "homozygous", "heterozygous" or "outlier" depending on the coverage median for this window.
Larger window sizes will decrease the running time but may decrease the sensitivity of DupLess. Smaller window sizes will increase the sensitivity but also possibly the number false positives, this may be counterbalanced by choosing a higher value for the blast length threshold (-l/--blast_length). The optimal value depends on the fragmentation of your assembly and your objectives: removing as many duplications as possible (small window size) or reducing the number of false positives (large window size).
/!\ When using paired end reads (when creating the coverage bed file), a drop in coverage is expected at the extreme ends of the sequences. This is due to the fact that only one read of the pair align to these regions. We recommand setting -w or -l higher than the insert size when using paired end reads to avoid false positives.
We have implemented two options to allow the user to skip parts of the pipeline:
- -s/--skip_het_detection: to avoid the detection of the heterozygous regions. You need to give a bed file with the coordinates of the regions to blast (it can be from a previous run of DupLess: "Heterozygous_regions_ALL.bed").
python DupLess.py -s [het_regions_bed] -t [nb_threads] -a [assembly.fasta] -i [min_blast_identity] -l [min_blast_length] -o [new_output_folder]
- -f/--filter_blast_only": an option to try different blast thresholds, this will modify the sensitivity/specificity of DupLess. You need to give a blast output with "-outfmt 6" (it can be from a previous run of DupLess: "All_Blasts_scaffolds_coord.tab").
python DupLess.py -f [blast_output] -t [nb_threads] -a [assembly.fasta] -i [min_blast_identity] -l [min_blast_length] -o [new_output_folder]
DupLess workflow is composed of two main steps:
-
The detection, based on the coverage, of heterozygous regions in the assembly.
-
The detection of duplicates among the heterozygous regions, based on their sequences similarity (using megablast).
For the first step, DupLess processes each sequence by splitting it into windows of size defined by the "-w/--window_size" option. Then the median coverage of each window is calculated based on the coverage at each base. The window is classified in three categories depending on its median value (EC = Expected Coverage):
- Heterozygous if: "0 < median <= EC / 1.5"
- Homozygous if: "EC / 1.5 < median < EC * 1.5"
- Outlier if: "median = 0 OR median >= EC * 1.5"
Only the heterozygous regions are considered for later analysis and consecutives heterozygous windows are merged together.
The second step of the analysis is the pairwise megablast alignment of the heterozygous regions. Each region is aligned against all the others, only the best hit is retained for each region. After all the hits have been found, they are filtered by identity (-i/--blast_identity) and length (-l/--blast_length). The aligned pairs that are left after the filtering are the regions to remove from the assembly. DupLess does not remove the whole region, only the part that aligned.
For each aligned pair DupLess removes always the one on the smaller sequence (contig/scaffold) of the pair, this is done to reduce the possible misassemblies introduced by DupLess, indeed aligned pairs are mostly one large sequence that align to a much smaller and almost only heterozygous sequence, see figure below.
Output files are produced all along DupLess pipeline, so that the user can explore in more details how the duplications have been removed:
- The heterozygous regions coordinates are written to a bed file.
- The blast results are written to a tsv file.
- The classification of the windows for each sequence are plotted, with a color code for each category (green=homozygous, red=heterozygous, purple=outlier)
Q: deduplicated.fasta is the same as the assembly despite the "toDiscard.bed" being not empty.
A: Check if your assembly does not contain Windows new line characters: "\r". You can use "cat -v file", the hidden characters will appear as "^M". To resolve this issue you can run:
tr -d '\r' < assembly.fasta > assembly_newLineCorrected.fasta
- Support long reads to realign to duplicated regions and detect misassemblies
- Possibly: add Mummer to check how the regions around the duplications align to each other.
- Add an option to remove whole contigs if the blast hit span > threshold% of the total length.
- Improve speed.
- Flag regions with half the coverage but no blast hits.
For a genome of 1Gbp, DupLess requires ~50 Gb of RAM.
For running time, the bottleneck is the pairwise blasting of heterozygous regions, an assembly with a lot of regions will increase the running time significantly.
