Any metrics from assemblathon_stats.pl for Contigs calculated from Scaffolds are highly biased upwards in the order of 100%. This does not effect Scaffold metrics, only Contig related metrics when a Scaffold file is used as input.
What is happening is scaffolds are being split by default every N=25 bases, which is hard coded on
L143 but other N break points are not being split for Contigs creating longer pseudo-contigs distorting the Nx metrics upwards.
Data for this example for reproducibility is available from;
HongYang Test
We have a fasta file of scaffolds, and a fasta file of contigs, we can run assemblathon_stats.pl against both
to compare calculated metrics. If we do this what we find is an N50 contig length of 58864 in Kiwifruit_contig.fa, but when we calculate the N50 contig length in Kiwifruit_scaffold.fa.gz we get 117093, which is twice the amount.
We know that the contig N50 calculation is 58864, which is submitted and verified in NCBI
https://www.ncbi.nlm.nih.gov/assembly/GCA_000467755.1
## Contig stats
assemblathon_stats.pl Kiwifruit_contig.fa.gz > test1
## Scaffold stats
assemblathon_stats.pl Kiwifruit_scaffold.fa.gz > test2
Differencing the files to compare results;
$ diff -u3 test1 test2
--- test1 2018-01-23 10:03:16.770015274 +1300
+++ test2 2018-01-23 10:43:51.605648499 +1300
@@ -1,48 +1,48 @@
----------------- Information for assembly 'Kiwifruit_contig.fa.gz' ----------------
+---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------
- Number of scaffolds 26721 ## The first part we expect to be different, contigs versus scaffolds
- Total size of scaffolds 604217145
- Longest scaffold 423496
- Shortest scaffold 200
- Number of scaffolds > 1K nt 26373 98.7%
- Number of scaffolds > 10K nt 12188 45.6%
- Number of scaffolds > 100K nt 1106 4.1%
- Number of scaffolds > 1M nt 0 0.0%
+ Number of scaffolds 7698
+ Total size of scaffolds 616114069
+ Longest scaffold 3410229
+ Shortest scaffold 896
+ Number of scaffolds > 1K nt 7620 99.0%
+ Number of scaffolds > 10K nt 2131 27.7%
+ Number of scaffolds > 100K nt 1152 15.0%
+ Number of scaffolds > 1M nt 129 1.7%
Number of scaffolds > 10M nt 0 0.0%
- Mean scaffold size 22612
- Median scaffold size 7933
- N50 scaffold length 58864
- L50 scaffold count 2977
- scaffold %A 32.54
- scaffold %C 17.59
- scaffold %G 17.60
- scaffold %T 32.27
- scaffold %N 0.00
+ Mean scaffold size 80036
+ Median scaffold size 3358
+ N50 scaffold length 646786
+ L50 scaffold count 280
+ scaffold %A 31.92
+ scaffold %C 17.25
+ scaffold %G 17.26
+ scaffold %T 31.65
+ scaffold %N 1.92
scaffold %non-ACGTN 0.00
Number of scaffold non-ACGTN nt 0
- Percentage of assembly in scaffolded contigs 0.0%
- Percentage of assembly in unscaffolded contigs 100.0%
- Average number of contigs per scaffold 1.0
-Average length of break (>25 Ns) between contigs in scaffold 0
+ Percentage of assembly in scaffolded contigs 93.7%
+ Percentage of assembly in unscaffolded contigs 6.3%
+ Average number of contigs per scaffold 2.0
+Average length of break (>25 Ns) between contigs in scaffold 1507
- Number of contigs 26721 ## The second part should be the same 26721 contigs, versus 9758 contigs (in scaffolds file)
- Number of contigs in scaffolds 0
- Number of contigs not in scaffolds 26721
- Total size of contigs 604217145
- Longest contig 423496
- Shortest contig 200
- Number of contigs > 1K nt 26373 98.7%
- Number of contigs > 10K nt 12188 45.6%
- Number of contigs > 100K nt 1106 4.1%
+ Number of contigs 15529
+ Number of contigs in scaffolds 9758
+ Number of contigs not in scaffolds 5771
+ Total size of contigs 604305128
+ Longest contig 830300
+ Shortest contig 65
+ Number of contigs > 1K nt 15348 98.8%
+ Number of contigs > 10K nt 7647 49.2%
+ Number of contigs > 100K nt 1895 12.2%
Number of contigs > 1M nt 0 0.0%
Number of contigs > 10M nt 0 0.0%
- Mean contig size 22612
- Median contig size 7933
- N50 contig length 58864 ## N50 is considerably different 58864 versus 117093
- L50 contig count 2977
+ Mean contig size 38915
+ Median contig size 9483
+ N50 contig length 117093
+ L50 contig count 1517
contig %A 32.54
contig %C 17.59
contig %G 17.60
The break size between HongYang scaffolds is n=2000, if we explicitly specify this in the call to assemblathon_stats.pl we get even more spurious results;
$ assemblathon_stats.pl -n 2000 Kiwifruit_scaffold.fa.gz
---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------
Number of scaffolds 7698
Total size of scaffolds 616114069
Longest scaffold 3410229
Shortest scaffold 896
Number of scaffolds > 1K nt 7620 99.0%
Number of scaffolds > 10K nt 2131 27.7%
Number of scaffolds > 100K nt 1152 15.0%
Number of scaffolds > 1M nt 129 1.7%
Number of scaffolds > 10M nt 0 0.0%
Mean scaffold size 80036
Median scaffold size 3358
N50 scaffold length 646786
L50 scaffold count 280
scaffold %A 31.92
scaffold %C 17.25
scaffold %G 17.26
scaffold %T 31.65
scaffold %N 1.92
scaffold %non-ACGTN 0.00
Number of scaffold non-ACGTN nt 0
Percentage of assembly in scaffolded contigs 73.5%
Percentage of assembly in unscaffolded contigs 26.5%
Average number of contigs per scaffold 1.8
Average length of break (>25 Ns) between contigs in scaffold 1507
Number of contigs 13777
Number of contigs in scaffolds 7076
Number of contigs not in scaffolds 6701
Total size of contigs 605485125
Longest contig 1554749
Shortest contig 65
Number of contigs > 1K nt 13656 99.1%
Number of contigs > 10K nt 6737 48.9%
Number of contigs > 100K nt 1840 13.4%
Number of contigs > 1M nt 8 0.1%
Number of contigs > 10M nt 0 0.0%
Mean contig size 43949
Median contig size 9240
N50 contig length 140261
L50 contig count 1175
contig %A 32.48
contig %C 17.56
contig %G 17.56
contig %T 32.21
contig %N 0.20
contig %non-ACGTN 0.00
Number of contig non-ACGTN nt 0
Now the N50 has increased to 140,261, it should be 58,864.
Any metrics from assemblathon_stats.pl for Contigs calculated from Scaffolds are highly biased upwards in the order of 100%. This does not effect Scaffold metrics, only Contig related metrics when a Scaffold file is used as input.
What is happening is scaffolds are being split by default every N=25 bases, which is hard coded on
L143 but other N break points are not being split for Contigs creating longer pseudo-contigs distorting the Nx metrics upwards.
Data for this example for reproducibility is available from;
HongYang Test
We have a fasta file of scaffolds, and a fasta file of contigs, we can run assemblathon_stats.pl against both
to compare calculated metrics. If we do this what we find is an N50 contig length of 58864 in Kiwifruit_contig.fa, but when we calculate the N50 contig length in Kiwifruit_scaffold.fa.gz we get 117093, which is twice the amount.
We know that the contig N50 calculation is 58864, which is submitted and verified in NCBI
https://www.ncbi.nlm.nih.gov/assembly/GCA_000467755.1
Differencing the files to compare results;
The break size between HongYang scaffolds is n=2000, if we explicitly specify this in the call to
assemblathon_stats.plwe get even more spurious results;Now the N50 has increased to 140,261, it should be 58,864.