Skip to content

Contig related Nx metrics from Scaffolds are biased upwards #2

@mdavy86

Description

@mdavy86

Any metrics from assemblathon_stats.pl for Contigs calculated from Scaffolds are highly biased upwards in the order of 100%. This does not effect Scaffold metrics, only Contig related metrics when a Scaffold file is used as input.

What is happening is scaffolds are being split by default every N=25 bases, which is hard coded on
L143 but other N break points are not being split for Contigs creating longer pseudo-contigs distorting the Nx metrics upwards.

Data for this example for reproducibility is available from;

HongYang Test

We have a fasta file of scaffolds, and a fasta file of contigs, we can run assemblathon_stats.pl against both
to compare calculated metrics. If we do this what we find is an N50 contig length of 58864 in Kiwifruit_contig.fa, but when we calculate the N50 contig length in Kiwifruit_scaffold.fa.gz we get 117093, which is twice the amount.

We know that the contig N50 calculation is 58864, which is submitted and verified in NCBI

https://www.ncbi.nlm.nih.gov/assembly/GCA_000467755.1

## Contig stats
assemblathon_stats.pl Kiwifruit_contig.fa.gz > test1

## Scaffold stats
assemblathon_stats.pl Kiwifruit_scaffold.fa.gz > test2

Differencing the files to compare results;

$ diff -u3 test1 test2
--- test1       2018-01-23 10:03:16.770015274 +1300
+++ test2       2018-01-23 10:43:51.605648499 +1300
@@ -1,48 +1,48 @@

----------------- Information for assembly 'Kiwifruit_contig.fa.gz' ----------------
+---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------


-                                         Number of scaffolds      26721                 ## The first part we expect to be different, contigs versus scaffolds 
-                                     Total size of scaffolds  604217145
-                                            Longest scaffold     423496
-                                           Shortest scaffold        200
-                                 Number of scaffolds > 1K nt      26373  98.7%
-                                Number of scaffolds > 10K nt      12188  45.6%
-                               Number of scaffolds > 100K nt       1106   4.1%
-                                 Number of scaffolds > 1M nt          0   0.0%
+                                         Number of scaffolds       7698
+                                     Total size of scaffolds  616114069
+                                            Longest scaffold    3410229
+                                           Shortest scaffold        896
+                                 Number of scaffolds > 1K nt       7620  99.0%
+                                Number of scaffolds > 10K nt       2131  27.7%
+                               Number of scaffolds > 100K nt       1152  15.0%
+                                 Number of scaffolds > 1M nt        129   1.7%
                                 Number of scaffolds > 10M nt          0   0.0%
-                                          Mean scaffold size      22612
-                                        Median scaffold size       7933
-                                         N50 scaffold length      58864
-                                          L50 scaffold count       2977
-                                                 scaffold %A      32.54
-                                                 scaffold %C      17.59
-                                                 scaffold %G      17.60
-                                                 scaffold %T      32.27
-                                                 scaffold %N       0.00
+                                          Mean scaffold size      80036
+                                        Median scaffold size       3358
+                                         N50 scaffold length     646786
+                                          L50 scaffold count        280
+                                                 scaffold %A      31.92
+                                                 scaffold %C      17.25
+                                                 scaffold %G      17.26
+                                                 scaffold %T      31.65
+                                                 scaffold %N       1.92
                                          scaffold %non-ACGTN       0.00
                              Number of scaffold non-ACGTN nt          0

-                Percentage of assembly in scaffolded contigs       0.0%
-              Percentage of assembly in unscaffolded contigs     100.0%
-                      Average number of contigs per scaffold        1.0
-Average length of break (>25 Ns) between contigs in scaffold          0
+                Percentage of assembly in scaffolded contigs      93.7%
+              Percentage of assembly in unscaffolded contigs       6.3%
+                      Average number of contigs per scaffold        2.0
+Average length of break (>25 Ns) between contigs in scaffold       1507

-                                           Number of contigs      26721                ## The second part should be the same 26721 contigs, versus 9758 contigs (in scaffolds file)
-                              Number of contigs in scaffolds          0
-                          Number of contigs not in scaffolds      26721
-                                       Total size of contigs  604217145
-                                              Longest contig     423496
-                                             Shortest contig        200
-                                   Number of contigs > 1K nt      26373  98.7%
-                                  Number of contigs > 10K nt      12188  45.6%
-                                 Number of contigs > 100K nt       1106   4.1%
+                                           Number of contigs      15529
+                              Number of contigs in scaffolds       9758
+                          Number of contigs not in scaffolds       5771
+                                       Total size of contigs  604305128
+                                              Longest contig     830300
+                                             Shortest contig         65
+                                   Number of contigs > 1K nt      15348  98.8%
+                                  Number of contigs > 10K nt       7647  49.2%
+                                 Number of contigs > 100K nt       1895  12.2%
                                    Number of contigs > 1M nt          0   0.0%
                                   Number of contigs > 10M nt          0   0.0%
-                                            Mean contig size      22612
-                                          Median contig size       7933
-                                           N50 contig length      58864               ## N50 is considerably different 58864 versus 117093
-                                            L50 contig count       2977
+                                            Mean contig size      38915
+                                          Median contig size       9483
+                                           N50 contig length     117093
+                                            L50 contig count       1517
                                                    contig %A      32.54
                                                    contig %C      17.59
                                                    contig %G      17.60

The break size between HongYang scaffolds is n=2000, if we explicitly specify this in the call to assemblathon_stats.pl we get even more spurious results;

$ assemblathon_stats.pl -n 2000 Kiwifruit_scaffold.fa.gz
---------------- Information for assembly 'Kiwifruit_scaffold.fa.gz' ----------------


                                         Number of scaffolds       7698
                                     Total size of scaffolds  616114069
                                            Longest scaffold    3410229
                                           Shortest scaffold        896
                                 Number of scaffolds > 1K nt       7620  99.0%
                                Number of scaffolds > 10K nt       2131  27.7%
                               Number of scaffolds > 100K nt       1152  15.0%
                                 Number of scaffolds > 1M nt        129   1.7%
                                Number of scaffolds > 10M nt          0   0.0%
                                          Mean scaffold size      80036
                                        Median scaffold size       3358
                                         N50 scaffold length     646786
                                          L50 scaffold count        280
                                                 scaffold %A      31.92
                                                 scaffold %C      17.25
                                                 scaffold %G      17.26
                                                 scaffold %T      31.65
                                                 scaffold %N       1.92
                                         scaffold %non-ACGTN       0.00
                             Number of scaffold non-ACGTN nt          0

                Percentage of assembly in scaffolded contigs      73.5%
              Percentage of assembly in unscaffolded contigs      26.5%
                      Average number of contigs per scaffold        1.8
Average length of break (>25 Ns) between contigs in scaffold       1507

                                           Number of contigs      13777
                              Number of contigs in scaffolds       7076
                          Number of contigs not in scaffolds       6701
                                       Total size of contigs  605485125
                                              Longest contig    1554749
                                             Shortest contig         65
                                   Number of contigs > 1K nt      13656  99.1%
                                  Number of contigs > 10K nt       6737  48.9%
                                 Number of contigs > 100K nt       1840  13.4%
                                   Number of contigs > 1M nt          8   0.1%
                                  Number of contigs > 10M nt          0   0.0%
                                            Mean contig size      43949
                                          Median contig size       9240
                                           N50 contig length     140261
                                            L50 contig count       1175
                                                   contig %A      32.48
                                                   contig %C      17.56
                                                   contig %G      17.56
                                                   contig %T      32.21
                                                   contig %N       0.20
                                           contig %non-ACGTN       0.00
                               Number of contig non-ACGTN nt          0

Now the N50 has increased to 140,261, it should be 58,864.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions