Skip to content

median wrong in assembly_stats tool aka fasta_summary.pl #96

@peterjc

Description

@peterjc

Noticed on a real dataset where my Python code did not get the same median contig length, reduced to a minimal test case:

$ cat /tmp/median.fasta 
>one
A
>two
AA
>three
AAA
>four
AAAA

There are an even number of sequences, 1, 2, 3, 4, meaning the median should be the mean of the middle two values, which is 2.5 in this trivial case.

As can be seen below, fasta_summary.pl picks the larger of the middle two values instead (which in the context of sequences will ensure the answer is an integer), so in practise the error will often be larger:

$ perl fasta_summary.pl -i /tmp/median.fasta -o /tmp && more /tmp/stats.txt 
  Directory '/tmp' exists, so the existing fasta_summary.pl output files will be overwritten
Statistics for read lengths:
	Min read length:	1
	Max read length:	4
	Mean read length:	2.50
	Standard deviation of read length:	1.12
	Median read length:	3
	N50 read length:	3

Statistics for numbers of reads:
	Number of reads:	4
	Number of reads >=1kb:	0
	Number of reads in N50:	2

Statistics for bases in the reads:
	Number of bases in all reads:	10
	Number of bases in reads >=1kb:	0
	GC Content of reads:	0.00 %

Simple Dinucleotide repeats:
	Number of reads with over 70% dinucleotode repeats:	0.00 % (0 reads)
	AT:	0.00 % (0 reads)
	CG:	0.00 % (0 reads)
	AC:	0.00 % (0 reads)
	TG:	0.00 % (0 reads)
	AG:	0.00 % (0 reads)
	TC:	0.00 % (0 reads)

Simple mononucleotide repeats:
	Number of reads with over 50% mononucleotode repeats:	50.00 % (2 reads)
	AA:	50.00 % (2 reads)
	TT:	0.00 % (0 reads)
	CC:	0.00 % (0 reads)
	GG:	0.00 % (0 reads)


CC original author @sujaikumar - given the Perl script has been used widely it may be simpler just to document this behaviour? Is there an official repository for this script?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions