-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Great programs. Spent a while debugging panHiTE for some genomes and thought this might be helpful. My investigations started with an error in the pan_recover_low_copy_TEs process, which produced a traceback in pan_recover_low_copy_TEs.log (in the nextflow work directory hash):
"""
Traceback (most recent call last):
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/HiTE_conda/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/Util.py", line 10448, in run_find_members_v8
return is_TE_from_align_file(extend_member_file, query_name, cur_seq, temp_dir, subset_script_path, plant,
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/Util.py", line 10419, in is_TE_from_align_file
align_file = remove_sparse_col_in_align_file(align_file)
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/Util.py", line 10346, in remove_sparse_col_in_align_file
first_seq = align_contigs[align_names[0]]
IndexError: list index out of range
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/pan_recover_low_copy_TEs.py", line 676, in <module>
filter_true_TEs(batch_member_files, real_TEs, cur_temp_dir, subset_script_path, TE_type, plant, threads)
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/pan_recover_low_copy_TEs.py", line 478, in filter_true_TEs
result_info = job.result()
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/HiTE_conda/lib/python3.9/concurrent/futures/_base.py", line 439, in result
return self.__get_result()
File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/HiTE_conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
IndexError: list index out of rangeThe traceback for the error comes from an empty output from mafft, which is caused by a corrupted input resulting from tools/ready_for_MSA.sh. This is called by the function is_TE_from_align_file from Util.py, which is called by pan_recover_low_copy_TEs.py.
Fixing the following two bugs produced the desired output and allowed the workflow to continue through this process without errors:
Bug 1
Lines 1048-1049 of Util.py:
if len(member_names) > 100:
sub_command = 'cd ' + temp_dir + ' && sh ' + subset_script_path + ' ' + cur_align_file + ' 100 100 ' + ' > /dev/null 2>&1'If member_names is longer than 100, subset_script_path (a reference to tools/ready_for_MSA.sh) is called with parameters 100 and 100 for max and largest, respectively. In tools/ready_for_MSA.sh, there are a few bugs that cause a silent error that is piped to /dev/null.
On line 39, there is a simple bash arithmetic error, where r=$max-$largest should be r=$((max-largest)).
However, since 100-100 is always 0, even if the arithmetic is fixed this script always throws an error because if $r is 0, the shell command head receives the invalid flag -0 on line 42:
Full context:
sort -nk 2 -r $fasta.fai > $fasta.fai.sortedbylength
l=$largest #l is the number of largest sequences to be selected
r=$max-$largest #r is the number sequences to be randomly selected (MAKES THE STRING "100-100")
head -$l $fasta.fai.sortedbylength | awk '{print $1}' >> $fasta.headers #headears of "l" largest seqs
awk "NR> "$l" {print}" < $fasta.fai.sortedbylength > $fasta.temp
sort -R $fasta.temp | head -$r | awk '{print $1}' >> $fasta.headers #headers of "l" plus "r" random seqs (ALWAYS THROWS AN ERROR IF $max AND $largest ARE THE SAME OR $r IS THE STRING "100-100")
grep -A 1 -f $fasta.headers $fasta | sed '/^--/d' > $fasta.rdmSubset.fa
awk 'FNR==NR{a[$1]=$2;next} {print $1,a[$1]}' $fasta.fai $fasta.headers | sort -nk 2 -r >> $fasta.rdmSubset.len
rm $fasta.fai.sortedbylength $fasta.headers $fasta.temp You could fix this simply by using head -n $r if you want to pass a parameter other than 100 for largest and max. If only the first 100 sequences are required, there is no need for the random sampling. If there is a need to set this dynamically, you could use conditional logical so sort is not invoked before the pipe is broken by head, which would reduce unnecessary computation before throwing an error.
Bug 2
This is a minor bug related to the format of my input FASTA, which is in panSN spec.
Line 43 of ready_for_MSA.sh uses grep with the flag -A:
grep -A 1 -f $fasta.headers $fasta | sed '/^--/d' > $fasta.rdmSubset.fa Since panSN spec uses '#' to separate haplotypes and sequence IDs, grep interprets the headers as a binary file. Consider replacing line 43 with:
grep -a -A 1 -f $fasta.headers $fasta | sed '/^--/d' > $fasta.rdmSubset.fa The -a flag forces grep to treat $fasta.headers as strings.