Suggested bug fix for ready_for_MSA.sh

Great programs. Spent a while debugging panHiTE for some genomes and thought this might be helpful. My investigations started with an error in the pan_recover_low_copy_TEs process, which produced a traceback in `pan_recover_low_copy_TEs.log` (in the nextflow work directory hash):
```python
"""
Traceback (most recent call last):
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/HiTE_conda/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/Util.py", line 10448, in run_find_members_v8
    return is_TE_from_align_file(extend_member_file, query_name, cur_seq, temp_dir, subset_script_path, plant,
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/Util.py", line 10419, in is_TE_from_align_file
    align_file = remove_sparse_col_in_align_file(align_file)
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/Util.py", line 10346, in remove_sparse_col_in_align_file
    first_seq = align_contigs[align_names[0]]
IndexError: list index out of range
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/pan_recover_low_copy_TEs.py", line 676, in <module>
    filter_true_TEs(batch_member_files, real_TEs, cur_temp_dir, subset_script_path, TE_type, plant, threads)
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/module/pan_recover_low_copy_TEs.py", line 478, in filter_true_TEs
    result_info = job.result()
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/HiTE_conda/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/scratch/y95/mderbyshire/workflows/test_panhite/HiTE/HiTE_conda/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
IndexError: list index out of range
```

The traceback for the error comes from an empty output from `mafft`, which is caused by a corrupted input resulting from `tools/ready_for_MSA.sh`. This is called by the function  `is_TE_from_align_file` from `Util.py`, which is called by `pan_recover_low_copy_TEs.py`.

Fixing the following two bugs produced the desired output and allowed the workflow to continue through this process without errors:

## Bug 1
Lines 1048-1049 of `Util.py`:

```python
    if len(member_names) > 100:
        sub_command = 'cd ' + temp_dir + ' && sh ' + subset_script_path + ' ' + cur_align_file + ' 100 100 ' + ' > /dev/null 2>&1'
```

If `member_names` is longer than 100, `subset_script_path` (a reference to `tools/ready_for_MSA.sh`) is called with parameters `100` and `100` for max and largest, respectively. In `tools/ready_for_MSA.sh`, there are a few bugs that cause a silent error that is piped to `/dev/null`.
On line 39, there is a simple bash arithmetic error, where `r=$max-$largest` should be `r=$((max-largest))`. 
However, since `100-100` is always `0`, even if the arithmetic is fixed this script always throws an error because if `$r` is `0`, the shell command `head` receives the invalid flag `-0` on line 42:

Full context:
```bash
        		sort -nk 2 -r $fasta.fai > $fasta.fai.sortedbylength 
        		l=$largest  #l is the number of largest sequences to be selected
        		r=$max-$largest #r is the number sequences to be randomly selected (MAKES THE STRING "100-100")
        		head -$l $fasta.fai.sortedbylength | awk '{print $1}' >> $fasta.headers #headears of "l" largest seqs
        		awk "NR> "$l" {print}" < $fasta.fai.sortedbylength > $fasta.temp
        		sort -R $fasta.temp | head -$r | awk '{print $1}' >> $fasta.headers #headers of "l" plus "r" random seqs  (ALWAYS THROWS AN ERROR IF $max AND $largest ARE THE SAME OR $r IS THE STRING "100-100")
        		grep -A 1 -f $fasta.headers $fasta | sed '/^--/d' > $fasta.rdmSubset.fa  
        		awk 'FNR==NR{a[$1]=$2;next} {print $1,a[$1]}' $fasta.fai $fasta.headers | sort -nk 2 -r >> $fasta.rdmSubset.len
        		rm $fasta.fai.sortedbylength $fasta.headers $fasta.temp 
```
You could fix this simply by using `head -n $r` if you want to pass a parameter other than `100` for largest and max. If only the first 100 sequences are required, there is no need for the random sampling. If there is a need to set this dynamically, you could use conditional logical so `sort` is not invoked before the pipe is broken by `head`, which would reduce unnecessary computation before throwing an error.

## Bug 2

This is a minor bug related to the format of my input FASTA, which is in [panSN spec](https://github.com/pangenome/PanSN-spec). 

Line 43 of ready_for_MSA.sh uses `grep` with the flag `-A`:
```bash
        		grep -A 1 -f $fasta.headers $fasta | sed '/^--/d' > $fasta.rdmSubset.fa  
```

Since panSN spec uses '#' to separate haplotypes and sequence IDs, grep interprets the headers as a binary file. Consider replacing line 43 with:
```bash
        		grep -a -A 1 -f $fasta.headers $fasta | sed '/^--/d' > $fasta.rdmSubset.fa  
```
The `-a` flag forces grep to treat $fasta.headers as strings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested bug fix for ready_for_MSA.sh #92

Bug 1

Bug 2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Suggested bug fix for ready_for_MSA.sh #92

Description

Bug 1

Bug 2

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions