Annotation_workflow/Annotations_workflow_current.qmd at main · ndombrowski/Annotation_workflow · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Metagenomics: Annotation tutorial"
project:
  type: website
execute:
  eval: false
format:
  html:
    number-sections: true
    fig-width: 5
    fig-height: 4
    embed-resources: true
    df-print: paged
    toc: true
    toc-location: left
    toc-depth: 4
    link-external-newwindow: true
    output-file: index.html
    code-overflow: wrap
theme:
  light: cosmo
  dark: darkly
code-block-bg: true
code-block-border-left: "#31BAE9"
editor_options:
  chunk_output_type: inline
engine: knitr
---


# Introduction

The goal of this tutorial is to teach students how to annotate proteins from genomes and is part of the NIOZ bioinformatic workshop. However, if you are interested in the topic feel free to check out the code as well as all required tools and databases are referenced.

Specifically, we will:

- download two genomes from NCBI
- annotate these genomes using several databases such as the COGs, arCOGS, KOs, PFAM, ...
- combine the results into one comprehensive table

**General comments**

- A lot of programs in this workflow use for loops or GNU parallel to speed things up. While this might not make so much sense for only two genomes, if you work with thousands of files this becomes very useful.
- For most annotation steps we can use of several CPUs. For this tutorial we only use few CPUs but this can be adjusted according to the available resources you have available. More on how to check resources can be found in the [Unix tutorial](https://github.com/ndombrowski/Unix_tutorial)
- This script uses a lot of AWK for parsing. If you want to have some more insights into how to use AWK we also have a [AWK tutorial](https://github.com/ndombrowski/AWK_tutorial) available.
- Once you have the annotation table, we provide an [R tutorial here](https://github.com/ndombrowski/R_tutorial) that gives an example how to summarize the data.
- The individual database searches might take a while to run (esp. the interproscan and the diamond search against the NCBI database) so if you run the full workflow have patience

:::{.callout-note}
Whenever we ask a question the code solution will be hidden.
You can show the code by pressing on \<Click me to see an answer\>.
:::


# General

Below you find a list of the used programs. If you want to run this pipeline outside of the bioinformatics workshop on your own servers, please install the necessary tools and download any Custom scripts and databases from the [Zenodo repository](https://zenodo.org/deposit/6489670).

**General notice for people interested in eukaryotic genomes:**

This tutorial is designed specifically with prokaryotic genomes in mind but can be used for eukaryotic proteins as well. This especially is easy if you already have proteins for your genomes of interest. If you do not have such files beware that the tool we use for protein-calling, prokka, is specific for prokaryotic genomes. Tools for identifying eukaryotic proteins are, among others, [Augustus](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1160219/) for single genomes or [MetaEuk](https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-020-00808-x) for metagenomes.

Something to keep in mind with the used databases is that these were often generated using only archaeal and/or bacterial genomes. Only a few of the databases used here, such as the KO database, also includes eukaryotic genomes. A good alternative to check out are the [KOGs](https://pubmed.ncbi.nlm.nih.gov/14759257/) that are specifically designed for eukaryotes. The most recent profiles provided by EggNOG can be found [here](http://eggnog5.embl.de/#/app/downloads).


## Version programs

The versions might differ from what you have available on your system and should not be problematic but for consistency we list the version numbers that were used to generate this workflow as well as links with install instructions.

1. Python 2.7.5
2. perl v5.16.3
3. prokka 1.14-dev ; install instructions can be found [here](https://github.com/tseemann/prokka)
4. HMMER 3.1b2; install instructions can be found [here](http://hmmer.org/documentation.html)
5. blastp v2.7.1+; install instructions can be found [here](https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
6. diamond 0.9.22; install instructions can be found [here](https://github.com/bbuchfink/diamond/wiki)
7. interproscan: InterProScan-5.29-68.0; install instructions can be found [here](https://interproscan-docs.readthedocs.io/en/latest/)
8. GNU parallel 20181222; install instructions can be found [here](https://www.gnu.org/software/parallel/)


## Version databases

Databases (and mapping files) used in the current version can be downloaded from [here] (https://zenodo.org/deposit/6489670).

1. KOhmms: downloaded 90419 from [here](https://www.genome.jp/tools/kofamkoala/)
2. PGAP (former TIGRs): downloaded Nov 2021 from [here](https://ftp.ncbi.nlm.nih.gov/hmm/current/)
3. Pfams: RELEASE 34.0, downloaded Nov 2021 from [here](http://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/)
4. CAZymes: dbCAN-HMMdb-V9 downloaded on 21 April 2020 from [here](https://bcb.unl.edu/dbCAN2/download/)
5. Merops: downloaded from Merops server in November 2018 from [here](https://www.ebi.ac.uk/merops/download_list.shtml)
6. HydDB: downloaded form HydDB webserver in July 2020 from [here](https://services.birc.au.dk/hyddb/browser/)
7. COGs: downloaded from NCBI October 2020 from  [here](ftp://ftp.ncbi.nlm.nih.gov/pub/COG/COG2020/)
8. TransporterDB: downloaded from TCDB April 2021 [here](http://www.tcdb.org)
9. ncbi_nr (maintained and downloaded by Alejandro Abdala Asbun). The last update was done on August 2021. For more details on downloading the data on your own check out [this resource](https://www.ncbi.nlm.nih.gov/books/NBK569850/). Additionally, `prot.accession2taxid.gz` was downloaded from [here](https://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/). Additionally, since the nr database was converted into a diamond database with `blastdbcmd -entry all -db nr/nr -out all.nr.masqued.fasta  -line_length 10000000 -ctrl_a` followed by `diamond makedb --in all.nr.masqued.fasta  --db diamond/nr --taxonmap prot.accession2taxid.gz`
10. ArCOGs: downloaded in 2019 and available [here](https://ftp.ncbi.nih.gov/pub/wolf/COGs/arCOG/) (update date: 2018)


## External scripts needed

Custom scripts used in the current version can be downloaded from [Zenodo] (https://zenodo.org/deposit/6489670) and [Github](https://github.com/ndombrowski/Annotation_workflow).

**python**

- fasta_length_gc.py
- Parse_prokka_for_MAGs_from_gbk-file.py
- Split_Multifasta.py
- parse_IPRdomains_vs2_GO_2_ts_sigP.py
- parse_diamond_blast_results_id_taxid.py
- merge_df.py


# Setup working environment

## Define variables

Notice: If you have done the assembly and binning tutorials that we provide as well, you can see that we do things a bit differently here. When annotating genomes, we use a lot of different databases and there are two ways how to deal with a lot of files that might be found in different locations:

(a) we can add the path for each database into the code itself
(b) we can define variables that store the paths at the beginning of our workflow

Especially if you work with collaborators it often is easier to use the second option, as you just have to adjust the paths in one spot.

**Instructions:**

- Change the first variable, the wdir, to the path were you want to work from and change the paths for the databases according to where you downloaded the databases and mapping files.
- The two last lines of code are needed for the software prokka and are specific for the system the script was written on. Adjust according to how prokka is setup in your system.

For people following the tutorial outside of the NIOZ bioinformatics workshop on their own servers. The necessary databases and scripts can be downloaded from [zenodo](https://zenodo.org/record/7784978#.ZCVtc9BBxD8).


```{bash}
#set variable for the base directory and personal user directory
user_name="ndombrowski"
base_dir="/export/lv3/scratch/bioinformatics_workshop/Users/"
wdir=${base_dir}/${user_name}

#nr of cpus we generally want to use
cpus=2

#location of some custom python and perl scripts
export script_folder="/export/lv3/scratch/bioinformatics_workshop/S11b_annotations/scripts/"

#locations of the individual databases
cog_db="/export/data01/databases/cogs/hmms/NCBI_COGs_Oct2020.hmm"
arcog_db="/export/data01/databases/arCOG/arCOGs2019/All_Arcogs_2018.hmm"
ko_db="/export/data01/databases/ko/All_KO"
pfam_db="/export/data01/databases/pfam/Pfam-A.hmm"
tigr_db="/export/data01/databases/PGAP_ProtFam/hmm_PGAP.lib"
cazy_db="/export/data01/databases/cazy_spt/dbCAN-HMMdb-V7.txt"
transporter_db="/export/data01/databases/tcdb/tcdb"
hyd_db="/export/data01/databases/HydDB/HydDB_uniq"
ncbi_nr_db="/export/data01/databases/ncbi_nr/diamond/nr.dmnd"

#locations of the mapping files for the indiv. databases
#these files link the database ID, ie KO0001 to a more useful description
cog_mapping="/export/data01/databases/cogs/hmms/cog_mapping.txt"
ko_mapping="/export/data01/databases/ko/ko_list_for_mapping_clean"
pfam_mapping="/export/data01/databases/pfam/Pfam-A.clans.cleaned.tsv"
cazy_mapping="/export/data01/databases/cazy_spt/CAZY_mapping_2.txt"
hyd_mapping="/export/data01/databases/HydDB/HydDB_mapping"
taxon_db="/export/data01/databases/taxmapfiles/ncbi_nr/prot.accession2taxid.gz"
arcog_mapping="/export/data01/databases/arCOG/arCOGs2019/ar14.arCOGdef18.nina.tab"
transporter_mapping="/export/data01/databases/tcdb/tcdb_definition.tsv"
tigr_mapping="/export/data01/databases/tigr/hmm_PGAP_clean.tsv"

#mapping file for ncbi taxonomy
ncbi_tax_mapping="/export/lv3/scratch/bioinformatics_workshop/S12_annotations/databases/taxonomy5.txt"

#Fix PERL5LIB PATH (required for using prokka on the NIOZ servers)
export PERL5LIB=/export/lv1/public_perl/perl5/lib/perl5
```


##  Set our working directory

Let's first go to our working directory and make a new folder for the Annotation tutorial.

:::{.callout-note}
Below you see how we use our variables, i.e. we use the path stored in the `wdir` variable to go to our working folder. If you want to use the code but are on another server or want to use a different folder, you don't have to change anything in the code below. Instead there is only a single section were another user has to adapt the code keeping the rest of the code more stable.

Therefore, environmental variables can make it easier to adapt our workflow to changing circumstances. For example, if we need to change the location of our input data, we can update the environmental variable for the input data path rather than having to update the path at different spots in the workflow individually.
:::


```{bash}
#go to your working directory
cd $wdir

#make a folder for the Annotations and move into this directory
mkdir Annotations
cd Annotations
```


# Prepare genomes of interest

In this workflow we will work with 2 genomes that we download from NCBI but the steps can easily be adjusted for your own genomes. You could also work with the bins we have assembled and binned in the previous workflows, if you feel comfortable with the command line feel free to use those instead.

First, we will download two archaeal genomes from NCBI and rename them to avoid issues with the workflow. The bins from the previous workflows (inside `../Binning/4_FinalBins/renamed`) already have a good header and should be ready to use in case you want to repeat the workflow with those genomes.

**First a word on the importance of good file names and sequence headers**

After we have downloaded the 2 genomes from NCBI the file name and file header should look something like this:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations1b.png){width=80% fig-align="left"}

We can see that both the file name and file header are long and has a lot of symbols UNIX might have trouble with (brackets, spaces, dots...) and generally, something to watch out for when naming files is:

- Have short but descriptive file names and file headers. I.e. if we look at the header of the NCBI genomes you will see the header is rather long and has unneccessary information for our purposes (i.e. genomic). Therefore we want to make it shorter.

- Avoid extra symbols (especially avoid using spaces, commas, semicolons, quotes or brackets) as these can easily result in unwanted behavior. I.e. a very simple naming could only include two symbols:

  - underscores, `_` , for everything where we normally would use a space
  - minus symbol  `-`  for key information that we might want to separate later (i.e. genome name and protein ID)

Another useful thing is to add information into the headers that allow us to concatenate multiple genomes and afterwards still know from what genome a sequence came from. Therefore, we recommend to:

- Add the genome name into the fasta header and use a unique delimiter to separate it from the contig id. That way you can concatenate your genomes and afterwards still know from what genome a contig came from
- have a consistent way to separate the genome and protein ID across different analyses. In phylogenies its very common to concatenate several genes/proteins from individual genomes and for this it is useful to remove the proteinID from the fasta header.


:::{.callout-note}
If you want to work with the files from the binning tutorial:

Your files already have a header that is good to go, so you can simply move them from the Binning tutorial to the `fna/renamed` folder.

```{bash}
mkdir -p fna/renamed

#create symbolic link
ln -sf $(pwd)/../Binning/4_FinalBins/renamed/NIOZ1* fna/renamed

#control that all bins are available
ll fna/renamed/* | wc -l

#control that the sequence header has the format: BinID-contigID
head $(ls -1 fna/renamed//*fna | head -1)
```
:::


## Download genomes from NCBI and cleanup

```{bash}
#make folders to store our genomes in
mkdir -p fna/renamed

#download genomes from NCBI
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/008/085/GCA_000008085.1_ASM808v1/GCA_000008085.1_ASM808v1_genomic.fna.gz
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/017/945/GCA_000017945.1_ASM1794v1/GCA_000017945.1_ASM1794v1_genomic.fna.gz

#unzip files
gzip -d *gz

#confirm that we have as many files as we expect: 2
ls -l *.fna

#shorten the file name by cutting everything away after the dot
for f in *.fna; do mv $f "${f/.*}".fna; done

#check the file name and see if it changed
ls -l *.fna

#check how the header looks
#--> we see that we have a long name and a lot of symbols that we do not want, lets remove these
head $(ls -1 *fna | head -1)

#shorten the fasta headers by removing everything after the space
#we also move the file with the changed header into the fna folder
for f in *.fna; do cut -d " " -f1  $f > fna/${f}; done
head $(ls -1 fna/*fna | head -1)

#add the genome name into the header (this is useful if we want to concatenate a lot of genomes into one file)
cd fna
for i in *fna; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.fna/,x)}1' $i > renamed/$i; done
head $(ls -1 renamed/*fna | head -1)

#remove intermediate files
rm -f *fna
cd ..
rm -f *fna
```

The cleaned up files can be found in `fna/renamed` and file names should look something like this afterwards:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations2b.png){width=100% fig-align="left"}


### The steps above work fine but what if we have hundreds of genomes that we want to download from NCBI?

:::{.callout-warning}
Below we have some hidden, alternative code.

If you deal a lot with genomes from NCBI and you want to have a file that links a NCBI taxonomy id to a tax string. For example the NCBI taxonomy ID 7 is connected to this taxonomic information: `Bacteria,Proteobacteria,Alphaproteobacteria,Rhizobiales,Xanthobacteraceae,Azorhizobium,Azorhizobium_caulinodans` . The names between brackets refer to different taxonomic ranks: Domain, Phylum, Class, Order, Family, Genus and Species.

This information can be useful if you work with a genome of interest, such as E.coli, and if you might want to include several closely related genomes as reference in your analysis and NCBI is an excellent resource providing you with a multitude of genomes as well as metadata. The taxonomy string in this example can help you to more easily find related genomes

The code is not yet fully implemented for people working outside the NIOZ but the code below should give you enough information to get you started.

To explore the code, click on `Click me to look at the code` to see the code.
:::


<details>
    <summary>Click me to look at the code</summary>

Notice:

- The file that links the taxonomy ID with the taxa string (lineages-2020-12-10.csv) was generated with [ncbitax2lin](https://github.com/zyxue/ncbitax2lin) (instructions added below)
- this part of the workflow was set up to run at servers at the NIOZ and the paths need to be adjusted to work on your system

The first part of this workflow is to get a list of all available archaeal and bacterial genomes from NCBI and add a taxonomic string into the file we downloaded.

```{bash}
#prepare some folders
mkdir ncbi_genomes
cd ncbi_genomes

#get a list of all avail. microbial genomes
wget https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
```

This file includes a list of all genomes available at NCBI and relevant metadata such as the project ID, taxonomy ID, species name, download link for individual genome files etc.

Next, we will clean the file a bit and since the species name might not be informative enough (i.e. we might want to know to what phylum a species belongs), we want to add the full taxonomy string in.

```{bash}
#clean up the file by removing the first row and any `#` and move the tax id to the front
sed 1d assembly_summary_genbank.txt | sed 's/# //g' | awk -F"\t" -v OFS="\t" '{print $6, $0} ' | sed 's/ /_/g' > assembly_summary_edit.txt

#replace empty rows with a "none" and remove the header
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' assembly_summary_edit.txt | sed '1d' > temp1

#add in the tax string (to be able to better select genomes)
#the code for how to generate the csv needed here can be found below
#merge assembly info with tax string
LC_ALL=C join -a1  -j1 -e'-' -o 0,1.2,1.3,1.4,1.5,1.6,1.8,1.9,1.10,1.11,1.13,1.16,1.17,1.18,1.19,1.21,2.2 <(LC_ALL=C sort temp1) <(LC_ALL=C sort  /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020/lineages-2020-12-10.csv) -t $'\t' > temp2

#add in headers
echo -e "taxid\taccession\tbioproject\tbiosample\twgs\trefseq_category\tspeciesID\torganism\tintraspecific_name\tisolate\tassembly\trel_date\tasm\tsubmitter\tgcf_name\tftp_path\ttax" | cat -  temp2 >  assembly_summary_archaea_Dec2020_tax.txt

cd ..
```

In `assembly_summary_archaea_Dec2020_tax.txt` you can easily search for the genomes you are interested in. For example, you can subset for lineages of interest by selection specific groups via their taxonomy and you might run CheckM as well to select them based on their genome stats.

Once you have your genomes selected that you are interested in you then can copy the ftp_links into a document called `get_genomes.txt` via nano and then do some small changes to the file with sed. Afterwards you can download all the genomes from NCBI using wget.

```{bash}
# extend path to be able to download the genomes
sed -r 's|(ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)(GCA/)([0-9]{3}/)([0-9]{3}/)([0-9]{3}/)(GCA_.+)|\1\2\3\4\5\6/\6_genomic.fna.gz|' get_genomes.txt  > get_genomes_v2.txt

# download genomes
for next in $(cat get_genomes_v2.txt); do wget  "$next"; done
```

Once you have the genomes, you can clean the names as suggested above and continue with the steps below.

In case you want to generate the csv file that links the NCBI taxID to a taxonomic string, which includes the phylum, class, order, etc information, this is the workflow (you will need to download [ncbitax2lin](https://github.com/zyxue/ncbitax2lin)):

```{bash}
#change to the directory in which we installed ncbitax2lin
cd /export/lv1/user/spang_team/Scripts/NCBI_taxdump/ncbitax2lin

#start bashrc that allows to run conda
source ~/.bashrc.conda2

#setup virtual environment (see more instructions here: https://github.com/zyxue/ncbitax2lin)
#virtualenv venv/

#re-do taxonomy mapping file
make

#unzip file
gzip -d  lineages-2020-12-10.csv.gz

#close virtual environment
#source deactivate venv/

#make a new directory for the most current tax
mkdir /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020

#change dir
cd /export/lv1/user/spang_team/Databases/NCBI_taxonomy/Dec2020

#cp the taxa file into our new dir and replace spaces with underscore
sed 's/ /_/g'  ~/../spang_team/Scripts/NCBI_taxdump/ncbitax2lin/lineages-2020-12-10.csv  > temp1

#only print the levels until species level
awk -F',' -v OFS="\t" '{print $1,$2,$3,$4,$5,$6,$7,$8}' temp1 > temp2

#add in “none” whenever a tax level is emtpy
awk 'BEGIN { FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i="none"}; 1' temp2 > temp3

#merge columns 2-8, thus we have 1 column with the tax number and one column with the tax string
awk ' BEGIN { FS = OFS = "\t" } {print $1,$2","$3","$4","$5","$6","$7","$8}' temp3 > temp4

#remove header
sed '1d' temp4 > lineages-2020-12-10.csv

#clean up
rm temp*
```

</details>


## Make a filelist for all genomes of interest

Let us start by making a file that lists all the genomes we are working with. Such a file list is useful when working with for loops.

```{bash}
#make a folder for all the secondary files we need
mkdir FileLists

#list our genomes of interest
ls fna/renamed/* | sed 's/\.fna//g' | sed 's/fna\/renamed\///g' > FileLists/GenomeList

#check content of file
head FileLists/GenomeList
```

The last command should print something like this:


![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations7.png){width=70% fig-align="left"}


## Call proteins with prokka and clean files

Now that we have the genomes organized, lets identify the coding regions and call the proteins with Prokka. Another common tool you might encounter for this step is prodigal. Prokka basically uses prodigal but is doing some extra annotations, which we also want to look at.

Notice:

- Our genomes are from archaea, if you have other genomes change the **--kingdom Archaea** part below. This setting generates no major differences when calling proteins but the Prokka annotations might slightly change.
- Prokka is not designed to work with eukaryotic genomes, if you work with eukaryotic genomes an alternative tool to look into is for example Augustus.
- Prokka does very quick and dirty annotations, these are good for a first glimpse into what you have but in our experience there can be several issues, so use them with caution.
- Similarly, no database for annotating proteins is perfect. This is why we use mutliple databases in our workflow and compare results to be able to spot miss-annotated proteins more easily (more on that in later last section)


### Run prokka

[Prokka](https://github.com/tseemann/prokka) is a tool for rapid prokaryotic genome annotation [(Seeman et al., 2014)](https://pubmed.ncbi.nlm.nih.gov/24642063/).

```{bash}
#prepare new directories to store our data
mkdir faa
mkdir faa/Prokka

#run prokka
for sample in `cat FileLists/GenomeList`; do prokka fna/renamed/${sample}.fna --outdir faa/Prokka/${sample} --prefix $sample --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus $cpus --norrna --notrna ; done

#control number of faa files generated: 2
#this should be the same as the original number of genomes
ll faa/Prokka/*/*faa | wc -l

#check how the faa header looks like
head $(ls -1 faa/Prokka/*/*faa | head -1)
```

We have 2 genomes and the header of the first file should look something like this:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations8.png){width=100% fig-align="left"}

We can see that prokka gives a unique protein ID and adds an annotation to the file header.

Other than the protein faa file, Prokka generates several additional output files. These include:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/prokka.png){width=80% fig-align="left"}
Additionally, prokka does not only call proteins but also identifies tRNAs and tries to annotate proteins. The whole workflow is summarized in the image below:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/prokka2.png){width=100% fig-align="left"}

:::{.callout-note}

**A small piece of advice for running things more efficiently**

As you have seen above and in the following code we often use for loops to automatically run a tool multiple times on the same input files. Depending on the analysis done this can take a lot of time. If your system permits, a great alternative is GNU parallel.

More specifically, instead of running the two prokka runs after each other with:

```{bash}
for sample in `cat FileLists/GenomeList`; do prokka fna/renamed/${sample}.fna --outdir faa/Prokka/${sample} --prefix $sample --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus 2 --norrna --notrna ; done
```

Instead, we could run 2 prokka runs in parallel with GNU parallel. If we would have 1000 genomes and you could run 40 analyses in parallel that would make things even more efficient.

```{bash}
parallel -j2 'i={}; prokka fna/renamed/${i}.fna --outdir fna/renamed/${i} --prefix ${i} --kingdom Archaea --addgenes --force --increment 10 --compliant --centre UU --cpus 2 --norrna --notrna' ::: `cat FileLists/GenomeList`
```

:::


### Clean faa header and combine genomes

If we look again at the protein sequences headers of the newly generated files, we see that the header is not ideal yet. For example, the header should look something like this:
``>CALJACFG_00010 ATP-dependent DNA helicase Hel308``.

This header is very long and uses a lot of extra symbols (a space, and a minus symbol, which we already use for other purposes). Therefore, we first want to clean things up a bit.

Additionally, since for the following workflow we want to concatenate the two genomes, i.e. merge the two files into one, it is useful to add the genome name into the fasta header before merging the files.

This is optional, but it is useful to use a special delimiter to later separate the genome from the protein name. Therefore we add the file name and separate it with the old sequence header with a `-`.

```{bash}
#go into the prokka folder
cd faa/Prokka/

#make a new folder in which we want to store our renamed gneomes
mkdir renamed

#organize our original files
cp */*faa .

#add the filename into the fasta header and store them in our new folder
for i in *faa; do awk '/>/{sub(">","&"FILENAME"-");sub(/\.faa/,x)}1' $i > renamed/$i; done

#control if this worked
head $(ls -1 renamed/*faa | head -1)

#remove old files
rm -f *faa

#concatenate files
cat renamed/*faa > All_Genomes.faa

#clean the header and shorten it by removing everything after the first space
cut -f1 -d " "  All_Genomes.faa > All_Genomes_clean.faa

#check if that worked
head All_Genomes_clean.faa

#go back to the wdir
cd ../..

#make a blast db out of concatenated genomes (we need this whenever we want to run blast)
makeblastdb -in faa/Prokka/All_Genomes_clean.faa  -out faa/All_Genomes_clean   -dbtype prot -parse_seqids
```

Now our file header should look something like this:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations9.png){width=80% fig-align="left"}


# If you work with your own genomes

This is a short reminder for people who already have their own protein files and want to start with the annotations right away:

- Ideally have not only the protein ID but also the genome ID sequence header
- Make sure the file header is concise and does not have ANY spaces and that ideally uses a '-' (or any other unique delimiter) to separate the genome ID from the protein ID
- If you have a concise header + the bin ID in the header, it is easy to concatenate the protein sequences of your genomes into one single file and still easily know from what genome the sequence originally came from

More specifically you want to:

(a) for the fna files
  - generate a folder named `fna/renamed`
  - add the individual genomes in there

(b) for the faa files:
  - generate the file in the folder `faa/Prokka`
  - concatenate the proteins from all your genomes of interest and name the file `All_Genomes_clean.faa`
  - continue with the steps below


# Make a mapping file with all relevant genome info {#sec-mapping_file}

Now that our files are ready we first want to make a table with the basic stats (i.e. contig length, contig gc, protein length, ...).


## Get contig stats

Prokka has the habit to rename the fasta header. Therefore, it is not straightforward to say from what contig the protein originally came from. For example a header from the contigs (i.e. the fna file) might look like this `>GCA_000008085-AE017199.1` while a protein on that contig might be named like this `GCA_000008085-CALJACFG_00010` in the faa file.

The following workflow below generates a mapping file that links the original contig name with the prokka contig name. Beware the workflow is a bit convoluted since the original contig ID `AE017199` is stored nowhere in the prokka output files and we need it to improvise a bit to generate the information.

The goal is to generate a file with this information:

- old contig name, binID, contig GC, contig length

**Notice**

- This part depends a lot on the files having short naming schemes and no extra symbols. So make sure that the final file, has the 4 columns we would expect.


:::{.callout-note}

**Comment if working with your own genomes:**

If working with your own genomes this will only apply to you if you called proteins with prokka. Additionally, this workflow depends on a consistent naming of the sequence headers.Specifically, this step will ONLY work if your fasta header includes the binID and the contig ID separated with a `-`. If you do not have both pieces of info, have extra symbols symbol or if that symbol is somewhere else in the filename/header this part of the workflow will very likely not work (without needing adjustments).

If your file headers look different, it will be easiest skip this step and do the following:
(A) generate the file FileLists/1_Bins_to_protein_list.txt in the section **Prepare list that links proteins with binIDs**
(B) continue with the step **Do the annotations**.
(C) During the first annotation step with the COG database, ignore the step #combine with previous data frame and do `cp NCBI_COGs/A_NCBI_COGs.tsv merge/temp_A.tsv` to generate the first table.
:::


```{bash}
#make folder
mkdir contig_mapping

#calculate the length and gc for each contig
for i in `cat FileLists/GenomeList`; do  python $script_folder/fasta_length_gc.py fna/renamed/${i}.fna; done > contig_mapping/temp1

#make a separate column for the genome ID and proteinID and give a numbering to the contigs
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2], a[1], ++count[a[1]], $2, $3}' contig_mapping/temp1 > contig_mapping/Contig_Old_mapping.txt

#check file: the header should have --> contigID, binID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping.txt

#cleanup temp files
rm contig_mapping/*temp*

#create file for merging with prokka contigIDs. Here, we add the genome ID with the contig number together to recreate the prokka contig name
awk 'BEGIN{OFS="\t"}{print $2"_contig_"$3,$0}' contig_mapping/Contig_Old_mapping.txt > contig_mapping/Contig_Old_mapping_for_merging.txt

#check file: prokka contig name, prokka genome ID, genome ID, contig NR, GC, contig length
head contig_mapping/Contig_Old_mapping_for_merging.txt
```

In the file we see the contig name, contig GC and contig length. Since the two genomes are complete genomes we only work with two contigs in total. If you work with MAGs this table will get much longer.

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations10.png){width=100% fig-align="left"}


## Prepare list that links proteins with binIDs {#sec-protein_list}

Now, we want to generate a file that lists all proteins we are working and a second column with the bin ID. So we want two generate with two columns: proteinID (i.e. accession), genome ID

```{bash}
#get list of protein accession numbers
grep "^>" faa/Prokka/All_Genomes_clean.faa  > temp

#remove the prokka annotations
cut -f1 -d " " temp > temp2

#remove the ``>``
sed 's/>//g' temp2 > FileLists/AllProteins_list.txt

#check file; we want one column with the proteinIDs
head FileLists/AllProteins_list.txt

#remove temp files
rm -f temp*

#Modify protein list to add in a column with genome ID
awk -F'\t' -v OFS='\t' '{split($1,a,"-"); print $1, a[1]}' FileLists/AllProteins_list.txt | LC_ALL=C sort > FileLists/1_Bins_to_protein_list.txt

#check file content: we now have the accession and the binID
head FileLists/1_Bins_to_protein_list.txt

#check with how many proteins we work: 2069
wc -l FileLists/1_Bins_to_protein_list.txt
```

We now have a small table that lists all our 2,069 proteins and the respective genome each protein belongs to. The first few lines of the file should look something like this:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations11.png){width=80% fig-align="left"}
:::{.callout-note}

Throughout the workflow (or whenever generating your own pipeline), check that the number of protein stays consistent this is a useful sanity check to ensure that the code does what you expect it to.

Additionally, while running things for the first time it is always good to check the file content that was generated with `nano` or `less` to get a better grasp on:

- What is happening at specific points in the script
- Find potential bugs

:::


## Prepare a file with prokka annotations & Lengths & other genome info

Now, we want to generate a file that links the contig info and all the protein info.

```{bash}
#compress gbk files (generated by prokka)
gzip faa/Prokka/*/*gbk

#make list for looping
for f in faa/Prokka/*/*gbk.gz; do echo $f >> FileLists/list_of_gzips; done

#control that the nr of files we have is correct: 2
wc -l FileLists/list_of_gzips

#create a summary file based on the gbk file
#lists the genome ID, contig name, contig length, proteinID, prokka annotation
python $script_folder/Parse_prokka_for_MAGs_from_gbk-file_v2.py -i FileLists/list_of_gzips -t FileLists/Contig_Protein_list.txt

#check output of the script: Prokka binID, Prokka contig ID, contig length, gene start, gene stop, strand, accession, prokka annotation
head FileLists/Contig_Protein_list.txt
```

The output should look something like this

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations12.png){width=100% fig-align="left"}

Next, we want to have a list that assigns the new ID prokka gives to a genome to the old genome name we got from NCBI:

```{bash}
#make prokka to genome I mapping file
awk 'BEGIN{OFS="\t"}{split($1,a,"-"); print a[2],$2}' FileLists/1_Bins_to_protein_list.txt | sed 's/_[0-9]*\t/\t/g' | sort | uniq > FileLists/Prokka_to_BinID.txt

#check file: Prokka bin ID, old bin ID
head FileLists/Prokka_to_BinID.txt
```

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations13.png){width=80% fig-align="left"}

Finally, we can merge everything together.

```{bash}
#link old to new genome ID
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' FileLists/Prokka_to_BinID.txt FileLists/Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$10,$2,$2,$3,$4,$5,$6,$7,$8}'  > temp_Bin_Contig_Protein_list.txt

#add in extra column for contig nr
awk -F'\t' -v OFS='\t' 'gsub("[A-Za-z0-9]*_","",$4)' temp_Bin_Contig_Protein_list.txt | awk -F'\t' -v OFS='\t' '{print $2"_contig_"$4,$1,$2,$3,$4,$5,$6,$7,$8,$9,$10}' > FileLists/Bin_Contig_Protein_list.txt

#check file
head FileLists/Bin_Contig_Protein_list.txt

#merge with old contig IDs
#headers: accession, genome ID, new ContigID, old ContigID, merged ContigID,new Contig Length, old Contig Length, GC,ProteinID, protein start, protein end, protein strand, prokka annotation
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' contig_mapping/Contig_Old_mapping_for_merging.txt FileLists/Bin_Contig_Protein_list.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $3"-"$7,$3,$4,$13,$1,$6,$17,$16,$7,$8,$9,$10,$11 }' >  FileLists/Bin_Contig_Protein_list_merged.txt

#check file and number of proteins in file
head FileLists/Bin_Contig_Protein_list_merged.txt
wc -l FileLists/Bin_Contig_Protein_list_merged.txt

#prepare a file with protein length and gc to add protein length into our file
perl $script_folder//fasta_length_gc.py faa/Prokka/All_Genomes_clean.faa > temp

#merge with contig/protein info
awk 'BEGIN{FS="\t";OFS="\t"}FNR==NR{a[$1]=$0;next}{print $0,a[$1]}' temp FileLists/Bin_Contig_Protein_list_merged.txt | awk 'BEGIN{FS="\t";OFS="\t"}{print $1,$2,$3,$4,$6,$7,$8,$9,$10,$11,$12,$16,$13}' > temp2

#add a header
echo -e "accession\tBinID\\tNewContigID\tOldContigId\tContigNewLength\tContigOldLength\tGC\tProteinID\tProteinStart\tProteinEnd\tStrand\tProteinLength\tProkka" | cat - temp2 > FileLists/B_GenomeInfo.txt

#check file
less -S FileLists/B_GenomeInfo.txt

#control that all is ok and has the expected number of rows: 2070
#notice, we have one row extra since we count the header
wc -l FileLists/B_GenomeInfo.txt

#remove temp files
rm -f temp*

```

The file stored in `FileLists/B_GenomeInfo.txt` should look something like this:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations14.png){width=100% fig-align="left"}


# Do the annotations {#sec-start_annotations}

All searches that follow below have a similar workflow:

- Run the search (with hmmer, diamond, blast, etc.)
- For each protein, make sure we have just one unique hit (usually we select the best hit for each protein using the best e-value or bitscore). I.e. a single protein might be similar to several dehydrogenases and we only keep the best hit.
- Clean up the output from the search.
- Merge the output with our protein list (that way we make sure we always have the same number of rows in all the files we generate).
- add gene descriptions to our table, i.e. when we do the search against the COGs we only have the COG IDs, but we want to add a gene description here.
- add an informative header.

For all the searches, we keep the original input table and the final output table and the other intermediate files we delete.

Once we have run our searches against several different databases, we merge all the results in a step-wise manner (using the protein IDs) and then in the final step clean the final data frame by removing redundant information.

We will add basic information on the different databases and search algorithms in the last section of this tutorial, so if you want more details check there.

:::{.callout-note}
Depending on the size of your dataset (and the database) these searches might take a while to run.

Therefore, when you run this on your own genomes adjust the number of CPUs accordingly and depending on what system you run this on consider submitting it as a batch job or inside a screen in case your server connection gets disrupted.
:::

## COG search

The COGs (Clusters of Orthologous Genes) is a database of orthologous gene clusters generated from a database from 1,187 bacteria and 122 archaea (2020 version) and therefore is useful if you work with organisms from these two groups.

Possible settings to change:

- number of CPUs
- e-value threshold

```{bash}
#Setup directories
mkdir NCBI_COGs
mkdir merge

#run hmmsearch against all COGs
hmmsearch --tblout NCBI_COGs/sequence_results.txt -o NCBI_COGs/results_all.txt --domtblout NCBI_COGs/domain_results.txt --notextw --cpu $cpus $cog_db faa/Prokka/All_Genomes_clean.faa

#format the full table and only select hits above a certain e-value
sed 's/ \+ /\t/g' NCBI_COGs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > NCBI_COGs/sequence_results_red_e_cutoff.txt

#get best hit/protein based on bit score, and e-value
sort -t$'\t' -k3,3gr -k4,4g NCBI_COGs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k3,3gr -k4,4g >  NCBI_COGs/All_NCBI_COGs_hmm.txt

#merge with contig list
LC_ALL=C join -a1  -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort NCBI_COGs/All_NCBI_COGs_hmm.txt) | LC_ALL=C sort  > NCBI_COGs/temp4

#merge with COG mapping file
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.2,2.5,1.3 <(LC_ALL=C sort -k2 NCBI_COGs/temp4) <(LC_ALL=C sort -k1 $cog_mapping) | LC_ALL=C  sort > NCBI_COGs/temp5

#add headers
echo -e "accession\tCOG\tCOG_Description\tCOG_PathwayID\tCOG_Pathway\tCOG_evalue" | cat - NCBI_COGs/temp5 > NCBI_COGs/A_NCBI_COGs.tsv

#check file
head NCBI_COGs/A_NCBI_COGs.tsv

#count the number of lines we have
#this should be our number of proteins + 1 row for the header
wc -l NCBI_COGs/A_NCBI_COGs.tsv

#cleanup
rm -f NCBI_COGs/temp*
```

In step 2, hmmsearch will generate multiple outputs (for details on the individual data found in these tables [see chaper 5, page 45](http://eddylab.org/software/hmmer3/3.1b2/Userguide.pdf) of the HMMER manual):

1. The --tblout output option produces the target hits table. The target hits table consists of one line for each different query/target comparison that met the reporting thresholds, ranked by decreasing statistical significance (increasing E-value).
2. The --domtblout option produces the domain hits table. There is one line for each domain. There may be more than one domain per sequence. T
3. A table with the raw output, including alignments (-o option). Notice: This file can get rather large and if you have space issues, its not essential and can be deleted or you don't even have to use the `-o` option.

After parsing our table that is stored in `NCBI_COGs/A_NCBI_COGs.tsv` contains the protein accession, COG ID, COG description and e-value. If other relevant data is available, in the case of the COGs individual COGs are sorted into individual pathways, this information is added as well:

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations15.png){width=100% fig-align="left"}


**Exercise**

Compare the files sequence_results.txt and sequence_results_red_e_cutoff.txt. What is the difference between them?

<details>
    <summary>Click me to see an answer</summary>

The commands to check this are:

```{bash}
head NCBI_COGs/sequence_results.txt
head NCBI_COGs/sequence_results_red_e_cutoff.txt

#we can also check the row numbers
#we see that we go from 17888 to 11066 rows
wc -l NCBI_COGs/sequence_results*
```

Additionally, we can see that we removed some columns from the second table and that we excluded any rows starting with a `#`

</details>


## ArCOGs search

The arCOGs is a database of orthologous groups based on archaeal genomes and therefore is especially useful when working with archaea.

Possible settings to change:

- number of CPUs (default = 2)
- e-value threshold (default = 1e-3)

```{bash}
#prepare folders
mkdir arcogs

#run hmmsearch
hmmsearch --tblout arcogs/sequence_results.txt -o arcogs/results_all.txt --domtblout arcogs/domain_results.txt --notextw --cpu $cpus $arcog_db faa/Prokka/All_Genomes_clean.faa

#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' arcogs/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > arcogs/sequence_results_red_e_cutoff.txt

#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g arcogs/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k3,3gr -k4,4g >  arcogs/All_arcogs_hmm.txt

#separate header with arcog and ID
awk -F'\t' -v OFS='\t' '{split($2,a,"."); print $1, a[1], $3,$4}' arcogs/All_arcogs_hmm.txt | LC_ALL=C sort > arcogs/temp3

#merge with contig list
LC_ALL=C join -a1  -j1 -e'-' -t $'\t' -o 0,2.2,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort arcogs/temp3) | LC_ALL=C sort  > arcogs/temp4

#merge with arcog names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o1.1,0,2.3,2.4,2.2,1.3 <(LC_ALL=C sort -k2 arcogs/temp4) <(LC_ALL=C sort -k1 $arcog_mapping) | LC_ALL=C  sort > arcogs/temp5

#add in headers
echo -e "accession\tarcogs\tarcogs_geneID\tarcogs_Description\tarcogs_Pathway\tarcogs_evalue" | cat - arcogs/temp5 > arcogs/B_arcogs.tsv

#control counts
wc -l arcogs/B_arcogs.tsv

#check what happened
head arcogs/B_arcogs.tsv

#clean-up
rm -f arcogs/temp*
```

Now, we we have a second table with information for the arCOGs. If we would merge the two tables already here, we could check whether each protein is assigned to the same annotation when using different databases.

![](C:/Users/ndmic/WorkingDir/Notebooks/Pictures/Annotations16.png){width=100% fig-align="left"}
Again, it is useful to use more then one database to analyse your genome since no database is perfect and having several databases to compare the results allows us to judge how confident we are with a certain annotation.

Unfortunately, you also see that there is no consistent naming convention. I.e. in the table above you can see for CALJACFG_00080 that the COG links this protein to a tRNA_A37_methylthiotransferase and the arCOG links this protein to a 2-methylthioadenine_synthetase but despite the different names both refer to the same enzyme.

**Question**

We now have 3 annotations for each protein from prokka, the arcogs and the COGs. This is a good time to look at whether or not we get consistent annotations across all these annotation results.

As an exercise look at the gene with ID 00260 for the genome GCA_000008085 and check if all 3 databases give consistent annotations. If not, which annotation do you trust most? Since the prokka generates different genome IDs for each run, I can not give you the exact accession, but using grep to search for the genome and protein ID should be enough for you to find the protein.

<details>
    <summary>Click me to see an answer</summary>

You can search for the term as follows:

```{bash}
#find the prokka annotation
grep 'GCA_000008085-.*00260'  FileLists/B_GenomeInfo.txt

#find the hit in our arcog and cog annotations
grep 'GCA_000008085-.*00260' */[A-Z]_*.tsv
```

This should give you the following results:

- Prokka: GCA_000008085-CALJACFG_00260 Sulfide-quinone reductase
- A_NCBI_COGs.tsv:GCA_000008085-CALJACFG_00260  COG1252 NADH_dehydrogenase,_FAD-containing_subunit
- GCA_000008085-CALJACFG_00260        arCOG01065      HcaD    NAD(FAD)-dependent_dehydrogenase

[CALJACFG_00260 hits to sulfide-quinone reductase with prokka and to a NADH_dehydrogenase with the arcogs and the COGs. Generally, the e-values are pretty good with the last two databases, while prokka by default does not give us any score to judge the confidence of this hit. We therefore should be skeptical about the annotation as sulfide quinone reductase.

</details>


## KO search using Hmm's

The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs and was generated using archaeal, bacterial and eukaryotic genomes.

Possible settings to change:

- number of CPUs
- e-value threshold

```{bash}
#setup directories
mkdir KO_hmm

#run KO search
hmmsearch --tblout KO_hmm/sequence_results.txt -o KO_hmm/results_all.txt --domtblout KO_hmm/domain_results.txt --notextw --cpu $cpus $ko_db faa/Prokka/All_Genomes_clean.faa

#format the full table and only select sequences above a certain e-value
sed 's/ \+ /\t/g' KO_hmm/sequence_results.txt | sed '/^#/d'| sed 's/ /\t/g'| awk -F'\t' -v OFS='\t' '{print $1, $3, $6, $5}' | awk -F'\t' -v OFS='\t' '($4 + 0) <= 1E-3'  > KO_hmm/sequence_results_red_e_cutoff.txt

#get best hit based on bit score, and then e-value
sort -t$'\t' -k3,3gr -k4,4g KO_hmm/sequence_results_red_e_cutoff.txt | sort -t$'\t' --stable -u -k1,1  | sort -t$'\t' -k3,3gr -k4,4g >  KO_hmm/All_KO_hmm.txt

#merge with protein list
LC_ALL=C join -a1 -t $'\t' -j1 -e'-' -o 0,2.2,2.3,2.4 <(LC_ALL=C sort FileLists/1_Bins_to_protein_list.txt) <(LC_ALL=C sort KO_hmm/All_KO_hmm.txt) | sort -u -k1 > KO_hmm/temp

#merge with KO_hmm names
LC_ALL=C join -a1 -1 2 -2 1 -e'-' -t $'\t' -o 1.1,1.2,1.4,1.3,2.2,2.12  <(LC_ALL=C sort -k2 KO_hmm/temp) <(LC_ALL=C sort -k1 $ko_mapping) | LC_ALL=C  sort > KO_hmm/temp3

#add in an extra column that lists whether hits have a high confidence score
awk  -v OFS='\t' '{ if ($4 > $5){ $7="high_score" }else{ $7="-" } print } ' KO_hmm/temp3 > KO_hmm/temp4

#add header
echo -e "accession\tKO_hmm\tKO_e_value\tKO_bit_score\tKO_bit_score_cutoff\tKO_Definition\tKO_confidence" | cat - KO_hmm/temp4 > KO_hmm/C_KO_hmm.tsv

#control lines and content
wc -l KO_hmm/C_KO_hmm.tsv
head KO_hmm/C_KO_hmm.tsv

#clean up
rm -f KO_hmm/temp*
```


## Pfam search

The Pfam database is a large collection of protein families that are generated from [UniProt Reference Proteomes](https://www.uniprot.org/help/reference_proteome) and includes archaea, bacteria and eukaryotes.

Possible settings to change:

- number of CPUs
- e-value threshold