AWK_tutorial/AWK_tutorial.Rmd at main · ndombrowski/AWK_tutorial · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "General AWK notebook"
author: "Nina Dombrowski"
affiliation: "NIOZ"
date: "`r Sys.Date()`"
knit: (function(input_file, encoding) {out_dir <- 'docs';rmarkdown::render(input_file,encoding=encoding,output_file=file.path(dirname(input_file), out_dir, 'AWK_tutorial.html'))})
output:
  rmdformats::readthedown:
    highlight: kate
    code_folding: hide
editor_options:

  chunk_output_type: console
---


```{css, echo=FALSE}
.spoiler {
  visibility: hidden;
}

.spoiler::before {
  visibility: visible;
  content: "Spoiler alert! Hover over me to see the answer.";
  font-weight: bold
}

.spoiler:hover {
  visibility: visible;
}

.spoiler:hover::before {
  display: none;
}
```


```{r knitr setup, include=FALSE,  eval=TRUE, echo=FALSE, warning=FALSE}
library(knitr)
library(kableExtra)
options(knitr.table.format = "html")
knitr::opts_chunk$set(eval=TRUE, cache=FALSE, message=FALSE, warning=FALSE,
                      comment = "", results="markup")
#https://bookdown.org/yihui/rmarkdown/html-document.html
#install.packages('knitr', ependencies = TRUE)
#install.packages("devtools", lib="~/R/lib")
#library(DT)
#devtools::session_info()
```

###################################################################################
###################################################################################
# 0. Introduction
###################################################################################
###################################################################################

This file lists code snippets and a small mini tutorial to work with AWK.

AWK is an excellent tool to filter and manipulate data that comes with rows and columns, which is very common with biological data.
Awk requires no compiling, and allows the user to use variables, numeric functions, string functions, and logical operators.
Importantly, it can deal fast and effectively with very large data files.

Examples of things AWK can do:

- Scans a file line by line
- Splits each input line into fields
- Compares input line/fields to pattern
- Performs action(s) on matched lines

When you want to work with AWK a basic knowledge about how to use the command line is required.
We have a small tutorial for this [here](https://github.com/ndombrowski/Unix_tutorial)

Here, we test basic awk operations with several input files that can be found in the 1_Inputfiles folder.
The files usually look like this:

<p align="left">
  <img width=600, height=600, src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/input.png">
</p>

We have a file with

- 4 rows
- 5 columns
- The columns have information for one 1_Inputfiles/Experiment with control and nitrogen measurements (two things were measured)

In this tutorial we learn how to:

- Subset dataframes by columns
- Subset dataframes based on patterns
- Merge dataframes
- etc.

**General info on this tutorial**

- `r text_spec("grey box.", background = "#BEBEBE", color = "black", bold = F)` : The code that we want to execute
- `r text_spec("red box.", background = "#D05A6E", color = "white", bold = T)` : Urgent comments ;-)
- `r text_spec("Code", background = "white", color = "black", bold = F)` : In some cases the code might be hidden and you only see a little box at the right hand side. Hidden codes means that from the previous sessions you should have enough knowledge to do this step or it is is a test of your skills. If you run into issues you can open the code and check how to run your command by clicking on the `Code` button to reveal the code
- `r text_spec("hyperlink", background = "white", color = "red", bold = F)` : If the text is highlighted like this during a text section, then you can click on it and a hyperlink gives you more info (either the publication or an online tutorial)
- **Exercise:** : This will appear if you should have the required background to finish all the steps yourself. If you run into trouble, you can find the answers if you click the Code butto, which will reveal the answer.
- Sometimes we also just ask questions for you to think about the output of certain programs. The text is then hidden and you have to hover over the "spoiler" to see the answer to the question


###################################################################################
###################################################################################
# Basic introduction
###################################################################################
###################################################################################

An essential awk command looks like follows:

**awk 'instructions' files**

The Input is read a line at a time. The instructions are enclosed by single quotes to protect them from the shell, since curly brackets or dollars signs are otherwise not interpreted correctly.

Same as with sed, we can issue multiple commands by using a semicolon.

**awk pattern { action }**

- pattern = when the action is performed
- action = what we wanto do

In case we want to run an awk script with more complex commands, we run:

**awk -f script file**


###################################################################################
###################################################################################
# Subsetting dataframes
###################################################################################
###################################################################################

1. To test things you can use some files generated for this purpose in the 1_Inputfoles folder.

2. By default the examples below will print to the screen. In case you want to save into a new file you can do parse the output into a new file with a >. Example: **awk command 1_Inputfiles/Experiment1.txt > File_new.txt**

3. The file **1_Inputfiles/Experiment*.txt** contains five columns: The 1_Inputfiles/ExperimentID, conditions, two measurments and a column with some comments. The file **1_Inputfiles/list** contains a list of addresses, which uses the comma as a separator.

4. If you want to follow this tutorial do the following steps

- Make a new folder inside the workshop directory, name it `Awk_tutorial`
- Go inside that folder
- Copy the folder with the test data from `/export/lv3/scratch/workshop_2021/Sxx_Practice/Awk/1_Inputfiles/`
- familirize yourself with the content of the indiv. files

```{bash class.source = 'fold-bash', eval=FALSE}
#make folder
mkdir Awk_tutorial

#go inside that new folder
cd Awk_tutorial

#get the test data
cp -r /export/lv3/scratch/workshop_2021/Sxx_Practice/Awk/1_Inputfiles/ .

#check files
head 1_Inputfiles/list
head 1_Inputfiles/Experiment1.txt
head 1_Inputfiles/Metadata
head 1_Inputfiles/nameState
```


## Print a column of interest
######################

First, lets learn how we print certain columns, we are interested in.

General info:

- **FS** = input field separator
- **OFS** = Output field separator
- **$0** prints everything
- A **BEGIN** rule is executed once only, before the first input record is read

We can print the entire input line with ``$0`` or specific input fields (or columns) with ``$1`` and ``$2`` and so on.

**Important:**

By default awks assumes that a space and tab is used as a separator for columns/fields. We can change this behaviour and will discuss this later.

```{bash class.source = 'fold-show', eval=FALSE}
#just print column 1
awk '{print $1}' 1_Inputfiles/list
```

**Exercise**: Print the 3rd input field.

```{bash class.source = 'fold-hide', eval=FALSE}
#just print column 1
awk '{print $3}' 1_Inputfiles/list
```


We can also print lines that have specific pattern.

```{bash class.source = 'fold-show', eval=FALSE}
#just print addresses from MA
awk '$3~ /MA/' 1_Inputfiles/list
```


**Exercise**: Extract the data for Amy

```{bash class.source = 'fold-hide', eval=FALSE}
#qst1
awk '/Amy/' 1_Inputfiles/list
```


In the above example, we specify a pattern but no procedure. When doing this the default action is to print each line that matches the pattern.

We can not only specifiy the pattern but also exactly say what fields we want to print.

```{bash class.source = 'fold-show', eval=FALSE}
#just print addresses from MA and just print the names
awk '/MA/ {print $1}' 1_Inputfiles/list
```

We can also specify the separator for the different fields using the ``-F`` option. For example, we can use the comma in our list.

```{bash class.source = 'fold-show', eval=FALSE}
#just print addresses from MA and just print the names
awk -F, '/MA/ {print $1}' 1_Inputfiles/list
```

If we use this command not only the first but the full name is printed.


**Exercise**: Extract in what street Amy lives

```{bash class.source = 'fold-hide', eval=FALSE}
#just print column 1
awk -F, '/Amy/{print $2}' 1_Inputfiles/list
```


If we want to specify in what colum to look for the data, we do it like this.

```{bash class.source = 'fold-show', eval=FALSE}
#just print names from MA
awk -F, '$3 ~ /MA/ {print $1}' 1_Inputfiles/list
```

This is esp. useful if MA might also occur in unrelated columns.

If we want to search for two patterns:


```{bash class.source = 'fold-show', eval=FALSE}
#just print names from MA and PA
awk -F, '$3 ~ /MA|PA/ {print $1}' 1_Inputfiles/list
```

We can also print all names but exclude people from Ma and PA

```{bash class.source = 'fold-show', eval=FALSE}
#just print names from MA and PA
awk -F, '!/MA|PA/ {print $1,$3}' 1_Inputfiles/list

awk -F, '$3 !~ /MA|PA/ {print $1,$3}' 1_Inputfiles/list

```


## Splitting data by categories
######################

Now imagine we want a separate table with the names of people for each different state, with awk that is relatively easy to do

```{bash class.source = 'fold-show', eval=FALSE}
#just print names from MA and PA
awk -F, '{print > $3}'  1_Inputfiles/list2

#use ll to see what happened
ls -l

#check the content of a file
head VA
```

To make this a bit nicer, we even can add an extension to the new files that are generaetd


```{bash class.source = 'fold-show', eval=FALSE}
#just print names from MA and PA
awk -F, '{print > $3".txt"}'  1_Inputfiles/list2

#use ll to see what happened
ls -l

#check the content of a file
head VA.txt
```

We can also just print certain data, i.e. just the names

```{bash class.source = 'fold-show', eval=FALSE}
#just print names from MA and PA
awk -F, '{print $1 > $3".txt"}'  1_Inputfiles/list2

#use ll to see what happened
ls -l

#check the content of a file
head VA.txt
```


## Combining several actions
######################

We can combine multiple actions by using the semicolon (same as in sed)

```{bash class.source = 'fold-show', eval=FALSE}
#just print addresses from MA and just print the names
awk -F, '{print $1; print $2; print $3}' 1_Inputfiles/list
```

Options used:

- `-F`: we define the input field delimiter and say to only see the comma as a delimiter not the space


## Print several columns
######################

We can also print any other combination as shown below. Try this by yourself and see what is happening.

```{bash class.source = 'fold-show', eval=FALSE}
#print column 3 and 4
awk -F, '{print $1, $3}' 1_Inputfiles/list

#print column 3 and 4 and separate them with a minus symbol
awk -F, '{print $1"-"$3}' 1_Inputfiles/list

#print column 3 and 4 and tab separate them
awk -F, '{print $1 "\t" $3}' 1_Inputfiles/list
awk -F, -v OFS="\t" '{print $1, $3}' 1_Inputfiles/list

```

Notice:

- The ``"\t"`` represents an escaped character and represents a ``tab``

Options used:

- `-v OFS="\t"`: we define the output field delimiter as a tab

**Exercise**: Extract in what street and state Amy lives. Separate the columns by a `!`

```{bash class.source = 'fold-hide', eval=FALSE}
#just print column 1
awk -F',' '/Amy/{print $2"!"$3}' 1_Inputfiles/list
```


## Print everything
######################

We can also print everything using ``$0``.


```{bash class.source = 'fold-show', eval=FALSE}
awk '{print $0, $1}' 1_Inputfiles/list
```

Another way to do this is:

```{bash class.source = 'fold-show', eval=FALSE}
awk '1' 1_Inputfiles/list
```

The above awk command has only the pattern or condition part, no action part. The '1' in the pattern indicates "true" which means true for every line. As said above, no action part denotes just to print which is the default when no action statement is given, and hence the entire file contents get printed.


## Print a column of interest and define the column separators
######################

Now, lets work with a file that uses a tab to separate columns, as this is the type of file we more commonly work with, the file in `1_Inputfiles/Experiment1.txt `

**Something to keep in mind about the default awk behaviour:**

By default awk uses BOTH tabs and spaces as separated, which sometimes can create issues.

Besides using -F and -v OFS to control that behaviour you might also see some different ways that basically do the same thing:

```{bash class.source = 'fold-show', eval=FALSE}
#Print column 1 and specify that the input separator (-F) is a tab
awk 'BEGIN{FS="\t"}{print $1}' 1_Inputfiles/Experiment1.txt

#Print column 1 and 2 and specify that the input separator (-F) is a tab and the output separator is a comma
awk 'BEGIN{FS="\t";OFS=","}{print $1,$2}' 1_Inputfiles/Experiment1.txt

#print column 1 and then all the columns
awk 'BEGIN{FS=OFS="\t"} {print $1,$0}'  1_Inputfiles/Experiment1.txt
```

If we look at the second example, we see that awk recognizes the tab of the input file but then for the output generated uses the semicoloon.

<p align="left">
  <img width=100, height=100, src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/awk2.png">
</p>

 *For the rest of the tutorial we set the OFS and FS as tab as a default.


## Add a column
######################

We can use this if we want to add some extra information into our tables.
Later we can also do this a bit more sophisticated using **if/else** statements

```{bash class.source = 'fold-show', eval=FALSE}
#add a new column in the beginning
awk 'BEGIN{FS=OFS="\t"}{print "redo",$0}' 1_Inputfiles/Experiment1.txt
```

If we run this, we see that we have a new first column followed by the original dataframe.

<p align="left">
  <img width=600, height=600, src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/awk3.png">
</p>

We can exactly control where we add new information.
For example, with the code below, we add the extra column between the 2nd and 3rd column of the original dataframe.

```{bash class.source = 'fold-show', eval=FALSE}
#add a new column as new third column
awk 'BEGIN{FS=OFS="\t"}{print $1,$2,"redo",$3,$4,$5}' 1_Inputfiles/Experiment1.txt

```

**Exercixe**:

Imagine we want to add a column with the T of the experiment. Add a new column at the end with 36C

```{bash class.source = 'fold-hide', eval=FALSE}
#add a new column as new third column
awk 'BEGIN{FS=OFS="\t"}{print $0,"36C"}' 1_Inputfiles/Experiment1.txt

```


## Change the content of an existing column
######################

Let's say, we do not like to use the word Experiment in our table, but rather would say Repetition.
We can change the content of the first column like this:

```{bash class.source = 'fold-show', eval=FALSE}
#change the content in column 1 to**repetition**
awk 'BEGIN{FS=OFS="\t"}{$1="Repetition1"} {print $0 }' 1_Inputfiles/Experiment1.txt
```

<p align="left">
  <img width=600, height=600, src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/awk4.png">
</p>


**Exercise**

After plotting the data, we realized that the measurements that we initially labelled as control and N[0.4Um]. We realized we did a mistake and did not change the N levels. Rename everything in column 2 to control.

```{bash class.source = 'fold-hide', eval=FALSE}
#qst1
awk 'BEGIN{FS=OFS="\t"}{$2="control"} {print $0 }' 1_Inputfiles/Experiment1.txt
```


## Split a column
######################

We can also split columns. For example, the Ex1 in the first column, maybe we would want to split before and after the x and print the results into new columns.
This is a random example, but when dealing with proteins, you often have something like GenomeName-ProteinID and for some operations we might want to have that information in different columns/

Some new syntax for AWK:

- **split()** =  the function 'split' divides a string into pieces
- **$1** =  The column we want to split
- **"x"** =  The pattern we want to use for splitting
- **a** =  We name the ARRAY we generate 'a'. An ARRAY is similar to a variable you just can store more information in the array we store the different pieces that were split
- **a[1]** =  retrieve the first part of the split array =E
- **a[2]** =  retrieve the second part of the split array =1

Here, we see the output and what our arrays are:

<p align="left">
  <img src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/AWK_arrays.png">
</p>


```{bash class.source = 'fold-show', eval=FALSE}
#split using  **x** in the 1st column and print the two arrays generated in that step
awk 'BEGIN{FS=OFS="\t"}{split($1,a,"x")} {print $1,a[1],a[2],$2,$3}' 1_Inputfiles/Experiment1.txt
```

The output of this line of code should look like this:

<p align="left">
  <img width=600, height=600, src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/awk5.png">
</p>


**Exercise**

- In column 5, split after the '_'. Print the first two columns and the first and second part of the array

```{bash class.source = 'fold-show', eval=FALSE}
#qst1
awk 'BEGIN{FS=OFS="\t"}{split($5,a,"_")} {print $1,$2,a[1],a[2]}' 1_Inputfiles/Experiment1.txt
```


## Dealing with more than one column
######################

In this example we work with the split function we used above but the syntax stays the same for other commands.

More specifically, we want to do the following:

1. Split the third column by the dot
2. Split the second column by the dot
3. print the 1st, 2nd, the first array of the first split (=a) and the first array of the second split (=b)

The curly brackets allow us to separate two different commands (version 1). We can also do this by using a semicolon (version 2).

```{bash class.source = 'fold-show', eval=FALSE}
#version 1
awk -F'\t' -v OFS='\t'  '{split($3,a,".")} {split($4,b,".")} {print $1,$2,a[1],b[1]}' 1_Inputfiles/Experiment1.txt

#version 2
awk -F'\t' -v OFS='\t'  '{split($3,a,"."); split($4,b,"."); print $1,$2,a[1],b[1]}' 1_Inputfiles/Experiment1.txt
```


## Use search and replace
######################

We can search for patterns and if a pattern is found we can replace it with something else.
For example, we can search for **Ex** and replace it with **Experiment**

We can also use wildcards here. The default behaviour of wildcards is explained in more detail in the [Unix tutorial](https://github.com/ndombrowski/Unix_tutorial).

```{bash class.source = 'fold-show', eval=FALSE}
#default
awk -F'\t' -v OFS='\t'  '{gsub(/Ex/,"Experiment")}{print $0}' 1_Inputfiles/Experiment1.txt

#use gsub together with wildcards
#Here, we replace control1 + control2 with blank
awk -F'\t' -v OFS='\t'  '{gsub(/control[12]/,"blank")}{print $0}' 1_Inputfiles/Experiment1.txt
```

The second example looks like this:

<p align="left">
  <img width=600, height=600, src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/awk6.png">
</p>


We can see that since we use the wildcard ([12]) both control1 and control2 are replaced with blank.


**Exercise**

- replace the `N[0.4uM]` with `nitrogen`. Do not forget that brackets mean something else for awk and we need to tell it to use the brackets literally. If you do not know how to do this, check the general Unix tutorial
- replace the Ex1 with repetition and the control[12] with blank

```{bash class.source = 'fold-hide', eval=FALSE}
#question1
awk -F'\t' -v OFS='\t'  '{gsub(/N\[0.4uM\]/,"nitrogen")}{print $0}' 1_Inputfiles/Experiment1.txt

#question2
awk -F'\t' -v OFS='\t'  '{gsub(/Ex1/,"Experiment"); gsub(/control[12]/,"blank")}{print $0}' 1_Inputfiles/Experiment1.txt

```


## Only print rows that have a pattern of interest
######################

We can also only print rows that we find relevant. I.e. we might only want to print the columns with our control measurements. Or we might want to only print the information from specific treatments, 1_Inputfiles/Metadata for certain organisms, etc...

New syntax that we use here:

-**?** =  Searches for any Number/Letter (exactly one)
- **==** =  is a logical operator/Boolean expression. This one means find a pattern that is **equal** to the pattern we gave
- **!=** =  means unequal to
- **~** =  means find a pattern that is the same AND allow for wildcards
- `||` or operator

```{bash class.source = 'fold-show', eval=FALSE}
#only print rows with control treatments (search across all columns)
awk -F'\t' -v OFS='\t'  '/control?/''{print $0}' 1_Inputfiles/Experiment1.txt

#only print rows if we find the term control2 in column2
awk -F'\t' -v OFS='\t'  '$2=="control2"''{print $0}' 1_Inputfiles/Experiment1.txt

#only print rows wit the control in column2 (AND use wildcards)
awk -F'\t' -v OFS='\t'  '$2~"control?"''{print $0}' 1_Inputfiles/Experiment1.txt

#print everything EXCEPT control2
awk -F'\t' -v OFS='\t'  '$2!="control2"' 1_Inputfiles/Experiment1.txt

#we can search for several patterns arcoss different columns like this
awk -F'\t' -v OFS='\t' '$2 =="N" || $3>6''{print $0}' 1_Inputfiles/Experiment1.txt

```

**Exercise**

- Find the rows in which the experiment was ok and where we did not have issues with contamination (the extra info is stored in column 5)

```{bash class.source = 'fold-hide', eval=FALSE}
awk -F'\t' -v OFS='\t' '$5=="#all_good"''{print $0}' 1_Inputfiles/Experiment1.txt
#or
awk -F'\t' -v OFS='\t' '$5~"good"''{print $0}' 1_Inputfiles/Experiment1.txt
```


## Only print data if a condition is met
######################

Here: Only print rows if the second column has the pattern **N**

```{bash class.source = 'fold-show', eval=FALSE}
#search pattern across the whole file
awk -F'\t' -v OFS='\t'  '/N/ {print $0}' 1_Inputfiles/Experiment1.txt

#search pattern in specific column
awk -F'\t' -v OFS='\t'  '$2 ~ "N" { print $0 }' 1_Inputfiles/Experiment1.txt
```


## Subset dataframes by using cutoffs  and using relational operators
######################

###  Relational operators

```{bash class.source = 'fold-show', eval=FALSE}
awk 'BEGIN {a = 10; b = 10; if (a == b) print "a==b"}'

awk 'BEGIN {a = 10; b = 20; if (a != b) print "a = "a,"b = "b,"therefore", "a!=b"}'

awk 'BEGIN {a = 10; b = 20; if (a <= b) print "a<=b"}'

awk 'BEGIN {a = 10; b = 10; if (a < b) print "a==b"; else print "false"}'
```

We might also want to subset data by numeric values. I.e. we only might want to retain columns were values are larger than 10, or larger than a certain e-value, etc..

This can be very helpful when subsetting blast tables

We can combine two statements using **||** , which means OR
Alternatively, we can use **&&** , which means AND

```{bash class.source = 'fold-show', eval=FALSE}
#only print rows if the third column has values larger than 5
awk -F'\t' -v OFS='\t'  '$3>5' 1_Inputfiles/Experiment1.txt

#if the value in the third column is equal to 5.2 do NOT print
awk -F'\t' -v OFS='\t'  '$3!=5.2' 1_Inputfiles/Experiment1.txt

#if the third column is larger than 5 OR the fourth column is greater than 0.2 then print
awk -F'\t' -v OFS='\t'  '($3>5)||($4>0.2)' 1_Inputfiles/Experiment1.txt

#only print if the values in the third column are equal to or greater than 5
awk -F'\t' -v OFS='\t' '$3>=5.2' 1_Inputfiles/Experiment1.txt

#only print if the third column is equal/larger than 5.2 AND the fourth column is larger than 4.5
awk -F'\t' -v OFS='\t' '($3>=5.2)&&($4>4.5)' 1_Inputfiles/Experiment1.txt
```

**Exercise**

- search for data, where we measured the N treatments and where the value in the third column is greater than 10

```{bash class.source = 'fold-hide', eval=FALSE}
awk -F'\t' -v OFS='\t' '($2~"N")&&($3>10)''{print $0}' 1_Inputfiles/Experiment1.txt

```

## Subsetting data and storing the results into different tables
######################

Remember when we split our list into separate files depending on what states people lived in? We can also do this with number and store numbers smaller than one in a new file.


```{bash class.source = 'fold-hide', eval=FALSE}
awk -F'\t' -v OFS='\t' '{if($3<=1)print > "small.txt";else print > "large.txt"}' 1_Inputfiles/Experiment1.txt

#check if that worked
head small.txt
```

Using a so-called `if-else statement` we say:  If it is lesser or equal to 1, the record goes to "smallL.txt", else to "large.txt".


## Changing specific cells
######################

This specific command can be quite useful when we for example want to change headers.

Here, we learn several new statements that can be used in AWK:

- **if** =  Do the action only IF a specific condition is met
- **NR** =  Number of fields (or "rows")
- **NR==3** =  We are in the 3rd field
- **if(NR==3)** =  if we are in field 3 do sth
- **if(NR==3) $2="P"** =  if we are in field 3 replace the value in the 2nd column with P

```{bash class.source = 'fold-show', eval=FALSE}
#change a the pattern in the 2nd column to **P** if we are in the 3rd ROW
awk 'BEGIN{FS="\t"; OFS="\t"}{if(NR==3) $2="P"} {print $0 }' 1_Inputfiles/Experiment1.txt
```

**Exercise**

We made an error in our comment section (column 5). Instead of `#all_good` for our second measurements, we wanted to write `#stunted_growth`. Correct this error

```{bash class.source = 'fold-hide', eval=FALSE}
#change a the pattern in the 2nd column to **P** if we are in the 3rd ROW
awk 'BEGIN{FS="\t"; OFS="\t"}{if(NR==2) $5="#stunted_growth"} {print $0 }' 1_Inputfiles/Experiment1.txt
```


## AWK and variables
######################

awk provides a "-v" option to pass arguments. Using this, we can pass the shell variable to it

``````{bash class.source = 'fold-show', eval=FALSE}
#define variable
x=3

#add 3 to our third column
awk -v val=$x '{print $0"-->"$3+val}' 1_Inputfiles/Experiment1.txt
```


###################################################################################
###################################################################################
# Math with awk
###################################################################################
###################################################################################

### use Arithmetic operators
######################

```{bash class.source = 'fold-show', eval=FALSE}
awk 'BEGIN {a = 50; b = 20; print "(a + b) = ", (a + b)}'

awk 'BEGIN {a = 50; b = 20; print "(a - b) = ", (a - b)}'

awk 'BEGIN {a = 50; b = 20; print "(a * b) = ", (a * b)}'

awk 'BEGIN {a = 50; b = 20; print "(a / b) = ", (a / b)}'
```

We can also use this on our data tables like this:


```{bash class.source = 'fold-show', eval=FALSE}
#add the values of the 3rd and 4th column
awk '{print $3+$4}' 1_Inputfiles/Experiment1.txt

#subtract
awk '{print $3-$4}' 1_Inputfiles/Experiment1.txt

#multiply
awk '{print $3*$4}' 1_Inputfiles/Experiment1.txt

```


## Using variables
######################

## Using one variable

Here, we create a variable named **SUM** that stores added values of the 3rd and 4th column. We can use this variable for printing.

```{bash class.source = 'fold-show', eval=FALSE}
awk '{SUM=$3+$4}{print $1,$2,SUM}' 1_Inputfiles/Experiment1.txt
```


**Exercise**

- Substract column 3 from colum 4 and store the result in a variable with the name 'diff'. Print the second columnd and diff

```{bash class.source = 'fold-hide', eval=FALSE}
awk '{diff=$4-$3}{print $2,diff}' 1_Inputfiles/Experiment1.txt
```


### Using several variables

Here, we again create the variable **SUM**. Then, we run a second command (using the semicolon) and add a 1 to the values stored in **SUM** and create a new variable named **SUM2**

```{bash class.source = 'fold-show', eval=FALSE}
awk '{SUM=$3+$4;SUM2=SUM+1}{print $1,$2,SUM,SUM2}' 1_Inputfiles/Experiment1.txt
```

In a more useful way we can use this to create basic functions and for example calculate the average and  print it:

```{bash class.source = 'fold-show', eval=FALSE}
awk '{SUM = $3+$4; avg = SUM/2}{print $1,$2,avg}' 1_Inputfiles/Experiment1.txt
```


### Summarizing data across columns

<p align="left">
  <img src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/Row_v_Column.png">
</p>


Now, we want to add all the values of the third column together:

- **SUM+** =  SUM = the variable we define and '+' = is the summary function
- **2x{}** =  2 action blocks we are executing
- **END** =  Execute the 1st action block until the end of the file is reached = marks the end of action block 1


```{bash class.source = 'fold-show', eval=FALSE}
#using the variable SUM
awk '{SUM+=$3}END{print SUM}' 1_Inputfiles/Experiment1.txt

#using the variable new_name
awk '{New_name+=$3}END{print New_name}' 1_Inputfiles/Experiment1.txt

#add sum text to the results
awk '{SUM+=$3}END{print "The result is",SUM}' 1_Inputfiles/Experiment1.txt

#also here we can combine two actions into one command, i.e. summarize the 3rd and 4th column
awk '{SUM_Col3+=$3;SUMCol4+=$4}END{print SUM_Col3,SUMCol4}' 1_Inputfiles/Experiment1.txt
```

If we do not use **END** this gets messed up, so be careful. Basically without this variable, we consecutively add values within the column and print the individual values. You can check this unwanted behaviour with the code below.

```{bash class.source = 'fold-show', eval=FALSE}
#not using END
awk '{New_name+=$3}{print New_name}' 1_Inputfiles/Experiment1.txt
```


### Summarizing across columns by using categories

In this example, we want to only summarize the values across our three categories: control1, control2 and N

<p align="left">
  <img src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/Categories.png">
</p>

New elements to keep in mind:

- **{SUM[$2]+=$3}** =  Sum column3 based on the categories in column2. SUM is stored in an array, where we have 3 indices (i) for control1, control2 and N[0.4uM]
- **for(i in SUM)** =  a for loop that loops through our array loop through every of the 3 indices we stored
- **print i** =  print each index stored in SUM
- **print SUM[i]** =  print each calculated sum stored in the array


```{bash class.source = 'fold-show', eval=FALSE}
#using the default categories
awk '{SUM[$2]+=$3}END{for(i in SUM) print i,SUM[i]}' 1_Inputfiles/Experiment1.txt

#combining control1 and control2 into one categorie (we use pipes again, Notice we can use pipes to combine different progamms)
sed 's/control[12]/control/g' 1_Inputfiles/Experiment1.txt | awk '{SUM[$2]+=$4}END{for(i in SUM)print i,SUM[i]}'
```


### Using the build in variables


#### The NF command

The NF command is used to keep a count of the fields within the current input record.

```{bash class.source = 'fold-show', eval=FALSE}
awk '{print  $0"-->"NF;}' 1_Inputfiles/Experiment1.txt
```

**Question**

Why does the 3 row have 6 fields?

[Remember that awk usually sees both space and tab as field separator. So without specifying the tab as unique separator, `#contamination yellow_discoloration?` is seen as two separate fields.]{.spoiler}


#### The NR command

NR gives you the total number of records being processed or line number.

```{bash class.source = 'fold-show', eval=FALSE}
awk '{print $0"-->"NR;}' 1_Inputfiles/Experiment1.txt
```


#### The Filename command

While this is not math this is incredibly useful if you have thousands of files that you want to merge (as long as you can distinguish them by the file name). For example, if we have the 100 different files with measurements from different 1_Inputfiles/Experiments, we can combine them into one document as long as they have the same layout in terms of nr. of columns and a useful header

```{bash class.source = 'fold-show', eval=FALSE}
awk '{print FILENAME,$0}' 1_Inputfiles/Experiment1.txt

```

With our example we would get something like this:

<p align="left">
  <img src="/Users/ninadombrowski/Desktop/WorkingDir/Notebooks/Pictures/Filenames.png">
</p>


Since, the **.txt** is not pretty we can also remove it by using gsub and a pipe. Pipes are explained in the general notebook but basically allow to combine two commands into one.

```{bash class.source = 'fold-show', eval=FALSE}
awk '{print FILENAME,$0}' 1_Inputfiles/Experiment1.txt | awk '{gsub(/.txt/,"")}{print $0}'
```


#### The FNR command

When awk reads from the multiple input file, awk NR variable will give the total number of records relative to all the input file. Awk FNR will give you number of records for each input file.

```{bash class.source = 'fold-show', eval=FALSE}
awk '{print FILENAME, FNR;}' 1_Inputfiles/Experiment1.txt 1_Inputfiles/Metadata
```


### Dealing with headers
######################

Sometimes dealing with headers can be a challenging since they are an extra column we might want to exclude if we for example want to summarize data. Now that we now about special variables, we can deal with this, for example in the `Experiment2.txt` file that comes with a header

```{bash class.source = 'fold-show', eval=FALSE}
#check how the file looks like
head 1_Inputfiles/Experiment2.txt

#remove the header
awk 'NR!=1{print $0}' 1_Inputfiles/Experiment2.txt
```

**Exercise**

- For Experiment2.txt calculate the sum of the third plus the fourth column. Do this with and without removing the header

```{bash class.source = 'fold-hide', eval=FALSE}
awk 'NR!=1{SUM=$3+$4}{print $1,$2,SUM}' 1_Inputfiles/Experiment2.txt

awk '{SUM=$3+$4}{print $1,$2,SUM}' 1_Inputfiles/Experiment2.txt

```


### use Increment and decrement operators
######################

- ++a : increments the value of an operand by ‘1’. first increments the value of the operand, then returns the incremented value.
- a++: first returns the value of the operand, then it decrements its value
- Printf = allows more option for printing

```{bash class.source = 'fold-show', eval=FALSE}
awk 'BEGIN {a = 10; b = ++a; print b}'

awk 'BEGIN {a = 10; b = ++a; printf "a = %d, b = %d\n", a, b}'

awk 'BEGIN {a = 10; b = --a; printf "a = %d, b = %d\n", a, b}'

awk 'BEGIN {a = 10; b = a--; printf "a = %d, b = %d\n", a, b}'
```

We could for example tell awk to count the number of lines in which a specified pattern is matched and then output that ‘count’. For example, lets count how often we did control measurements?