Intro_to_programming_for_data_sci/robjects.qmd at main · NUstat/Intro_to_programming_for_data_sci · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
---
title: "R: Objects"
format:
    html:
      toc: true
      self-contained: true
editor_options:
  chunk_output_type: console
---

## Atomic vectors

An atomic vector in R is a vector containing objects of the same datatype. If the objects are not of the same datatype, then they are coerced to be of the same datatype. It is defined using the keyword `c()`.

```{r}
numbers = c(1, 2, 67)
```

The in-built R function `length()` is used to find the length of an atomic vector.

```{r}
length(numbers)
```

### Slicing the atomic vector

#### Slicing using indices

An atomic vector can be sliced using the indices of the elements within  `[]` brackets.

For example, consider the vector:

```{r}
vec <- 1:40
```

Suppose, we wish to get the $3^{rd}$ element of the vector. We can get it using the index `3`:

```{r}
vec[3]
```

A sequence of consecutive elements can be sliced using the indices of the first element and the last element around the `:` operator. For example, let us slice elements from the $3^{rd}$ index to the $10^{th}$ element of the vector `vec`:

```{r}
vec[3:10]
```

We can slice elements at different indices by putting the indices in an atomic vector within the `[]` brackets. Let us slice the $4^{th}$, $7^{th}$, and $18^{th}$ elements of the vector `vec`:

```{r}
vec[c(4,7,18)]
```

We can slice consecutive elements, and non-consecutive elements simultaneously. Let us slice the elements from the $4^{th}$ index to the $9^{th}$ index and the $30^{th}$ and $36^{th}$ element.

```{r}
vec[c(4:9,30,36)]
```

#### Slicing using a logical atomic vector

An atomic vector can be sliced using a logical atomic vector of the same length. The logical atomic vector will have `TRUE` values corresponding to the indices where the element is to be selected, and `FALSE` where the element is to be discarded. See the example below.

```{r}
vec <- 1:5
vec[c(TRUE, FALSE, FALSE, TRUE, FALSE)]
```

### Removing elements from atomic vector

Elements can be removed from the vector using the negative sign within `[]` brackets.

Remove the 2nd element from the vector:

```{r}
vec <- 1:5
vec[-2]
```

If multiple elements need to be removed, the indices of the elements to be removed can be given as an atomic vector.

Remove elements `2` to `6` and element 10 from the vector:
```{r}
vec <- 1:20
vec[-c(2:6, 10)]
```

**Example:** USA’s GDP per capita from 1960 to 2021 is given by the vector `G` in the code chunk below. The values are arranged in ascending order of the year, i.e., the first value is for 1960, the second value is for 1961, and so on. Store the years in which the GDP per capita of the US increased by more than 10%, in a vector.

```{r}
G = c(3007, 3067, 3244, 3375,3574, 3828, 4146, 4336, 4696, 5032,5234,5609,6094,6726,7226,7801,8592,9453,10565,11674,12575,13976,14434,15544,17121,18237,19071,20039,21417,22857,23889,24342,25419,26387,27695,28691,29968,31459,32854,34515,36330,37134,37998,39490,41725,44123,46302,48050,48570,47195,48651,50066,51784,53291,55124,56763,57867,59915,62805,65095,63028,69288)
```

```{r}
years <- c()
for (i in 1:(length(G) - 1)) {
  diff = (G[i+1] - G[i]) / G[i]
  if (diff > 0.1) years <- c(years, 1960 + i)
}
print(years)
```

### Element-wise operations on atomic vectors

When we use arithmetic operators like `+`, `-`, `*`, etc., or comparison operators like `>`, `>=`, `==`, etc., between atomic vectors, then these operators are applied element-wise on the elements of the respective atomic vectors with the same index. Consider the examples below.

```{r}
vec1 <- 1:4
vec2 <- 1:4
vec1 + vec2
```

```{r}
vec1 > vec2
```

It is highly recommended that these operators be applied on atomic vectors of the same length. Otherwise, the vector of the smaller length will broadcast (or repeat itself) to match the length of the larger vector. A warning will be returned if the length of the longer vector is not a multiple of the length of the shorter vector. Broadcasting may be difficult to interpret, especially when arithmetic operators are being applied on more than 2 atomic vectors of different lengths.

If an operator is applied between an atomic vector and a scalar, then the operation is performed on each element of the atomic vector and the scalar. See the examples below.

```{r}
vec1*4
```

```{r}
vec1 > 2
```

Suppose, we wish to slice all elements from the object `vec` that are greater than 2. Here is one approach to do it. We will apply the `>` operator between `vec` and 2 to obtain a logical vector that is `TRUE` on indices where the condition is satisfied, and `FALSE` otherwise. We will then use this logical vector to slice `vec`. Below is the code.

```{r}
vec1[vec1 > 2]
```

Now, solve the previous **example** without using a `for` loop.


### The `seq()` function

The `seq()` function is used to generate an atomic vector consisting of a sequence of integers with a constant gap. For example, the code below generates a sequence of integers starting from 20 upto 60 with gaps of 5.

```{r}
seq(20, 60, 5)
```

### The `rep()` function

The `rep()` function is used to repeat an object a fixed number of times.

```{r}
rep(4, 10)
```

```{r}
rep(c(2, 3), 10)
```

### The `which()` function

The `which()` function is used to find the indices of `TRUE` elements in a logical  atomic vector.

```{r}
vec <- c(8, 3, 4, 7, 9, 7, 5)
```

```{r}
which(vec == 8)
```

In the above code, a logical vector is being created with `vec == 8`, and the `which()` function is returning the indices of the `TRUE` elements.

The index of the maximum and minimum values can be found using `which.max()` and `which.min()` respectively. In case of multple maximum or minimum elements, the smallest index is returned.

```{r}
which.max(vec)
```

```{r}
which.min(vec)
```


### Practice exercise 1

Below is a vector consisting of responses to the question: “At what age do you think you will marry?” from students of the STAT303-1 Fall 2022 class.

```{r}
exp_marriage_age <- c('24','30','28','29','30','27','26','28','30+','26','28','30','30','30','probably never','30','25','25','30','28','30+ ','30','25','28','28','25','25','27','28','30','30','35','26','28','27','27','30','25','30','26','32','27','26','27','26','28','37','28','28','28','35','28','27','28','26','28','26','30','27','30','28','25','26','28','35','29','27','27','30','24','25','29','27','33','30','30','25','26','30','32','26','30','30','I wont','25','27','27','25','27','27','32','26','25','never','28','33','28','35','25','30','29','30','31','28','28','30','40','30','28','30','27','by 30','28','27','28','30-35','35','30','30','never','30','35','28','31','30','27','33','32','27','27','26','N/A','25','26','29','28','34','26','24','28','30','120','25','33','27','28','32','30','26','30','30','28','27','27','27','27','27','27','28','30','30','30','28','30','28','30','30','28','28','30','27','30','28','25','never','69','28','28','33','30','28','28','26','30','26','27','30','25','Never','27','27','25')
```

#### Cleaning data
Remove the elements that are not integers - such as ‘probably never’, ‘30+’, etc. Convert the reamining elements to integer. What is the length of the new vector?

```{r}
new_vector <- as.integer(exp_marriage_age)
numeric_values <- new_vector[!is.na(new_vector)]
length(numeric_values)
```


#### Capping unreasonably high values
Cap the values greater than 80 to 80, in the clean vector obtained above. What is the mean age when people expect to marry in the new vector?

```{r}
numeric_values[numeric_values > 80] <- 80
mean(numeric_values)
```

#### People marrying at 30 or more
Determine the percentage of people who expect to marry at an age of 30 or more.
```{r}
sum(numeric_values >= 30) / length(numeric_values)
```

### The `sapply()` function

The `sapply()` function is used to apply a function on all the elements of a list, atomic vector or matrix.

For example, consider the vector below:

```{r}
vec <- 1:6
vec
```

Suppose, we wish to square each element of the vector. We can use the `sapply()` function as below:

```{r}
sapply(vec, FUN = function(x) x**2)
```


### Practice exercise 2

Write a function that identifies if a word is a [palindrome](https://en.wikipedia.org/wiki/Palindrome) *(A palindrome is a word that reads the same both backwards and forwards, for example, peep, rotator, madam, etc.)*. Apply the function to the vector of words below to count the number of palindrome words.

```{r}
words_vec <- c('fat', 'civic', 'radar', 'mountain', 'noon','papa')
```

```{r}
palindrome <- function(word) {
  for (i in 1:as.integer(nchar(word)/2)) {
    if (substr(word, i, i) != substr(word, nchar(word) - (i-1), nchar(word) - (i-1))) {
      return(FALSE)
    }
  }
  return(TRUE)
}
sum(sapply(words_vec, palindrome))
```


## Matrix

Matrices are two-dimensional arrays. The in-built function `matrix()` is used to define a matrix. An atomic vector can be organized as a matrix by specifying the number of rows and columns.

For example, let us define a 2x3 matrix (2 rows and 3 columns) consisting of consecutive integers fro1 1 to 6.

```{r}
mat <- matrix(1:6, 2, 3)
mat
```

Note that the integers fill up column-wise in the matrix. If we wish to fill-up the matrix by row, we can use the `byrow` argument.

```{r}
mat <- matrix(1:6, 2, 3, byrow = TRUE)
mat
```

The functions `nrow()` and `ncol()` can be used to get the number of rows and columns of the matrix respectively.

```{r}
nrow(mat)
```

```{r}
ncol(mat)
```

Matrices can be sliced using the indices of row and column separated by a `,` in box brackets. Suppose we wish to get the element in the $2^{nd}$ row and $3^{rd}$ column of the matrix:

```{r}
mat[2, 3]
```

For selecting all rows or columns of a matrix, the index for the row/column can be left blank. Suppose we wish to get all the elements of the $1^{st}$ of the matrix:

```{r}
mat[1, ]
```

Row and columns of the matrix can be sliced using the `:` operator. Suppose we want to select a sub-matrix that has elements in the first two rows and columns 2 and 3 of the matrix `mat`:

```{r}
mat[1:2, 2:3]
```

Element-wise arithmetic operations can be performed between 2 matrices of the same shape.

```{r}
mat1 <- matrix(1:6, 2, 3)
mat2 <- matrix(c(9, 2, 6, 5, 1, 0), 2, 3)
mat1 + mat2
```

```{r}
mat1 - mat2
```

Suppose we need to sum up all the rows of the matrix. We can do it using a `for` loop as follows:

```{r}
row_sum <- c(0,0)
for (i in 1:nrow(mat)) {
  for (j in 1:ncol(mat)) {
    row_sum[i] <- row_sum[i] + mat[i, j]
  }
}
row_sum
```

Observe that in the above `for` loop, elements of each row are added one at a time. We can add all the elements of a row simultaneously using the `sum()` function. This will reduce a `for` loop from the above code:

```{r}
row_sum <- c(0,0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
```

In the above code, we sum up all the elements of the row simultaneously. However, we still need to sum up the elements of each row one at a time.

### The `apply()` function

The `apply()` function can be used to apply a function on each row or column of a matrix. Thus, this function helps avoid the user to write a `for()` loop in R to iterate over all the rows and columns of the matrix. Note that each row / column of a matrix is an atomic vector. Thus, vectorized computations can be performed within the function, resulting in efficient computations.

Note that the apply functions use a `for()` loop under-the-hood, and thus the function will be applied sequentially on each row / column of the matrix. However, as the implementation of the `for()` loop is in C, it is likely to be faster than a `for()` loop in R.

Let us use the `apply()` function to sum up all the rows of the matrix `mat`.

```{r}
apply(mat, 1, sum)
```

Let us compare the time taken to sum up rows of a matrix using a `for` loop with the time taken using the `apply()` function.

```{r}
options(digits.secs = 6)
start.time <- Sys.time()
row_sum<-c(0, 0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
```


```{r}
start.time <- Sys.time()
apply(mat, 1, sum)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
```

Observe that the `apply()` function takes much lesser time to sum up all the rows of the matrix as compared to the `for` loop.

Recall the earlier example where we computed year's in which the increase in GDP per capita was more than 10%. Let us use matrices to solve the problem. We'll also compare the time it takes using a matrix with the time it takes using `for` loops.

```{r}
start.time <- Sys.time()

#Let the first column of the matrix be the GDP of all the years except 1960, and the second column be the GDP of all the years except 2021.
GDP_mat <- matrix(c(G[-1], G[-length(G)]), length(G) - 1, 2)

#The percent increase in GDP can be computed by performing computations using the 2 columns of the matrix
inc <- (GDP_mat[,1] - GDP_mat[,2]) / GDP_mat[,2]
years <- 1961:2021
years <- years[inc > 0.1]
years
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
```

Without matrices, the time taken to perform the same computation is  measured with the code below.

```{r}
start.time <- Sys.time()
years <- c()
for (i in 1:(length(G) - 1)) {
  diff = (G[i+1] - G[i]) / G[i]
  if (diff > 0.1) years <- c(years, 1960 + i)
}
print(years)
#print(proc.time()[3]-start_time)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
```

Observe that matrices reduce the execution time of the code as computations are performed simultaneously, in contrast to a `for` loop where computations are performed one at a time.

Sometimes, the computations on rows / columns of a matrix are not straighforward and we may need to use the `apply()` function to apply a function on each row / column of a matrix.

**Example:** Find the maximum GDP per capita of the US in each of the 5 year periods starting from 1961-1965, and upto 2015-2020.

```{r}
GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_max_5year <- apply(GDP_5year, 1, max)
```

In the above code, we applied the in-built function `max` on all the rows. Sometimes, an in-built function may not be available for the computations to be performed. In such as case, we can write our own user-defined function within the `apply()` function. See the example below.

**Example:** Find the range (max-min) of GDP per capita of the US in each of the 5 year periods starting from 1961-1965, and upto 2015-2020.

```{r}
GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_range_5year <- apply(GDP_5year, 1, function(x) max(x) - min(x))
GDP_range_5year
```

In the code above we applied a user-defined function on each row of the matrix. However, if the function has multiple lines, it may be inconvenient to write the function within the `apply()` function. In that case, we can define the function outside the `apply()` function.

**Example:** Find the five year periods  starting from 1961-1965, and upto 2016-2020, during which the GDP per capita decreased as compared to the previous year.


```{r}
GDP_inc <- function (GDP_5yr) {
  dec <- 0
  for (i in 1:4) {
    if(GDP_5yr[i+1] < GDP_5yr[i]) dec <- 1
  }
  return(dec)
}

GDP_5year_mat <- matrix(G[-c(1,length(G))], 12, 5, byrow = TRUE)
years_inc_dec <- apply(GDP_5year_mat, 1, GDP_inc)
five_year_periods <- seq(1960, 2015, 5)
print("Five year periods in which the GDP per capita decreased are those starting from the years:")
print(five_year_periods[years_inc_dec == 1] + 1)
```

The 5 year periods during which the GDP per capita decreased as compared to the previous year are 2006-2010, and 2016-2020.

### Practice exercise 3

Find the 5 year period in which the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest.

**Solution:**

```{r}
five_year_periods[which.max(apply(GDP_5year_mat, 1, function(x) (max(x) - min(x)) / min(x)))] + 1
```

```{r}
print("During 1976-1980 the difference of the maximum GDP per capita and the minimum GDP per capita as a percentage of the minimum GDP per capita was the highest.")
```

### Practice exercise 4

The object `country_names` is an atomic vector consisting of country names. The object `coordinates_capital_cities` is a matrix consisting of the latitude-longitude pair of the capital city of the respective country. The order of countries in `country_names` is the same as the order in which their capital city coordinates (latitude-longitude) appear in the matrix `coordinates_capital_cities`.

```{r}
#| echo: false
#| eval: false
setwd("C:\\Users\\akl0407\\Desktop\\STAT303-1\\Quarto Book\\Intro_to_programming_for_data_sci\\Datasets")
```

Download the file *capital_cities.csv* from [here](https://nuwildcat-my.sharepoint.com/:f:/g/personal/akl0407_ads_northwestern_edu/EkQnSWXXMm5Ak8qCCWx2fVgBz3s-ksJ7xNgbzn-mmf0IFw?e=pebtNE
). Make sure the file is in your current working directory. Execute the following code to obtain the objects `coordinates_capital_cities` and `country_names`.

```{r}
#| eval: false
capital_cities <- read.csv('capital_cities.csv')
coordinates_capital_cities <- as.matrix(capital_cities[,c(3, 4)])
country_names <- capital_cities[,1]
```

#### Country with capital closest to DC
Print the name and coordinates of the country with the capital city closest to the US capital - Washington DC.

Note that:

1. The *Country Name* for US is given as *United States* in the data.
2. The 'closeness' of capital cities from the US capital is based on the Euclidean distance of their coordinates to those of the US capital.

**Hint:**

1. Get the coordinates of Washington DC from `coordinates_capital_cities`. The row that contains the coordinates of DC will have the same index as `United States` has in the vector `country_names`

2. Create a matrix that has coordinates of Washington DC in each row, and has the same number of rows as the matrix `coordinates_capital_cities`.

3. Subtract `coordinates_capital_cities` from the matrix created in (2). Element-wise subtraction will occur between the matrices.

4. Use the `apply()` function on the matrix obtained above to find the Euclidean distance of Washington DC from the rest of the capital cities.

5. Using the distances obtained above, find the country that has the closest capital to DC.


```{r}
#| eval: false
#| echo: false
US_index <- which(country_names == 'United States')
DC_coord <- coordinates_capital_cities[US_index,]
DC_coord_matrix <- matrix(rep(DC_coord, 245), 245, 2, byrow = TRUE)
DC_minus_capital_cities <- DC_coord_matrix - coordinates_capital_cities
distance_of_DC_from_other_cities <- apply(DC_minus_capital_cities**2, 1, sum)
distance_of_DC_from_other_cities[distance_of_DC_from_other_cities == 0] <- 999999
country_names[which.min(distance_of_DC_from_other_cities)]
coordinates_capital_cities[which.min(distance_of_DC_from_other_cities),]
top_10_coords <- coordinates_capital_cities[which.min(distance_of_DC_from_other_cities),]
```

#### Top 10 countries closest to DC

1. Print the names of the countries of the top 10 capital cities closest to the US capital - Washington DC.

2. Create and print a matrix containing the coordinates of the top 10 capital cities closest to Washington DC.


```{r}
#| eval: false
US_index = which(country_names == 'United States')
dc_coord <- coordinates_capital_cities[US_index,]
distances_to_DC <- apply(coordinates_capital_cities, 1,
                  function(city_coord) sqrt(sum((city_coord - dc_coord)**2)))
num_of_countries <- length(country_names)
distances_to_DC_matrix <- cbind(1:num_of_countries, distances_to_DC)
sorted <- distances_to_DC_matrix[order(distances_to_DC_matrix[,2]),]
```

Top 10 countries with capitals closest to Washington DC are the following:

```{r}
#| eval: false
country_names[sorted[3:12, 1]]
```

The coordinates of the top 10 capital cities closest to Washington DC are:

```{r}
#| eval: false
coordinates_capital_cities[sorted[3:12, 1],]
```


```{r}
#| eval: false
#| echo: false
for (i in 1:9) {
  distance_of_DC_from_other_cities[which.min(distance_of_DC_from_other_cities)] <- 999999
  print(country_names[which.min(distance_of_DC_from_other_cities)])
  coord_closest_to_DC <- coordinates_capital_cities[which.min(distance_of_DC_from_other_cities),]
  top_10_coords <- rbind(top_10_coords,coord_closest_to_DC)
}
```

## Lists

Atomic vectors and matrices are quite useful in R. However, a constraint with them is that they can only contain objects of the same datatype. For example, an atomic vector can contain all numeric objects, all character objects, or all logical objects, but not a mixture of multiple types of objects. Thus, there arises a need for a `list` data structure that can store objects of multiple datatypes.

A list can be defined using the `list()` function. For example, consider the list below:

```{r}
list_ex <- list(1, "apple", TRUE, list("another list", TRUE))
```

The list `list_ex` consists of objects of mutiple datatypes. The length of the list can be obtained using the `length()`function:

```{r}
length(list_ex)
```

A list is an ordered collection of objects. Each object of the list is associated with an index that corresponds to its order of occurrence in the list.

A single element can be sliced from the list by specifying its index within the `[[]]` operator. Let us slice the $2^{nd}$ element of the list `list_ex`:

```{r}
list_ex[[2]]
```

Multiple elements can be sliced from the list by specifying the indices as an atomic vector within the `[]` operator. Let us slice the $1^{st}$ and $3^{rd}$ elements from the list `list_ex`:

```{r}
list_ex[c(1,3)]
```

Elements of a list can be named using the `names()` function. Let us name the elements of `list_ex`:

```{r}
names(list_ex) <- c("Name1", "second_name", "3rd_element", "Number 4")
```

A single element can be sliced from the list using the name of the element with the `$` operator. Let us slice the element named as `second_name` from the list `list_ex`:

```{r}
list_ex$second_name
```

Note that if the name of the element does not begin with a letter or has special characters such as a space, then it should be specified within single quotes after the `$` operator. For example, let us slice the element named as `3rd_element` from the list `list_ex`:

```{r}
list_ex$`3rd_element`
```

Names of elements of a list can also be specified while defining the list, as in the example below:

```{r}
list_ex_with_names <- list(movie = 'The Dark Knight', IMDB_rating = 9)
```

A list can be converted to an atomic vector using the `unlist()` function. For example, let us convert the list `list_ex` to a vector:

```{r}
unlist(list_ex)
```

Since a vector can contain objects of a single datatype, note that all objects have been converted to the `character` datatype in the vector above.

### Practice exercise 5

Download the dataset *movies.json*. Execute the following code to read the data into the object `movies`:

```{r}
#| eval: false
library(rjson)
movies<-fromJSON(file = 'movies.json')
```

####
What is the datatype of the object `movies`?

```{r}
#| eval: false
class(movies)
```

The datatype of the object `movies` is list.

####
Count the number movies having a negative profit, i.e., their production budget is higher than their worldwide gross.

Ignore the movies that:

1. Have missing values of production budget or worldwide gross. Use the `is.null()` function to identify missing or NULL values.

2. Have a zero worldwide gross *(A zero worldwide gross is probably an incorrect value)*.

```{r}
#| eval: false
negative_profit <- c()
count <- 0
for (i in 1:length(movies)) {
  pb <- movies[[i]]$`Production Budget`
  wg <- movies[[i]]$`Worldwide Gross`
  if (!(is.null(pb) | is.null(wg))) {
    if (pb > wg & wg > 0) {
      count <- count + 1
    }
  }
}
print(paste("Number of movies with negative profit =", count))
```

### The `lapply()` function

The `lapply()` function is used to apply a function on each element of a list, and returns a list of the same length.

For example, consider the list below:

```{r}
list_ex <- list(1, "apple", TRUE, list("another list", TRUE))
```

Let us use the `lapply()` function to find the class of each element of the list `list_ex`:

```{r}
lapply(list_ex, function(x) class(x))
```

### Practice exercise 6

Solve [practice exercise 5](https://nustat.github.io/Intro_to_programming_for_data_sci/robjects.html#practice-exercise-5) without using a `for` loop. Use the `lapply()` function.

```{r}
#| eval: false
profit <- lapply(movies, function(x) x$`Worldwide Gross`-x$`Production Budget`)
positive_wg <- lapply(movies, function(x) x$`Worldwide Gross` > 0)
sum(profit < 0 & positive_wg > 0, na.rm = TRUE)
```