openSIMD_site/04-Functions.Rmd at master · TheDataLabScotland/openSIMD_site · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
output:
  pdf_document: default
  html_document: default
---
# Functions {#functions}

Below is the documentation of the functions developed for this project, which are used throughout the **openSIMD** procedure.

## normalScores {#normalScores}

This function calculates the normal scores for each indicator. The normal score is defined as follows:

\begin{equation*}
y_{i} = \phi^{-1}\left(\frac{r_{i}}{n + 1}\right)
\end{equation*}

where: $\phi^{-1}$ is the inverse cumulative normal (probit) function, $r_{i}$ is the rank of the i'th observation and $n$ is the number of non-missing observations for the ranking variable. This is the inverse cumulative normal probability density of proportional ranks. The resulting variable should appear normally distributed regardless of the input data. We translated this approach using the [SAS documentation](http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146840.htm) as a guide resulting in the following R function.

```{r}
normalScores <- function(
  v,                  # a numeric vector as the input variable
  ties = "average",   # passed to ties.method argument in rank()
  forwards = TRUE     # smallest numerical value on left? default is TRUE
) {

  r <- rank(v, ties.method = ties)
  n <- length(na.omit(v))

  rn <- r / (n + 1)

  y <- qnorm(rn, mean = 0, sd = 1, lower.tail = forwards)

  return(y)

}
```

The function takes a numeric vector as its input `v`. It first ranks this input `r` and then calculates the proportional rank `rn`. The final step is to apply the cumulative normal probability using the `qnorm` function. The return value is a numeric vector of the same length as the input `v`.

## replaceMissing {#replaceMissing}

This function replaces missing values, once normalised indicator scores have been calculated. The function finds missing values (as well as `Inf` and `-Inf`) in a vector and replaces them with `0`. This is used in **openSIMD** where a data zone has zero population, or has a missing value for an indicator, and we want it to sit in the centre of the distribution and so we assign a value of 0.

```{r}
replaceMissing <- function(v) replace(v, is.na(v) | v == Inf | v == -Inf, 0)
```

The function takes a numeric vector `v` and returns a vector of equal length with missing values replaced with 0.

## getFAWeights {#getFAWeights}

This function performs a factor analysis using the `psych::fa` function. It then extracts the loadings on the first resulting factor, converts them to proportions of the sum of loadings, the weights, and then returns them as individual elements of a list. This is designed to be equivalent to the SAS procedure in previous SIMD calculations, however the results are only comparable up to two decimal places, likely due to differences in the implementation of factor analysis in the two packages (see [section on Quality Assurance](#QA)).

```{r}
getFAWeights <- function(dat, ...) {

  fact <- psych::fa(dat, nfactors = 1, fm = "ml", rotate = "none", ...)

  f1_scores <- as.data.frame(fact$weights) %>% select(ML1)

  f1_weights <- f1_scores / sum(f1_scores)

  # This is just to make each weight an individual element of a list
  # For compatibility with purrr::map2() in the next step, combineWeightsAndNorms()
  return(lapply(seq_along(f1_weights$ML1), function(i) f1_weights$ML1[i]))

  }
```

The function takes a data.frame `dat` which contains all of the variables for factor analysis. The return value is a list with individual elements corresponding to the weights of variables in the same order as the collumns in the input data. A list is returned here for compatibility with the next function `combineWeightsAndNorms`; however, this can be easily converted to a vector with `unlist`.

## combineWeightsAndNorms {#combineWeightsAndNorms}

This function takes the normalised indicator scores and the weights derived from factor analysis, multiplies them out and then takes the sum of these weighted indicator scores to get the final score for that domain. Weights and indicators need to be in the same order.

```{r}
combineWeightsAndNorms <- function(weights, norms) {

  combined <- purrr::map2(weights, norms, ~ .x * .y)

  combined %>% data.frame %>% rowSums

}
```

The function takes a list of weights (generated by `getFAWeights`) and a data.frame of normalised scores. The function
returns a numeric vector containing the combined domain score.

## expoTransform {#expoTransform}

This function exponentially transforms the (inverted) domain ranks.

```{r}
expoTransform <- function(ranks) {

  prop_ranks <- ranks / max(ranks)

  expo <- -23 * log(1 - prop_ranks * (1 - exp( -100 / 23)))

  return(expo)
}
```

The function takes a numeric vector of the (inverted) ranks and returns a numeric vector of equal length containing the transformed values.

## reassignRank {#reassignRank}

This function manually reassigns ranks to individual data zones. It can be used, for example, when the domain ranking needs to reflect a population of zero in a data zone.

```{r}
reassignRank <- function(data, domain, data_zone, end = "max", offset = 0) {

  if(end == "max") {
    data[data$data_zone == data_zone, domain] <-
      max(data[, domain], na.rm = TRUE) - (offset + 0.1)
  }else
    if(end == "min") {
      data[data$data_zone == data_zone, domain] <-
        min(data[, domain], na.rm = TRUE) + (offset + 0.1)
    }

  data[, domain] <- rank(data[, domain])

  return(data)
}
```

The function takes a data.frame containing a column named 'data_zone' and a ranked variable. The function changes the rank of the data zone in question (with an optional offset), and it then re-ranks the whole variable. The function returns the corrected data.frame.