openSIMD_site/02-Calculating-Domains.Rmd at master · TheDataLabScotland/openSIMD_site · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Calculating Domains


```{r setup, include=FALSE}
knitr::opts_chunk$set(
	message = FALSE,
	warning = TRUE,
	echo = TRUE
)
```

In this section, I will explain the code used to calculate the individual domain scores.

## Setting up

Here is a complete list of the files in `openSIMD_analysis` that we need for **openSIMD**:

- `scripts/calculations/domains.R` to calculate the domains
- `scripts/calculations/openSIMD.R` to calculate SIMD
- `scripts/utils/helpers.R`, some utility functions (see [Functions](#functions) for documentation)
- `data/SIMD16 indicator data.xlsx`
- `data/SIMD16 ranks and domain ranks.xlsx`.

If you use RStudio you can open `openSIMD_analysis.Rproj` to open the project.

Starting in `domains.R`, the first things to do are:

- load a few packages
- source the utility functions
- read in the data

You will need to make sure that the file paths in your code correspond to where your files are. If you are using RStudio and you have opened the `openSIMD_analysis.Rproj` then the directories of interest will be `scripts/utils` and `data`.

You may experience an error when reading the .xlsx file files with `read_excel`. If this happens you have two options. Either you can open the file in Excel, save it, close it, and try again. The file should now read into R without an error. Alternatively, you can save the file as a .csv file (make sure you select the correct sheet) and then read it into R using the `read.csv` function.

```{r load_resources}
library(readxl)
library(dplyr)
source("../openSIMD_analysis/scripts/utils/helpers.R")
d <- read_excel("../openSIMD_analysis/data/SIMD16 indicator data.xlsx", sheet = 3, na = "*")
```

The data here contains the published indicators, in this case from SIMD 2016.

```{r}
str(d)
```

## The recipe

For each domain, the process has several steps, many of which are repeated across domains. To make this recipe easier to follow, I have defined some **openSIMD** utility functions in `scripts/utils/helpers.R`, see [Functions](#functions). The steps of the recipe and the associated functions are as follows:

- Select the indicators (columns) that are relevant to that domain, using the function `dplyr::select`
- Calculate normalised ranks for each indicator, using the function [`normalScores`](#normalScores)
- Replace missing values, using the function [`replaceMissing`](#replaceMissing)
- Derive the factor analysis weights for each indicator, using the function [`getFAWeights`](#getFAWeights)
- Combine the normalised ranks and weights to generate the indicator score, using the function [`combineWeightsAndNorms`](#combineWeightsAndNorms)
- Rank the indicator score, using the function `base::rank`.

Not all steps are used for every domain.

## dplyr

Throughout this project, I make use of the tools in the `dplyr` package for data manipulation including the `%>%` notation for forwards piping. I won't introduce these tools but if they are unfamiliar, I recommend reading [this article](https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html). See the section on chaining for an explanation of the `%>%` pipe.

## An example: Education

As an example of the process, here we will calculate the domain rank for the education domain. In this first chunk of code, I will create the normalised ranks for the education domain as follows:

```{r}
normalised_education <- d %>% # start with the raw data
  select(Attendance, Attainment, Noquals, NEET, HESA) %>% # select relevant columns
  mutate(Attendance = normalScores(Attendance, forwards = FALSE)) %>% # replace each column with its normalised values
  mutate(Attainment = normalScores(Attainment, forwards = FALSE)) %>%
  mutate(Noquals    = normalScores(Noquals, forwards = TRUE)) %>%
  mutate(NEET       = normalScores(NEET, forwards = TRUE)) %>%
  mutate(HESA       = normalScores(HESA, forwards = FALSE)) %>%
  mutate_all(funs(replaceMissing)) # replace missing values
```

**Note:** you may see a warning here because [`normalScores`](#normalScores) can generate missing values from missing values. This is fine, these will be replaced with 0, when you call [`replaceMissing`](#replaceMissing).

The only decisions to make are (a) which columns to select using `select` and (b) which orientation to rank (and then normalise) each indicator. The orientation is determined by the `forwards` argument to `normalScores`, see [`normalScores`](#normalScores) and [Variations](#Variations) for further information.

When combining the indicators, we need to apply a different weight to each. The weights are derived through factor analysis of the normalised indicator scores, and the proportional loadings on factor 1 serve as the weightings. We extract the loadings using the [`getFAWeights`](#getFAWeights) function as follows:

```{r}
education_weights <- getFAWeights(normalised_education)
```

Now that we have the normalised indicator scores and weights, we can combine them with the utility function [`combineWeightsAndNorms`](#combineWeightsAndNorms). Each normalised indicator variable is multiplied by its weight derived from factor analysis, as follows:

```{r}
education_score <- combineWeightsAndNorms(education_weights, normalised_education)
```

Finally we rank these weighted scores to generate the domain rank (1 = most deprived).

```{r}
education_rank <- rank(-education_score)
```

## Variations {#Variations}

The remaining domains are calculated in a similar way with some variations. Rather than explaining each domain, I will explain the possible variations.

For the housing domain, the sum of the overcrowding rate and non-central heating rate is ranked.

For the crime, income and employment domains, we simply use the published ranks. The reason for this is that the published indicator data for these domains is rounded, whereas the published domain ranks are based on unrounded data and therefore more precise.

In the education example above, when applying the [`normalScores`](#normalScores) function, we needed to pay attention to the `forwards` argument to orient the variables (decide whether a high value was good or bad). In the other domains this is not the case and each indicator can take the default value `forwards = TRUE` (high value = deprived). This means we can use the `dplyr::mutate_all()` function instead of mutating each variable independently. In addition, if indicator column names have something in common, we can select them with `dplyr::select(contains("some_common_text"))`.

The access domain is unique in that it has two sub-domains (drive time and public transport time) which are processed separately in the normal way (normalise -> weight -> rank). Then, each sub-domain is exponentially transformed (covered in the next section on [calculating SIMD](#simd)) and the resulting scores are summed in a 2:1 ratio before the final ranking for the domain.

## Re-assigning ranks

We have included some functionality to manually re-assign ranks to allow for certain exceptions. This is done via the [`reassignRank`](#reassignRank) function.