CARTO-SDSC2022/Workbook.Rmd at main · patrickballantyne/CARTO-SDSC2022 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
---
title: "Retail and Geodemographics: A Spatial Data Science Approach"
author: "Patrick Ballantyne"
date: '2022-09-23'
output:
  rmdformats::downcute:
    code_folding: show
    toc_depth: 2
    number_sections: true
    self_contained: true
    thumbnails: false
    lightbox: true
    default_style: "dark"
    downcute: "default"
---

# Libraries

Libraries are an essential part of the R workflow, giving you all the necessary tools to complete the analysis. Different people use different libraries for certain aspects of analysis, but these are my favourites:

```{r message = FALSE, warning = FALSE}
library(tidyverse) # Data wrangling
library(sf) # (Spatial) data wrangling and GIS
library(tidycensus) # Census API
library(tmap) # Interactive mapping
tmap_mode("view")
```

```{r eval=FALSE, include=FALSE}
library(prettydoc)
library(rmdformats)
library(ggbeeswarm)
```
---

# Workshop Data

Today we will be working with three main types of data:

- Store data, supplied by [SafeGraph](https://www.safegraph.com/).
- Geodemographic classification, from the published paper by [Seth Spielman and Alex Singleton](https://www.tandfonline.com/doi/full/10.1080/00045608.2015.1052335).
- Population data, from the US Census Bureau API, utilising the [tidycensus R package](https://walker-data.com/tidycensus/).

---

## Store data - SafeGraph places

We have prepared a small subset of the SafeGraph places database, specifically those located in the Atlanta Metropolitan Area, and those affiliated to larger brands, i.e. non-independent stores. Let's loads these in:

```{r message = FALSE }
places <- st_read("Data/places.gpkg", quiet = TRUE)
```

Look at the attributes:

```{r}
glimpse(places)
```

Let's take a look at the distribution of grocery stores:

```{r}
places %>%
  filter(top_category == "Grocery Stores") %>%
  tm_shape() +
    tm_dots(col = "brands", title = "Brand", size = 0.05) +
  tm_layout(legend.outside = TRUE)

```

There are lots of other categories of stores, non-branded stores and data for other areas of the U.S., Canada, U.K. and beyond - SafeGraph places really is a fantastic resource!!

---

## Geodemographic classification

We have prepared a short subset of the US Geodemographic Classification also for Atlanta Metropolitan area. Let's load it in:

```{r}
geodemo <- st_read("Data/geodemographic.gpkg", quiet = TRUE)
```

Look at the attributes:

```{r}
glimpse(geodemo)
```

The 'TractGroup' column is the one we are interested in; every census tract in the US is assigned to one of 10 groups, based on the geodemographic characteristics of the area in question.

Let's look at the distribution of the 10 groups within Atlanta:

```{r warning = FALSE}
tm_shape(geodemo) +
  tm_fill(col = "TractGroup", alpha = 0.5, palette = "Pastel1") +
  tm_borders(col = "black", lwd = 0.25) +
  tm_layout(legend.outside = TRUE)
```

So there are notable differences in the geographic distribution of the groups across Atlanta.

Reminder: if you want to check out what the geodemographics of your local neighbourhood area, you can do that [here](https://tinyurl.com/us-geo).

---

## Population - US Census Bureau (via tidycensus)

Finally, we are also going to be integrating some population data (also at Tract level). To do this we are going to be using a fantastic package called ['tidycensus'](https://walker-data.com/tidycensus/) which is used to directly query data from the US Census Bureau API.

You'll need a US Census Bureau API key if you haven't got one already, which you can get [HERE](https://api.census.gov/data/key_signup.html).

```{r warning = FALSE}
## Remember to get an API key for the US Census Bureau
census_api_key("e0f3cdfa0e1e13ab3a4d9dc32c67984d0be5523b")
```

So, tidycensus allows us to query data from the US Census Bureau, in this case we are going to use data from the 2010 census to obtain population counts at tract level:

```{r message= FALSE}
pop <- get_decennial(geography = "Tract", variables = "P001001",
                     state = "GA", year = 2010)
```

Look at the attributes:

```{r}
glimpse(pop)
```

Here, the 'value' column is the one we are interested in; total population in each census tract. Let's merge this onto the geodemographic classification:

```{r}
gd_pop <- merge(geodemo, pop, by = "GEOID")
```

```{r}
glimpse(gd_pop)
```

Now that we have the population data converted to a spatial form, let's take a look at the population distribution of Atlanta MSA:

```{r}
tm_shape(gd_pop) +
  tm_fill(col = "value", title = "TotalPopulation", alpha = 0.5, style = "jenks") +
  tm_borders(col = "black", lwd = 0.25) +
  tm_layout(legend.outside = TRUE)
```

---

# Store Catchments

Ok, now that we are familiar with the three main datasets, we can actually get to some 'spatial data science'. In this workshop we are going to be looking at the geodemographic profiles of two very different retailers:

**Whole Foods Market:**

*"The stores offer a wide variety of organic and natural foods, as well as a large selection of prepared foods... also has a number of unique features, such as a cheese shop, a wine bar and a coffee bar. Shopping at this luxury grocery store might cost shoppers more than they are willing to spend. According to a survey conducted by Business Insider, Whole Foods is 15% more expensive than general grocery stores such as Walmart and Kroger"* (source: [gobankingrates.com](https://www.gobankingrates.com/money/wealth/most-luxurious-grocery-stores-in-the-us/)).

**Ollie's Bargain Outlet:**

*"Ollie's Bargain Outlet is an American chain of discount closeout retailers...  Its selection of merchandise comprises a variety of discounted household goods, apparel, pet supplies, kitchen pantry staples, and seasonal products (holiday, gardening, patio, pool and beach supplies); a majority of these items are unsold or overstocked merchandise that is purchased in bulk from other retailers and sold at discounted prices"* (source: [wiki](https://en.wikipedia.org/wiki/Ollie%27s_Bargain_Outlet)).

Let's get the relevant places:

```{r}
wholefoods <- places %>%
  filter(brands == "Whole Foods Market")

ollies <- places %>%
  filter(brands == "Ollie's Bargain Outlet")
```


Let's visualise these:

```{r}
q1 <- tm_shape(wholefoods) +
  tm_dots(col = "blue", size = 0.1, alpha = 0.5) +
  tm_text("brands", size = 1, just = "top") +
  tm_shape(ollies) +
  tm_dots(col = "orange", size = 0.1, alpha = 0.5) +
  tm_text("brands", size = 1, just = "top")

q1
```

And what does this look like with the geodemographic classification underneath:

```{r warning = FALSE}
tm_shape(gd_pop) +
  tm_fill(col = "TractGroup", alpha = 0.75, palette = "Pastel1") +
  tm_borders(col = "black", lwd = 0.1) +
  tm_layout(legend.outside = TRUE) +
  q1
```


So there are some pretty obvious locational differences between them. But how can we assess this - **catchments**. Store catchments help us to better understand the characteristics of consumers that are likely to use these stores.

Let's build a 2.5km buffer for whole foods stores:

```{r}
wholefoods_catch <- st_buffer(wholefoods, 2500)
ollies_catch <- st_buffer(ollies, 2500)
```

What do these look like:

```{r}
q2 <-
  tm_shape(wholefoods_catch) +
  tm_polygons(col = "red", alpha = 0.3) +
  tm_shape(ollies_catch) +
  tm_polygons(col = "red", alpha = 0.3) +
  q1
q2
```

And with the geodemographic:

```{r warning = FALSE}
tm_shape(gd_pop) +
  tm_fill(col = "TractGroup", alpha = 0.5, palette = "Pastel1") +
  tm_borders(col = "black", lwd = 0.1) +
  tm_layout(legend.outside = TRUE) +
  q2
```

So there are some obvious differences between the general location and geodemographic profile of Whole Foods vs Ollie's Bargain Outlet. So, how do we quantify this: **catchment profiling**.

---

# Catchment Profiling

Catchment profiling is not a new approach, with retailers commonly using it to better understand the characteristics of people living in the catchments of their stores. However, what we want to introduce you to is a method through which to make this much simpler, accounting for a vast array of characteristics all at once: **catchment geodemographic profiling**.

## Catchment Geodemographics (Whole Foods Market)

Ok, so we have already constructed catchment for each of these stores, enabling us to identify people living nearby that are more likely to shop there, we can profile exactly who these people are.

Let's do a spatial join, allowing us to see what tracts (and geodemographics) fall within each store catchment:

```{r}
wholefoods_gd <- st_join(wholefoods_catch, gd_pop)
```

Let's take a look at the output:

```{r}
head(wholefoods_gd)
```
What we want to do for each store is calculate the total catchment population in the catchment, and the total population disaggregated by geodemographic group. This is easy with tidyverse.

First, get the total population in each catchment:

```{r warning = FALSE, message = FALSE}
wholefood_profile <- wholefoods_gd %>%
  as.data.frame() %>%
  select(-c(geom)) %>%
  group_by(location_name, placekey) %>%
  summarise(catch_pop = sum(value))
```

Then get the total population by geodemographic group in each catchment:

```{r warning = FALSE, message = FALSE}
wholefood_gd_profile <- wholefoods_gd %>%
  as.data.frame() %>%
  select(-c(geom)) %>%
  group_by(location_name, placekey, TractGroup) %>%
  summarise(catch_gd_pop = sum(value))
```

Bring the two values together:

```{r}
wholefood_profile <- merge(wholefood_profile, wholefood_gd_profile, by = c("location_name", "placekey"), all.x = TRUE)
```

```{r}
head(wholefood_profile)
```

Calculate proportion of catchment occupied by each geodemographic group:

```{r}
wholefood_profile_final <- wholefood_profile %>%
  mutate(gd_prop = (catch_gd_pop / catch_pop) * 100) %>%
  select(location_name, placekey, TractGroup, gd_prop)
```

Let's visualise the breakdown of different groups between stores:

```{r}
wholefood_profile_final %>%
  complete(TractGroup, location_name, placekey, fill=list(gd_prop=0)) %>%
  ggplot(aes(fill = TractGroup, x = placekey, y = gd_prop)) +
  xlab("Store") +
  ylab("Catchment Proportion (%)") +
  scale_fill_brewer(palette = "Pastel1") +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_blank())
```

## Catchment Geodemographics (Ollie's Bargain Outlet)

Let's repeat those steps for Ollie's Bargain Outlet:

```{r warning = FALSE}
ollies_gd <- st_join(ollies_catch, gd_pop)
ollies_profile <- ollies_gd %>%
  as.data.frame() %>%
  select(-c(geom)) %>%
  group_by(location_name, placekey) %>%
  summarise(catch_pop = sum(value))
ollies_gd_profile <- ollies_gd %>%
  as.data.frame() %>%
  select(-c(geom)) %>%
  group_by(location_name, placekey, TractGroup) %>%
  summarise(catch_gd_pop = sum(value))
ollies_profile <- merge(ollies_profile, ollies_gd_profile, by = c("location_name", "placekey"), all.x = TRUE)
ollies_profile_final <- ollies_profile %>%
  mutate(gd_prop = (catch_gd_pop / catch_pop) * 100) %>%
  select(location_name, placekey, TractGroup, gd_prop)
```

And visualise the output:

```{r}
ollies_profile_final %>%
  complete(TractGroup, location_name, placekey, fill=list(gd_prop=0)) %>%
  ggplot(aes(fill = TractGroup, x = placekey, y = gd_prop)) +
  xlab("Store") +
  ylab("Catchment Proportion (%)") +
  scale_fill_brewer(palette = "Pastel1") +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_blank())
```

---

## Catchment Income Profile

Whilst it is useful to visualise differences in geodemographic profiles of store catchments, it is perhaps more useful to provide an empirical measure that quantifies these differences.

Based on the methodology applied in [Singleton et al. (2016)](https://www.sciencedirect.com/science/article/pii/S0016718515301500), we calculate a **Catchment Income Profile** score for each store catchment, which is based on differences in income between the different groups, whilst still accounting for geodemographic differences between them.

Let's read in a file containing some information about the overall wealth of these different groups. For each geodemographic group, we have prepared a summary of the overall wealth taken straight from the pen portrait in [Spielman and Singleton (2015)](https://www.tandfonline.com/doi/full/10.1080/00045608.2015.1052335). Using these descriptions, we have assigned an arbitrary z-score (or weight) which can be used to weight these population proportions, and derive a composite index from them.

```{r}
zscores <- read.csv("Data/z-scores.csv")
```

```{r}
glimpse(zscores)
```

Merge the z-scores onto the profiles:

```{r}
wholefood_profile_final <- merge(wholefood_profile_final, zscores[, c("TractGroup", "z.score")], all.x = TRUE)
```

Weight the geodemographic population proportions with the z scores:

```{r}
wholefood_profile_final <- wholefood_profile_final %>%
  arrange(location_name, placekey) %>%
  mutate(w_gd_prop = gd_prop * z.score)
```

Take a look:

```{r}
head(wholefood_profile_final)
```

Finally, we want to calculate an aggregate score, which uses these weighted catchment proportions to derive a score that represents the *Catchment Income Profile* of the store catchments.

```{r}
wholefood_out <- wholefood_profile_final %>%
  select(location_name, placekey, w_gd_prop) %>%
  group_by(location_name, placekey) %>%
  summarise(exp = sum(w_gd_prop))
wholefood_out
```
And repeat for Ollie's:

```{r}
ollies_profile_final <- merge(ollies_profile_final, zscores[, c("TractGroup", "z.score")], all.x = TRUE)
ollies_profile_final <- ollies_profile_final %>%
  arrange(location_name, placekey) %>%
  mutate(w_gd_prop = gd_prop * z.score)
ollies_out <- ollies_profile_final %>%
  select(location_name, placekey, w_gd_prop) %>%
  group_by(location_name, placekey) %>%
  summarise(exp = sum(w_gd_prop))
ollies_out
```

Ok, let's compare the distribution of our score across the two types of stores:

```{r}
stat <- rbind(ollies_out, wholefood_out)
stat %>%
  group_by(location_name) %>%
  summarise(incomeProfile = median(exp))
```

So it's clear that in Atlanta, Whole Foods Stores are typically located in neighbourhoods with a greater exposure to higher **Catchment Income Profiles**, whereas Ollie's Bargain Outlet is typically found in neighbourhoods with lower values.

---

## Implications

So what does all of this mean? What can such insights be used for?

1) Use of Geodemographics to understand income patterns of stores, but also lots of other aspects. Say you're interested in wealthy, young, female consumers, you could re-run such analysis to account for these geodemographic differences, and quantify the exposure of your catchment.

2) Underperformance? Is your store exposure much lower than other stores in the portfolio?

3) Location analysis - using Geodemographics to support decision-making about new store locations.

---

## Extra fun - lapply and scaling analyses

What we have done in this workbook is calculate for two different brands their Income Profile and compared them directly, walking you through the code line by line as to how this is achieved. However, in practice i'm sure many of you are less interested in the actual code to do this, and more about the results and insights.

Thus, we can use a function to iterate these steps across multiple different brands, and generate faster comparisons of Income Profiles.

First, we need to build the function:

```{r}
getProfile <- function(brand) {

  ## Data
  x <- places %>%
  filter(brands == brand)
  x_catch <- st_buffer(x, 2500)

  ## Spatial Join
  x_gd <- st_join(x_catch, gd_pop)

  ## Calculate populations
  x_profile <- x_gd %>%
    as.data.frame() %>%
    select(-c(geom)) %>%
    group_by(location_name, placekey) %>%
    summarise(catch_pop = sum(value))
  x_gd_profile <- x_gd %>%
    as.data.frame() %>%
    select(-c(geom)) %>%
    group_by(location_name, placekey, TractGroup) %>%
    summarise(catch_gd_pop = sum(value))
  x_profile <- merge(x_profile, x_gd_profile, by = c("location_name", "placekey"), all.x = TRUE)
  x_profile_final <- x_profile %>%
  mutate(gd_prop = (catch_gd_pop / catch_pop) * 100) %>%
  select(location_name, placekey, TractGroup, gd_prop)

  ## Calculate Income Profile
  x_profile_final <- merge(x_profile_final, zscores[, c("TractGroup", "z.score")], all.x = TRUE)
  x_profile_final <- x_profile_final %>%
    arrange(location_name, placekey) %>%
    mutate(w_gd_prop = gd_prop * z.score)
  x_out <- x_profile_final %>%
    select(location_name, placekey, w_gd_prop) %>%
    group_by(location_name, placekey) %>%
    summarise(exp = sum(w_gd_prop))
  x_income_prof <- x_out %>%
    group_by(location_name) %>%
    summarise(incomeProfile = mean(exp))
  return(x_income_prof)

}
```

Let's identify a few different brands to extract profiles for

```{r}
brand_ls <- c("Banana Republic", "Michael Kors", "T.J. Maxx", "JCPenney")
```

Run the profiles together, using the lapply() function:

```{r echo=FALSE}
profiles <- lapply(brand_ls, getProfile)
do.call(rbind, profiles)
```

---

## Enjoyed today's workshop? Want to learn more?

Check out the free R tutorial on Retail applications in R; drive-time catchments, gravity models, profiling, available [HERE](https://data.cdrc.ac.uk/dataset/advanced-gis-methods-training-retail-centres-and-catchment-areas).

A new version of the US Geodemographic classification is under construction, keep an eye on Prof. Alex Singleton's [GitHub repo](https://github.com/alexsingleton) for updates.