Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 43 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@

# crosswalk

An R interface to inter-geography and inter-temporal crosswalks.
An R package providing a simple interface to access and apply
crosswalks.

## Overview

Expand All @@ -27,23 +28,29 @@ The package sources crosswalks from:
- **Programmatic access**: No more manual downloads from web interfaces
- **Standardized output**: Consistent column names across all crosswalk
sources
- **Metadata tracking**: Full provenance stored as attributes
- **Multi-step handling**: Automatic chaining when both geography and
year change
- **Local caching**: Reproducible workflows with cached crosswalks
- **Metadata tracking**: Full provenance of crosswalks stored as
attributes
- **Crosswalk chaining**: Automatic chaining when multiple crosswalks
are required
- **Local caching**: Reproducible workflows with locally-cached
crosswalks for speed

## Installation

``` r
# Install from GitHub
renv::install("UI-Research/crosswalk")
#> # Downloading packages -------------------------------------------------------
#> - Downloading crosswalk 0.0.0.9001 from GitHub ... OK [95.7 Kb in 0.54s]
#> Successfully downloaded 1 package in 1 second.
#>
#> The following package(s) will be installed:
#> - crosswalk [UI-Research/crosswalk]
#> These packages will be installed into "C:/Users/wcurrangroome/AppData/Local/Temp/RtmpSkgo68/temp_libpathd7e02418a38".
#> These packages will be installed into "C:/Users/wcurrangroome/AppData/Local/Temp/RtmpkTRpSB/temp_libpath1dfc460a4888".
#>
#> # Installing packages --------------------------------------------------------
#> - Installing crosswalk 0.0.0.9001 ... OK [copied from cache in 0.24s]
#> Successfully installed 1 package in 0.26 seconds.
#> - Installing crosswalk 0.0.0.9001 ... OK [built from source and cached in 2.1s]
#> Successfully installed 1 package in 2.4 seconds.
```

## Overview
Expand Down Expand Up @@ -160,7 +167,7 @@ attr(crosswalked_data, "crosswalk_metadata")
#> NULL
#>
#> $retrieved_at
#> [1] "2026-02-01 00:09:21 EST"
#> [1] "2026-02-02 13:20:20 EST"
#>
#> $cached
#> [1] FALSE
Expand Down Expand Up @@ -289,7 +296,9 @@ combined_data %>%

## Core Functions

The package has two main functions:
The package has two main functions, though you can also specify the
needed crosswalk(s) directly from `crosswalk_data()` and omit the
intermediate `get_crosswalk()` call.

| Function | Purpose |
|----|----|
Expand All @@ -306,8 +315,7 @@ result <- get_crosswalk(
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population"
)
weight = "population")

names(result)
#> [1] "crosswalks" "plan" "message"
Expand All @@ -332,25 +340,25 @@ geography, different year):
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
weight = "population"
)
weight = "population")
# result$crosswalks$step_1 contains one crosswalk

# Same geography, different year (NHGIS)
result <- get_crosswalk(
source_geography = "tract",
target_geography = "tract",
source_year = 2010,
target_year = 2020
)
target_year = 2020)
# result$crosswalks$step_1 contains one crosswalk
```

**Multi-step crosswalks** (different geography AND different year):
**Multi-step crosswalks** (when a single, direct crosswalk is not
available):

When both geography and year change, no single crosswalk source provides
this directly. The package automatically plans and fetches a two-step
chain:
For some source year/geography -\> target year/geography specifications
do not have a crosswalk. In such cases, two or more crosswalks may be
needed. The package automatically plans and fetches the required
crosswalks:

1. **Step 1 (NHGIS)**: Change year, keep geography constant
2. **Step 2 (Geocorr)**: Change geography at target year
Expand All @@ -361,8 +369,7 @@ result <- get_crosswalk(
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population"
)
weight = "population")

# Two crosswalks are returned
names(result$crosswalks)
Expand All @@ -386,7 +393,7 @@ Each crosswalk contains standardized columns:

Additional columns may include `source_year`, `target_year`,
`population_2020`, `housing_2020`, and `land_area_sqmi` depending on the
source.
source of the crosswalk.

### Accessing Metadata

Expand Down Expand Up @@ -414,7 +421,10 @@ names(metadata)
## Using `crosswalk_data()` to Interpolate Data

`crosswalk_data()` applies crosswalk weights to transform your data. It
automatically handles multi-step crosswalks.
automatically handles multi-step crosswalks. If you’re in a hurry, you
can omit a call to `get_crosswalk()` and specify the needed crosswalk
parameters to `crosswalk_data()`, which will pass these to
`get_crosswalk()` behind the scenes.

### Column Naming Convention

Expand Down Expand Up @@ -503,3 +513,12 @@ original developers.
> Missouri Census Data Center, University of Missouri. (2022). Geocorr
> 2022: Geographic Correspondence Engine. Retrieved from:
> <https://mcdc.missouri.edu/applications/geocorr2022.html>

**For CTData**, a suggested citation (adjust for alternate source
geography):

> CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved
> from: <https://github.com/CT-Data-Collaborative/2022-tract-crosswalk>.

**For this package**, refer here:
<https://ui-research.github.io/crosswalk/authors.html#citation>
24 changes: 10 additions & 14 deletions vignettes/standardizing-longitudinal-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,8 @@ names(hmda_data) = metadata$year %>% as.character()
Let's inspect the structure of the data:

```{r inspect-data}
glimpse(hmda_data[["2018"]])
## just view the first ten columns
glimpse(hmda_data[["2018"]] %>% select(1:10))
```

## Step 2: Prepare Data for Crosswalking
Expand All @@ -84,7 +85,7 @@ variables we want to crosswalk to the appropriate parameter (`count_columns` or
but it's easy (and nice practice) to prefix these variables with their unit types ("count" and "median",
respectively), and `crosswalk_data()` will crosswalk each appropriately by default.

```{r prepare-data, echo = FALSE}
```{r prepare-data}
prepare_hmda <- function(data) {
data |>
rename_with(.cols = matches("^geo20"), .fn = ~ "source_geoid") |>
Expand Down Expand Up @@ -127,7 +128,7 @@ fractions of records in our source data do not join to our crosswalk. When this
data is effectively lost because it has no associated target geography nor allocation factor
assigned to it.

```{r apply-crosswalk, echo = FALSE}
```{r apply-crosswalk}
# Years that need crosswalking (2010 vintage)
years_to_crosswalk <- c("2018", "2019", "2020", "2021")

Expand Down Expand Up @@ -158,7 +159,9 @@ hmda_crosswalked |>
attr("join_quality") |>
pluck("data_geoids_unmatched") |>
head(5))
```

```{r}
## how many source records are we unable to crosswalk each year, excluding
## those with "X" in their GEOIDs? under 30 each year.
hmda_crosswalked |>
Expand All @@ -178,20 +181,13 @@ unable to accurately compare neighborhood changes over time.
Now, we have apples-to-apples measurements for tracts from 2018 through 2023.

```{r final-summary}
hmda_combined <- bind_rows(hmda_crosswalked) |>
## data for years that are crosswalked have slightly different/additional columsn
mutate(
geoid = if_else(is.na(geoid), source_geoid, geoid)) |>
select(-c(geography_name, source_geoid, vintage)) |>
arrange(geoid, data_year) |>
mutate(
state = str_sub(geoid, 1, 2),
percent_race_white_purchase = count_race_white_purchase / count_owner_purchase_originations)

## there's a little bit of variation year-to-year in terms of which tracts have
## reported HMDA data, but for the majority, we have observations in each of the
## six years:
hmda_combined |>
hmda_combined <- bind_rows(hmda_crosswalked) |>
## data for years that are crosswalked have slightly different/additional columns
mutate(
geoid = if_else(is.na(geoid), source_geoid, geoid)) |>
count(geoid) |>
count(n)
```