Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/pkgdown.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ jobs:
group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }}
env:
GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }}
IPUMS_API_KEY: ${{ secrets.IPUMS_API_KEY }}
permissions:
contents: write
steps:
Expand Down
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: crosswalk
Type: Package
Title: Simple interface to inter-temporal and inter-geography crosswalks
Title: streamlining inter-temporal and inter-geography crosswalking
Version: 0.0.0.9001
Description: An R package providing a simple interface to access geographic crosswalks.
Description: An R package providing a simple interface to access and apply crosswalks.
License: MIT + file LICENSE
Authors@R:
person(given = "Will", family = "Curran-Groome", email = "wcurrangroome@urban.org", role = c("aut", "cre"))
Expand Down
53 changes: 32 additions & 21 deletions README.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,12 @@ devtools::load_all()

# crosswalk

An R interface to inter-geography and inter-temporal crosswalks.
An R package providing a simple interface to access and apply crosswalks.

## Overview

This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates
This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches
that work across different geography and year combinations. The package also facilitates
interpolation--that is, adjusting source geography/year values by their crosswalk weights and translating
these values to the desired target geography/year--including diagnostics of the joins between source data
and crosswalks.
Expand All @@ -37,9 +38,9 @@ The package sources crosswalks from:

- **Programmatic access**: No more manual downloads from web interfaces
- **Standardized output**: Consistent column names across all crosswalk sources
- **Metadata tracking**: Full provenance stored as attributes
- **Multi-step handling**: Automatic chaining when both geography and year change
- **Local caching**: Reproducible workflows with cached crosswalks
- **Metadata tracking**: Full provenance of crosswalks stored as attributes
- **Crosswalk chaining**: Automatic chaining when multiple crosswalks are required
- **Local caching**: Reproducible workflows with locally-cached crosswalks for speed

## Installation

Expand Down Expand Up @@ -151,7 +152,8 @@ combined_data %>%

## Core Functions

The package has two main functions:
The package has two main functions, though you can also specify the needed crosswalk(s)
directly from `crosswalk_data()` and omit the intermediate `get_crosswalk()` call.

| Function | Purpose |
|--------------------------------------|----------------------------------|
Expand All @@ -168,8 +170,7 @@ result <- get_crosswalk(
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population"
)
weight = "population")

names(result)
#> [1] "crosswalks" "plan" "message"
Expand All @@ -192,23 +193,23 @@ The list contains three elements:
result <- get_crosswalk(
source_geography = "tract",
target_geography = "zcta",
weight = "population"
)
weight = "population")
# result$crosswalks$step_1 contains one crosswalk

# Same geography, different year (NHGIS)
result <- get_crosswalk(
source_geography = "tract",
target_geography = "tract",
source_year = 2010,
target_year = 2020
)
target_year = 2020)
# result$crosswalks$step_1 contains one crosswalk
```

**Multi-step crosswalks** (different geography AND different year):
**Multi-step crosswalks** (when a single, direct crosswalk is not available):

When both geography and year change, no single crosswalk source provides this directly. The package automatically plans and fetches a two-step chain:
For some source year/geography -> target year/geography specifications do not have a crosswalk.
In such cases, two or more crosswalks may be needed. The package automatically plans and fetches the
required crosswalks:

1. **Step 1 (NHGIS)**: Change year, keep geography constant
2. **Step 2 (Geocorr)**: Change geography at target year
Expand All @@ -219,8 +220,7 @@ result <- get_crosswalk(
target_geography = "zcta",
source_year = 2010,
target_year = 2020,
weight = "population"
)
weight = "population")

# Two crosswalks are returned
names(result$crosswalks)
Expand All @@ -241,7 +241,8 @@ Each crosswalk contains standardized columns:
| `allocation_factor_source_to_target` | Weight for interpolating values |
| `weighting_factor` | What attribute was used (population, housing, land) |

Additional columns may include `source_year`, `target_year`, `population_2020`, `housing_2020`, and `land_area_sqmi` depending on the source.
Additional columns may include `source_year`, `target_year`, `population_2020`, `housing_2020`,
and `land_area_sqmi` depending on the source of the crosswalk.

### Accessing Metadata

Expand All @@ -257,6 +258,8 @@ names(metadata)
## Using `crosswalk_data()` to Interpolate Data

`crosswalk_data()` applies crosswalk weights to transform your data. It automatically handles multi-step crosswalks.
If you're in a hurry, you can omit a call to `get_crosswalk()` and specify the needed crosswalk parameters
to `crosswalk_data()`, which will pass these to `get_crosswalk()` behind the scenes.

### Column Naming Convention

Expand All @@ -270,7 +273,6 @@ The function auto-detects columns based on prefixes:
You can also specify columns explicitly via `count_columns` and `non_count_columns`.
All non-count variables are interpolated using weighted means, weighting by the allocation factor from the crosswalk.


## Supported Geography and Year Combinations

### Inter-Geography Crosswalks (Geocorr)
Expand Down Expand Up @@ -300,11 +302,14 @@ NHGIS provides cross-decade crosswalks with the following structure:
**Notes:**
- Within-decade crosswalks (e.g., 2010→2014) are not available from NHGIS
- Block→ZCTA, Block→PUMA, etc. are only available for decennial years (1990, 2000, 2010, 2020)
- The package automatically uses direct NHGIS crosswalks when available (e.g., `get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020)` returns a single-step NHGIS crosswalk)
- The package automatically uses direct NHGIS crosswalks when available (e.g.,
`get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020)`
returns a single-step NHGIS crosswalk)

### 2020→2022 Crosswalks (CTData)

For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut (where planning regions replaced counties) and identity mappings for other states (where no changes occurred).
For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut
(where planning regions replaced counties) and identity mappings for other states (where no changes occurred).

## API Keys

Expand Down Expand Up @@ -336,4 +341,10 @@ The intellectual credit for the underlying crosswalks belongs to the original de

**For Geocorr**, a suggested citation:

> Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html
> Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html

**For CTData**, a suggested citation (adjust for alternate source geography):

> CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved from: https://github.com/CT-Data-Collaborative/2022-tract-crosswalk.

**For this package**, refer here: https://ui-research.github.io/crosswalk/authors.html#citation
99 changes: 7 additions & 92 deletions renv/activate.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@ local({

# the requested version of renv
version <- "1.1.7"
attr(version, "md5") <- "dd5d60f155dadff4c88c2fc6680504b4"
attr(version, "sha") <- NULL

# the project directory
Expand Down Expand Up @@ -169,16 +168,6 @@ local({
if (quiet)
return(invisible())

# also check for config environment variables that should suppress messages
# https://github.com/rstudio/renv/issues/2214
enabled <- Sys.getenv("RENV_CONFIG_STARTUP_QUIET", unset = NA)
if (!is.na(enabled) && tolower(enabled) %in% c("true", "1"))
return(invisible())

enabled <- Sys.getenv("RENV_CONFIG_SYNCHRONIZED_CHECK", unset = NA)
if (!is.na(enabled) && tolower(enabled) %in% c("false", "0"))
return(invisible())

msg <- sprintf(fmt, ...)
cat(msg, file = stdout(), sep = if (appendLF) "\n" else "")

Expand Down Expand Up @@ -226,16 +215,6 @@ local({
section <- header(sprintf("Bootstrapping renv %s", friendly))
catf(section)

# try to install renv from cache
md5 <- attr(version, "md5", exact = TRUE)
if (length(md5)) {
pkgpath <- renv_bootstrap_find(version)
if (length(pkgpath) && file.exists(pkgpath)) {
file.copy(pkgpath, library, recursive = TRUE)
return(invisible())
}
}

# attempt to download renv
catf("- Downloading renv ... ", appendLF = FALSE)
withCallingHandlers(
Expand All @@ -261,6 +240,7 @@ local({

# add empty line to break up bootstrapping from normal output
catf("")

return(invisible())
}

Expand All @@ -277,20 +257,12 @@ local({
repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA)
if (!is.na(repos)) {

# split on ';' if present
parts <- strsplit(repos, ";", fixed = TRUE)[[1L]]

# split into named repositories if present
idx <- regexpr("=", parts, fixed = TRUE)
keys <- substring(parts, 1L, idx - 1L)
vals <- substring(parts, idx + 1L)
names(vals) <- keys
# check for RSPM; if set, use a fallback repository for renv
rspm <- Sys.getenv("RSPM", unset = NA)
if (identical(rspm, repos))
repos <- c(RSPM = rspm, CRAN = cran)

# if we have a single unnamed repository, call it CRAN
if (length(vals) == 1L && identical(keys, ""))
names(vals) <- "CRAN"

return(vals)
return(repos)

}

Expand Down Expand Up @@ -539,51 +511,6 @@ local({

}

renv_bootstrap_find <- function(version) {

path <- renv_bootstrap_find_cache(version)
if (length(path) && file.exists(path)) {
catf("- Using renv %s from global package cache", version)
return(path)
}

}

renv_bootstrap_find_cache <- function(version) {

md5 <- attr(version, "md5", exact = TRUE)
if (is.null(md5))
return()

# infer path to renv cache
cache <- Sys.getenv("RENV_PATHS_CACHE", unset = "")
if (!nzchar(cache)) {
root <- Sys.getenv("RENV_PATHS_ROOT", unset = NA)
if (!is.na(root))
cache <- file.path(root, "cache")
}

if (!nzchar(cache)) {
tools <- asNamespace("tools")
if (is.function(tools$R_user_dir)) {
root <- tools$R_user_dir("renv", "cache")
cache <- file.path(root, "cache")
}
}

# start completing path to cache
file.path(
cache,
renv_bootstrap_cache_version(),
renv_bootstrap_platform_prefix(),
"renv",
version,
md5,
"renv"
)

}

renv_bootstrap_download_tarball <- function(version) {

# if the user has provided the path to a tarball via
Expand Down Expand Up @@ -1052,7 +979,7 @@ local({

renv_bootstrap_validate_version_release <- function(version, description) {
expected <- description[["Version"]]
is.character(expected) && identical(c(expected), c(version))
is.character(expected) && identical(expected, version)
}

renv_bootstrap_hash_text <- function(text) {
Expand Down Expand Up @@ -1254,18 +1181,6 @@ local({

}

renv_bootstrap_cache_version <- function() {
# NOTE: users should normally not override the cache version;
# this is provided just to make testing easier
Sys.getenv("RENV_CACHE_VERSION", unset = "v5")
}

renv_bootstrap_cache_version_previous <- function() {
version <- renv_bootstrap_cache_version()
number <- as.integer(substring(version, 2L))
paste("v", number - 1L, sep = "")
}

renv_json_read <- function(file = NULL, text = NULL) {

jlerr <- NULL
Expand Down
17 changes: 9 additions & 8 deletions vignettes/standardizing-longitudinal-data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,15 @@ vignette: >
---

```{r, include = FALSE}
# Only evaluate chunks if IPUMS API key is available
has_api_key <- nchar(Sys.getenv("IPUMS_API_KEY")) > 10

knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
message = FALSE,
echo = TRUE,
eval = TRUE)
eval = has_api_key)
```

## Overview
Expand Down Expand Up @@ -75,13 +78,11 @@ glimpse(hmda_data[["2018"]])

## Step 2: Prepare Data for Crosswalking

The HMDA data includes a `tractid` column that contains the 11-digit tract GEOID.
Let's prepare a subset of variables for crosswalking. We'll focus on a subset of variables
We'll focus on a subset of variables for crosswalking
(total applications by race/ethnicity and median loan amounts). We could explicitly pass the
variables we want to crosswalk to the appropriate parameter (`count_columns` or `non_count_columns`),
but it's easy (and nice practice) to prefix these variables with their unit types ("count" and "median",
respectively), and `crosswalk_data()` will crosswalk each appropriately by default since they have these
standard unit prefixes in their names.
respectively), and `crosswalk_data()` will crosswalk each appropriately by default.

```{r prepare-data, echo = FALSE}
prepare_hmda <- function(data) {
Expand Down Expand Up @@ -121,7 +122,7 @@ tract_crosswalk$message
## Step 4: Apply the Crosswalk to 2018-2021 Data

Now we apply the crosswalk to the four years of data that use 2010 tract definitions.
We can see in the console-printed output that relatively small, though not insignificant,
We can see that relatively small, though not insignificant,
fractions of records in our source data do not join to our crosswalk. When this occurs, source
data is effectively lost because it has no associated target geography nor allocation factor
assigned to it.
Expand Down Expand Up @@ -171,8 +172,8 @@ hmda_crosswalked |>

## Result: A Panel Dataset in 2020 Tract Definitions
We now have a single dataframe with all six years of HMDA data standardized to 2020
tract definitions. Due to changes in tract geographies between decades, we were unable
to accurately compare neighborhood changes over time.
tract definitions. Due to changes in tract geographies between decades, we were previously
unable to accurately compare neighborhood changes over time.

Now, we have apples-to-apples measurements for tracts from 2018 through 2023.

Expand Down