diff --git a/.github/workflows/pkgdown.yaml b/.github/workflows/pkgdown.yaml index 057cae3..81597b6 100644 --- a/.github/workflows/pkgdown.yaml +++ b/.github/workflows/pkgdown.yaml @@ -20,6 +20,7 @@ jobs: group: pkgdown-${{ github.event_name != 'pull_request' || github.run_id }} env: GITHUB_PAT: ${{ secrets.GITHUB_TOKEN }} + IPUMS_API_KEY: ${{ secrets.IPUMS_API_KEY }} permissions: contents: write steps: diff --git a/DESCRIPTION b/DESCRIPTION index 932bdf9..94baada 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,8 +1,8 @@ Package: crosswalk Type: Package -Title: Simple interface to inter-temporal and inter-geography crosswalks +Title: streamlining inter-temporal and inter-geography crosswalking Version: 0.0.0.9001 -Description: An R package providing a simple interface to access geographic crosswalks. +Description: An R package providing a simple interface to access and apply crosswalks. License: MIT + file LICENSE Authors@R: person(given = "Will", family = "Curran-Groome", email = "wcurrangroome@urban.org", role = c("aut", "cre")) diff --git a/README.Rmd b/README.Rmd index 563c2b3..f5e0d9e 100644 --- a/README.Rmd +++ b/README.Rmd @@ -18,11 +18,12 @@ devtools::load_all() # crosswalk -An R interface to inter-geography and inter-temporal crosswalks. + An R package providing a simple interface to access and apply crosswalks. ## Overview -This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches that work across different geography and year combinations. The package also facilitates +This package provides a consistent API and standardized versions of crosswalks to enable consistent approaches +that work across different geography and year combinations. The package also facilitates interpolation--that is, adjusting source geography/year values by their crosswalk weights and translating these values to the desired target geography/year--including diagnostics of the joins between source data and crosswalks. @@ -37,9 +38,9 @@ The package sources crosswalks from: - **Programmatic access**: No more manual downloads from web interfaces - **Standardized output**: Consistent column names across all crosswalk sources -- **Metadata tracking**: Full provenance stored as attributes -- **Multi-step handling**: Automatic chaining when both geography and year change -- **Local caching**: Reproducible workflows with cached crosswalks +- **Metadata tracking**: Full provenance of crosswalks stored as attributes +- **Crosswalk chaining**: Automatic chaining when multiple crosswalks are required +- **Local caching**: Reproducible workflows with locally-cached crosswalks for speed ## Installation @@ -151,7 +152,8 @@ combined_data %>% ## Core Functions -The package has two main functions: +The package has two main functions, though you can also specify the needed crosswalk(s) +directly from `crosswalk_data()` and omit the intermediate `get_crosswalk()` call. | Function | Purpose | |--------------------------------------|----------------------------------| @@ -168,8 +170,7 @@ result <- get_crosswalk( target_geography = "zcta", source_year = 2010, target_year = 2020, - weight = "population" -) + weight = "population") names(result) #> [1] "crosswalks" "plan" "message" @@ -192,8 +193,7 @@ The list contains three elements: result <- get_crosswalk( source_geography = "tract", target_geography = "zcta", - weight = "population" -) + weight = "population") # result$crosswalks$step_1 contains one crosswalk # Same geography, different year (NHGIS) @@ -201,14 +201,15 @@ result <- get_crosswalk( source_geography = "tract", target_geography = "tract", source_year = 2010, - target_year = 2020 -) + target_year = 2020) # result$crosswalks$step_1 contains one crosswalk ``` -**Multi-step crosswalks** (different geography AND different year): +**Multi-step crosswalks** (when a single, direct crosswalk is not available): -When both geography and year change, no single crosswalk source provides this directly. The package automatically plans and fetches a two-step chain: +For some source year/geography -> target year/geography specifications do not have a crosswalk. +In such cases, two or more crosswalks may be needed. The package automatically plans and fetches the +required crosswalks: 1. **Step 1 (NHGIS)**: Change year, keep geography constant 2. **Step 2 (Geocorr)**: Change geography at target year @@ -219,8 +220,7 @@ result <- get_crosswalk( target_geography = "zcta", source_year = 2010, target_year = 2020, - weight = "population" -) + weight = "population") # Two crosswalks are returned names(result$crosswalks) @@ -241,7 +241,8 @@ Each crosswalk contains standardized columns: | `allocation_factor_source_to_target` | Weight for interpolating values | | `weighting_factor` | What attribute was used (population, housing, land) | -Additional columns may include `source_year`, `target_year`, `population_2020`, `housing_2020`, and `land_area_sqmi` depending on the source. +Additional columns may include `source_year`, `target_year`, `population_2020`, `housing_2020`, +and `land_area_sqmi` depending on the source of the crosswalk. ### Accessing Metadata @@ -257,6 +258,8 @@ names(metadata) ## Using `crosswalk_data()` to Interpolate Data `crosswalk_data()` applies crosswalk weights to transform your data. It automatically handles multi-step crosswalks. +If you're in a hurry, you can omit a call to `get_crosswalk()` and specify the needed crosswalk parameters +to `crosswalk_data()`, which will pass these to `get_crosswalk()` behind the scenes. ### Column Naming Convention @@ -270,7 +273,6 @@ The function auto-detects columns based on prefixes: You can also specify columns explicitly via `count_columns` and `non_count_columns`. All non-count variables are interpolated using weighted means, weighting by the allocation factor from the crosswalk. - ## Supported Geography and Year Combinations ### Inter-Geography Crosswalks (Geocorr) @@ -300,11 +302,14 @@ NHGIS provides cross-decade crosswalks with the following structure: **Notes:** - Within-decade crosswalks (e.g., 2010→2014) are not available from NHGIS - Block→ZCTA, Block→PUMA, etc. are only available for decennial years (1990, 2000, 2010, 2020) -- The package automatically uses direct NHGIS crosswalks when available (e.g., `get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020)` returns a single-step NHGIS crosswalk) +- The package automatically uses direct NHGIS crosswalks when available (e.g., +`get_crosswalk(source_geography = "block", target_geography = "zcta", source_year = 2010, target_year = 2020)` +returns a single-step NHGIS crosswalk) ### 2020→2022 Crosswalks (CTData) -For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut (where planning regions replaced counties) and identity mappings for other states (where no changes occurred). +For 2020 to 2022 transformations, the package uses CT Data Collaborative crosswalks for Connecticut +(where planning regions replaced counties) and identity mappings for other states (where no changes occurred). ## API Keys @@ -336,4 +341,10 @@ The intellectual credit for the underlying crosswalks belongs to the original de **For Geocorr**, a suggested citation: -> Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html \ No newline at end of file +> Missouri Census Data Center, University of Missouri. (2022). Geocorr 2022: Geographic Correspondence Engine. Retrieved from: https://mcdc.missouri.edu/applications/geocorr2022.html + +**For CTData**, a suggested citation (adjust for alternate source geography): + +> CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved from: https://github.com/CT-Data-Collaborative/2022-tract-crosswalk. + +**For this package**, refer here: https://ui-research.github.io/crosswalk/authors.html#citation \ No newline at end of file diff --git a/renv/activate.R b/renv/activate.R index 512fdc8..d5089b0 100644 --- a/renv/activate.R +++ b/renv/activate.R @@ -3,7 +3,6 @@ local({ # the requested version of renv version <- "1.1.7" - attr(version, "md5") <- "dd5d60f155dadff4c88c2fc6680504b4" attr(version, "sha") <- NULL # the project directory @@ -169,16 +168,6 @@ local({ if (quiet) return(invisible()) - # also check for config environment variables that should suppress messages - # https://github.com/rstudio/renv/issues/2214 - enabled <- Sys.getenv("RENV_CONFIG_STARTUP_QUIET", unset = NA) - if (!is.na(enabled) && tolower(enabled) %in% c("true", "1")) - return(invisible()) - - enabled <- Sys.getenv("RENV_CONFIG_SYNCHRONIZED_CHECK", unset = NA) - if (!is.na(enabled) && tolower(enabled) %in% c("false", "0")) - return(invisible()) - msg <- sprintf(fmt, ...) cat(msg, file = stdout(), sep = if (appendLF) "\n" else "") @@ -226,16 +215,6 @@ local({ section <- header(sprintf("Bootstrapping renv %s", friendly)) catf(section) - # try to install renv from cache - md5 <- attr(version, "md5", exact = TRUE) - if (length(md5)) { - pkgpath <- renv_bootstrap_find(version) - if (length(pkgpath) && file.exists(pkgpath)) { - file.copy(pkgpath, library, recursive = TRUE) - return(invisible()) - } - } - # attempt to download renv catf("- Downloading renv ... ", appendLF = FALSE) withCallingHandlers( @@ -261,6 +240,7 @@ local({ # add empty line to break up bootstrapping from normal output catf("") + return(invisible()) } @@ -277,20 +257,12 @@ local({ repos <- Sys.getenv("RENV_CONFIG_REPOS_OVERRIDE", unset = NA) if (!is.na(repos)) { - # split on ';' if present - parts <- strsplit(repos, ";", fixed = TRUE)[[1L]] - - # split into named repositories if present - idx <- regexpr("=", parts, fixed = TRUE) - keys <- substring(parts, 1L, idx - 1L) - vals <- substring(parts, idx + 1L) - names(vals) <- keys + # check for RSPM; if set, use a fallback repository for renv + rspm <- Sys.getenv("RSPM", unset = NA) + if (identical(rspm, repos)) + repos <- c(RSPM = rspm, CRAN = cran) - # if we have a single unnamed repository, call it CRAN - if (length(vals) == 1L && identical(keys, "")) - names(vals) <- "CRAN" - - return(vals) + return(repos) } @@ -539,51 +511,6 @@ local({ } - renv_bootstrap_find <- function(version) { - - path <- renv_bootstrap_find_cache(version) - if (length(path) && file.exists(path)) { - catf("- Using renv %s from global package cache", version) - return(path) - } - - } - - renv_bootstrap_find_cache <- function(version) { - - md5 <- attr(version, "md5", exact = TRUE) - if (is.null(md5)) - return() - - # infer path to renv cache - cache <- Sys.getenv("RENV_PATHS_CACHE", unset = "") - if (!nzchar(cache)) { - root <- Sys.getenv("RENV_PATHS_ROOT", unset = NA) - if (!is.na(root)) - cache <- file.path(root, "cache") - } - - if (!nzchar(cache)) { - tools <- asNamespace("tools") - if (is.function(tools$R_user_dir)) { - root <- tools$R_user_dir("renv", "cache") - cache <- file.path(root, "cache") - } - } - - # start completing path to cache - file.path( - cache, - renv_bootstrap_cache_version(), - renv_bootstrap_platform_prefix(), - "renv", - version, - md5, - "renv" - ) - - } - renv_bootstrap_download_tarball <- function(version) { # if the user has provided the path to a tarball via @@ -1052,7 +979,7 @@ local({ renv_bootstrap_validate_version_release <- function(version, description) { expected <- description[["Version"]] - is.character(expected) && identical(c(expected), c(version)) + is.character(expected) && identical(expected, version) } renv_bootstrap_hash_text <- function(text) { @@ -1254,18 +1181,6 @@ local({ } - renv_bootstrap_cache_version <- function() { - # NOTE: users should normally not override the cache version; - # this is provided just to make testing easier - Sys.getenv("RENV_CACHE_VERSION", unset = "v5") - } - - renv_bootstrap_cache_version_previous <- function() { - version <- renv_bootstrap_cache_version() - number <- as.integer(substring(version, 2L)) - paste("v", number - 1L, sep = "") - } - renv_json_read <- function(file = NULL, text = NULL) { jlerr <- NULL diff --git a/vignettes/standardizing-longitudinal-data.Rmd b/vignettes/standardizing-longitudinal-data.Rmd index bd3603d..b0bf4c6 100644 --- a/vignettes/standardizing-longitudinal-data.Rmd +++ b/vignettes/standardizing-longitudinal-data.Rmd @@ -8,12 +8,15 @@ vignette: > --- ```{r, include = FALSE} +# Only evaluate chunks if IPUMS API key is available +has_api_key <- nchar(Sys.getenv("IPUMS_API_KEY")) > 10 + knitr::opts_chunk$set( collapse = TRUE, comment = "#>", message = FALSE, echo = TRUE, - eval = TRUE) + eval = has_api_key) ``` ## Overview @@ -75,13 +78,11 @@ glimpse(hmda_data[["2018"]]) ## Step 2: Prepare Data for Crosswalking -The HMDA data includes a `tractid` column that contains the 11-digit tract GEOID. -Let's prepare a subset of variables for crosswalking. We'll focus on a subset of variables +We'll focus on a subset of variables for crosswalking (total applications by race/ethnicity and median loan amounts). We could explicitly pass the variables we want to crosswalk to the appropriate parameter (`count_columns` or `non_count_columns`), but it's easy (and nice practice) to prefix these variables with their unit types ("count" and "median", -respectively), and `crosswalk_data()` will crosswalk each appropriately by default since they have these -standard unit prefixes in their names. +respectively), and `crosswalk_data()` will crosswalk each appropriately by default. ```{r prepare-data, echo = FALSE} prepare_hmda <- function(data) { @@ -121,7 +122,7 @@ tract_crosswalk$message ## Step 4: Apply the Crosswalk to 2018-2021 Data Now we apply the crosswalk to the four years of data that use 2010 tract definitions. -We can see in the console-printed output that relatively small, though not insignificant, +We can see that relatively small, though not insignificant, fractions of records in our source data do not join to our crosswalk. When this occurs, source data is effectively lost because it has no associated target geography nor allocation factor assigned to it. @@ -171,8 +172,8 @@ hmda_crosswalked |> ## Result: A Panel Dataset in 2020 Tract Definitions We now have a single dataframe with all six years of HMDA data standardized to 2020 -tract definitions. Due to changes in tract geographies between decades, we were unable -to accurately compare neighborhood changes over time. +tract definitions. Due to changes in tract geographies between decades, we were previously +unable to accurately compare neighborhood changes over time. Now, we have apples-to-apples measurements for tracts from 2018 through 2023.