diff --git a/README.md b/README.md index c419584..e64c46a 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,8 @@ # crosswalk -An R interface to inter-geography and inter-temporal crosswalks. +An R package providing a simple interface to access and apply +crosswalks. ## Overview @@ -27,23 +28,29 @@ The package sources crosswalks from: - **Programmatic access**: No more manual downloads from web interfaces - **Standardized output**: Consistent column names across all crosswalk sources -- **Metadata tracking**: Full provenance stored as attributes -- **Multi-step handling**: Automatic chaining when both geography and - year change -- **Local caching**: Reproducible workflows with cached crosswalks +- **Metadata tracking**: Full provenance of crosswalks stored as + attributes +- **Crosswalk chaining**: Automatic chaining when multiple crosswalks + are required +- **Local caching**: Reproducible workflows with locally-cached + crosswalks for speed ## Installation ``` r # Install from GitHub renv::install("UI-Research/crosswalk") +#> # Downloading packages ------------------------------------------------------- +#> - Downloading crosswalk 0.0.0.9001 from GitHub ... OK [95.7 Kb in 0.54s] +#> Successfully downloaded 1 package in 1 second. +#> #> The following package(s) will be installed: #> - crosswalk [UI-Research/crosswalk] -#> These packages will be installed into "C:/Users/wcurrangroome/AppData/Local/Temp/RtmpSkgo68/temp_libpathd7e02418a38". +#> These packages will be installed into "C:/Users/wcurrangroome/AppData/Local/Temp/RtmpkTRpSB/temp_libpath1dfc460a4888". #> #> # Installing packages -------------------------------------------------------- -#> - Installing crosswalk 0.0.0.9001 ... OK [copied from cache in 0.24s] -#> Successfully installed 1 package in 0.26 seconds. +#> - Installing crosswalk 0.0.0.9001 ... OK [built from source and cached in 2.1s] +#> Successfully installed 1 package in 2.4 seconds. ``` ## Overview @@ -160,7 +167,7 @@ attr(crosswalked_data, "crosswalk_metadata") #> NULL #> #> $retrieved_at -#> [1] "2026-02-01 00:09:21 EST" +#> [1] "2026-02-02 13:20:20 EST" #> #> $cached #> [1] FALSE @@ -289,7 +296,9 @@ combined_data %>% ## Core Functions -The package has two main functions: +The package has two main functions, though you can also specify the +needed crosswalk(s) directly from `crosswalk_data()` and omit the +intermediate `get_crosswalk()` call. | Function | Purpose | |----|----| @@ -306,8 +315,7 @@ result <- get_crosswalk( target_geography = "zcta", source_year = 2010, target_year = 2020, - weight = "population" -) + weight = "population") names(result) #> [1] "crosswalks" "plan" "message" @@ -332,8 +340,7 @@ geography, different year): result <- get_crosswalk( source_geography = "tract", target_geography = "zcta", - weight = "population" -) + weight = "population") # result$crosswalks$step_1 contains one crosswalk # Same geography, different year (NHGIS) @@ -341,16 +348,17 @@ result <- get_crosswalk( source_geography = "tract", target_geography = "tract", source_year = 2010, - target_year = 2020 -) + target_year = 2020) # result$crosswalks$step_1 contains one crosswalk ``` -**Multi-step crosswalks** (different geography AND different year): +**Multi-step crosswalks** (when a single, direct crosswalk is not +available): -When both geography and year change, no single crosswalk source provides -this directly. The package automatically plans and fetches a two-step -chain: +For some source year/geography -\> target year/geography specifications +do not have a crosswalk. In such cases, two or more crosswalks may be +needed. The package automatically plans and fetches the required +crosswalks: 1. **Step 1 (NHGIS)**: Change year, keep geography constant 2. **Step 2 (Geocorr)**: Change geography at target year @@ -361,8 +369,7 @@ result <- get_crosswalk( target_geography = "zcta", source_year = 2010, target_year = 2020, - weight = "population" -) + weight = "population") # Two crosswalks are returned names(result$crosswalks) @@ -386,7 +393,7 @@ Each crosswalk contains standardized columns: Additional columns may include `source_year`, `target_year`, `population_2020`, `housing_2020`, and `land_area_sqmi` depending on the -source. +source of the crosswalk. ### Accessing Metadata @@ -414,7 +421,10 @@ names(metadata) ## Using `crosswalk_data()` to Interpolate Data `crosswalk_data()` applies crosswalk weights to transform your data. It -automatically handles multi-step crosswalks. +automatically handles multi-step crosswalks. If you’re in a hurry, you +can omit a call to `get_crosswalk()` and specify the needed crosswalk +parameters to `crosswalk_data()`, which will pass these to +`get_crosswalk()` behind the scenes. ### Column Naming Convention @@ -503,3 +513,12 @@ original developers. > Missouri Census Data Center, University of Missouri. (2022). Geocorr > 2022: Geographic Correspondence Engine. Retrieved from: > + +**For CTData**, a suggested citation (adjust for alternate source +geography): + +> CT Data Collaborative. (2023). 2022 Census Tract Crosswalk. Retrieved +> from: . + +**For this package**, refer here: + diff --git a/vignettes/standardizing-longitudinal-data.Rmd b/vignettes/standardizing-longitudinal-data.Rmd index b0bf4c6..a771a42 100644 --- a/vignettes/standardizing-longitudinal-data.Rmd +++ b/vignettes/standardizing-longitudinal-data.Rmd @@ -73,7 +73,8 @@ names(hmda_data) = metadata$year %>% as.character() Let's inspect the structure of the data: ```{r inspect-data} -glimpse(hmda_data[["2018"]]) +## just view the first ten columns +glimpse(hmda_data[["2018"]] %>% select(1:10)) ``` ## Step 2: Prepare Data for Crosswalking @@ -84,7 +85,7 @@ variables we want to crosswalk to the appropriate parameter (`count_columns` or but it's easy (and nice practice) to prefix these variables with their unit types ("count" and "median", respectively), and `crosswalk_data()` will crosswalk each appropriately by default. -```{r prepare-data, echo = FALSE} +```{r prepare-data} prepare_hmda <- function(data) { data |> rename_with(.cols = matches("^geo20"), .fn = ~ "source_geoid") |> @@ -127,7 +128,7 @@ fractions of records in our source data do not join to our crosswalk. When this data is effectively lost because it has no associated target geography nor allocation factor assigned to it. -```{r apply-crosswalk, echo = FALSE} +```{r apply-crosswalk} # Years that need crosswalking (2010 vintage) years_to_crosswalk <- c("2018", "2019", "2020", "2021") @@ -158,7 +159,9 @@ hmda_crosswalked |> attr("join_quality") |> pluck("data_geoids_unmatched") |> head(5)) +``` +```{r} ## how many source records are we unable to crosswalk each year, excluding ## those with "X" in their GEOIDs? under 30 each year. hmda_crosswalked |> @@ -178,20 +181,13 @@ unable to accurately compare neighborhood changes over time. Now, we have apples-to-apples measurements for tracts from 2018 through 2023. ```{r final-summary} -hmda_combined <- bind_rows(hmda_crosswalked) |> - ## data for years that are crosswalked have slightly different/additional columsn - mutate( - geoid = if_else(is.na(geoid), source_geoid, geoid)) |> - select(-c(geography_name, source_geoid, vintage)) |> - arrange(geoid, data_year) |> - mutate( - state = str_sub(geoid, 1, 2), - percent_race_white_purchase = count_race_white_purchase / count_owner_purchase_originations) - ## there's a little bit of variation year-to-year in terms of which tracts have ## reported HMDA data, but for the majority, we have observations in each of the ## six years: -hmda_combined |> +hmda_combined <- bind_rows(hmda_crosswalked) |> + ## data for years that are crosswalked have slightly different/additional columns + mutate( + geoid = if_else(is.na(geoid), source_geoid, geoid)) |> count(geoid) |> count(n) ```