Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .Rbuildignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
^renv$
^renv/
^\.renv$
^renv\.lock$
^climateapi\.Rproj$
^\.Rproj\.user$
Expand All @@ -8,3 +10,4 @@
^docs$
^pkgdown$
^\.github$
^temporary-scripts$
1 change: 0 additions & 1 deletion .Rprofile

This file was deleted.

27 changes: 27 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# climateapi Package Development Notes

## Performance-Critical Functions

The following functions rely on large input datasets and are slow. Speed optimizations using `duckplyr`, `parquet`, `arrow`, and `tidytable` are critical for these functions:

- `get_ihp_registrations()` - IHP registration data can be very large (millions of records)
- `get_nfip_policies()` - NFIP policy data exceeds 80 million records nationally
- `get_nfip_claims()` - NFIP claims data exceeds 2 million records

### Testing Strategy for Large-Data Functions

Tests for these functions load data once at the top of the test file and reuse that object for all success tests. This avoids repeated I/O during test runs. Validation tests (expected to fail) call the function directly without using the cached data object.

### Performance Considerations

When modifying these functions:
- Prefer `arrow::read_parquet()` over CSV reads
- Use `tidytable` or `dtplyr` for grouped operations on large data
- Avoid loading full datasets into memory when filtering is possible
- Consider chunked processing for extremely large files

## Testing Philosophy

**Do not create skip functions for unavailable dependencies.** If a test requires a package (like `tidycensus`) or a resource (like Box), that dependency should be available when tests run. If something is missing, that's a real problem to fix, not work around with skip logic.

The only acceptable skip pattern is for tests that require external data sources that legitimately may not be configured in all environments (e.g., Box path for large data files). Even then, the validation and signature tests should still run.
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,12 @@ Remotes:
UI-Research/urbnindicators,
UrbanInstitute/urbnthemes
URL: https://ui-research.github.io/climateapi/
Suggests:
Suggests:
knitr,
qualtRics,
rmarkdown,
testthat (>= 3.0.0),
tidyverse
Config/testthat/edition: 3
VignetteBuilder: knitr
Config/testthat/edition: 3
7 changes: 7 additions & 0 deletions R/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
<claude-mem-context>
# Recent Activity

<!-- This section is auto-generated by claude-mem. Edit content outside the tags. -->

*No recent activity*
</claude-mem-context>
100 changes: 74 additions & 26 deletions R/cache_it.R
Original file line number Diff line number Diff line change
Expand Up @@ -10,30 +10,35 @@
#' case, file_name must be provided.
#' @param file_name File name (without extension). Optional when object is provided
#' (uses object's name). Required when object is missing and reading from cache.
#' @param path Directory path where the file should be saved/read. Defaults to /data.
#' If the path does not exist, the user will be prompted to create it (in
#' interactive sessions) or an error will be thrown (in non-interactive sessions).
#' Must not contain path separators or invalid filename characters.
#' @param path Directory path where the file should be saved/read. Defaults to
#' current directory ("."). If the path does not exist, the user will be prompted
#' to create it (in interactive sessions) or an error will be thrown (in
#' non-interactive sessions).
#' @param read Logical or character. TRUE by default.
#' - TRUE: Find and read the most recent cached version based on datestamp.
#' - FALSE: Skip reading, always write a new cached file
#' - Character: Read the specific file with this exact filename (including extension).
#' Defaults to TRUE.
#' @param keep_n Integer. Maximum number of cached versions to keep. When writing
#' a new file, older versions beyond this limit are deleted. Defaults to 5.
#' Set to NULL or Inf to keep all versions.
#'
#' @return The object that was cached (either written or read)
#'
#' @examples
#' \dontrun{
#' ## Note: datestamps in filenames are illustrative; user results will
#' ## vary depending on the the date at runtime
#' ## vary depending on the date at runtime
#'
#' # Regular data frames
#' my_data <- tibble(x = 1:10, y = letters[1:10])
#'
#' # Cache with automatic naming and datestamp
#' cache_it(my_data) # Creates: my_data_2025_12_07.parquet
#' # Cache with automatic naming and datestamp (writes to current directory)
#' cache_it(my_data) # Creates: ./my_data_2025_12_07.parquet
#'
#' # Cache with custom filename
#' cache_it(my_data, file_name = "custom_name")
#' # Cache with custom filename and path
#' cache_it(my_data, file_name = "custom_name", path = "data")
#'
#' # Read most recent cached version if exists, otherwise write
#' cached_data <- cache_it(my_data, read = TRUE)
Expand All @@ -56,9 +61,15 @@
#' # Read specific file when object doesn't exist
#' old_data <- cache_it(read = "my_data_2025_12_01.parquet")
#'
#' # Keep only the 3 most recent cached versions
#' cache_it(my_data, keep_n = 3)
#'
#' # Keep all cached versions (no cleanup)
#' cache_it(my_data, keep_n = NULL)
#'
#' # SF objects (automatically uses sfarrow)
#' my_sf <- sf::st_read(system.file("shape/nc.shp", package="sf"))
#' cache_it(my_sf) # Creates: my_sf_2025_12_07_sf.parquet
#' cache_it(my_sf) # Creates: ./my_sf_2025_12_07_sf.parquet
#'
#' # Read most recent sf cached file
#' cached_sf <- cache_it(my_sf, read = TRUE)
Expand All @@ -70,11 +81,12 @@
#' @export
cache_it <- function(object,
file_name = NULL,
path = "/data",
read = TRUE) {
path = ".",
read = TRUE,
keep_n = 5) {

# Determine if object parameter was provided
object_provided <- !missing(object)
object_provided <- !missing(object)

# Get the name to use for the file and check if we have an actual object value
is_string_literal <- FALSE
Expand All @@ -96,6 +108,12 @@ cache_it <- function(object,
}
}

# Validate file_name: no path separators or invalid filename characters
invalid_chars <- c("/", "\\", ":", "*", "?", "\"", "<", ">", "|")
if (any(stringr::str_detect(file_name, stringr::fixed(invalid_chars)))) {
stop("file_name contains invalid characters. Must not contain: / \\ : * ? \" < > |")
}

# Try to access the actual object value (if provided and not a string literal)
has_object_value <- FALSE
if (object_provided && !is_string_literal) {
Expand All @@ -104,6 +122,7 @@ cache_it <- function(object,
force(object)
TRUE
}, error = function(e) {
warning("Object '", file_name, "' could not be evaluated: ", conditionMessage(e))
FALSE
})
}
Expand All @@ -125,7 +144,7 @@ cache_it <- function(object,
# Construct full file path
full_path <- file.path(path, full_file_name)

# if the specified `path` does not exist, check with user about creating it
# If the specified `path` does not exist, check with user about creating it
if (!dir.exists(path)) {
if (interactive()) {
create_dir <- readline(prompt = stringr::str_c("The specified `path` does not exist. Do you want to create a directory at ", path, "? Y/N: "))
Expand All @@ -139,11 +158,22 @@ cache_it <- function(object,
}
}

# Escape regex metacharacters in file_name for pattern matching
file_name_escaped <- stringr::str_replace_all(
file_name,
"([\\.\\^\\$\\*\\+\\?\\{\\}\\[\\]\\\\\\|\\(\\)])",
"\\\\\\1"
)

# Helper function to find cached files
find_cached_files <- function() {
pattern <- stringr::str_c("^", file_name_escaped, "_\\d{4}_\\d{2}_\\d{2}(_sf)?\\.parquet$")
list.files(path, pattern = pattern, full.names = TRUE)
}

# Handle reading based on read parameter
if (isTRUE(read)) {
# Find the most recent cached version (both regular and sf files)
pattern <- stringr::str_c("^", file_name, "_\\d{4}_\\d{2}_\\d{2}(_sf)?\\.parquet$")
cached_files <- list.files(path, pattern = pattern, full.names = TRUE)
cached_files <- find_cached_files()

if (length(cached_files) > 0) {
# Extract dates from filenames and find the most recent
Expand All @@ -159,17 +189,15 @@ cache_it <- function(object,
# Check if file is an sf object based on filename
file_is_sf <- stringr::str_detect(most_recent_file, "_sf\\.parquet$")

message(stringr::str_c("Reading most recent cached file: ", basename(most_recent_file),
" (dated ", most_recent_date, ")"))
message("Reading cached file: ", basename(most_recent_file), " (dated ", most_recent_date, ")")

if (file_is_sf) {
return(sfarrow::st_read_parquet(most_recent_file))
} else {
return(arrow::read_parquet(most_recent_file))
}
} else {
message(stringr::str_c("No cached files found for '", file_name,
"'. Writing new file."))
message("No cached files found for '", file_name, "'. Writing new file.")
}

} else if (is.character(read)) {
Expand All @@ -180,7 +208,7 @@ cache_it <- function(object,
# Check if file is an sf object based on filename
file_is_sf <- stringr::str_detect(specific_path, "_sf\\.parquet$")

message(stringr::str_c("Reading specified cached file: ", read))
message("Reading cached file: ", read)

if (file_is_sf) {
return(sfarrow::st_read_parquet(specific_path))
Expand All @@ -192,8 +220,7 @@ cache_it <- function(object,
}

} else if (isFALSE(read)) {
# Don't read, proceed to writing
message(stringr::str_c("Skipping read. Writing new cached file."))
message("Writing new cached file.")
}

# Write object to parquet file
Expand All @@ -203,10 +230,31 @@ cache_it <- function(object,

if (is_sf) {
sfarrow::st_write_parquet(obj = object, dsn = full_path)
message(stringr::str_c("Cached sf object to: ", basename(full_path)))
} else {
arrow::write_parquet(object, full_path)
message(stringr::str_c("Cached object to: ", basename(full_path)))
arrow::write_parquet(object, full_path, compression = "snappy")
}
message("Cached to: ", basename(full_path))

# Clean up old versions if keep_n is set
if (!is.null(keep_n) && is.finite(keep_n) && keep_n > 0) {
cached_files <- find_cached_files()

if (length(cached_files) > keep_n) {
file_dates <- cached_files |>
basename() |>
stringr::str_extract("\\d{4}_\\d{2}_\\d{2}") |>
stringr::str_replace_all("_", "-") |>
as.Date()

# Sort by date (oldest first) and identify files to delete
date_order <- order(file_dates)
files_to_delete <- cached_files[date_order[seq_len(length(cached_files) - keep_n)]]

for (f in files_to_delete) {
file.remove(f)
}
message("Removed ", length(files_to_delete), " old cached file(s) (keeping ", keep_n, " most recent).")
}
}

return(object)
Expand Down
8 changes: 7 additions & 1 deletion R/convert_table_text_to_dataframe.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,13 @@
#' @param short_document Boolean; default is FALSE. If TRUE, it is assumed that the document is short enough that it can be processed in a single API call. If FALSE and the inputted `text` is a single item, the function throws an error. Note that multi-page documents should be broken into multi-item vectors/lists before being passed to `text`.
#' @param required Boolean; default is FALSE. If TRUE, the LLM will be instructed to return values for all columns. If FALSE, `NULL` values are allowed. Generally, NULL values should be allowed unless you are certain that every value in the inputted text-table has a non-NULL value.
#'
#' @return A list of dataframes, with each item corresponding to one page of the inputted text. The dataframes have the same column names and types as specified in `column_types`. Use `purrr::bind_rows()` to consolidate results into a single dataframe, if needed.
#' @return A list of tibbles, where each list element corresponds to one item (typically one page) in the input `text` vector/list. Each tibble contains:
#' \describe{
#' \item{Structure}{Columns match the names and types defined in `column_types`. Each row represents one record extracted from the table text by the LLM.}
#' \item{NULL values}{When `required = FALSE` (default), columns may contain NULL/NA values if the LLM could not extract a value for that cell.}
#' \item{Empty dataframes}{If the LLM encounters an error processing a page, that list element will be an empty `data.frame()`.}
#' }
#' Use `purrr::list_rbind()` or `dplyr::bind_rows()` to consolidate results into a single dataframe. A warning is issued reminding users to review AI-generated results for accuracy.
#' @export
#' @examples
#' \dontrun{
Expand Down
14 changes: 13 additions & 1 deletion R/estimate_units_per_parcel.R
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,19 @@ benchmark_units_to_census = function(data) {
#' @param zoning A spatial (polygon) zoning dataset.
#' @param acs Optionally, a non-spatial dataset, at the tract level, returned from `urbnindicators::compile_acs_data()`.
#'
#' @returns The inputted parcels datasets with attributes describing estimated unit counts by unit type.
#' @return An `sf` object (point geometry, representing parcel centroids) containing the input parcel data augmented with estimated residential unit information. The returned object includes:
#' \describe{
#' \item{parcel_id}{Character or numeric. The unique parcel identifier from the input data.}
#' \item{tract_geoid}{Character. The 11-digit Census tract GEOID containing the parcel centroid.}
#' \item{jurisdiction}{Character. The jurisdiction name associated with the parcel.}
#' \item{municipality_name}{Character. The municipality name associated with the parcel.}
#' \item{residential_unit_count}{Numeric. The estimated number of residential units on the parcel, benchmarked against ACS estimates at the tract level.}
#' \item{residential_unit_categories}{Factor (ordered). Categorical classification of unit counts: "0", "1", "2", "3-4", "5-9", "10-19", "20-49", "50+".}
#' \item{median_value_improvement_sf}{Numeric. Tract-level median improvement value for single-family parcels.}
#' \item{median_value_improvement_mh}{Numeric. Tract-level median improvement value for manufactured home parcels.}
#' \item{acs_units_*}{Numeric. ACS-reported housing unit counts by units-in-structure category for the tract.}
#' \item{zone, zoned_housing_type, far, setback_*, height_maximum, ...}{Various zoning attributes joined from the zoning dataset.}
#' }
#' @export
estimate_units_per_parcel = function(
structures,
Expand Down
10 changes: 9 additions & 1 deletion R/get_emergency_managerment_performance.R
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,15 @@
#' @param file_path Path to the downloaded dataset on Box.
#' @param api Logical indicating whether to use the OpenFEMA API to retrieve the data. Default is TRUE.
#'
#' @return A data frame containing emergency management performance grant (EMPG) data.
#' @return A tibble containing Emergency Management Performance Grant (EMPG) data with the following columns:
#' \describe{
#' \item{state_name}{Character. The name of the state receiving the grant (renamed from original "state" column).}
#' \item{year_project_start}{Numeric. The year the project started, with corrections applied for known data entry errors in the source data.}
#' \item{state_code}{Character. Two-digit FIPS state code.}
#' \item{state_abbreviation}{Character. Two-letter USPS state abbreviation.}
#' \item{...}{Additional columns from the OpenFEMA EMPG dataset, cleaned via `janitor::clean_names()`.}
#' }
#' Data are filtered to records with `year_project_start > 2012`. A warning is issued noting data completeness concerns for 2024-2025.
#' @export

get_emergency_management_performance = function(
Expand Down
15 changes: 14 additions & 1 deletion R/get_government_finances.R
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,20 @@
#'
#' @param year A four-digit year. The default is 2022.
#'
#' @return A dataframe containing government unit-level expenses for the specified year.
#' @return A tibble containing government unit-level financial data aggregated by unit, with the following columns:
#' \describe{
#' \item{unit_id}{Character. Unique identifier for the government unit.}
#' \item{year_data}{Numeric. The year of the financial data.}
#' \item{amount_thousands}{Numeric. Total expenditure amount in thousands of dollars.}
#' \item{government_type}{Character. Type of government unit: "State", "County", "City", "Township", "Special District", or "School District/Educational Service Agency".}
#' \item{data_quality}{Numeric. Proportion of records that were reported (vs. imputed or from alternative sources), ranging from 0 to 1.}
#' \item{unit_name}{Character. Name of the government unit.}
#' \item{county_name}{Character. County name where the unit is located.}
#' \item{state_code}{Character. Two-digit state FIPS code.}
#' \item{population}{Numeric. Population served by the government unit.}
#' \item{enrollment}{Numeric. Student enrollment (for school districts; NA for other unit types).}
#' \item{amount_per_capita}{Numeric. Expenditure per capita (or per enrolled student for school districts).}
#' }
#' @export

get_government_finances = function(year = 2022) {
Expand Down
14 changes: 13 additions & 1 deletion R/get_ihp_registrations.R
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,19 @@
#' @param api If TRUE, query the API. If FALSE (default), read from disk.
#' @param outpath The path to save the parquet-formatted datafile. Applicable only when `api = FALSE`.
#'
#' @returns A dataframe comprising IHP registrations
#' @return A tibble containing Individual and Households Program (IHP) registration data at the household level, joined to county-level geography. Due to ZIP-to-county crosswalking, records may be duplicated across counties (see warning). The returned object includes:
#' \describe{
#' \item{unique_id}{Character. A UUID uniquely identifying each original IHP registration.}
#' \item{allocation_factor_zcta_to_county}{Numeric. The proportion of the ZCTA's population in this county (0-1). Used to apportion registrations when a ZIP spans multiple counties.}
#' \item{geoid_county}{Character. Five-digit FIPS county code.}
#' \item{zcta_code}{Character. Five-digit ZCTA (ZIP Code Tabulation Area) code.}
#' \item{geoid_tract}{Character. 11-digit Census tract GEOID (may have missingness).}
#' \item{geoid_block_group}{Character. 12-digit Census block group GEOID (may have missingness).}
#' \item{disaster_number}{Character. FEMA disaster number associated with the registration.}
#' \item{amount_individual_housing_program, amount_housing_assistance, amount_other_needs_assistance, amount_rental_assistance, amount_repairs, amount_replacement, amount_personal_property}{Numeric. Various IHP assistance amounts in dollars.}
#' \item{amount_flood_insurance_premium_paid_by_fema}{Numeric. Flood insurance premium paid by FEMA in dollars.}
#' \item{state_name, state_abbreviation, state_code}{Character. State identifiers.}
#' }
#' @export
#'
#' @examples
Expand Down
2 changes: 0 additions & 2 deletions R/get_lodes.R
Original file line number Diff line number Diff line change
Expand Up @@ -157,8 +157,6 @@ rename_lodes_variables = function(.df) {
#' \item{jobs_firm_age}{number of employees by the age of employing firm; only available in 'wac' datasets}
#' \item{jobs_firm_size}{number of employees for a given range in employer size; only available in 'wac' datasets}
#' }
#'
#'
#' @export
get_lodes = function(
lodes_type,
Expand Down
Loading
Loading