Skip to content

Vs30 refactor#42

Draft
AndrewRidden-Harper wants to merge 131 commits intomasterfrom
vs30_refactor
Draft

Vs30 refactor#42
AndrewRidden-Harper wants to merge 131 commits intomasterfrom
vs30_refactor

Conversation

@AndrewRidden-Harper
Copy link
Contributor

@AndrewRidden-Harper AndrewRidden-Harper commented Jan 27, 2026

This PR is a major refactor of the Vs30 package to improve readability, modularity, and testability. The mathematical expressions are equivalent to the original version, but use clearer variable names and simplified forms. Regression tests confirm that the output from the refactored codebase is consistent with the original version.

Design changes

Centralized configuration

  • All parameters that influence the calculated Vs30 values are defined in a single config.yaml file, rather than being hardcoded across various source files as in the original codebase.
  • Parameters can optionally be overridden via command-line arguments.
  • Configuration is validated at load time using Pydantic.

Modular CLI

  • A Typer-based CLI (vs30) exposes the full pipeline as well as individual stages (update-categorical-vs30-models, make-initial-vs30-raster, spatial-fit, etc.), making it possible to run or re-run specific steps independently.
  • A new compute-at-locations command computes Vs30 at specific latitude/longitude points without generating full raster grids, which is efficient for querying a small number of sites.

Input Vs30 data

  • The original version had multiple modes of operation that selected which observed Vs30 dataset to use (e.g., "original" combined three hardcoded sources with dataset-specific filtering and downsampling; "cpt" loaded CPT data), with different processing and Bayesian update paths for each mode.
  • In the refactored version, data preparation and filtering is the user's responsibility. All observed Vs30 values that are passed in are used directly.
  • Observations are provided in one or both of two categories:
    • Clustered observations (typically dense CPT-inferred Vs30 values) — processed with DBSCAN clustering so that spatially grouped measurements are not over-weighted in the Bayesian update.
    • Independent observations — treated as individual measurements.
  • Both kinds of datasets can be used in the same run, with at least one being required.

Vs30 map generation

  • When spatially adjusting the Vs30 map by fitting multivariate normal (MVN) distributions to observed Vs30 values, the original package looped over all pixels in the raster (processing them in blocks), computing the distance from each pixel to all observations to determine which (if any) observations would affect it.
  • The refactored version reverses the approach: it first loops over observations, using bounding boxes to identify which map pixels will be affected, then loops only over those affected pixels for the more expensive MVN conditioning calculations.
  • When the number of observations is much smaller than the number of pixels in the map, this is substantially faster because the majority of pixels (those far from any observation) are never processed.
  • For clustered observations, the bounding box search uses a subsampled set of observations (by default, every 100th observation within each cluster). This is a good approximation because all observations within a spatial cluster affect nearly the same set of map pixels. Unclustered (isolated) observations are always included in full. The subsampling step is configurable and can be set to 1 to use all observations, at the cost of a slower bounding box search.
  • Separately, the max_points parameter caps the number of nearest observations used for the per-pixel MVN update, limiting the cost of the matrix inversion at each pixel.

Multiprocessing

  • Both the raster pipeline and compute-at-locations support multiprocessing. The spatial adjustment work is divided into chunks of affected pixels distributed across worker processes.
  • BLAS thread oversubscription is managed at runtime using threadpoolctl, rather than requiring environment variables to be set before import.

Test suite

  • A comprehensive pytest suite covers all modules: configuration, category assignment, Bayesian updates, raster creation, hybrid modifications, spatial adjustment, the CLI, parallelism, and full-pipeline regression tests.

@AndrewRidden-Harper
Copy link
Contributor Author

AndrewRidden-Harper commented Feb 25, 2026

I've addressed all suggestions. The main changes are:

  1. All orchestration/pipeline code has been moved into a new module pipeline.py
  2. Now only three CLI commands: map (make a Vs30 map), points (Vs30 at provided points), and update-priors (Bayesian update of the assumed prior Vs30 for terrain and geology categories)
  3. The GitHub Actions workflows are now in line with the current approach in our other repos (uses uv etc.)

AndrewRidden-Harper and others added 4 commits February 27, 2026 12:20
Typer only supports auto-generated --long-form options OR explicit
custom names, not both. Dropping short flags (-c, -v, -t, -i, -l,
-o, -d) lets Typer derive intuitive --option-names from parameter
names without cluttering the code with redundant explicit strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously each worker process created its own tqdm bar inside
compute_spatial_adjustment_at_points(), causing rapid flickering
between competing progress displays. Now progress is controlled
by the caller via an optional progress_bar parameter.

Sequential path: one smooth bar covering geology + terrain.
Parallel path: splits into many small chunks instead of n_proc
large chunks, with a single parent bar updating as each chunk
completes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… grid updates

Add Geology/Terrain prefix to all progress bars so users understand why
stages run twice. Use many small chunks for grid spatial adjustment so
the progress bar updates smoothly. Suppress progress bar for single-chunk
bbox search in favor of a simple status message. Split points sequential
path into separate geology and terrain progress bars.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@AndrewRidden-Harper
Copy link
Contributor Author

Getting Vs30 values at Cesar's list of locations has proven to be a very useful stress test. Actually using the CLI tool revealed that the CLI flags were confusing and progress bars unhelpful, so I've improved those now. Unfortunately, (or perhaps actually fortunately), I also learned that when doing lists of points (rather than a grid), the coastal distance adjustment was not being properly applied, so that will probably take a few days to fix. I'll mark this PR as a draft until it's fixed.

@AndrewRidden-Harper AndrewRidden-Harper marked this pull request as draft February 27, 2026 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants