Skip to content

Add velocity model compression#49

Open
lispandfound wants to merge 7 commits intomainfrom
vm_compression
Open

Add velocity model compression#49
lispandfound wants to merge 7 commits intomainfrom
vm_compression

Conversation

@lispandfound
Copy link
Contributor

@lispandfound lispandfound commented Feb 23, 2026

This scripts implements the lessons from the successful xyts compression stage to the velocity model outputs, allowing us to save all velocity models in a compressed format for archival storage in high Rho/Vp/Vs resolution with more than 45x compression ratios possible.

I picked two reasonably small velocity models: a 9gb and 16gb VM and Cesar's Darfield VM (98GB) and compressed them with this code:

VM Size Compressed Size Ratio Vp resolution (km/s) Vs resolution (km/s) Rho resolution (g/cm3)
9GB 225M 45x 0.03 0.02 0.007
16GB 343M 48x 0.03 0.02 0.007
98GB 1.2GB 81x 0.03 0.02 0.007

Assuming this ratio scales to the larger models (which does appear to be true of the xyts compression code), we can expect to represent the full AlpineF2K VM from the Cybershake 100m in less than 3GB at resolutions similar to above. This resolution of 20-30m/s is more than enough to produce spatial plots and debug VM issues in the case where a result looks bad.

The Vp, Vs, and Rho resolutions are dynamically calculated using two variables:

  1. The compression datatype (uint8), which sets the degree of the quanitisation,
  2. The range of the Vp, Vs, and Rho for the velocity model (inferred from the VM).

Then the range of the quality divided by 255 is the scale set for the quantisation. By aggressively picking a small datatype for the final model, we further reduce redundant bits in the output. I deemed the extra $\sim 2^8$ factor of resolution not worth at least doubling the size of the compressed velocity model given this is used for exploratory work but I'm open to changing this.

The compressed model is saved as an xarray dataset, which provides compatibility with the rest of the workflow and transparently handles decompression without researchers writing any additional code, i.e.,

>>> import xarray as xr
>>> import xoak # provides lat/lon indexing using KD-Trees
>>> dset = xr.open_dataset('test_compress_large.h5') # The 16GB model.
>>> dset
<xarray.Dataset> Size: 17GB
Dimensions:  (z: 330, y: 1695, x: 2388)
Coordinates:
  * z        (z) int64 3kB 0 1 2 3 4 5 6 7 8 ... 322 323 324 325 326 327 328 329
  * y        (y) int64 14kB 0 1 2 3 4 5 6 ... 1688 1689 1690 1691 1692 1693 1694
  * x        (x) int64 19kB 0 1 2 3 4 5 6 ... 2381 2382 2383 2384 2385 2386 2387
    lon      (x, y) float64 32MB ...
    lat      (x, y) float64 32MB ...
    depth    (z) float64 3kB ...
Data variables:
    vp       (z, y, x) float32 5GB ... # NOTE: The real value is uint8, and the real size is 100Mb, xarray is transparently representing as float32
    vs       (z, y, x) float32 5GB ...
    rho      (z, y, x) float32 5GB ...
    inbasin  (z, y, x) uint8 1GB ...
Attributes: (12/18)
    call_type:      GENERATE_VELOCITY_MOD
    config_string:  CALL_TYPE=GENERATE_VELOCITY_MOD\nMODEL_VERSION=2.09\nOUTP...
    extent_x:       238.76803023221174
    extent_y:       169.53824916087385
    extent_zmax:    33.0
    extent_zmin:    0.0
    ...             ...
    nz:             330
    origin_lat:     -41.29451828364293
    origin_lon:     174.1625010466834
    origin_rot:     326.6863186192412
    output_dir:     out
    topo_type:      SQUASHED_TAPERED
>>> ds = dset.set_xindex(['lat', 'lon'], xr.indexes.NDPointIndex) # Set an ND-point index for lat and lon to make querying simp\
le
>>> ds.sel(lat=-43.5381, lon=172.6474, method='nearest').vs.values # VS profile at the nearest VM point to CCCC
array([0.69843173, 1.9431396 , 2.2137284 , 2.3039246 , 2.3039246 ,
       2.3219638 , 2.3219638 , 2.340003  , 2.340003  , 2.3580422 ,
       4.2882414 , 4.2882414 , 4.3062806 , 4.3062806 , 4.32432   ,
       ...
       4.32432   , 4.32432   , 4.3423595 , 4.3423595 , 4.3603983 ,
       4.3603983 , 4.3603983 , 4.378438  , 4.378438  , 4.3964767 ],
      dtype=float32)

This works without having to decompress explicitly, making it easy for researchers to extract vs profiles from the actual model they simulated with.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @lispandfound, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust solution for compressing large velocity models, drastically reducing their storage footprint without compromising the resolution needed for exploratory analysis. By leveraging quantization and xarray's capabilities, the changes ensure that these models remain easily accessible and usable within the existing data ecosystem, providing substantial benefits for archival storage and data management.

Highlights

  • Velocity Model Compression: Implemented a new script to compress velocity model outputs, achieving significant storage reductions (45x-48x compression ratios) for archival purposes.
  • Quantization Method: Utilized a quantization method that converts vp, vs, and rho data to uint8 based on dynamically calculated resolutions, balancing precision with file size.
  • Xarray Integration: Compressed models are saved as xarray datasets, which transparently handle decompression and maintain compatibility with existing workflows, simplifying data access for researchers.
  • Command-Line Tool: Introduced a new command-line tool compress-vm for easy execution of the compression process with configurable compression levels and chunking options.
Changelog
  • pyproject.toml
    • Added xarray[io] as a new project dependency.
    • Registered compress-vm as a new command-line entry point.
  • velocity_modelling/tools/compress_vm.py
    • Added a new script compress_vm.py for velocity model compression.
    • Implemented get_extrema function to determine data ranges for quantization.
    • Developed compress_quality function to quantize vp, vs, and rho data to uint8.
    • Included read_inbasin function to handle inbasin data arrays.
    • Created compressed_vm_as_dataset to convert HDF5 velocity models into compressed xarray datasets.
    • Defined a compress_vm Typer command with options for compression level, chunking, and shuffling.
Activity
  • No human activity has been recorded for this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new utility to compress velocity models for archival storage, leveraging xarray and h5py. This is a valuable addition for managing large datasets efficiently. The implementation correctly handles quantization and uses appropriate compression settings. I've identified a few areas for improvement regarding error handling, attribute management, and potential performance optimizations in the get_extrema function.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Contributor

@claudio525 claudio525 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nitpicking, looks good!

Co-authored-by: Claudio <45545396+claudio525@users.noreply.github.com>
claudio525
claudio525 previously approved these changes Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants