Add velocity model compression by lispandfound · Pull Request #49 · ucgmsim/velocity_modelling

lispandfound · 2026-02-23T03:05:30Z

This scripts implements the lessons from the successful xyts compression stage to the velocity model outputs, allowing us to save all velocity models in a compressed format for archival storage in high Rho/Vp/Vs resolution with more than 45x compression ratios possible.

I picked two reasonably small velocity models: a 9gb and 16gb VM and Cesar's Darfield VM (98GB) and compressed them with this code:

VM Size	Compressed Size	Ratio	Vp resolution (km/s)	Vs resolution (km/s)	Rho resolution (g/cm3)
9GB	225M	45x	0.03	0.02	0.007
16GB	343M	48x	0.03	0.02	0.007
98GB	1.2GB	81x	0.03	0.02	0.007

Assuming this ratio scales to the larger models (which does appear to be true of the xyts compression code), we can expect to represent the full AlpineF2K VM from the Cybershake 100m in less than 3GB at resolutions similar to above. This resolution of 20-30m/s is more than enough to produce spatial plots and debug VM issues in the case where a result looks bad.

The Vp, Vs, and Rho resolutions are dynamically calculated using two variables:

The compression datatype (uint8), which sets the degree of the quanitisation,
The range of the Vp, Vs, and Rho for the velocity model (inferred from the VM).

Then the range of the quality divided by 255 is the scale set for the quantisation. By aggressively picking a small datatype for the final model, we further reduce redundant bits in the output. I deemed the extra $\sim 2^8$ factor of resolution not worth at least doubling the size of the compressed velocity model given this is used for exploratory work but I'm open to changing this.

The compressed model is saved as an xarray dataset, which provides compatibility with the rest of the workflow and transparently handles decompression without researchers writing any additional code, i.e.,

>>> import xarray as xr
>>> import xoak # provides lat/lon indexing using KD-Trees
>>> dset = xr.open_dataset('test_compress_large.h5') # The 16GB model.
>>> dset
<xarray.Dataset> Size: 17GB
Dimensions:  (z: 330, y: 1695, x: 2388)
Coordinates:
  * z        (z) int64 3kB 0 1 2 3 4 5 6 7 8 ... 322 323 324 325 326 327 328 329
  * y        (y) int64 14kB 0 1 2 3 4 5 6 ... 1688 1689 1690 1691 1692 1693 1694
  * x        (x) int64 19kB 0 1 2 3 4 5 6 ... 2381 2382 2383 2384 2385 2386 2387
    lon      (x, y) float64 32MB ...
    lat      (x, y) float64 32MB ...
    depth    (z) float64 3kB ...
Data variables:
    vp       (z, y, x) float32 5GB ... # NOTE: The real value is uint8, and the real size is 100Mb, xarray is transparently representing as float32
    vs       (z, y, x) float32 5GB ...
    rho      (z, y, x) float32 5GB ...
    inbasin  (z, y, x) uint8 1GB ...
Attributes: (12/18)
    call_type:      GENERATE_VELOCITY_MOD
    config_string:  CALL_TYPE=GENERATE_VELOCITY_MOD\nMODEL_VERSION=2.09\nOUTP...
    extent_x:       238.76803023221174
    extent_y:       169.53824916087385
    extent_zmax:    33.0
    extent_zmin:    0.0
    ...             ...
    nz:             330
    origin_lat:     -41.29451828364293
    origin_lon:     174.1625010466834
    origin_rot:     326.6863186192412
    output_dir:     out
    topo_type:      SQUASHED_TAPERED
>>> ds = dset.set_xindex(['lat', 'lon'], xr.indexes.NDPointIndex) # Set an ND-point index for lat and lon to make querying simp\
le
>>> ds.sel(lat=-43.5381, lon=172.6474, method='nearest').vs.values # VS profile at the nearest VM point to CCCC
array([0.69843173, 1.9431396 , 2.2137284 , 2.3039246 , 2.3039246 ,
       2.3219638 , 2.3219638 , 2.340003  , 2.340003  , 2.3580422 ,
       4.2882414 , 4.2882414 , 4.3062806 , 4.3062806 , 4.32432   ,
       ...
       4.32432   , 4.32432   , 4.3423595 , 4.3423595 , 4.3603983 ,
       4.3603983 , 4.3603983 , 4.378438  , 4.378438  , 4.3964767 ],
      dtype=float32)

This works without having to decompress explicitly, making it easy for researchers to extract vs profiles from the actual model they simulated with.

gemini-code-assist · 2026-02-23T03:05:44Z

Summary of Changes

Hello @lispandfound, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust solution for compressing large velocity models, drastically reducing their storage footprint without compromising the resolution needed for exploratory analysis. By leveraging quantization and xarray's capabilities, the changes ensure that these models remain easily accessible and usable within the existing data ecosystem, providing substantial benefits for archival storage and data management.

Highlights

Velocity Model Compression: Implemented a new script to compress velocity model outputs, achieving significant storage reductions (45x-48x compression ratios) for archival purposes.
Quantization Method: Utilized a quantization method that converts vp, vs, and rho data to uint8 based on dynamically calculated resolutions, balancing precision with file size.
Xarray Integration: Compressed models are saved as xarray datasets, which transparently handle decompression and maintain compatibility with existing workflows, simplifying data access for researchers.
Command-Line Tool: Introduced a new command-line tool compress-vm for easy execution of the compression process with configurable compression levels and chunking options.

Changelog

pyproject.toml
- Added xarray[io] as a new project dependency.
- Registered compress-vm as a new command-line entry point.
velocity_modelling/tools/compress_vm.py
- Added a new script compress_vm.py for velocity model compression.
- Implemented get_extrema function to determine data ranges for quantization.
- Developed compress_quality function to quantize vp, vs, and rho data to uint8.
- Included read_inbasin function to handle inbasin data arrays.
- Created compressed_vm_as_dataset to convert HDF5 velocity models into compressed xarray datasets.
- Defined a compress_vm Typer command with options for compression level, chunking, and shuffling.

Activity

No human activity has been recorded for this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a new utility to compress velocity models for archival storage, leveraging xarray and h5py. This is a valuable addition for managing large datasets efficiently. The implementation correctly handles quantization and uses appropriate compression settings. I've identified a few areas for improvement regarding error handling, attribute management, and potential performance optimizations in the get_extrema function.

velocity_modelling/tools/compress_vm.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

velocity_modelling/tools/compress_vm.py

claudio525

Just some nitpicking, looks good!

velocity_modelling/tools/compress_vm.py

Co-authored-by: Claudio <45545396+claudio525@users.noreply.github.com>

…g into vm_compression

lispandfound added 2 commits February 23, 2026 15:38

feature(compress_vm): Add velocity model compression

d3a54fa

docs(compress_vm): add module-level docstring

3c9c9a7

gemini-code-assist bot reviewed Feb 23, 2026

View reviewed changes

velocity_modelling/tools/compress_vm.py Show resolved Hide resolved

velocity_modelling/tools/compress_vm.py Outdated Show resolved Hide resolved

velocity_modelling/tools/compress_vm.py Outdated Show resolved Hide resolved

fix(compress_vm): infer correct chunking dimensions

82cd764

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

lispandfound requested review from AndrewRidden-Harper, claudio525, joelridden and sungeunbae February 23, 2026 03:13

sungeunbae reviewed Feb 24, 2026

View reviewed changes

velocity_modelling/tools/compress_vm.py Outdated Show resolved Hide resolved

sungeunbae requested changes Feb 24, 2026

View reviewed changes

claudio525 reviewed Feb 24, 2026

View reviewed changes

velocity_modelling/tools/compress_vm.py Outdated Show resolved Hide resolved

velocity_modelling/tools/compress_vm.py Outdated Show resolved Hide resolved

pr comments

9fc8169

Co-authored-by: Claudio <45545396+claudio525@users.noreply.github.com>

claudio525 previously approved these changes Feb 24, 2026

View reviewed changes

refactor(compress_vm): make chunking more flexible

ff15d97

lispandfound requested a review from sungeunbae February 24, 2026 22:34

Merge branch 'vm_compression' of github.com:ucgmsim/velocity_modellin…

ce6921d

…g into vm_compression

lispandfound dismissed claudio525’s stale review via ce6921d February 24, 2026 22:35

fix(compress-vm): miscellaneous fixes

a68c704

claudio525 approved these changes Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add velocity model compression#49

Add velocity model compression#49
lispandfound wants to merge 7 commits intomainfrom
vm_compression

lispandfound commented Feb 23, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claudio525 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lispandfound commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 23, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claudio525 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lispandfound commented Feb 23, 2026 •

edited

Loading