Skip to content

Conversation

@lbesnard
Copy link
Collaborator

@lbesnard lbesnard commented Aug 19, 2025

Draft PR to showcase how CSIRO could create Parquet dataset to be ingested by AODN

@codecov-commenter
Copy link

codecov-commenter commented Aug 19, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 66.81%. Comparing base (9e1e881) to head (9542303).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #193   +/-   ##
=======================================
  Coverage   66.81%   66.81%           
=======================================
  Files          29       29           
  Lines        4752     4752           
=======================================
  Hits         3175     3175           
  Misses       1577     1577           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lbesnard
Copy link
Collaborator Author

lbesnard commented Aug 19, 2025

@Rosspet
Steps:

  1. Input CSIRO NetCDF modified with this script
    csiro_underway_netcdf_conversion.py. This converts to NetCDF4 and create a TIME variable from the global attributes. Run as for f in `fd . -t f | grep nc`; do ./csiro_underway_netcdf_conversion.py $f;done
  2. Upload the modified NetCDF4 files to local input MinIO, under s3://[INPUT_BUCKET_NAME]/[DATASET_NAME_LOCATION]/ . More info
  3. create the dataset configuration by using one NetCDF file as the point of truth cloud_optimised_create_dataset_config -f s3://[INPUT_BUCKET_NAME]/[DATASET_NAME_LOCATION]/in2017_v02uwy.nc -c parquet -d vessel_underway_csiro --s3fs-opts '{"key": "minioadmin", "secret": "minioadmin", "client_kwargs": {"endpoint_url": "http://localhost:9000"}}'. More info
  4. The config file should be quite similar to the one in this PR, however the one is the PR is more complete and works
  5. What needs to be modified in the config file:
    • add_variables (from global attributes, for example voyage and ship_name ; see doc)
    • setting global attributes (acknowledgement, citations, project, author ...)
    • partition keys (similar to indexes in the database, they're ordered, see doc)
    • run_settings with the input and output bucket, the number of files to process at once ...
    • drop_variables: Some variables were multidimensional and were being flatten, causing memory explosion. These variables were not needed
  6. run poetry install --with dev and restart the shell if needed
  7. run cloud_optimised_vessel_underway_csiro. This will run locally
  8. check the associated notebook which also points to the local MinIO bucket with
aodn = GetAodn(
   bucket_name="aodn-cloud-optimised",
   prefix="",
   s3_fs_opts={
       "key": "minioadmin",
       "secret": "minioadmin",
       "client_kwargs": {
           "endpoint_url": "http://127.0.0.1:9000"
       }
   }
)
  1. speak to @mhidas and @craigrose from the pipeline core team to see how to implement the lib to process one file at a time

I won't merge this PR, but this is a good working example

@lbesnard lbesnard self-assigned this Aug 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants