Skip to content

Test and document using GDAL virtual file system handlers#444

Merged
dcdenu4 merged 8 commits intonatcap:mainfrom
emlys:feature/441
Jul 28, 2025
Merged

Test and document using GDAL virtual file system handlers#444
dcdenu4 merged 8 commits intonatcap:mainfrom
emlys:feature/441

Conversation

@emlys
Copy link
Copy Markdown
Member

@emlys emlys commented Jul 16, 2025

Fixes #441

  • Add a few tests using vsicurl (testing against new test data included in this repo, so that we can reference it at the github hosted URL)
  • Add a section to the documentation about using virtual file system handlers
  • Update all docstrings to indicate that GDAL vsi paths are supported for inputs

@emlys emlys marked this pull request as ready for review July 17, 2025 16:56
@emlys emlys requested a review from dcdenu4 July 17, 2025 16:56
Copy link
Copy Markdown
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emlys , I found a couple other places we might want to add the vsi blurb too. One I couldn't really comment on is under the def build_overviews functions in geoprocessing.py, line 4670.

When thinking about testing a few questions came to mind that I was curious if you'd come across or had thought about.

  1. Where does GDAL save the needed downloaded data from URLs and how / when does GDAL garbage collect it? Is that something we need to be mindful of?
  2. I'd love to have some more comprehensive testing that could help us spot issues further in advance, but I know we don't want to burden our GHA runners all the time. It might be nice to start thinking about having a more comprehensive test runner / suite that we can manually run every once in awhile or that runs periodically during off hours.

lambda block: block * 2, [input_path], target_path)
numpy.testing.assert_array_equal(
pygeoprocessing.raster_to_numpy_array(target_path),
numpy.ones((9, 9), dtype=numpy.float32))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This got me thinking that it'd be nice to document what test_data/raster.tif is and it's properties. Maybe a more descriptive name for that test file could help at the very least. But if I wanted to use it in another test function I wouldn't know what to expect for values.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I'm not sure exactly how to handle that. I renamed the files to small_raster.tif and small_vector.gpkg so hopefully that helps a bit? But I don't want to be too specific e.g. small_raster_for_vsi_tests.tif would then not make sense if we did want to use it in another test.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be overkill to have a small README in the folder that describes the data? I'm mostly thinking that I don't know what the values are and therefore what I'd expect for some kind of output. Maybe we can address this in the future if we do more of this.

def test_raster_map_vsicurl(self):
"""PGP: raster_map with vsicurl."""
# Access test data hosted on github
input_path = '/vsicurl/https://raw.githubusercontent.com/emlys/pygeoprocessing/feature/441/tests/test_data/raster.tif'
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In playing around with this, do you have a sense for where GDAL downloads the data it needs from a hosted file? Are we sure we're cleaning that up in our tests?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that it downloads it to anywhere - it may just be in memory. I wouldn't expect it to leave any extra files around afterward

@emlys
Copy link
Copy Markdown
Member Author

emlys commented Jul 25, 2025

Thanks @dcdenu4! Here are my thoughts -

Where does GDAL save the needed downloaded data from URLs and how / when does GDAL garbage collect it? Is that something we need to be mindful of?

In the docs I do not see any mention of downloaded data being permanently stored. I think it's safe to assume that this is abstracted away by the drivers. The documentation suggests that everything is cached and there is a reasonable limit on the cache size:

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading, it will progressively increase the chunk size up to 128 times CPL_VSIL_CURL_CHUNK_SIZE (so 2 MB by default) to improve download performance.
In addition, a global least-recently-used cache of 16 MB shared among all downloaded content is used, and content in it may be reused after a file handle has been closed and reopen, during the life-time of the process or until VSICurlClearCache() is called. Starting with GDAL 2.3, the size of this global LRU cache can be modified by setting the configuration option CPL_VSIL_CURL_CACHE_SIZE (in bytes).

I'd love to have some more comprehensive testing that could help us spot issues further in advance, but I know we don't want to burden our GHA runners all the time. It might be nice to start thinking about having a more comprehensive test runner / suite that we can manually run every once in awhile or that runs periodically during off hours.

I think this could be addressed separately... I don't want to fall into the trap of testing GDAL functionality rather than pygeoprocessing functionality. VSI handlers are new to us, but not new to GDAL, so I think we can assume that they work as described and not need to write our own comprehensive tests using that feature.

I did also test locally running raster_map on a huge global dataset from the data hub, and it worked as expected and I observed that memory use did not climb during the run time.

@emlys emlys requested a review from dcdenu4 July 25, 2025 18:49
Copy link
Copy Markdown
Member

@dcdenu4 dcdenu4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emlys for taking the time to reply to my questions.

@dcdenu4 dcdenu4 merged commit 27ccf7a into natcap:main Jul 28, 2025
93 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Officially support GDAL virtual file system inputs

2 participants