Test and document using GDAL virtual file system handlers by emlys · Pull Request #444 · natcap/pygeoprocessing

emlys · 2025-07-16T16:07:59Z

Fixes #441

Add a few tests using vsicurl (testing against new test data included in this repo, so that we can reference it at the github hosted URL)
Add a section to the documentation about using virtual file system handlers
Update all docstrings to indicate that GDAL vsi paths are supported for inputs

dcdenu4

Thanks @emlys , I found a couple other places we might want to add the vsi blurb too. One I couldn't really comment on is under the def build_overviews functions in geoprocessing.py, line 4670.

When thinking about testing a few questions came to mind that I was curious if you'd come across or had thought about.

Where does GDAL save the needed downloaded data from URLs and how / when does GDAL garbage collect it? Is that something we need to be mindful of?
I'd love to have some more comprehensive testing that could help us spot issues further in advance, but I know we don't want to burden our GHA runners all the time. It might be nice to start thinking about having a more comprehensive test runner / suite that we can manually run every once in awhile or that runs periodically during off hours.

src/pygeoprocessing/geoprocessing.py

dcdenu4 · 2025-07-18T17:11:50Z

tests/test_geoprocessing.py

+            lambda block: block * 2, [input_path], target_path)
+        numpy.testing.assert_array_equal(
+            pygeoprocessing.raster_to_numpy_array(target_path),
+            numpy.ones((9, 9), dtype=numpy.float32))


This got me thinking that it'd be nice to document what test_data/raster.tif is and it's properties. Maybe a more descriptive name for that test file could help at the very least. But if I wanted to use it in another test function I wouldn't know what to expect for values.

Hmm, I'm not sure exactly how to handle that. I renamed the files to small_raster.tif and small_vector.gpkg so hopefully that helps a bit? But I don't want to be too specific e.g. small_raster_for_vsi_tests.tif would then not make sense if we did want to use it in another test.

Would it be overkill to have a small README in the folder that describes the data? I'm mostly thinking that I don't know what the values are and therefore what I'd expect for some kind of output. Maybe we can address this in the future if we do more of this.

dcdenu4 · 2025-07-18T17:15:06Z

tests/test_geoprocessing.py

+    def test_raster_map_vsicurl(self):
+        """PGP: raster_map with vsicurl."""
+        # Access test data hosted on github
+        input_path = '/vsicurl/https://raw.githubusercontent.com/emlys/pygeoprocessing/feature/441/tests/test_data/raster.tif'


In playing around with this, do you have a sense for where GDAL downloads the data it needs from a hosted file? Are we sure we're cleaning that up in our tests?

I don't know that it downloads it to anywhere - it may just be in memory. I wouldn't expect it to leave any extra files around afterward

Co-authored-by: Doug <dcdenu4@gmail.com>

emlys · 2025-07-25T18:47:37Z

Thanks @dcdenu4! Here are my thoughts -

Where does GDAL save the needed downloaded data from URLs and how / when does GDAL garbage collect it? Is that something we need to be mindful of?

In the docs I do not see any mention of downloaded data being permanently stored. I think it's safe to assume that this is abstracted away by the drivers. The documentation suggests that everything is cached and there is a reasonable limit on the cache size:

Partial downloads (requires the HTTP server to support random reading) are done with a 16 KB granularity by default. Starting with GDAL 2.3, the chunk size can be configured with the CPL_VSIL_CURL_CHUNK_SIZE configuration option, with a value in bytes. If the driver detects sequential reading, it will progressively increase the chunk size up to 128 times CPL_VSIL_CURL_CHUNK_SIZE (so 2 MB by default) to improve download performance.
In addition, a global least-recently-used cache of 16 MB shared among all downloaded content is used, and content in it may be reused after a file handle has been closed and reopen, during the life-time of the process or until VSICurlClearCache() is called. Starting with GDAL 2.3, the size of this global LRU cache can be modified by setting the configuration option CPL_VSIL_CURL_CACHE_SIZE (in bytes).

I'd love to have some more comprehensive testing that could help us spot issues further in advance, but I know we don't want to burden our GHA runners all the time. It might be nice to start thinking about having a more comprehensive test runner / suite that we can manually run every once in awhile or that runs periodically during off hours.

I think this could be addressed separately... I don't want to fall into the trap of testing GDAL functionality rather than pygeoprocessing functionality. VSI handlers are new to us, but not new to GDAL, so I think we can assume that they work as described and not need to write our own comprehensive tests using that feature.

I did also test locally running raster_map on a huge global dataset from the data hub, and it worked as expected and I observed that memory use did not climb during the run time.

dcdenu4

Thanks @emlys for taking the time to reply to my questions.

emlys added 5 commits July 16, 2025 09:07

add test using VSICurl with get_raster_info natcap#441

ea8dd88

add test and test data for get_vector_info with vsicurl natcap#441

3fea440

add documentation on GDAL vsi to api docs

db58400

add test using vsicurl with raster_map natcap#441

7e61c5f

update docstrings to mention gdal VSI paths for all inputs natcap#441

24c346c

emlys marked this pull request as ready for review July 17, 2025 16:56

emlys requested a review from dcdenu4 July 17, 2025 16:56

emlys assigned emlys and dcdenu4 Jul 17, 2025

dcdenu4 requested changes Jul 18, 2025

View reviewed changes

emlys and others added 3 commits July 25, 2025 10:09

Update src/pygeoprocessing/geoprocessing.py

43ae3f6

Co-authored-by: Doug <dcdenu4@gmail.com>

add vsi blurb to a few more docstrings natcap#441

4deabcd

rename test data files

68706f7

emlys requested a review from dcdenu4 July 25, 2025 18:49

dcdenu4 approved these changes Jul 28, 2025

View reviewed changes

dcdenu4 merged commit 27ccf7a into natcap:main Jul 28, 2025
93 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test and document using GDAL virtual file system handlers#444

Test and document using GDAL virtual file system handlers#444
dcdenu4 merged 8 commits intonatcap:mainfrom
emlys:feature/441

emlys commented Jul 16, 2025 •

edited

Loading

Uh oh!

dcdenu4 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcdenu4 Jul 18, 2025

Uh oh!

emlys Jul 25, 2025

Uh oh!

dcdenu4 Jul 28, 2025

Uh oh!

dcdenu4 Jul 18, 2025

Uh oh!

emlys Jul 25, 2025

Uh oh!

emlys commented Jul 25, 2025 •

edited

Loading

Uh oh!

dcdenu4 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

emlys commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcdenu4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dcdenu4 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

emlys Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

dcdenu4 Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

dcdenu4 Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

emlys Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

emlys commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcdenu4 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

emlys commented Jul 16, 2025 •

edited

Loading

emlys commented Jul 25, 2025 •

edited

Loading