catch a RuntimeError from reproject_vector by davemfish · Pull Request #427 · natcap/pygeoprocessing

davemfish · 2025-04-02T19:18:20Z

Fixes #418

I wasn't sure if/how to synthesize an offending dataset and write a test. I have a shapefile that triggers this RuntimeError, but I didn't create it myself.

I think the OGR behavior varies by driver. I tried to write a geojson vector with the same characters as present in my shapefile, but reprojecting that to a shapefile, OGR already handles it by issuing a warning.

phargogh · 2025-04-02T23:49:29Z

src/pygeoprocessing/geoprocessing.py

+            try:
+                target_feature.SetField(
+                    target_index, base_feature.GetField(base_index))
+            except RuntimeError as runtime_error:
+                LOGGER.debug(
+                    f"Skipping copy field value for feature {feature_index}, "
+                    f"field {layer_dfn.GetFieldDefn(base_index).GetName()} "
+                    f"due to: {runtime_error}")


This makes sense to me, and it also makes sense that this error might be specific to certain vector types that support alternate encodings.

When OGR raises a warning, does it skip the field entirely as we do here?

And, as a possible solution, I see that there is a field.GetFieldAsBinary (link) and field.setField() which takes a bytesteing that I'm wondering if we might be able to use here to bypass the encoding issue. It would only work on string and binary fields. Since you have the data for this, might this work as a solution?

SetFieldAsBinary works. Thanks for that idea.

I thought that might allow me to synthesize some data for a test too. But I could not figure out how to write a field value that would decode into a string (GetField) but then fail to encode('utf-8') back during SetField.

I tried something like this, based on the data I have, and it's likely encoding, but could never raise the exception.

field_string = 'República Dominicana (la)' field_bytes = field_string.encode('MacRoman') feature.SetFieldBinary(field_name, field_bytes) layer.CreateFeature(feature) ... try: pygeoprocessing.reproject_vector( base_vector_path=base_vector_path, target_projection_wkt=base_srs.ExportToWkt(), target_path=target_vector_path) except Exception as e: self.fail(str(e))

While poking around with encodings, I came across an interesting question: If we are reprojecting from a Latin-1-encoded Shapefile to a Latin-1-encoded Shapefile, should we expect the field values of strings to stay the same?

While fiddling with things, I ended up writing this test that is showing that the field values appear to actually be converted from latin-1 to UTF-8 in the process of calling reproject_vector:

def test_reproject_vector_nonstandard_encodings(self): # Create polygon shapefile to reproject base_srs = osr.SpatialReference() # NAD83(CSRS) / UTM zone 10N base_srs.ImportFromEPSG(3157) extents = [ 443723.1273278, 4956276.905980, 443993.1273278, 4956546.905980] polygon_a = shapely.geometry.box(*extents) base_vector_path = os.path.join( self.workspace_dir, 'base_vector.shp') field_name = 'id' test_driver = ogr.GetDriverByName('ESRI Shapefile') test_vector = test_driver.CreateDataSource(base_vector_path) test_layer = test_vector.CreateLayer('base_layer', srs=base_srs) field_defn = ogr.FieldDefn(field_name, ogr.OFTString) test_layer.CreateField(field_defn) layer_defn = test_layer.GetLayerDefn() test_feature = ogr.Feature(layer_defn) test_geometry = ogr.CreateGeometryFromWkb(polygon_a.wkb) test_feature.SetGeometry(test_geometry) # Default encoding for ESRI Shapefile is ISO-8859-1 test_feature.SetFieldBinary( field_name, b'Rep\xfablica Dominicana (la)') test_layer.CreateFeature(test_feature) # Confirm this layer is encoded as Latin-1 self.assertEqual( test_layer.GetMetadata_Dict("SHAPEFILE")['ENCODING_FROM_LDID'], 'ISO-8859-1') test_layer = None test_vector = None test_driver = None target_reference = osr.SpatialReference() # UTM zone 18N target_reference.ImportFromEPSG(26918) target_vector_path = os.path.join( self.workspace_dir, 'target_vector.shp') # create the file first so the model needs to deal with that target_file = open(target_vector_path, 'w') target_file.close() pygeoprocessing_logger = logging.getLogger('pygeoprocessing') layer_name = 'my_fun_layer' pygeoprocessing.reproject_vector( base_vector_path, target_reference.ExportToWkt(), target_vector_path, layer_id=0, target_layer_name=layer_name) vector = ogr.Open(target_vector_path) layer = vector.GetLayer() self.assertEqual(layer.GetFeatureCount(), 1) # Confirm the result layer is also Latin-1 self.assertEqual( layer.GetMetadata_Dict("SHAPEFILE")['ENCODING_FROM_LDID'], 'ISO-8859-1') # Make sure GDAL can automatically convert the known latin-1 string to # UTF-8 self.assertEqual( layer.GetFeature(0).GetField(field_name), "República Dominicana (la)") # Make sure the binary value is still set correctly in the reprojected # vector, encoded as Latin-1 self.assertEqual( layer.GetFeature(0).GetFieldAsBinary(field_name), b'Rep\xfablica Dominicana (la)')

Dave and I chatted about this some more on the team slack group and, since GDAL is the best authority to determine 1) whether there is a known encoding in the vector and 2) how to translate the text to/from any encodings that are in place, the best, most reasonable approach for us right now seems to be what we have already in the PR, which is:

Attempt to use GetField / SetField, which relies on GDAL doing the conversion to/from bytestrings according to the native encodings.

If that fails, just copy over the binary from the source vector's field to the new vector's field.

phargogh · 2025-04-13T03:18:33Z

This has been a really fun deep dive into how GDAL handles encodings! Forgive me if this is old news, but I just learned that ESRI Shapefiles have an encoding metadata attribute (which can be set at vector creation time and at vector open), and when this encoding is set (which it is, by default), then GDAL/python will do the translation between bytestrings and strings for us, which is great for usability. So it seems like the key to re-raising this exception is to use a file format that doesn't support encodings and isn't ASCII/UTF-8 by definition. GeoJSON is out of the question, as it's UTF-8 per the JSON spec.

Will CSV do the trick? I see that ogr.GetDriverByName('CSV').TestCapability(ogr.OLCStringsAsUTF8) = False

davemfish · 2025-04-14T14:52:26Z

This has been a really fun deep dive into how GDAL handles encodings! Forgive me if this is old news, but I just learned that ESRI Shapefiles have an encoding metadata attribute (which can be set at vector creation time and at vector open), and when this encoding is set (which it is, by default), then GDAL/python will do the translation between bytestrings and strings for us, which is great for usability. So it seems like the key to re-raising this exception is to use a file format that doesn't support encodings and isn't ASCII/UTF-8 by definition. GeoJSON is out of the question, as it's UTF-8 per the JSON spec.

Will CSV do the trick? I see that ogr.GetDriverByName('CSV').TestCapability(ogr.OLCStringsAsUTF8) = False

More good ideas! Unfortunately CSV are not usable in reproject_vector because the layer does not have a spatial reference. layer.GetSpatialRef() is None.

For some reason all ogr drivers return TestCapability(ogr.OLCStringsAsUTF8) = False.

phargogh · 2025-04-23T17:08:01Z

OK great, I'm sold on not providing test data for this for now since we can't quite reproduce the issue.

phargogh

Thanks @davemfish !

catch a RuntimeError from reproject_vector. natcap#418

f21d851

davemfish requested a review from phargogh April 2, 2025 19:53

phargogh reviewed Apr 2, 2025

View reviewed changes

try setting field as binary before skipping. natcap#418

ae85cc2

davemfish requested a review from phargogh April 8, 2025 20:51

remove unused variable; fix comment. natcap#418

361ac62

davemfish added this to the 2.4.8 milestone Apr 22, 2025

phargogh approved these changes Apr 24, 2025

View reviewed changes

phargogh merged commit ce587a0 into natcap:main Apr 24, 2025
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

catch a RuntimeError from reproject_vector#427

catch a RuntimeError from reproject_vector#427
phargogh merged 3 commits intonatcap:mainfrom
davemfish:bugfix/418-non-utf8-strings

davemfish commented Apr 2, 2025

Uh oh!

phargogh Apr 2, 2025

Uh oh!

davemfish Apr 8, 2025 •

edited

Loading

Uh oh!

phargogh Apr 23, 2025

Uh oh!

phargogh Apr 24, 2025

Uh oh!

phargogh commented Apr 13, 2025 •

edited

Loading

Uh oh!

davemfish commented Apr 14, 2025

Uh oh!

phargogh commented Apr 23, 2025

Uh oh!

phargogh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

davemfish commented Apr 2, 2025

Uh oh!

phargogh Apr 2, 2025

Choose a reason for hiding this comment

Uh oh!

davemfish Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phargogh Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

phargogh Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

phargogh commented Apr 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davemfish commented Apr 14, 2025

Uh oh!

phargogh commented Apr 23, 2025

Uh oh!

phargogh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

davemfish Apr 8, 2025 •

edited

Loading

phargogh commented Apr 13, 2025 •

edited

Loading