Skip to content

catch a RuntimeError from reproject_vector#427

Merged
phargogh merged 3 commits intonatcap:mainfrom
davemfish:bugfix/418-non-utf8-strings
Apr 24, 2025
Merged

catch a RuntimeError from reproject_vector#427
phargogh merged 3 commits intonatcap:mainfrom
davemfish:bugfix/418-non-utf8-strings

Conversation

@davemfish
Copy link
Copy Markdown
Contributor

Fixes #418

I wasn't sure if/how to synthesize an offending dataset and write a test. I have a shapefile that triggers this RuntimeError, but I didn't create it myself.

I think the OGR behavior varies by driver. I tried to write a geojson vector with the same characters as present in my shapefile, but reprojecting that to a shapefile, OGR already handles it by issuing a warning.

@davemfish davemfish requested a review from phargogh April 2, 2025 19:53
Comment on lines +2280 to +2287
try:
target_feature.SetField(
target_index, base_feature.GetField(base_index))
except RuntimeError as runtime_error:
LOGGER.debug(
f"Skipping copy field value for feature {feature_index}, "
f"field {layer_dfn.GetFieldDefn(base_index).GetName()} "
f"due to: {runtime_error}")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, and it also makes sense that this error might be specific to certain vector types that support alternate encodings.

When OGR raises a warning, does it skip the field entirely as we do here?

And, as a possible solution, I see that there is a field.GetFieldAsBinary (link) and field.setField() which takes a bytesteing that I'm wondering if we might be able to use here to bypass the encoding issue. It would only work on string and binary fields. Since you have the data for this, might this work as a solution?

Copy link
Copy Markdown
Contributor Author

@davemfish davemfish Apr 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SetFieldAsBinary works. Thanks for that idea.

I thought that might allow me to synthesize some data for a test too. But I could not figure out how to write a field value that would decode into a string (GetField) but then fail to encode('utf-8') back during SetField.

I tried something like this, based on the data I have, and it's likely encoding, but could never raise the exception.

        field_string = 'República Dominicana (la)'
        field_bytes = field_string.encode('MacRoman')
        feature.SetFieldBinary(field_name, field_bytes)
        layer.CreateFeature(feature)
        ...
        try:
            pygeoprocessing.reproject_vector(
                base_vector_path=base_vector_path,
                target_projection_wkt=base_srs.ExportToWkt(),
                target_path=target_vector_path)
        except Exception as e:
            self.fail(str(e))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While poking around with encodings, I came across an interesting question: If we are reprojecting from a Latin-1-encoded Shapefile to a Latin-1-encoded Shapefile, should we expect the field values of strings to stay the same?

While fiddling with things, I ended up writing this test that is showing that the field values appear to actually be converted from latin-1 to UTF-8 in the process of calling reproject_vector:

    def test_reproject_vector_nonstandard_encodings(self):
        # Create polygon shapefile to reproject
        base_srs = osr.SpatialReference()
        # NAD83(CSRS) / UTM zone 10N
        base_srs.ImportFromEPSG(3157)
        extents = [
            443723.1273278, 4956276.905980, 443993.1273278, 4956546.905980]

        polygon_a = shapely.geometry.box(*extents)

        base_vector_path = os.path.join(
            self.workspace_dir, 'base_vector.shp')
        field_name = 'id'
        test_driver = ogr.GetDriverByName('ESRI Shapefile')
        test_vector = test_driver.CreateDataSource(base_vector_path)
        test_layer = test_vector.CreateLayer('base_layer', srs=base_srs)

        field_defn = ogr.FieldDefn(field_name, ogr.OFTString)
        test_layer.CreateField(field_defn)
        layer_defn = test_layer.GetLayerDefn()

        test_feature = ogr.Feature(layer_defn)
        test_geometry = ogr.CreateGeometryFromWkb(polygon_a.wkb)
        test_feature.SetGeometry(test_geometry)

        # Default encoding for ESRI Shapefile is ISO-8859-1
        test_feature.SetFieldBinary(
            field_name, b'Rep\xfablica Dominicana (la)')
        test_layer.CreateFeature(test_feature)

        # Confirm this layer is encoded as Latin-1
        self.assertEqual(
            test_layer.GetMetadata_Dict("SHAPEFILE")['ENCODING_FROM_LDID'],
            'ISO-8859-1')

        test_layer = None
        test_vector = None
        test_driver = None

        target_reference = osr.SpatialReference()
        # UTM zone 18N
        target_reference.ImportFromEPSG(26918)

        target_vector_path = os.path.join(
            self.workspace_dir, 'target_vector.shp')
        # create the file first so the model needs to deal with that
        target_file = open(target_vector_path, 'w')
        target_file.close()

        pygeoprocessing_logger = logging.getLogger('pygeoprocessing')
        layer_name = 'my_fun_layer'
        pygeoprocessing.reproject_vector(
            base_vector_path, target_reference.ExportToWkt(),
            target_vector_path, layer_id=0, target_layer_name=layer_name)

        vector = ogr.Open(target_vector_path)
        layer = vector.GetLayer()
        self.assertEqual(layer.GetFeatureCount(), 1)

        # Confirm the result layer is also Latin-1
        self.assertEqual(
            layer.GetMetadata_Dict("SHAPEFILE")['ENCODING_FROM_LDID'],
            'ISO-8859-1')

        # Make sure GDAL can automatically convert the known latin-1 string to
        # UTF-8
        self.assertEqual(
            layer.GetFeature(0).GetField(field_name),
            "República Dominicana (la)")

        # Make sure the binary value is still set correctly in the reprojected
        # vector, encoded as Latin-1
        self.assertEqual(
            layer.GetFeature(0).GetFieldAsBinary(field_name),
            b'Rep\xfablica Dominicana (la)')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dave and I chatted about this some more on the team slack group and, since GDAL is the best authority to determine 1) whether there is a known encoding in the vector and 2) how to translate the text to/from any encodings that are in place, the best, most reasonable approach for us right now seems to be what we have already in the PR, which is:

  1. Attempt to use GetField / SetField, which relies on GDAL doing the conversion to/from bytestrings according to the native encodings.
  2. If that fails, just copy over the binary from the source vector's field to the new vector's field.

@davemfish davemfish requested a review from phargogh April 8, 2025 20:51
@phargogh
Copy link
Copy Markdown
Member

phargogh commented Apr 13, 2025

This has been a really fun deep dive into how GDAL handles encodings! Forgive me if this is old news, but I just learned that ESRI Shapefiles have an encoding metadata attribute (which can be set at vector creation time and at vector open), and when this encoding is set (which it is, by default), then GDAL/python will do the translation between bytestrings and strings for us, which is great for usability. So it seems like the key to re-raising this exception is to use a file format that doesn't support encodings and isn't ASCII/UTF-8 by definition. GeoJSON is out of the question, as it's UTF-8 per the JSON spec.

Will CSV do the trick? I see that ogr.GetDriverByName('CSV').TestCapability(ogr.OLCStringsAsUTF8) = False

@davemfish
Copy link
Copy Markdown
Contributor Author

This has been a really fun deep dive into how GDAL handles encodings! Forgive me if this is old news, but I just learned that ESRI Shapefiles have an encoding metadata attribute (which can be set at vector creation time and at vector open), and when this encoding is set (which it is, by default), then GDAL/python will do the translation between bytestrings and strings for us, which is great for usability. So it seems like the key to re-raising this exception is to use a file format that doesn't support encodings and isn't ASCII/UTF-8 by definition. GeoJSON is out of the question, as it's UTF-8 per the JSON spec.

Will CSV do the trick? I see that ogr.GetDriverByName('CSV').TestCapability(ogr.OLCStringsAsUTF8) = False

More good ideas! Unfortunately CSV are not usable in reproject_vector because the layer does not have a spatial reference. layer.GetSpatialRef() is None.

For some reason all ogr drivers return TestCapability(ogr.OLCStringsAsUTF8) = False.

@davemfish davemfish added this to the 2.4.8 milestone Apr 22, 2025
@phargogh
Copy link
Copy Markdown
Member

OK great, I'm sold on not providing test data for this for now since we can't quite reproduce the issue.

Copy link
Copy Markdown
Member

@phargogh phargogh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davemfish !

@phargogh phargogh merged commit ce587a0 into natcap:main Apr 24, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

reproject_vector fails on non-UTF-8 strings

2 participants