catch a RuntimeError from reproject_vector#427
Conversation
src/pygeoprocessing/geoprocessing.py
Outdated
| try: | ||
| target_feature.SetField( | ||
| target_index, base_feature.GetField(base_index)) | ||
| except RuntimeError as runtime_error: | ||
| LOGGER.debug( | ||
| f"Skipping copy field value for feature {feature_index}, " | ||
| f"field {layer_dfn.GetFieldDefn(base_index).GetName()} " | ||
| f"due to: {runtime_error}") |
There was a problem hiding this comment.
This makes sense to me, and it also makes sense that this error might be specific to certain vector types that support alternate encodings.
When OGR raises a warning, does it skip the field entirely as we do here?
And, as a possible solution, I see that there is a field.GetFieldAsBinary (link) and field.setField() which takes a bytesteing that I'm wondering if we might be able to use here to bypass the encoding issue. It would only work on string and binary fields. Since you have the data for this, might this work as a solution?
There was a problem hiding this comment.
SetFieldAsBinary works. Thanks for that idea.
I thought that might allow me to synthesize some data for a test too. But I could not figure out how to write a field value that would decode into a string (GetField) but then fail to encode('utf-8') back during SetField.
I tried something like this, based on the data I have, and it's likely encoding, but could never raise the exception.
field_string = 'República Dominicana (la)'
field_bytes = field_string.encode('MacRoman')
feature.SetFieldBinary(field_name, field_bytes)
layer.CreateFeature(feature)
...
try:
pygeoprocessing.reproject_vector(
base_vector_path=base_vector_path,
target_projection_wkt=base_srs.ExportToWkt(),
target_path=target_vector_path)
except Exception as e:
self.fail(str(e))There was a problem hiding this comment.
While poking around with encodings, I came across an interesting question: If we are reprojecting from a Latin-1-encoded Shapefile to a Latin-1-encoded Shapefile, should we expect the field values of strings to stay the same?
While fiddling with things, I ended up writing this test that is showing that the field values appear to actually be converted from latin-1 to UTF-8 in the process of calling reproject_vector:
def test_reproject_vector_nonstandard_encodings(self):
# Create polygon shapefile to reproject
base_srs = osr.SpatialReference()
# NAD83(CSRS) / UTM zone 10N
base_srs.ImportFromEPSG(3157)
extents = [
443723.1273278, 4956276.905980, 443993.1273278, 4956546.905980]
polygon_a = shapely.geometry.box(*extents)
base_vector_path = os.path.join(
self.workspace_dir, 'base_vector.shp')
field_name = 'id'
test_driver = ogr.GetDriverByName('ESRI Shapefile')
test_vector = test_driver.CreateDataSource(base_vector_path)
test_layer = test_vector.CreateLayer('base_layer', srs=base_srs)
field_defn = ogr.FieldDefn(field_name, ogr.OFTString)
test_layer.CreateField(field_defn)
layer_defn = test_layer.GetLayerDefn()
test_feature = ogr.Feature(layer_defn)
test_geometry = ogr.CreateGeometryFromWkb(polygon_a.wkb)
test_feature.SetGeometry(test_geometry)
# Default encoding for ESRI Shapefile is ISO-8859-1
test_feature.SetFieldBinary(
field_name, b'Rep\xfablica Dominicana (la)')
test_layer.CreateFeature(test_feature)
# Confirm this layer is encoded as Latin-1
self.assertEqual(
test_layer.GetMetadata_Dict("SHAPEFILE")['ENCODING_FROM_LDID'],
'ISO-8859-1')
test_layer = None
test_vector = None
test_driver = None
target_reference = osr.SpatialReference()
# UTM zone 18N
target_reference.ImportFromEPSG(26918)
target_vector_path = os.path.join(
self.workspace_dir, 'target_vector.shp')
# create the file first so the model needs to deal with that
target_file = open(target_vector_path, 'w')
target_file.close()
pygeoprocessing_logger = logging.getLogger('pygeoprocessing')
layer_name = 'my_fun_layer'
pygeoprocessing.reproject_vector(
base_vector_path, target_reference.ExportToWkt(),
target_vector_path, layer_id=0, target_layer_name=layer_name)
vector = ogr.Open(target_vector_path)
layer = vector.GetLayer()
self.assertEqual(layer.GetFeatureCount(), 1)
# Confirm the result layer is also Latin-1
self.assertEqual(
layer.GetMetadata_Dict("SHAPEFILE")['ENCODING_FROM_LDID'],
'ISO-8859-1')
# Make sure GDAL can automatically convert the known latin-1 string to
# UTF-8
self.assertEqual(
layer.GetFeature(0).GetField(field_name),
"República Dominicana (la)")
# Make sure the binary value is still set correctly in the reprojected
# vector, encoded as Latin-1
self.assertEqual(
layer.GetFeature(0).GetFieldAsBinary(field_name),
b'Rep\xfablica Dominicana (la)')There was a problem hiding this comment.
Dave and I chatted about this some more on the team slack group and, since GDAL is the best authority to determine 1) whether there is a known encoding in the vector and 2) how to translate the text to/from any encodings that are in place, the best, most reasonable approach for us right now seems to be what we have already in the PR, which is:
- Attempt to use
GetField/SetField, which relies on GDAL doing the conversion to/from bytestrings according to the native encodings. - If that fails, just copy over the binary from the source vector's field to the new vector's field.
|
This has been a really fun deep dive into how GDAL handles encodings! Forgive me if this is old news, but I just learned that ESRI Shapefiles have an encoding metadata attribute (which can be set at vector creation time and at vector open), and when this encoding is set (which it is, by default), then GDAL/python will do the translation between bytestrings and strings for us, which is great for usability. So it seems like the key to re-raising this exception is to use a file format that doesn't support encodings and isn't ASCII/UTF-8 by definition. GeoJSON is out of the question, as it's UTF-8 per the JSON spec. Will CSV do the trick? I see that |
More good ideas! Unfortunately CSV are not usable in For some reason all |
|
OK great, I'm sold on not providing test data for this for now since we can't quite reproduce the issue. |
Fixes #418
I wasn't sure if/how to synthesize an offending dataset and write a test. I have a shapefile that triggers this
RuntimeError, but I didn't create it myself.I think the OGR behavior varies by driver. I tried to write a geojson vector with the same characters as present in my shapefile, but reprojecting that to a shapefile, OGR already handles it by issuing a warning.