-
Notifications
You must be signed in to change notification settings - Fork 42
Description
Motivation
To convert legacy Parquet files that store geometry as a BINARY column whose payload is WKB into GeoParquet, the snippet below can be used. It explicitly converts the binary WKB payload into a geometry value (and sets the SRID), so that SedonaDB recognizes the column as geometry and to_parquet() can write GeoParquet metadata correctly.
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32
df = sd.read_parquet("/data/geo_legacy.parquet")
# Register a view name for SQL
df = df.to_view("t", overwrite=True)
df = sd.sql("""
SELECT
ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
* EXCLUDE (geo_bin)
FROM t
""")
df.to_parquet("geo_geoparquet.parquet")Proposed new API
It would be helpful to have an easier API for this. Using a dedicated method (instead of fusing the cast into read_parquet() or to_parquet()) makes the conversion more flexible, especially when “logically geometry, physically WKB-in-binary” columns come from other sources or are produced mid-query.
def with_geometry(...):
"""
Convert one or more binary WKB columns into geometry columns.
Args:
columns: A column name or list of column names containing WKB payloads.
crs: Optional CRS identifier (e.g., 4326 or "EPSG:4326").
validate: If True, validate WKB payloads while converting.
primary: Optional name to mark as the primary geometry column.
Examples:
>>> sd = sedona.db.connect()
>>> df = sd.read_parquet("geo_legacy.parquet").with_geometry(
... columns=["geo_bin"],
... crs="EPSG:4326",
... validate=True,
... primary="geo_bin",
... )
"""Example usage
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32
df = sd.read_parquet("/data/geo_legacy.parquet")
df = df.with_geometry(
columns="geo_bin",
crs=4326,
validate=True,
primary="geo_bin",
)
df.to_parquet("geo_geoparquet.parquet")Implementation
Internally, it's simply add expression projection on geometry columns (with ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326))