Skip to content

Python API to cast binary columns to WKB columns #530

@2010YOUY01

Description

@2010YOUY01

Motivation

To convert legacy Parquet files that store geometry as a BINARY column whose payload is WKB into GeoParquet, the snippet below can be used. It explicitly converts the binary WKB payload into a geometry value (and sets the SRID), so that SedonaDB recognizes the column as geometry and to_parquet() can write GeoParquet metadata correctly.

# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

# Register a view name for SQL
df = df.to_view("t", overwrite=True)

df = sd.sql("""
  SELECT
    ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
    * EXCLUDE (geo_bin)
  FROM t
""")

df.to_parquet("geo_geoparquet.parquet")

Proposed new API

It would be helpful to have an easier API for this. Using a dedicated method (instead of fusing the cast into read_parquet() or to_parquet()) makes the conversion more flexible, especially when “logically geometry, physically WKB-in-binary” columns come from other sources or are produced mid-query.

def with_geometry(...):
    """
    Convert one or more binary WKB columns into geometry columns.

    Args:
        columns: A column name or list of column names containing WKB payloads.
        crs: Optional CRS identifier (e.g., 4326 or "EPSG:4326").
        validate: If True, validate WKB payloads while converting.
        primary: Optional name to mark as the primary geometry column.

    Examples:
        >>> sd = sedona.db.connect()
        >>> df = sd.read_parquet("geo_legacy.parquet").with_geometry(
        ...     columns=["geo_bin"],
        ...     crs="EPSG:4326",
        ...     validate=True,
        ...     primary="geo_bin",
        ... )
    """

Example usage

# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

df = df.with_geometry(
    columns="geo_bin",
    crs=4326,
    validate=True,
    primary="geo_bin",
)

df.to_parquet("geo_geoparquet.parquet")

Implementation

Internally, it's simply add expression projection on geometry columns (with ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions