Python API to cast binary columns to WKB columns

# Motivation

To convert legacy Parquet files that store geometry as a `BINARY` column whose payload is WKB into GeoParquet, the snippet below can be used. It explicitly converts the binary WKB payload into a geometry value (and sets the SRID), so that SedonaDB recognizes the column as geometry and `to_parquet()` can write GeoParquet metadata correctly.

```python
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

# Register a view name for SQL
df = df.to_view("t", overwrite=True)

df = sd.sql("""
  SELECT
    ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326) AS geometry,
    * EXCLUDE (geo_bin)
  FROM t
""")

df.to_parquet("geo_geoparquet.parquet")
```

# Proposed new API

It would be helpful to have an easier API for this. Using a dedicated method (instead of fusing the cast into `read_parquet()` or `to_parquet()`) makes the conversion more flexible, especially when “logically geometry, physically WKB-in-binary” columns come from other sources or are produced mid-query.

```python
def with_geometry(...):
    """
    Convert one or more binary WKB columns into geometry columns.

    Args:
        columns: A column name or list of column names containing WKB payloads.
        crs: Optional CRS identifier (e.g., 4326 or "EPSG:4326").
        validate: If True, validate WKB payloads while converting.
        primary: Optional name to mark as the primary geometry column.

    Examples:
        >>> sd = sedona.db.connect()
        >>> df = sd.read_parquet("geo_legacy.parquet").with_geometry(
        ...     columns=["geo_bin"],
        ...     crs="EPSG:4326",
        ...     validate=True,
        ...     primary="geo_bin",
        ... )
    """
```

## Example usage

```python
# geo_legacy.parquet schema
# - geo_bin: Binary (payload is WKB)
# - c1: Int32
# - c2: Int32

df = sd.read_parquet("/data/geo_legacy.parquet")

df = df.with_geometry(
    columns="geo_bin",
    crs=4326,
    validate=True,
    primary="geo_bin",
)

df.to_parquet("geo_geoparquet.parquet")
```

## Implementation

Internally, it's simply add expression projection on geometry columns (with `ST_SetSRID(ST_GeomFromWKB(geo_bin), 4326)`)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python API to cast binary columns to WKB columns #530

Motivation

Proposed new API

Example usage

Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python API to cast binary columns to WKB columns #530

Description

Motivation

Proposed new API

Example usage

Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions