Skip to content

Conversation

@Imbruced
Copy link
Member

@Imbruced Imbruced commented Jan 18, 2026

Did you read the Contributor Guide?

Is this PR related to a ticket?

  • Yes, and the PR name follows the format [SEDONA-738] my subject.

What changes were proposed in this PR?

Sedona vectorized udf (Apache Arrow exchange), which is utilizing the SedonaDB. It supports:

  • scalar functions
  • daemon mode

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

TODO

@github-actions github-actions bot added sedona-python sedona-spark github_actions Pull requests that update GitHub Actions code root labels Jan 18, 2026
@Imbruced
Copy link
Member Author

image

@Imbruced
Copy link
Member Author

Working on proper benchmarking

@Imbruced
Copy link
Member Author

Items on the list are the extensions to the Sedona DB vectorized UDFs:

  • support for Spark 4.0
  • vectorized table functions (Sedona DB, table object would be an input to the function)
  • implementing additional serialization method in SedonaDB to reduce the amount of transformation for table functions
  • aggregate functions?
  • adding geopandas as other method

@Imbruced
Copy link
Member Author

@jiayuasu @paleolimbot I think we can start reviewing the changes and the ideas that I am proposing in this MR. What I observed is that this way, UDF can be even faster than native Sedona functions like ST_Buffer. But, for instance, ST_Area is three times slower, and I guess it depends on the specific function. But what is more important, the performance is better than the previous UDFs in Sedona. I would mark this functionality as experimental.

Also, I haven't included a documentation update, as we might decide during the review that this MR needs adjustment.

@Imbruced Imbruced marked this pull request as ready for review January 25, 2026 21:27
@Imbruced Imbruced requested a review from jiayuasu as a code owner January 25, 2026 21:27
@Imbruced
Copy link
Member Author

This piece of code is working only for Spark 3.5, but I plan to extend it for Spark 4.0

@Imbruced
Copy link
Member Author

I would like to extend it to include table-defined user functions, which will allow us to operate on the entire SedonaDB dataframe.

cd python
uv add apache-flink==1.20.1
uv sync
# uv sync --extra flink
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

matrix:
os: ['ubuntu-latest', 'windows-latest', 'macos-15']
python: ['3.11', '3.10', '3.9', '3.8']
python: ['3.11', '3.10', '3.9']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had trouble integrating with Python 3.8, it's already one year since it reached EOL, what would you think about removing it? and maybe start supporting Python 3.12 and 3.13

# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is almost a copy-paste of what is in Apache Spark. The only difference is the import worker function

from sedona.spark.worker.worker import main as worker_main

I don't know what a better approach is, using the import of functions like manager?

crs = self.geom_offsets[arg]
fields.append(
f"ST_GeomFromSedonaSpark(_{arg}, 'EPSG:{crs}') AS _{arg}"
) # nosec
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretical SQL injection, which is not causing any harm here.

return Py_BuildValue("(Kibi)", geom, geom_type_id, has_z, length);
}

static PyObject *to_sedona_func(PyObject *self, PyObject *args) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Sedona speedup is available, instead of translating to wkb and then loading from wkb with shapely, we can create shapely objects directly to speed up vectorized UDFs.

"20",
)
# Pandas on PySpark doesn't work with ANSI mode, which is enabled by default
.config("spark.executor.memory", "10G")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To remove, forgot to remove it after testing.

from setuptools import setup
import numpy

setup(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed to make numpy C wrappers available

val sedonaArrowStrategy = Try(
Class
.forName("org.apache.spark.sql.udf.SedonaArrowStrategy")
.forName("org.apache.spark.sql.execution.python.SedonaArrowStrategy")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need some execution, python private methods from spark

case _ => None
}

schema
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infer for geometry fields by taking the firs value

.config("sedona.join.autoBroadcastJoinThreshold", "-1")
.config("spark.sql.extensions", "org.apache.sedona.sql.SedonaSqlExtensions")
.config("sedona.python.worker.udf.module", "sedona.spark.worker.worker")
.config("sedona.python.worker.udf.daemon.module", "sedonaworker.daemon")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works because the sedona.python.worker.daemon.enabled is false, need to either remove this param (by default is used sedona.spark.worker.daemon ) from the test or change to sedona.spark.worker.daemon

<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<!-- <version>3.12.0</version>-->
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github_actions Pull requests that update GitHub Actions code root sedona-python sedona-spark

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant