SEDONA-738 Add sedonadb worker #2593

Imbruced · 2026-01-18T18:09:59Z

Did you read the Contributor Guide?

Yes, I have read the Contributor Rules and Contributor Development Guide

Is this PR related to a ticket?

Yes, and the PR name follows the format [SEDONA-738] my subject.

What changes were proposed in this PR?

Sedona vectorized udf (Apache Arrow exchange), which is utilizing the SedonaDB. It supports:

scalar functions
daemon mode

How was this patch tested?

unit tests

Did this PR include necessary documentation updates?

TODO

Imbruced · 2026-01-18T19:06:07Z

Imbruced · 2026-01-18T19:06:45Z

Working on proper benchmarking

Imbruced · 2026-01-18T20:49:36Z

Items on the list are the extensions to the Sedona DB vectorized UDFs:

support for Spark 4.0
vectorized table functions (Sedona DB, table object would be an input to the function)
implementing additional serialization method in SedonaDB to reduce the amount of transformation for table functions
aggregate functions?
adding geopandas as other method

Imbruced · 2026-01-25T21:26:47Z

@jiayuasu @paleolimbot I think we can start reviewing the changes and the ideas that I am proposing in this MR. What I observed is that this way, UDF can be even faster than native Sedona functions like ST_Buffer. But, for instance, ST_Area is three times slower, and I guess it depends on the specific function. But what is more important, the performance is better than the previous UDFs in Sedona. I would mark this functionality as experimental.

Also, I haven't included a documentation update, as we might decide during the review that this MR needs adjustment.

Imbruced · 2026-01-25T21:27:47Z

This piece of code is working only for Spark 3.5, but I plan to extend it for Spark 4.0

Imbruced · 2026-01-25T21:29:43Z

I would like to extend it to include table-defined user functions, which will allow us to operate on the entire SedonaDB dataframe.

Imbruced · 2026-01-25T21:30:05Z

.github/workflows/pyflink.yml

          cd python
          uv add apache-flink==1.20.1
-          uv sync
+      #          uv sync --extra flink


Imbruced · 2026-01-25T21:31:39Z

.github/workflows/python-extension.yml

      matrix:
        os: ['ubuntu-latest', 'windows-latest', 'macos-15']
-        python: ['3.11', '3.10', '3.9', '3.8']
+        python: ['3.11', '3.10', '3.9']


I had trouble integrating with Python 3.8, it's already one year since it reached EOL, what would you think about removing it? and maybe start supporting Python 3.12 and 3.13

Imbruced · 2026-01-25T21:35:18Z

python/sedona/spark/worker/daemon.py

+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+


This one is almost a copy-paste of what is in Apache Spark. The only difference is the import worker function

from sedona.spark.worker.worker import main as worker_main

I don't know what a better approach is, using the import of functions like manager?

Imbruced · 2026-01-25T21:36:18Z

python/sedona/spark/worker/udf_info.py

+                crs = self.geom_offsets[arg]
+                fields.append(
+                    f"ST_GeomFromSedonaSpark(_{arg}, 'EPSG:{crs}') AS _{arg}"
+                )  # nosec


Theoretical SQL injection, which is not causing any harm here.

Imbruced · 2026-01-25T21:37:35Z

python/src/geomserde_speedup_module.c

  return Py_BuildValue("(Kibi)", geom, geom_type_id, has_z, length);
 }

+static PyObject *to_sedona_func(PyObject *self, PyObject *args) {


If the Sedona speedup is available, instead of translating to wkb and then loading from wkb with shapely, we can create shapely objects directly to speed up vectorized UDFs.

Imbruced · 2026-01-25T21:38:08Z

python/tests/test_base.py

                    "20",
                )
-                # Pandas on PySpark doesn't work with ANSI mode, which is enabled by default
+                .config("spark.executor.memory", "10G")


To remove, forgot to remove it after testing.

Imbruced · 2026-01-25T21:38:35Z

python/setup.py

+from setuptools import setup
+import numpy
+
+setup(


this is needed to make numpy C wrappers available

Imbruced · 2026-01-25T21:39:18Z

spark/common/src/main/scala/org/apache/sedona/spark/SedonaContext.scala

    val sedonaArrowStrategy = Try(
      Class
-        .forName("org.apache.spark.sql.udf.SedonaArrowStrategy")
+        .forName("org.apache.spark.sql.execution.python.SedonaArrowStrategy")


need some execution, python private methods from spark

Imbruced · 2026-01-25T21:40:09Z

spark/spark-3.5/src/main/scala/org/apache/spark/sql/execution/python/SedonaArrowStrategy.scala

+      case _ => None
+    }
+
+    schema


infer for geometry fields by taking the firs value

Imbruced · 2026-01-25T21:42:54Z

spark/spark-3.5/src/test/scala/org/apache/sedona/sql/TestBaseScala.scala

    .config("sedona.join.autoBroadcastJoinThreshold", "-1")
    .config("spark.sql.extensions", "org.apache.sedona.sql.SedonaSqlExtensions")
+    .config("sedona.python.worker.udf.module", "sedona.spark.worker.worker")
+    .config("sedona.python.worker.udf.daemon.module", "sedonaworker.daemon")


This works because the sedona.python.worker.daemon.enabled is false, need to either remove this param (by default is used sedona.spark.worker.daemon ) from the test or change to sedona.spark.worker.daemon

Imbruced · 2026-01-25T21:43:24Z

pom.xml

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-javadoc-plugin</artifactId>
+<!--                <version>3.12.0</version>-->


SEDONA-738 Add sedonadb worker

1c3aa88

github-actions bot added sedona-python sedona-spark github_actions Pull requests that update GitHub Actions code root labels Jan 18, 2026

Imbruced marked this pull request as ready for review January 25, 2026 21:27

Imbruced requested a review from jiayuasu as a code owner January 25, 2026 21:27

Imbruced commented Jan 25, 2026

View reviewed changes

.github/workflows/pyflink.yml

cd python

uv add apache-flink==1.20.1

uv sync

# uv sync --extra flink

Copy link

Member Author

Imbruced Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to remove

Imbruced commented Jan 25, 2026

View reviewed changes

python/setup.py

from setuptools import setup

import numpy

setup(

Copy link

Member Author

Imbruced Jan 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is needed to make numpy C wrappers available

Imbruced commented Jan 25, 2026

View reviewed changes

SEDONA-738 Add sedonadb worker #2593

Are you sure you want to change the base?

SEDONA-738 Add sedonadb worker #2593

Uh oh!

Conversation

Imbruced commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Did you read the Contributor Guide?

Is this PR related to a ticket?

What changes were proposed in this PR?

How was this patch tested?

Did this PR include necessary documentation updates?

Uh oh!

Imbruced commented Jan 18, 2026

Uh oh!

Imbruced commented Jan 18, 2026

Uh oh!

Imbruced commented Jan 18, 2026

Uh oh!

Imbruced commented Jan 25, 2026

Uh oh!

Imbruced commented Jan 25, 2026

Uh oh!

Imbruced commented Jan 25, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Imbruced commented Jan 18, 2026 •

edited

Loading