-
Notifications
You must be signed in to change notification settings - Fork 332
fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: update Spark plugin for compatibility with Spark 3.x and 4.x #3350
Conversation
Signed-off-by: Kevin Liao <q85292542000@gmail.com>
|
Thank you for opening this pull request! 🙌 These tips will help get your PR across the finish line:
|
|
Sorry not familiar with this part of code. I think the code path you are modifying will be executed when user set the parameter with type If so, could you also try running a workflow with parameter type set as |
|
Hi Nary, thanks for the review. As my understanding, Spark introduced Spark Connect in Spark 3.4+, and Spark 4 moves toward making it the primary execution backend. To support both the classic JVM engine and the Connect engine under a unified API, pyspark.sql.DataFrame now acts as a high-level abstract entrypoint rather than a concrete implementation. When a DataFrame is created, Spark redirects the construction to the appropriate backend: I also tested this behavior in PySpark 4, and the resulting DataFrame type is indeed This is the reference you may be interested in: apache/spark@393a84f Additionally, in PySpark 3.x there is actually no |
machichima
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you for the detailed explanation!
cc @pingsutw
|
Congrats on merging your first pull request! 🎉 |
|
Bito Automatic Review Skipped – PR Already Merged |
Tracking issue
Why are the changes needed?
The current Spark plugin imports and uses Spark 4-specific data types that do not exist in Spark 3.x (including Spark 3.4).
This results in runtime import errors as
Because Flyte users run Spark workloads across mixed versions (Spark 3.x or Spark 4.x), the plugin must not assume Spark 4 APIs exist at runtime.
Without this patch, Spark 3.x tasks fail immediately, even if their logic does not depend on Spark-4-only features.
What changes were proposed in this pull request?
How was this patch tested?
Tested PySpark locally with Spark 3.4 and Spark 4.x.
Built a Flyte sandbox image with the updated transformer and schema.
Ran Spark tasks in Flyte using:
Spark 3.4 base image → passed
Spark 4.x base image → passed
Setup process
Screenshots
Check all the applicable boxes
Related PRs
Docs link
Summary by Bito