Skip to content

Conversation

@novatechflow
Copy link
Member

Port object-file IO onto a safe codec

  • add ObjectFileSerializationMode + helper to switch between legacy Java serialization and a JSON-based format, with unit tests covering both paths
  • rework Java/Flink/Spark object-file sources/sinks (and Flink’s output format) to use the shared serializer, log deprecation warnings for the legacy mode, and stop calling raw ObjectInputStream
  • update Spark’s SequenceFile readers/writers to chunk records via RDD transformations and share logic with the new serializer

Motivation: the legacy path deserializes untrusted SequenceFile payloads via ObjectInputStream.readObject, letting an attacker ship a gadget chain that runs arbitrary bytecode inside the Wayang JVM. That RCE can be used to execute system commands, exfiltrate data, or tamper with jobs, so we move to JSON by default and require explicit opt-in for the old codec.

Replaces PR #638

Copy link
Contributor

@juripetersen juripetersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stylistic changes could be done here to match the style we have in our codebase and improve readability a bit.

<spark.version>2.4.0</spark.version>
<scala.mayor.version>2.12</scala.mayor.version>
<giraph.version>1.2.0-hadoop2</giraph.version>
<python.worker.tests.skip>true</python.worker.tests.skip>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you adding this because the tests fail locally for you?
I don't think this should be addressed in this PR, it has nothing to do with the python API

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It fails int he git workflow ... maybe you can fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our other recent PRs don't seem to have that problem. @mspruc, what is your take on this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR has been opened with it in the initial commit, it is kind of hard for me to comment on it without having the stack trace.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove and run again, maybe the pipeline had a issue.

@mspruc
Copy link
Contributor

mspruc commented Dec 10, 2025

@juripetersen
Copy link
Contributor

One general question that remains for me:
@2pk03 What is the overhead needed for users to enable the JSON serialization, if any?

If there is none, why do we even keep offering the legacy serialization? Given that it poses as a potential security threat, it should be removed if there is no overhead.

@novatechflow
Copy link
Member Author

Tested with Apache Flink, there is no overhead from source and sink's side. It's just to use the serializer - physically, like computation power and memory, it adds the burden to Wayang's processing when filtering, but I think it's marginal tbh,

@mspruc
Copy link
Contributor

mspruc commented Dec 10, 2025

What about our usage of pickle.loads in:

https://github.com/apache/incubator-wayang/blob/d4fa09229e8c7f063212df509cea94dace84bb5e/python/src/pywy/execution/worker.py#L105
?

@2pk03 Do you know if this also needs to be changed?

@novatechflow
Copy link
Member Author

What about our usage of pickle.loads in:
https://github.com/apache/incubator-wayang/blob/d4fa09229e8c7f063212df509cea94dace84bb5e/python/src/pywy/execution/worker.py#L105

?

@2pk03 Do you know if this also needs to be changed?

the pickle.loads(decoded_udf) mirrors how PySpark workers hydrate the UDF sent from the JVM. That payload comes from our own driver, right?

…ationMode and SparkObjectFileSink.encodeBuffer to inline the BytesWritable construction for readability
@mspruc
Copy link
Contributor

mspruc commented Dec 10, 2025

Yeah, this is from our own driver, I guess by reasoning it should be fine

@juripetersen juripetersen merged commit b893e2c into apache:main Dec 11, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants