Skip to content

Conversation

@novatechflow
Copy link
Member

Port object-file IO onto a safe codec

  • add ObjectFileSerializationMode + helper to switch between legacy Java serialization and a JSON-based format, with unit tests covering both paths
  • rework Java/Flink/Spark object-file sources/sinks (and Flink’s output format) to use the shared serializer, log deprecation warnings for the legacy mode, and stop calling raw ObjectInputStream
  • update Spark’s SequenceFile readers/writers to chunk records via RDD transformations and share logic with the new serializer

Motivation: the legacy path deserializes untrusted SequenceFile payloads via ObjectInputStream.readObject, letting an attacker ship a gadget chain that runs arbitrary bytecode inside the Wayang JVM. That RCE can be used to execute system commands, exfiltrate data, or tamper with jobs, so we move to JSON by default and require explicit opt-in for the old codec.

========================== IMPORTANT =========================
Before we merge test the patch, ideally in a setup with source and sink platforms. Compilation and tests when through in my setup.

mvn -pl wayang-commons/wayang-basic -Dtest=org.apache.wayang.basic.operators.ObjectFileSerializationTest test
[INFO] Building Wayang Basic 1.1.1-SNAPSHOT
[WARNING] 2 problems were encountered while building the effective model for org.apache.yetus:audience-annotations:jar:0.5.0 during dependency collection step for project (use -X to see details)
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0
[INFO] 
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.683 s
[INFO] Finished at: 2025-12-08T19:47:45+01:00
[INFO] ------------------------------------------------------------------------

@zkaoudi
Copy link
Contributor

zkaoudi commented Dec 8, 2025

There are individual operator tests for sources and sinks in each platform. See eg:
https://github.com/apache/incubator-wayang/blob/afb8e413c13b20c5094c818ec2d0b9cf92d499f0/wayang-platforms/wayang-java/src/test/java/org/apache/wayang/java/operators/JavaObjectFileSourceTest.java

Which setup would you suggest for testing besides these tests?

@novatechflow
Copy link
Member Author

@zkaoudi - did a local test with Flink, no issues - let’s merge if no objections. Can you please check if we might have a collision with #639?


protected final Class<T> tClass;

private ObjectFileSerializationMode serializationMode = ObjectFileSerializationMode.LEGACY_JAVA_SERIALIZATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If deprecated, shouldn't this utilize the JSON variant by default?


private final Class<T> tClass;

private ObjectFileSerializationMode serializationMode = ObjectFileSerializationMode.LEGACY_JAVA_SERIALIZATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If deprecated, shouldn't this utilize the JSON variant by default?


private transient DataOutputViewStreamWrapper outView;

private ObjectFileSerializationMode serializationMode = ObjectFileSerializationMode.LEGACY_JAVA_SERIALIZATION;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If deprecated, shouldn't this utilize the JSON variant by default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not switch off functions directly, instead give users a certain amount of time to use the new feature for backwards compatibility. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it depends on the consequences for the user. If they have to implement further things to make sure their data types implement Serializable, I agree. If this is a change without any overhead for the user, fixing a potential security threat, then it should be the new default.

//TODO: remove the set parallelism 1
DataSetChannel.Instance input = (DataSetChannel.Instance) inputs[0];
ObjectFileSerializationMode serializationMode = this.getSerializationMode();
if (serializationMode == ObjectFileSerializationMode.LEGACY_JAVA_SERIALIZATION) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated in every implementation extending ObjectFileSink, so it should just be moved there and called here for checking the serializationMode


private static final Logger LOGGER = LogManager.getLogger(FlinkObjectFileSink.class);

private static final AtomicBoolean LEGACY_WARNING_EMITTED = new AtomicBoolean(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field could also be moved to the ObjectFileSink to make it less redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but (I know ...) from past experiences it's better to tell every time a soon deprecated function is called. We can refactor it, main goal is to close the already known security bug.

@novatechflow
Copy link
Member Author

created #640 - my fork gots messy with merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants