Skip to content

[RFC] RayDP 2.0: Migration to Spark 4.1 & Java 17 #459

@rexminnis

Description

@rexminnis

1. Executive Summary

This proposal outlines the roadmap for RayDP 2.0, a major architectural upgrade designed to support the next generation of big data infrastructure. The primary goal is to migrate the core runtime to Apache Spark 4.1.1, enforcing Java 17 and Scala 2.13. This modernization addresses the deprecation of older JVMs in the Spark ecosystem and aligns RayDP with the performance and security standards of 2025+.

2. Motivation

  • Spark 4.0 Paradigm Shift: Apache Spark 4.0 has officially dropped support for Scala 2.12 and Java 8/11. To ensure RayDP remains compatible with modern data stacks, we must align with these upstream requirements.
  • Performance & Security: Moving to Java 17 (LTS) unlocks significant garbage collection improvements (ZGC), better container awareness, and enhanced security features required by enterprise deployments.
  • Technical Debt Resolution: The previous build system relied on legacy setup.py behaviors and Maven plugins incompatible with the Java module system (JPMS). This overhaul modernizes the entire build chain.

3. Technical Specification

3.1 Core Dependency Matrix

RayDP 2.0 introduces a strict upgrade to the dependency matrix.

Component RayDP 1.x (Legacy) RayDP 2.0 (New) Impact
Java 8 / 11 17 (Strict) Requires environment updates on all Ray nodes.
Spark 3.x 4.1.1 Major internal API drift (Scheduler, SQL, RPC).
Scala 2.12 2.13 Incompatible binary interface; requires recompilation.
Ray 2.x 2.53.0+ Aligned with recent Ray releases.
Python 3.6+ 3.10+ Dropped support for EOL Python versions.

3.2 Architectural Changes

A. The Shim Layer Strategy

Spark 4.1 introduces aggressive encapsulation and API removals. To handle this, we have restructured the core/shims module:

  • Legacy Shims Disabled: Modules for Spark 3.2–3.5 are disabled in the build reactor.
  • New Shim (raydp-shims-spark411): A dedicated module implementing the SparkShims interface for the 4.1.x line.
  • Package-Private Accessors: Due to stricter visibility in Spark 4.x, we introduced helper classes in the org.apache.spark namespace:
    • Spark411Helper: Handles TaskContextImpl instantiation and CoarseGrainedExecutorBackend creation.
    • Spark411SQLHelper: Bridges ArrowConverters for toDataFrame operations, handling the new signature requirements for Arrow batch conversion.

B. Scheduler Backend Refactoring

The RayCoarseGrainedSchedulerBackend has been significantly refactored:

  • Removed Deprecated APIs: Calls to Utils.sparkJavaOpts (removed in Spark 4.x) have been replaced with direct configuration handling.
  • Constructor Updates: Adapted to the new HadoopDelegationTokenManager and CoarseGrainedSchedulerBackend signatures.
  • Scala 2.13 Compliance: Migrated all collection conversions from JavaConverters to CollectionConverters and handled explicit boxing for primitive types.

3.3 Build System Overhaul

  • PEP 517 Compliance: The Python build system now uses pyproject.toml as the source of truth for dependencies and metadata.
  • Editable Installs: setup.py was rewritten to support pip install -e . by correctly copying compiled JARs from core/ into the raydp/jars source tree during development. This fixes runtime IndexError and ClassNotFoundException issues during local testing.
  • Maven Modernization: Updated maven-compiler-plugin, scala-maven-plugin, and scalatest to versions compatible with Java 17 modules.

4. Breaking Changes

Users upgrading to RayDP 2.0 must be aware of the following:

  1. Java 8/11 is no longer supported. Attempting to run RayDP 2.0 on older JVMs will result in UnsupportedClassVersionError.
  2. Spark 3.x is no longer supported. The codebase is not backward compatible.
  3. Scala 2.12 is no longer supported. Custom UDFs or libraries compiled against Scala 2.12 will cause binary incompatibility errors.

5. Verification Plan

We have established a rigorous verification suite (examples/test_raydp_sanity.py) to validate the migration:

  • [x] Cluster Initialization: Verifies that RayDPSparkMaster and RayDPExecutor actors start successfully on Ray.
  • [x] Network Binding: Validates that Spark Driver and Executors can communicate, specifically addressing macOS split-brain networking (Localhost vs LAN IP) via explicit binding.
  • [x] Job Execution: Runs standard Spark actions (count, range) to verify the RPC layer.
  • [x] Data Transfer (Spark → Ray): Validates ray.data.from_spark() using the updated ObjectStoreWriter.
  • [x] Data Transfer (Ray → Spark): Validates ds.to_spark() using the new Spark411SQLHelper to reconstruct DataFrames from Arrow batches.

6. Migration Guide

To upgrade to RayDP 2.0:

  1. Upgrade Java: Install OpenJDK 17 on your local machine and all cluster nodes.
  2. Set JAVA_HOME: Ensure JAVA_HOME points to the JDK 17 installation.
  3. Update Python Deps: pip install pyspark>=4.1.1 ray>=2.53.0.
  4. Rebuild: Run mvn clean package in core/ and pip install . in python/.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions