-
Notifications
You must be signed in to change notification settings - Fork 80
Description
1. Executive Summary
This proposal outlines the roadmap for RayDP 2.0, a major architectural upgrade designed to support the next generation of big data infrastructure. The primary goal is to migrate the core runtime to Apache Spark 4.1.1, enforcing Java 17 and Scala 2.13. This modernization addresses the deprecation of older JVMs in the Spark ecosystem and aligns RayDP with the performance and security standards of 2025+.
2. Motivation
- Spark 4.0 Paradigm Shift: Apache Spark 4.0 has officially dropped support for Scala 2.12 and Java 8/11. To ensure RayDP remains compatible with modern data stacks, we must align with these upstream requirements.
- Performance & Security: Moving to Java 17 (LTS) unlocks significant garbage collection improvements (ZGC), better container awareness, and enhanced security features required by enterprise deployments.
- Technical Debt Resolution: The previous build system relied on legacy
setup.pybehaviors and Maven plugins incompatible with the Java module system (JPMS). This overhaul modernizes the entire build chain.
3. Technical Specification
3.1 Core Dependency Matrix
RayDP 2.0 introduces a strict upgrade to the dependency matrix.
| Component | RayDP 1.x (Legacy) | RayDP 2.0 (New) | Impact |
|---|---|---|---|
| Java | 8 / 11 | 17 (Strict) | Requires environment updates on all Ray nodes. |
| Spark | 3.x | 4.1.1 | Major internal API drift (Scheduler, SQL, RPC). |
| Scala | 2.12 | 2.13 | Incompatible binary interface; requires recompilation. |
| Ray | 2.x | 2.53.0+ | Aligned with recent Ray releases. |
| Python | 3.6+ | 3.10+ | Dropped support for EOL Python versions. |
3.2 Architectural Changes
A. The Shim Layer Strategy
Spark 4.1 introduces aggressive encapsulation and API removals. To handle this, we have restructured the core/shims module:
- Legacy Shims Disabled: Modules for Spark 3.2–3.5 are disabled in the build reactor.
- New Shim (
raydp-shims-spark411): A dedicated module implementing theSparkShimsinterface for the 4.1.x line. - Package-Private Accessors: Due to stricter visibility in Spark 4.x, we introduced helper classes in the
org.apache.sparknamespace:Spark411Helper: HandlesTaskContextImplinstantiation andCoarseGrainedExecutorBackendcreation.Spark411SQLHelper: BridgesArrowConvertersfortoDataFrameoperations, handling the new signature requirements for Arrow batch conversion.
B. Scheduler Backend Refactoring
The RayCoarseGrainedSchedulerBackend has been significantly refactored:
- Removed Deprecated APIs: Calls to
Utils.sparkJavaOpts(removed in Spark 4.x) have been replaced with direct configuration handling. - Constructor Updates: Adapted to the new
HadoopDelegationTokenManagerandCoarseGrainedSchedulerBackendsignatures. - Scala 2.13 Compliance: Migrated all collection conversions from
JavaConverterstoCollectionConvertersand handled explicit boxing for primitive types.
3.3 Build System Overhaul
- PEP 517 Compliance: The Python build system now uses
pyproject.tomlas the source of truth for dependencies and metadata. - Editable Installs:
setup.pywas rewritten to supportpip install -e .by correctly copying compiled JARs fromcore/into theraydp/jarssource tree during development. This fixes runtimeIndexErrorandClassNotFoundExceptionissues during local testing. - Maven Modernization: Updated
maven-compiler-plugin,scala-maven-plugin, andscalatestto versions compatible with Java 17 modules.
4. Breaking Changes
Users upgrading to RayDP 2.0 must be aware of the following:
- Java 8/11 is no longer supported. Attempting to run RayDP 2.0 on older JVMs will result in
UnsupportedClassVersionError. - Spark 3.x is no longer supported. The codebase is not backward compatible.
- Scala 2.12 is no longer supported. Custom UDFs or libraries compiled against Scala 2.12 will cause binary incompatibility errors.
5. Verification Plan
We have established a rigorous verification suite (examples/test_raydp_sanity.py) to validate the migration:
- [x] Cluster Initialization: Verifies that
RayDPSparkMasterandRayDPExecutoractors start successfully on Ray. - [x] Network Binding: Validates that Spark Driver and Executors can communicate, specifically addressing macOS split-brain networking (Localhost vs LAN IP) via explicit binding.
- [x] Job Execution: Runs standard Spark actions (
count,range) to verify the RPC layer. - [x] Data Transfer (Spark → Ray): Validates
ray.data.from_spark()using the updatedObjectStoreWriter. - [x] Data Transfer (Ray → Spark): Validates
ds.to_spark()using the newSpark411SQLHelperto reconstruct DataFrames from Arrow batches.
6. Migration Guide
To upgrade to RayDP 2.0:
- Upgrade Java: Install OpenJDK 17 on your local machine and all cluster nodes.
- Set JAVA_HOME: Ensure
JAVA_HOMEpoints to the JDK 17 installation. - Update Python Deps:
pip install pyspark>=4.1.1 ray>=2.53.0. - Rebuild: Run
mvn clean packageincore/andpip install .inpython/.