[FEA] Add support for Spark 4.1.1 [databricks]#14120
[FEA] Add support for Spark 4.1.1 [databricks]#14120gerashegalov merged 63 commits intoNVIDIA:release/26.02from
Conversation
…rk 410 shim Part of NVIDIA#14036
Use delta-stub instead of delta-40x for Spark 4.1.0 because io.delta:delta-spark is not yet compatible with Spark 4.1.0. CheckpointFileManager moved packages in Spark 4.1.0. Contributes to NVIDIA#14119
…change In Spark 4.1.0, AtomicReplaceTableAsSelectExec.invalidateCache callback signature changed from (TableCatalog, Identifier) => Unit to (TableCatalog, Table, Identifier) => Unit. Create shims to handle this API change: - spark400/InvalidateCacheShims.scala for Spark 4.0.x (2-arg callback) - spark410/InvalidateCacheShims.scala for Spark 4.1.0+ (3-arg callback) - spark410/GpuAtomicReplaceTableAsSelectExec.scala for 4.1.0+ exec Contributes to NVIDIA#14119
Contributes to NVIDIA#14056
|
This draft PR aims to make building Spark 410 pass to unblock other tasks for other co-workers. Notes:
TODOs: |
|
Are we going to change to support spark-4.1.1 shim instead? @res-life Uploaded spark-4.1.1 bin to internal artifactory, feel free to trigger the CI for testing when you change is ready, thanks! |
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
sql-plugin/src/main/java/com/nvidia/spark/rapids/HashedPriorityQueue.java
Show resolved
Hide resolved
|
Thanks @res-life for putting up the PR. Looking into it. Scala2.12 builds are failing. Could you please fix these. |
|
Yes, Scala 2.12 has some regressions, it's minor. Nest steps: |
|
Current status: unit cases passed For Spark 41 & Scala 2.13 |
Signed-off-by: Chong Gao <res_life@163.com>
Signed-off-by: Chong Gao <res_life@163.com>
- Modified the buildall script to ensure the MVN variable is correctly exported with options. - Moved user-facing ParquetCachedBatchSerializer class to sql-plugin-api. - Updated integration test requirements to include pytz.
…ntShims setup in ParquetCachedBatchSerializer
|
build |
Signed-off-by: Chong Gao <res_life@163.com>
|
build |
closes #14056
closes #14105
closes #14104
closes #14107
closes #14150
closes #14036
closes #14111
closes #14103
closes #14112
closes #14113
closes #14114
closes #14115
Description
Adds initial support for Spark 4.1.1 shim with the following API changes handled:
API Changes in Spark 4.1.1
StoragePartitionJoinParams package change - Moved from
org.apache.spark.sql.execution.datasources.v2toorg.apache.spark.sql.execution.joinsMAX_BROADCAST_TABLE_BYTES removal - Constant removed from
BroadcastExchangeExec, now configurable viaconf.maxBroadcastTableSizeInBytesWindowInPandasExec renamed - Renamed to
ArrowWindowPythonExecTimeAdd renamed - Renamed to
TimestampAddIntervalFileStreamSink/MetadataLogFileIndex package change - Moved to
org.apache.spark.sql.execution.streaming.sinksandorg.apache.spark.sql.execution.streaming.runtimeParquetColumnVector constructor change - Removed
memoryModeparameterSQLConf.getConf return type change - Changed from
StringtoEnumfor certain configurationsExpressionWithRandomSeed trait addition - Added
withShiftedSeedmethod requirementSpecializedGetters trait additions - Added
getGeographyandgetGeometrymethodsAtomicReplaceTableAsSelectExec.invalidateCache callback change - Changed from 2-arg to 3-arg signature
Add Gpu version for OneRowRelationExec - New Exec
Delta Lake Status
Delta Lake support is excluded for Spark 4.1.1 because
io.delta:delta-sparkis not yet compatible with Spark 4.1.1 (CheckpointFileManager moved packages). Using delta-stub instead.See: #14119
Testing
mvn clean package -f scala2.13/ -DskipTests -Dbuildver=410 -T18Checklists
behaviors.
new code paths.
(Please explain in the PR description how the new code paths are tested,
such as names of the new/existing tests that cover them.)
in the PR description. Or, an issue has been filed with a link in the PR
description.