-
Notifications
You must be signed in to change notification settings - Fork 583
Description
Backend
VL (Velox)
Bug description
We suppose to schedule some IO bound tasks such as the stage contains table scan to CPU node, and some computation intensive tasks to GPU.
Now Spark has this ability to do stage resource scheduler by resource profile as this document https://spark.apache.org/docs/latest/configuration.html#custom-resource-scheduling-and-configuration-overview describes, in Gluten, there has been offheap/onheap memory allocation adjusted by ResourceProfile
This script describes how to set up GPU host environment, the script has executed on the IBM internal AMI linux image, so if you use IBM pipeline pipeline-create-dev-vm and select GPU node such as g4dn.xlarge, the environment is ready, no need to execute the script.
https://raw.githubusercontent.com/jinchengchenghh/gluten/cudf_script/dev/start_cudf_amazon.sh
Note: The environment has been upgraded to cuda 13.1 because cudf build issue, but the script install cuda 12.8, it is outdated.
This document describes how to set up yarn on GPU node.
https://docs.nvidia.com/spark-rapids/user-guide/23.10/getting-started/yarn-gpu.html
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html
GPU document describes how to build with GPU
https://github.com/apache/incubator-gluten/blob/main/docs/get-started/VeloxGPU.mdutdated.
Existing offheap/onheap memory ResourceProfile allocation, we should use the similar way to set the profile to require 1 GPU, now the Spark cannot set the core number by resource profile, this feature is under developing.
#8209
We could use TPCDS q95 to test.
The query runs successfully on yarn, but if we set up the environment according to https://docs.nvidia.com/spark-rapids/user-guide/23.10/getting-started/yarn-gpu.html, the query will hang, I also tried stand alone mode before, it also hangs.
Gluten version
No response
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
Metadata
Metadata
Assignees
Labels
Type
Projects
Status