Skip to content

[VL] GPU and CPU mixed cluster schedule #11524

@jinchengchenghh

Description

@jinchengchenghh

Backend

VL (Velox)

Bug description

We suppose to schedule some IO bound tasks such as the stage contains table scan to CPU node, and some computation intensive tasks to GPU.
Now Spark has this ability to do stage resource scheduler by resource profile as this document https://spark.apache.org/docs/latest/configuration.html#custom-resource-scheduling-and-configuration-overview describes, in Gluten, there has been offheap/onheap memory allocation adjusted by ResourceProfile

This script describes how to set up GPU host environment, the script has executed on the IBM internal AMI linux image, so if you use IBM pipeline pipeline-create-dev-vm and select GPU node such as g4dn.xlarge, the environment is ready, no need to execute the script.
https://raw.githubusercontent.com/jinchengchenghh/gluten/cudf_script/dev/start_cudf_amazon.sh
Note: The environment has been upgraded to cuda 13.1 because cudf build issue, but the script install cuda 12.8, it is outdated.

This document describes how to set up yarn on GPU node.
https://docs.nvidia.com/spark-rapids/user-guide/23.10/getting-started/yarn-gpu.html
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/UsingGpus.html
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html

GPU document describes how to build with GPU
https://github.com/apache/incubator-gluten/blob/main/docs/get-started/VeloxGPU.mdutdated.

Existing offheap/onheap memory ResourceProfile allocation, we should use the similar way to set the profile to require 1 GPU, now the Spark cannot set the core number by resource profile, this feature is under developing.
#8209

We could use TPCDS q95 to test.

The query runs successfully on yarn, but if we set up the environment according to https://docs.nvidia.com/spark-rapids/user-guide/23.10/getting-started/yarn-gpu.html, the query will hang, I also tried stand alone mode before, it also hangs.

Gluten version

No response

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions