Skip to content

[GLUTEN-11524][VL][WIP] Support adjusting stage execution mode#11543

Closed
marin-ma wants to merge 10 commits intoapache:mainfrom
marin-ma:adjust-execution-mode
Closed

[GLUTEN-11524][VL][WIP] Support adjusting stage execution mode#11543
marin-ma wants to merge 10 commits intoapache:mainfrom
marin-ma:adjust-execution-mode

Conversation

@marin-ma
Copy link
Contributor

@marin-ma marin-ma commented Feb 2, 2026

wip

Related issue: #11524

@github-actions github-actions bot added CORE works for Gluten Core VELOX labels Feb 2, 2026
@github-actions
Copy link

github-actions bot commented Feb 2, 2026

Run Gluten Clickhouse CI on x86

@marin-ma
Copy link
Contributor Author

marin-ma commented Feb 2, 2026

Steps to test this feature in a local standalone spark cluster with mocked GPU resources. No actual GPU resources are required.

  1. Create resource scripts for CPU workers and GPU workers
  • CPU workers

    cpu.conf

    spark.worker.resource.cpu.amount	1
    spark.worker.resource.cpu.discoveryScript /path/to/cpu.sh
    

    cpu.sh

    #!/usr/bin/env bash
    
    echo {\"name\": \"cpu\", \"addresses\":[\"0\"]}
    
  • GPU workers

    gpu.conf

    spark.worker.resource.gpu.amount	1
    spark.worker.resource.gpu.discoveryScript /path/to/gpu.sh
    

    gpu.sh

    #!/usr/bin/env bash
    
    echo {\"name\": \"gpu\", \"addresses\":[\"1\"]}
    
  1. Start spark master, 2 CPU workers and 1 GPU workers. Each CPU workers has one "cpu" resource, and each GPU workers has one "gpu" resource.
sbin/start-master.sh

export SPARK_WORKER_DIR=/tmp/spark-worker-1
export SPARK_PID_DIR=/tmp/spark-pid-1
./sbin/start-worker.sh spark://localhost:7077 \
  --webui-port 8081 --properties-file /path/to/gpu.conf
 
 
export SPARK_WORKER_DIR=/tmp/spark-worker-2
export SPARK_PID_DIR=/tmp/spark-pid-2
./sbin/start-worker.sh spark://localhost:7077 \
  --webui-port 8082 --properties-file /path/to/cpu.conf
 
export SPARK_WORKER_DIR=/tmp/spark-worker-3
export SPARK_PID_DIR=/tmp/spark-pid-3
./sbin/start-worker.sh spark://localhost:7077 \
  --webui-port 8083 --properties-file /path/to/cpu.conf
  1. Run spark application with below configurations added. This will start 1 executor on each worker node. Default execution will be scheduled onto CPU worker nodes, and the selected gpu stages (configured by spark.gluten.auto.adjustStageExecutionMode=true, currently only the join stages) will be scheduled onto GPU worker nodes. If set spark.gluten.auto.adjustStageExecutionMode=false, all stages that can use cudf will be scheduled onto GPU worker nodes.
spark.driver.extraJavaOptions "-Dspark.testing=true -Dio.netty.tryReflectionSetAccessible=true"
spark.dynamicAllocation.enabled true
spark.executor.resource.cpu.amount 1
spark.executor.resource.cpu.discoveryScript=/path/to/cpu.sh
spark.gluten.sql.columnar.backend.velox.cudf.enableValidation=false
spark.gluten.sql.columnar.cudf=true
spark.gluten.auto.adjustStageExecutionMode=true

@marin-ma marin-ma changed the title [WIP] Support adjusting stage execution mode [VL][WIP] Support adjusting stage execution mode Feb 2, 2026
@github-actions
Copy link

github-actions bot commented Feb 2, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma changed the title [VL][WIP] Support adjusting stage execution mode [GLUTEN-11524][VL][WIP] Support adjusting stage execution mode Feb 3, 2026
@github-actions github-actions bot added the DOCS label Feb 3, 2026
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Run Gluten Clickhouse CI on x86

3 similar comments
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from 46c83a5 to 9f69f16 Compare February 3, 2026 20:02
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Run Gluten Clickhouse CI on x86

@marin-ma
Copy link
Contributor Author

marin-ma commented Feb 3, 2026

Verified with tpcds q95 on a gpu node with below configurations:

cat tpcds_parquet.scala | ${SPARK_HOME}/bin/spark-shell \
  --master spark://5da02429f127:7077 --deploy-mode client \
  --executor-cores 4 \
  --conf spark.plugins=org.apache.gluten.GlutenPlugin \
  --conf spark.driver.extraClassPath=${GLUTEN_JAR} \
  --conf spark.executor.extraClassPath=${GLUTEN_JAR} \
  --conf spark.memory.offHeap.enabled=true \
  --conf spark.memory.offHeap.size=2g \
  --conf spark.gluten.sql.columnar.forceShuffledHashJoin=true \
  --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager \
  --conf spark.driver.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005" \
  --conf spark.executor.extraJavaOptions="-Dio.netty.tryReflectionSetAccessible=true" \
  --conf spark.dynamicAllocation.enabled=true \
  --conf spark.executor.resource.cpu.amount=1 \
  --conf spark.executor.resource.cpu.discoveryScript=/opt/spark/conf/cpu.sh \
  --conf spark.gluten.sql.columnar.backend.velox.cudf.enableValidation=true \
  --conf spark.gluten.sql.columnar.cudf=true \
  --conf spark.gluten.sql.debug=true \
  --conf spark.gluten.sql.debug.cudf=true \
  --conf spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleOutput=true \
  --conf spark.gluten.sql.columnar.backend.velox.resizeBatches.shuffleInput=false \
  --conf spark.gluten.auto.adjustStageExecutionMode=true \
  --conf spark.log.level=INFO

Started 2 cpu workers and 1 gpu worker using the script above. During runtime there's one executor started on each worker. Only the 40 tasks of stage 40 were scheduled to the executor on the gpu worker, all other tasks were scheduled to the executor on the 2 cpu workers. Next will try it on a real mixed cpu/gpu cluster.
image

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from 9f69f16 to 0ca71a7 Compare February 3, 2026 20:32
@github-actions
Copy link

github-actions bot commented Feb 3, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from 0ca71a7 to dd922ea Compare February 4, 2026 09:27
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from dd922ea to 227dead Compare February 4, 2026 20:35
@github-actions
Copy link

github-actions bot commented Feb 4, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from 227dead to 4d14cd2 Compare February 5, 2026 11:17
@github-actions
Copy link

github-actions bot commented Feb 5, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from 4d14cd2 to 3174a67 Compare February 6, 2026 09:36
@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Run Gluten Clickhouse CI on x86

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from d61a8d6 to c30c76b Compare February 9, 2026 14:30
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma force-pushed the adjust-execution-mode branch from 549a926 to 0cef0ad Compare February 11, 2026 16:06
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link

Run Gluten Clickhouse CI on x86

@marin-ma marin-ma closed this Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant