New Cluster Scripts by misiugodfrey · Pull Request #273 · rapidsai/velox-testing

misiugodfrey · 2026-03-13T17:58:01Z

Cluster Benchmarking Infrastructure

This PR adds multi-node TPC-H benchmarking support in a new ccluster, including CPU and GPU variants, result validation, automated result posting, and sweep tooling.

New scripts

run-sweep.sh — Automates running launch-run.sh + post_results.py across multiple node/scale-factor combinations
run-presto-benchmarks.sh — Orchestrates the full benchmark lifecycle (setup, coordinator, workers, queries, results collection)
pull_ghcr_image.sh / enroot-decompress.sh — Pull container images from ghcr.io and save as .sqsh files, with transparent gzip/zstd decompression support
launch-gen-data.sh / gen-tpch-data.slurm — TPC-H data generation jobs
launch-analyze-tables.sh / run-analyze-tables.sh / run-analyze-tables.slurm — Hive table analysis jobs

Benchmark execution (functions.sh, launch-run.sh, run-presto-benchmarks.slurm)

NUMA-aware worker placement with optional --no-numa flag for older images
CPU benchmark mode (--cpu) — disables cuDF, one worker per node, no GPU allocation
Conditional CUDA_VISIBLE_DEVICES and NVIDIA_VISIBLE_DEVICES for GPU vs CPU runs
Writable coord_data and per-worker worker_data_N directories to avoid EROFS on read-only squashfs
Miniforge bind-mount so Python shebangs resolve inside the coordinator container
inject_benchmark_metadata — injects run context (image digest, engine, node/GPU counts, timestamp) into benchmark_result.json on exit
collect_results — copies configs and logs into result_dir for archival
Configurable nodelist, image names, and output path via CLI flags
Stale result prevention: result_dir and OUTPUT_DIR are fully removed (rm -rf) before each run so cancelled jobs cannot post old data

Config generation (generate_presto_config.sh, templates)

Per-worker config directories (etc_worker_N/) are now generated for both GPU and CPU variants
CPU variant explicitly sets cudf.enabled=false
Worker config template additions: async-data-cache-enabled=false, cudf.jit_expression_enabled=false, cudf.intra_node_exchange=true, cudf.concat_optimization_enabled, cudf.batch_size_min_threshold

Result reporting (post_results.py, validate_results.py)

Metadata is now read directly from benchmark_result.json context (no separate benchmark.json required)
Added node_count (cluster nodes) distinct from gpu_count (GPU workers/total workers)
CPU runs automatically report gpu_count=0 and gpu_name="N/A"
Per-query validation results attached to each query log entry; xfail status mapped to expected-failure for API compatibility
--velox-branch, --velox-repo, --presto-branch, --presto-repo args added to engine_config payload
identifier_hash falls back to image_digest from context if not provided on CLI
New validate_results.py for comparing query output against expected results

misiugodfrey · 2026-03-13T18:23:48Z

presto/scripts/generate_presto_config.sh

 # We want to propagate any changes from the original worker config to the new worker configs even if
 # we did not re-generate the configs.
-if [[ -n "$NUM_WORKERS" && "$VARIANT_TYPE" == "gpu" ]]; then
+if [[ -n "$NUM_WORKERS" && ( "$VARIANT_TYPE" == "gpu" || "$VARIANT_TYPE" == "cpu" ) ]]; then


Needed because the cluster is going to replicate worker configs in a similar way to the gpu configs.

misiugodfrey · 2026-03-13T18:24:29Z

presto/scripts/generate_presto_config.sh

    # Adds a cluster tag for cpu variant
    echo "cluster-tag=native-cpu" >> ${COORD_CONFIG}
+    # Disable cuDF for CPU mode
+    sed -i 's/^cudf\.enabled=true/cudf.enabled=false/' ${WORKER_CONFIG}


In the cluster we aren't using docker to control the gpu access, to we need to actively disable cudf, rather than letting it be disabled via the docker environment.

misiugodfrey · 2026-03-13T21:51:17Z

Making this a draft as there are many portions of it that should be pulled out and are not slurm-specific (validation, benchmark posting, etc...)

prestouser added 5 commits March 10, 2026 23:27

Working POC

e113776

Validation

dbc9c72

Validation and image pulling

c9f8462

Updates to run cpu presto

b23fcdc

use default queries file by default

48bce66

misiugodfrey commented Mar 13, 2026

View reviewed changes

misiugodfrey marked this pull request as ready for review March 13, 2026 18:32

misiugodfrey requested review from karthikeyann and quasiben March 13, 2026 18:32

misiugodfrey marked this pull request as draft March 13, 2026 20:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Cluster Scripts#273

New Cluster Scripts#273
misiugodfrey wants to merge 5 commits intomainfrom
misiug/SpaceMicePOC

misiugodfrey commented Mar 13, 2026

Uh oh!

misiugodfrey Mar 13, 2026

Uh oh!

misiugodfrey Mar 13, 2026

Uh oh!

misiugodfrey commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

misiugodfrey commented Mar 13, 2026

Uh oh!

misiugodfrey Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

misiugodfrey Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

misiugodfrey commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant