Skip to content

Fix executor placement in the same cluster in Dynamic client mode.#104

Merged
sudiptob2 merged 24 commits intoarmadaproject:masterfrom
sudiptob2:feat/dynamic-client-mode-gang-env
Mar 17, 2026
Merged

Fix executor placement in the same cluster in Dynamic client mode.#104
sudiptob2 merged 24 commits intoarmadaproject:masterfrom
sudiptob2:feat/dynamic-client-mode-gang-env

Conversation

@sudiptob2
Copy link
Copy Markdown
Collaborator

@sudiptob2 sudiptob2 commented Feb 24, 2026

closes G-Research/spark#152

Why: In dynamic client mode, the driver runs outside the Kubernetes cluster and lacks the ARMADA_GANG_* environment variables that Armada injects into pods. Without these, scale-up executors have no way to target the same cluster as the initial gang, causing them to land on arbitrary nodes. This change lets the driver learn gang attributes at runtime from the first executor and apply them as node selectors to all subsequent allocations.

Changes:

  • Add captureGangAttributes in ArmadaClusterManagerBackend to intercept RegisterExecutor RPCs and extract gang node selector attributes from the first executor
  • Add seedGangAttributesFromEnv for cluster mode, where the driver pod already has ARMADA_GANG_* env vars at startup
  • Gate scale-up allocation in ArmadaExecutorAllocator via isReadyToAllocateMore, blocking until gang attributes are captured in client mode with nodeUniformity
  • Use sys.env.contains("ARMADA_JOB_SET_ID") for runtime cluster-mode detection in start() to prevent double executor submission
  • Set gang cardinality to 0 for scale-up batches in ArmadaClientApplication, relying on node selectors instead of full gang scheduling
  • Add armada-entrypoint.sh Docker wrapper that forwards ARMADA_GANG_* env vars as SPARK_EXECUTOR_ATTRIBUTE_* so executors relay them during registration
  • Add internal config keys spark.armada.internal.gangNodeLabelName and spark.armada.internal.gangNodeLabelValue to store captured attributes in SparkConf

Tests:

  • Unit tests for captureGangAttributes, seedGangAttributesFromEnv, and isReadyToAllocateMore in ArmadaClusterManagerBackendSuite
  • Unit tests for gang cardinality override in ArmadaClientApplicationSuite
  • E2E tests for dynamicCluster and dynamicClient gang scheduling with scale-up assertions

Manual tests:

  1. Run mvn test — all 139 unit tests pass
  2. Run e2e tests: mvn clean package, ./scripts/CreateImage.sh, ./scripts/dev-e2e.sh
  3. Verify initial gang executors have gang annotations and env vars
  4. Verify scale-up executors have node selectors but no gang annotations
  5. Confirm no duplicate executor submissions in cluster-mode static tests

…mic client mode

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
… dynamic client mode

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
…luster mode

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
…xecutors

Scale-up executors submitted via validateAndSubmitExecutorJobs were using
getGangCardinality (minExecutors) instead of the actual batch count. This
caused Armada to wait for pods that would never arrive when batch size
differed from minExecutors. Pass gangCardinalityOverride=Some(executorCount)
only in the scale-up path so initial submissions still use the mode default.

Also split assertGangJobForDynamic into two assertions: env vars for
initial gang executors and annotations for all executors including scale-up.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
…on mode

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Skip gang annotations for scale-up batches once gang attributes are
captured, add isReadyToAllocateMore guard to prevent scale-up executors
from landing on a different cluster, and tune dynamic allocation defaults.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
…flag

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Rename parameters for clarity, assert gang annotations only on initial
batch, and verify scale-up pods have node selectors instead.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
- Use CAS as sole guard in captureGangAttributes so only the winning
  thread writes to SparkConf
- Replace sys.env.contains("ARMADA_JOB_SET_ID") with
  DeploymentModeHelper.isDriverInCluster in isReadyToAllocateMore
  and start()
- Check nodeValue.nonEmpty in getGangNodeSelector to prevent empty
  node selector values
- Add default for ARMADA_NODE_UNIFORMITY_LABEL in submit script
- DRY cleanup in seedGangAttributesFromEnv
- Add unit tests for isReadyToAllocateMore and assume() guard for
  env-dependent test

Signed-off-by:
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Revert modeHelper.isDriverInCluster back to sys.env.contains("ARMADA_JOB_SET_ID")
for cluster mode detection in start() and isReadyToAllocateMore. The env var is
only present in cluster-mode driver pods, making it a reliable runtime indicator.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
--conf spark.dynamicAllocation.initialExecutors=1
--conf spark.dynamicAllocation.minExecutors=2
--conf spark.dynamicAllocation.maxExecutors=10
--conf spark.dynamicAllocation.initialExecutors=2
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be at least 2, otherwise armada wont gang schedule them

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you don't have to address it in this ticket, but one of our requirements is that dynamic client mode supports 0 min executors so that it puts no load on armada when none is needed.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking in G-Research/spark#186

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes dynamic client mode in Spark-on-Armada by enabling the driver to discover gang node selector attributes at runtime from executors. In dynamic client mode, the driver runs outside the Kubernetes cluster and lacks the ARMADA_GANG_* environment variables that Armada injects into pods, which previously caused scale-up executors to land on arbitrary nodes rather than the same cluster as the initial gang.

Changes:

  • Implements runtime gang attribute capture via executor registration RPC interception, storing discovered node selector labels in SparkConf for use in subsequent executor allocations
  • Adds entrypoint wrapper that forwards ARMADA_GANG_* environment variables as SPARK_EXECUTOR_ATTRIBUTE_* so executors can relay them to the driver
  • Sets gang cardinality to 0 for scale-up batches while relying on node selectors instead of full gang scheduling to reduce overhead

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala Adds gang attribute capture from executor registration and environment seeding; implements allocation gating via isReadyToAllocateMore
src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaExecutorAllocator.scala Gates scale-up allocation until gang attributes are captured in client mode with nodeUniformity
src/main/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplication.scala Implements gang cardinality override for scale-up batches and gang node selector retrieval from SparkConf
src/main/scala/org/apache/spark/deploy/armada/Config.scala Adds internal config keys for storing captured gang node label name and value
docker/armada-entrypoint.sh New wrapper script that forwards ARMADA_GANG_* env vars as SPARK_EXECUTOR_ATTRIBUTE_* for executor-to-driver communication
docker/Dockerfile Updates entrypoint to use armada-entrypoint.sh wrapper
src/test/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackendSuite.scala Unit tests for captureGangAttributes, seedGangAttributesFromEnv, and isReadyToAllocateMore
src/test/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplicationSuite.scala Unit tests for getGangNodeSelector
src/test/scala/org/apache/spark/deploy/armada/e2e/E2ETestBuilder.scala Updates gang assertion logic for dynamic allocation with separate checks for env vars, annotations, and node selectors
src/test/scala/org/apache/spark/deploy/armada/e2e/ArmadaSparkE2E.scala Adds dynamicClient e2e test and updates dynamicCluster test with new assertion parameters
scripts/submitArmadaSpark.sh Adjusts dynamic allocation parameters and adds gang scheduling configuration
CLAUDE.md Updates ArmadaClusterManagerBackend documentation to reflect gang attribute capture feature

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2 sudiptob2 changed the title Fix Dynamic client mode by injecting env from executor -> driver Fix executor placement in the same cluster in Dynamic client mode. Feb 27, 2026
@sudiptob2 sudiptob2 marked this pull request as ready for review February 27, 2026 21:03
@sudiptob2 sudiptob2 requested a review from Copilot March 9, 2026 13:41
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

"conf.set(\"spark.armada.executor.limit.memory\", \"1Gi\")\n",
"conf.set(\"spark.armada.executor.request.memory\", \"1Gi\")"
]
"source": "# Spark Configuration\nconf = SparkConf()\nif auth_token:\n conf.set(\"spark.armada.auth.token\", auth_token)\nif auth_script_path:\n conf.set(\"spark.armada.auth.script.path\", auth_script_path)\nif not driver_host:\n raise ValueError(\n \"SPARK_DRIVER_HOST environment variable is required. \"\n )\nconf.set(\"spark.master\", armada_master)\nconf.set(\"spark.submit.deployMode\", \"client\")\nconf.set(\"spark.app.id\", app_id)\nconf.set(\"spark.app.name\", \"jupyter-spark-pi\")\nconf.set(\"spark.driver.bindAddress\", \"0.0.0.0\")\nconf.set(\"spark.driver.host\", driver_host)\nconf.set(\"spark.driver.port\", driver_port)\nconf.set(\"spark.driver.blockManager.port\", block_manager_port)\nconf.set(\"spark.home\", \"/opt/spark\")\nconf.set(\"spark.armada.container.image\", image_name)\nconf.set(\"spark.armada.queue\", armada_queue)\nconf.set(\"spark.armada.scheduling.namespace\", armada_namespace)\nconf.set(\"spark.armada.eventWatcher.useTls\", event_watcher_use_tls)\nconf.set(\"spark.kubernetes.file.upload.path\", \"/tmp\")\nconf.set(\"spark.kubernetes.executor.disableConfigMap\", \"true\")\nconf.set(\"spark.local.dir\", \"/tmp\")\nconf.set(\"spark.jars\", armada_jar)\n\n# Network timeouts\nconf.set(\"spark.network.timeout\", \"800s\")\nconf.set(\"spark.executor.heartbeatInterval\", \"60s\")\n\n# Resource limits\nconf.set(\"spark.armada.driver.limit.memory\", \"1Gi\")\nconf.set(\"spark.armada.driver.request.memory\", \"1Gi\")\nconf.set(\"spark.armada.executor.limit.memory\", \"1Gi\")\nconf.set(\"spark.armada.executor.request.memory\", \"1Gi\")\n\n# Allocation mode configuration\nif allocation_mode == 'dynamic':\n # Dynamic allocation - executors scale based on workload\n # Gang scheduling ensures all executors land on the same Armada cluster\n conf.set(\"spark.armada.scheduling.nodeUniformity\", node_uniformity_label)\n conf.set(\"spark.dynamicAllocation.enabled\", \"true\")\n conf.set(\"spark.dynamicAllocation.minExecutors\", \"2\")\n conf.set(\"spark.dynamicAllocation.maxExecutors\", \"10\")\n conf.set(\"spark.dynamicAllocation.initialExecutors\", \"2\") # Must be >= 2 for gang scheduling\n conf.set(\"spark.dynamicAllocation.executorIdleTimeout\", \"60s\")\n conf.set(\"spark.dynamicAllocation.schedulerBacklogTimeout\", \"5s\")\n print(f\"Using dynamic allocation (min=2, max=10, initial=2)\")\nelse:\n # Static allocation - fixed number of executors\n conf.set(\"spark.executor.instances\", \"2\")\n print(f\"Using static allocation (instances=2)\")"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the previous version had a setting per line making them much easier to read. Is there any reason not to maintain that convention?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@GeorgeJahad
Copy link
Copy Markdown
Collaborator

In the "how to verify" section, I'm not sure what steps 3,4 and 5 are telling us to do:?

3. Verify initial gang executors have gang annotations and env vars
4. Verify scale-up executors have node selectors but no gang annotations
5. Confirm no duplicate executor submissions in cluster-mode static tests

@GeorgeJahad
Copy link
Copy Markdown
Collaborator

The deploymentmode helper is meant to hide a lot of the implementation differences between client and cluster mode. It feels like the latest diffs allow some of that abstraction to leak out of the deploymode helper. I've entered Claude's opinion below. What do you think?:

The core issue

DeploymentModeHelper has no gang scheduling abstractions at all. The gang attribute capture/propagation logic is spread across three classes that each perform their own deploy-mode branching:

1. ArmadaClusterManagerBackend — the biggest offender

  • Line 112: Owns gangAttributesCaptured: AtomicBoolean state directly
  • Lines 586-607: isReadyToAllocateMore does explicit if (isClusterMode) ... else ... branching — exactly the kind of thing DeploymentModeHelper exists to hide
  • Lines 651-658: seedGangAttributesFromEnv() is cluster-mode-specific logic living in the Backend
  • Lines 667-680: captureGangAttributes() writes directly to SparkConf internal keys — client-mode-specific
  • Lines 718-721: Custom DriverEndpoint intercepts RegisterExecutor to capture gang attributes from executor attributes — client-mode-specific behavior embedded in the RPC layer

2. ArmadaClientApplication — cross-component coupling

  • Lines 1306-1312: getGangNodeSelector() reads the internal config keys that the Backend wrote, creating a hidden coupling between submission logic and the backend's capture mechanism
  • Lines 427-440: Performs its own gang-cardinality-override logic based on whether gang attributes have been captured
  • Lines 1271-1275: Directly merges gang node selectors into executor pod specs

3. ArmadaExecutorAllocator — indirect leakage

  • Lines 113-121: Calls backend.isReadyToAllocateMore, which itself contains deploy-mode branching

4. Config.scala — the glue

  • Lines 236-248: ARMADA_INTERNAL_GANG_NODE_LABEL_NAME and ARMADA_INTERNAL_GANG_NODE_LABEL_VALUE are internal config keys used as a side-channel between Backend and ClientApplication, rather than a proper abstraction boundary.

What's missing from DeploymentModeHelper

The helper currently has methods like getGangCardinality, isDriverInCluster, and getJobSetIdSource that properly abstract mode differences. But it lacks gang attribute lifecycle methods such as:

  • Whether the allocator is ready to allocate (encapsulating the cluster-vs-client readiness logic)
  • Seeding/capturing gang attributes (encapsulating the env-var vs executor-registration paths)
  • Retrieving the gang node selector for subsequent allocations
  • Determining gang cardinality for scale-up batches

The pattern of if (modeHelper.isDriverInCluster) ... else ... appearing in the Backend rather than being behind a single helper method is the clearest sign of the abstraction leak. The whole point of DeploymentModeHelper is to make those conditionals disappear from calling code.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2
Copy link
Copy Markdown
Collaborator Author

In the "how to verify" section, I'm not sure what steps 3,4 and 5 are telling us to do:?

3. Verify initial gang executors have gang annotations and env vars
4. Verify scale-up executors have node selectors but no gang annotations
5. Confirm no duplicate executor submissions in cluster-mode static tests

This is actually how I verified manually. Updated the PR description

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2 sudiptob2 requested a review from GeorgeJahad March 13, 2026 19:03
.assertGangJobForDynamic(
"armada-spark",
3
) // at least 3 executor pods (2 min + 1 scaled) with gang annotations seen
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this comment helpful. Does it need to be removed?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it no longer correct?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the assertion is more descriptive now, but yeah, better to have the comment, adding it back.

): Map[String, String] = {
val modeHelper = DeploymentModeHelper(conf)
val gangCardinality = modeHelper.getGangCardinality
val gangCardinality = gangCardinalityOverride.getOrElse(modeHelper.getGangCardinality)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be handled in the DeploymentModeHelper as well?

(All the rest of the changes to the DeploymentModeHelper look real good to me, but it seems like this should go as well.)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved

"- `dynamic`: Executors scale based on workload (recommended for interactive use)\n",
"\n",
"**Important for dynamic mode:** \n",
"- `initialExecutors` must be >= 2 for gang scheduling to work properly\n",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does an error get generated if initialExecutors is < 2?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Armada only injects ARMADA_GANG_NODE_UNIFORMITY_LABEL_NAME/VALUE env vars when the gang has >= 2 members.
As a result, the scale-up does not work. Should we do some sort of config validation to enforce initialExecutors>=2 for dynamic allocation?

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2 sudiptob2 force-pushed the feat/dynamic-client-mode-gang-env branch from 7d6ad0f to d71c8c2 Compare March 16, 2026 16:46
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Copy link
Copy Markdown
Collaborator

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks!

@sudiptob2 sudiptob2 merged commit 368fcf1 into armadaproject:master Mar 17, 2026
24 of 34 checks passed
@sudiptob2 sudiptob2 deleted the feat/dynamic-client-mode-gang-env branch March 17, 2026 21:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Finish dynamic client mode

3 participants