support minExecutors=0 in dynamic mode by sudiptob2 · Pull Request #107 · armadaproject/armada-spark

sudiptob2 · 2026-03-20T18:28:42Z

Closes G-Research-Forks/spark#186
Closes G-Research-Forks/spark#189

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

Copilot

Pull request overview

Updates Armada’s dynamic allocation support to allow spark.dynamicAllocation.minExecutors=0 by separating “bootstrap” executor count (initialExecutors) from the scale-down floor, and adjusts tests/examples accordingly.

Changes:

Replace “executor count” semantics with getInitialExecutorCount, using spark.dynamicAllocation.initialExecutors as the one-time gang bootstrap batch.
Enforce/propagate spark.armada.scheduling.nodeUniformity for dynamic modes and add scale-from-zero coverage in unit/E2E tests.
Update example Jupyter notebook guidance and dynamic allocation configuration.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
src/main/scala/org/apache/spark/deploy/armada/DeploymentModeHelper.scala	Introduces `getInitialExecutorCount`; dynamic modes now use `initialExecutors` for bootstrap and add stricter validation (nodeUniformity + minimum initialExecutors).
src/main/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplication.scala	Submits initial executor batch using `getInitialExecutorCount` and relaxes static-only validation to allow dynamic min=0.
src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala	Uses `getInitialExecutorCount` when proactively requesting executors.
src/test/scala/org/apache/spark/deploy/armada/DeploymentModeHelperSuite.scala	Updates expectations and adds new tests for scale-from-zero and initialExecutors/minExecutors behavior.
src/test/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplicationSuite.scala	Adjusts test config to include nodeUniformity + initialExecutors for dynamic min=0 scenario.
src/test/scala/org/apache/spark/deploy/armada/e2e/E2ETestBuilder.scala	Changes dynamic allocation helper to accept `initialExecutors`, forces `minExecutors=0`, and sets nodeUniformity.
src/test/scala/org/apache/spark/deploy/armada/e2e/ArmadaSparkE2E.scala	Updates node selector test values (currently to `armada-worker2`).
example/jupyter/notebooks/jupyter_armada_spark.ipynb	Updates dynamic allocation notes and example config for `minExecutors=0`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T17:28:28Z

+      .withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
+      .assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))


The node selector value was changed from "armada-worker" to "armada-worker2", but the E2E job template in this repo still uses "armada-worker" (src/test/resources/e2e/templates/spark-pi-job-template.yaml). Unless the test environment actually has a node named armada-worker2, this will likely make the node selector E2E tests fail. Please keep this consistent with the template/environment or make the selector configurable via test config.

Suggested change

.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))

.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))

.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))

.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))

Copilot · 2026-03-24T17:28:29Z

+  // Armada is multi-cluster: nodeUniformity must be configured so Armada knows
+  // which label to use for co-locating the gang and for subsequent scale-up node selectors.
+  if (!nodeUniformityConfigured) {
+    throw new IllegalArgumentException(
+      "spark.armada.scheduling.nodeUniformity must be configured in " +
+        "dynamic cluster mode. Armada is multi-cluster and requires " +
+        "a node uniformity label to co-locate all executors on the " +
+        "same cluster."


DynamicCluster now throws if spark.armada.scheduling.nodeUniformity is not configured. Since Armada also supports pinning placement via spark.armada.scheduling.nodeSelectors (which are applied to submitted pods), consider allowing dynamic mode without nodeUniformity when nodeSelectors are explicitly set (or document clearly why nodeSelectors aren’t sufficient). As-is, this is a behavior change that can break existing dynamic cluster deployments that relied on nodeSelectors or single-cluster setups.

Suggested change

// Armada is multi-cluster: nodeUniformity must be configured so Armada knows

// which label to use for co-locating the gang and for subsequent scale-up node selectors.

if (!nodeUniformityConfigured) {

throw new IllegalArgumentException(

"spark.armada.scheduling.nodeUniformity must be configured in " +

"dynamic cluster mode. Armada is multi-cluster and requires " +

"a node uniformity label to co-locate all executors on the " +

"same cluster."

// Detect whether explicit node selectors have been configured for scheduling.

// When nodeSelectors are set, they can be used to pin placement even without

// a nodeUniformity label.

private val nodeSelectorsConfigured: Boolean =

conf.getOption(ARMADA_SCHEDULING_NODE_SELECTORS).exists(_.trim.nonEmpty)

// Armada is multi-cluster: nodeUniformity must be configured so Armada knows

// which label to use for co-locating the gang and for subsequent scale-up node selectors.

// However, if explicit nodeSelectors are configured, allow dynamic cluster mode

// to proceed without nodeUniformity for backward compatibility.

if (!nodeUniformityConfigured && !nodeSelectorsConfigured) {

throw new IllegalArgumentException(

"spark.armada.scheduling.nodeUniformity must be configured in " +

"dynamic cluster mode unless explicit spark.armada.scheduling.nodeSelectors " +

"are set. Armada is multi-cluster and requires a node uniformity label to " +

"co-locate all executors on the same cluster when no node selectors are used."

nodeSelectors are per-pod constraints applied at submission time; they can't replace nodeUniformity, which instructs Armada's gang scheduler to co-locate the initial batch on the same cluster before any node selector is known.

Copilot · 2026-03-24T17:28:30Z

+  // Armada is multi-cluster: We use gang scheduling to co-locate the initial batch on one cluster,
+  // then capture the cluster label for all subsequent submissions.
+  // nodeUniformity must be configured so Armada knows which label to use.
+  if (!nodeUniformityConfigured) {
    throw new IllegalArgumentException(
-      s"spark.dynamicAllocation.minExecutors must be >= 2 in " +
-        s"dynamic client mode, but got: ${getExecutorCount}. " +
-        s"Armada requires gang cardinality >= 2 to co-locate " +
-        s"executors in the same cluster."
+      "spark.armada.scheduling.nodeUniformity must be configured in " +
+        "dynamic client mode. Armada is multi-cluster and requires " +
+        "a node uniformity label to co-locate all executors on the " +
+        "same cluster."
    )
  }


DynamicClient now throws if spark.armada.scheduling.nodeUniformity is not configured. If the intent is specifically to support scale-from-zero in multi-cluster environments, consider permitting dynamic client mode to run without nodeUniformity when spark.armada.scheduling.nodeSelectors is explicitly set (or when running in a known single-cluster environment). Otherwise this becomes a breaking configuration requirement unrelated to minExecutors=0 itself.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad · 2026-03-25T02:27:01Z

+
+  // In cluster mode, driver is part of the gang, so 1 executor + driver = cardinality 2.
+  if (initialExecutorCount < 1) {
+    throw new IllegalArgumentException(


Instead of throwing, should we should just force the count to be 1, and warn the user?

Not sure if that is a good approach, I will also create configuration drift.

GeorgeJahad · 2026-03-25T03:47:53Z

    )
-      .withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))
-      .assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))
+      .withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))


what is the reason for this change?

I made nodeUniformity mandatory for dynamic allocation. In the E2E test, globally we set nodeUniformity=armada-spark, Now, in our test kind cluster setup, only armada-worker2 has that label.
So if we try to select a node that doesn't have that label, Armada won't schedule the job. That's why pinning the nodeselector test to select armada-worker2

Now, in our test kind cluster setup, only armada-worker2 has that label.

where does this get set?

here it is,

armada-spark/e2e/armada-operator.patch

Lines 38 to 41 in e38e857

- role: worker

+- role: worker

+ labels:

+ armada-spark: true

Iine 38 - first worker
line 39 - 2nd worker

GeorgeJahad · 2026-03-25T04:09:10Z

I had a couple of minor questions, but everything looks good; I'll approve after I see the answers.

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad

lgtm, thanks @sudiptob2 !

support minExecutors=0 in dynamic mode

ce980c4

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

sudiptob2 force-pushed the feat/186/0-min-executor branch from 01a4d9c to ce980c4 Compare March 20, 2026 18:32

rename getExecutorCount to getInitialExecutorCount

33703e7

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

sudiptob2 marked this pull request as ready for review March 24, 2026 17:21

sudiptob2 requested a review from Copilot March 24, 2026 17:21

Copilot started reviewing on behalf of sudiptob2 March 24, 2026 17:22 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

address review comments

715fc06

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad reviewed Mar 25, 2026

View reviewed changes

sudiptob2 requested a review from GeorgeJahad March 30, 2026 14:53

add comment explaining armada-worker2 pin

5b2c220

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

GeorgeJahad approved these changes Mar 31, 2026

View reviewed changes

sudiptob2 merged commit 555ff5d into armadaproject:master Mar 31, 2026
12 checks passed

sudiptob2 deleted the feat/186/0-min-executor branch March 31, 2026 18:33

		.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
		.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))

Conversation

sudiptob2 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sudiptob2 Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

GeorgeJahad commented Mar 25, 2026

Uh oh!

GeorgeJahad left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

sudiptob2 commented Mar 20, 2026 •

edited

Loading