Skip to content

support minExecutors=0 in dynamic mode#107

Merged
sudiptob2 merged 4 commits intoarmadaproject:masterfrom
sudiptob2:feat/186/0-min-executor
Mar 31, 2026
Merged

support minExecutors=0 in dynamic mode#107
sudiptob2 merged 4 commits intoarmadaproject:masterfrom
sudiptob2:feat/186/0-min-executor

Conversation

@sudiptob2
Copy link
Copy Markdown
Collaborator

@sudiptob2 sudiptob2 commented Mar 20, 2026

Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2 sudiptob2 force-pushed the feat/186/0-min-executor branch from 01a4d9c to ce980c4 Compare March 20, 2026 18:32
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
@sudiptob2 sudiptob2 marked this pull request as ready for review March 24, 2026 17:21
@sudiptob2 sudiptob2 requested a review from Copilot March 24, 2026 17:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Armada’s dynamic allocation support to allow spark.dynamicAllocation.minExecutors=0 by separating “bootstrap” executor count (initialExecutors) from the scale-down floor, and adjusts tests/examples accordingly.

Changes:

  • Replace “executor count” semantics with getInitialExecutorCount, using spark.dynamicAllocation.initialExecutors as the one-time gang bootstrap batch.
  • Enforce/propagate spark.armada.scheduling.nodeUniformity for dynamic modes and add scale-from-zero coverage in unit/E2E tests.
  • Update example Jupyter notebook guidance and dynamic allocation configuration.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/main/scala/org/apache/spark/deploy/armada/DeploymentModeHelper.scala Introduces getInitialExecutorCount; dynamic modes now use initialExecutors for bootstrap and add stricter validation (nodeUniformity + minimum initialExecutors).
src/main/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplication.scala Submits initial executor batch using getInitialExecutorCount and relaxes static-only validation to allow dynamic min=0.
src/main/scala/org/apache/spark/scheduler/cluster/armada/ArmadaClusterManagerBackend.scala Uses getInitialExecutorCount when proactively requesting executors.
src/test/scala/org/apache/spark/deploy/armada/DeploymentModeHelperSuite.scala Updates expectations and adds new tests for scale-from-zero and initialExecutors/minExecutors behavior.
src/test/scala/org/apache/spark/deploy/armada/submit/ArmadaClientApplicationSuite.scala Adjusts test config to include nodeUniformity + initialExecutors for dynamic min=0 scenario.
src/test/scala/org/apache/spark/deploy/armada/e2e/E2ETestBuilder.scala Changes dynamic allocation helper to accept initialExecutors, forces minExecutors=0, and sets nodeUniformity.
src/test/scala/org/apache/spark/deploy/armada/e2e/ArmadaSparkE2E.scala Updates node selector test values (currently to armada-worker2).
example/jupyter/notebooks/jupyter_armada_spark.ipynb Updates dynamic allocation notes and example config for minExecutors=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +296 to +297
.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The node selector value was changed from "armada-worker" to "armada-worker2", but the E2E job template in this repo still uses "armada-worker" (src/test/resources/e2e/templates/spark-pi-job-template.yaml). Unless the test environment actually has a node named armada-worker2, this will likely make the node selector E2E tests fail. Please keep this consistent with the template/environment or make the selector configurable via test config.

Suggested change
.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))
.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))

Copilot uses AI. Check for mistakes.
Comment thread example/jupyter/notebooks/jupyter_armada_spark.ipynb
Comment on lines +259 to +266
// Armada is multi-cluster: nodeUniformity must be configured so Armada knows
// which label to use for co-locating the gang and for subsequent scale-up node selectors.
if (!nodeUniformityConfigured) {
throw new IllegalArgumentException(
"spark.armada.scheduling.nodeUniformity must be configured in " +
"dynamic cluster mode. Armada is multi-cluster and requires " +
"a node uniformity label to co-locate all executors on the " +
"same cluster."
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DynamicCluster now throws if spark.armada.scheduling.nodeUniformity is not configured. Since Armada also supports pinning placement via spark.armada.scheduling.nodeSelectors (which are applied to submitted pods), consider allowing dynamic mode without nodeUniformity when nodeSelectors are explicitly set (or document clearly why nodeSelectors aren’t sufficient). As-is, this is a behavior change that can break existing dynamic cluster deployments that relied on nodeSelectors or single-cluster setups.

Suggested change
// Armada is multi-cluster: nodeUniformity must be configured so Armada knows
// which label to use for co-locating the gang and for subsequent scale-up node selectors.
if (!nodeUniformityConfigured) {
throw new IllegalArgumentException(
"spark.armada.scheduling.nodeUniformity must be configured in " +
"dynamic cluster mode. Armada is multi-cluster and requires " +
"a node uniformity label to co-locate all executors on the " +
"same cluster."
// Detect whether explicit node selectors have been configured for scheduling.
// When nodeSelectors are set, they can be used to pin placement even without
// a nodeUniformity label.
private val nodeSelectorsConfigured: Boolean =
conf.getOption(ARMADA_SCHEDULING_NODE_SELECTORS).exists(_.trim.nonEmpty)
// Armada is multi-cluster: nodeUniformity must be configured so Armada knows
// which label to use for co-locating the gang and for subsequent scale-up node selectors.
// However, if explicit nodeSelectors are configured, allow dynamic cluster mode
// to proceed without nodeUniformity for backward compatibility.
if (!nodeUniformityConfigured && !nodeSelectorsConfigured) {
throw new IllegalArgumentException(
"spark.armada.scheduling.nodeUniformity must be configured in " +
"dynamic cluster mode unless explicit spark.armada.scheduling.nodeSelectors " +
"are set. Armada is multi-cluster and requires a node uniformity label to " +
"co-locate all executors on the same cluster when no node selectors are used."

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nodeSelectors are per-pod constraints applied at submission time; they can't replace nodeUniformity, which instructs Armada's gang scheduler to co-locate the initial batch on the same cluster before any node selector is known.

Comment on lines +325 to 335
// Armada is multi-cluster: We use gang scheduling to co-locate the initial batch on one cluster,
// then capture the cluster label for all subsequent submissions.
// nodeUniformity must be configured so Armada knows which label to use.
if (!nodeUniformityConfigured) {
throw new IllegalArgumentException(
s"spark.dynamicAllocation.minExecutors must be >= 2 in " +
s"dynamic client mode, but got: ${getExecutorCount}. " +
s"Armada requires gang cardinality >= 2 to co-locate " +
s"executors in the same cluster."
"spark.armada.scheduling.nodeUniformity must be configured in " +
"dynamic client mode. Armada is multi-cluster and requires " +
"a node uniformity label to co-locate all executors on the " +
"same cluster."
)
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DynamicClient now throws if spark.armada.scheduling.nodeUniformity is not configured. If the intent is specifically to support scale-from-zero in multi-cluster environments, consider permitting dynamic client mode to run without nodeUniformity when spark.armada.scheduling.nodeSelectors is explicitly set (or when running in a known single-cluster environment). Otherwise this becomes a breaking configuration requirement unrelated to minExecutors=0 itself.

Copilot uses AI. Check for mistakes.
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>

// In cluster mode, driver is part of the gang, so 1 executor + driver = cardinality 2.
if (initialExecutorCount < 1) {
throw new IllegalArgumentException(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of throwing, should we should just force the count to be 1, and warn the user?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if that is a good approach, I will also create configuration drift.

)
.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))
.assertNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker"))
.withNodeSelectors(Map("kubernetes.io/hostname" -> "armada-worker2"))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reason for this change?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made nodeUniformity mandatory for dynamic allocation. In the E2E test, globally we set nodeUniformity=armada-spark, Now, in our test kind cluster setup, only armada-worker2 has that label.
So if we try to select a node that doesn't have that label, Armada won't schedule the job. That's why pinning the nodeselector test to select armada-worker2

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, in our test kind cluster setup, only armada-worker2 has that label.

where does this get set?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here it is,

- role: worker
+- role: worker
+ labels:
+ armada-spark: true

Iine 38 - first worker
line 39 - 2nd worker

@GeorgeJahad
Copy link
Copy Markdown
Collaborator

I had a couple of minor questions, but everything looks good; I'll approve after I see the answers.

@sudiptob2 sudiptob2 requested a review from GeorgeJahad March 30, 2026 14:53
Signed-off-by: Sudipto Baral <sudiptobaral.me@gmail.com>
Copy link
Copy Markdown
Collaborator

@GeorgeJahad GeorgeJahad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @sudiptob2 !

@sudiptob2 sudiptob2 merged commit 555ff5d into armadaproject:master Mar 31, 2026
12 checks passed
@sudiptob2 sudiptob2 deleted the feat/186/0-min-executor branch March 31, 2026 18:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Can not set initialExecutor > minExecutor count Dynamic client mode should work with 0 minimum executor

3 participants