Skip to content

Fix: Errors deploying jobs of type parameterized since 0.4.2#855

Open
ritesh-harihar wants to merge 3 commits intomainfrom
fix-failing-parameterized-jobs
Open

Fix: Errors deploying jobs of type parameterized since 0.4.2#855
ritesh-harihar wants to merge 3 commits intomainfrom
fix-failing-parameterized-jobs

Conversation

@ritesh-harihar
Copy link
Copy Markdown
Collaborator

@ritesh-harihar ritesh-harihar commented Mar 30, 2026

Description
This PR fixes a regression where nomad-pack run could stop active child jobs (for example, dispatched or periodic runs) during reconciliation. The issue appeared after stale-job cleanup was introduced and affected deployments where child jobs inherit pack metadata.

Root Cause
Reconciliation in reconcile.go lists jobs by pack.deployment_name and treats jobs not in the current rendered root job set as stale. Child jobs created from a root pack job were included in that list and misclassified as stale, which caused unintended Deregister calls. The previous logic did not robustly distinguish root jobs from generated child runs.

Fix

  • Tightened stale-job selection to skip child jobs before stop actions.
  • Child detection now uses ordered signals:
    - ParentID (preferred),
    - pack.job metadata consistency (pack.job != job ID means child),
    - slash-shape fallback only when pack.job metadata is missing.(Currently slash-based child suffixes are hardcoded and used in nomad)
  • Added nil-safety in the reconcile loop to avoid dereference risk on unexpected list entries.

Why this approach

  • Avoids hardcoding only periodic-* / dispatch-* assumptions.
  • Keeps behavior future-tolerant without adding extra Nomad API lookups.
  • Preserves original stale-job cleanup intent for true removed root jobs.

ISSUE

Testing:

  • Before fix:
Screenshot 2026-03-30 at 4 03 07 PM
  • After fix:
Screenshot 2026-03-30 at 7 18 38 PM Screenshot 2026-03-30 at 5 09 19 PM

Reminders

  • Add CHANGELOG.md entry
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@ritesh-harihar ritesh-harihar requested review from a team as code owners March 30, 2026 13:38
@ritesh-harihar ritesh-harihar linked an issue Mar 30, 2026 that may be closed by this pull request
@ritesh-harihar ritesh-harihar self-assigned this Mar 30, 2026
Copy link
Copy Markdown
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @ritesh-harihar and it's looking good. I have a couple of comments I think are worth talking through before merging. The PR will also need a changelog.


// isChildStyleJobID is a conservative fallback: child jobs are represented as
// "<parent>/<child-run-id>", so they always contain a non-leading slash.
func isChildStyleJobID(jobID string) bool {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible Nomad job IDs are configured with a number of slashes irregardless of whether they are dispatched. I am curious of your thoughts on removing this fallback altogether and relying on the ParentID and metadata identification? Are there cases where this wouldn't be enough?

$ nomad status
ID                      Type     Priority  Status   Submit Date
example/self/namespace  service  50        pending  2026-03-31T10:07:51+01:00

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

“ParentID + pack.job " covers normal Nomad/nomad-pack behavior and removes slash-based false positives.

The remaining edge case is when both signals are absent/inconsistent in stubs; if we observe that in practice, we can add a targeted Jobs().Info check only for ambiguous entries to keep reconciliation safe.

So instead of calling Jobs().Info for every job, call it only for uncertain ones.

Pseudo flow:

if stub.ParentID != "" → child
else if pack.job exists:
pack.job != stub.ID → child
pack.job == stub.ID → root
else (ambiguous) → call Jobs().Info:
full ParentID != "" → child
else → root

But not sure if we really need this API call?

Copy link
Copy Markdown
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; I think we just need a changelog entry before merging.

@ritesh-harihar
Copy link
Copy Markdown
Collaborator Author

LGTM; I think we just need a changelog entry before merging.

Updated the changelog.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Errors deploying jobs of type parameterized since 0.4.2

2 participants