Fix: Errors deploying jobs of type parameterized since 0.4.2#855
Fix: Errors deploying jobs of type parameterized since 0.4.2#855ritesh-harihar wants to merge 3 commits intomainfrom
Conversation
jrasell
left a comment
There was a problem hiding this comment.
Thanks for this @ritesh-harihar and it's looking good. I have a couple of comments I think are worth talking through before merging. The PR will also need a changelog.
internal/runner/job/reconcile.go
Outdated
|
|
||
| // isChildStyleJobID is a conservative fallback: child jobs are represented as | ||
| // "<parent>/<child-run-id>", so they always contain a non-leading slash. | ||
| func isChildStyleJobID(jobID string) bool { |
There was a problem hiding this comment.
It is possible Nomad job IDs are configured with a number of slashes irregardless of whether they are dispatched. I am curious of your thoughts on removing this fallback altogether and relying on the ParentID and metadata identification? Are there cases where this wouldn't be enough?
$ nomad status
ID Type Priority Status Submit Date
example/self/namespace service 50 pending 2026-03-31T10:07:51+01:00There was a problem hiding this comment.
“ParentID + pack.job " covers normal Nomad/nomad-pack behavior and removes slash-based false positives.
The remaining edge case is when both signals are absent/inconsistent in stubs; if we observe that in practice, we can add a targeted Jobs().Info check only for ambiguous entries to keep reconciliation safe.
So instead of calling Jobs().Info for every job, call it only for uncertain ones.
Pseudo flow:
if stub.ParentID != "" → child
else if pack.job exists:
pack.job != stub.ID → child
pack.job == stub.ID → root
else (ambiguous) → call Jobs().Info:
full ParentID != "" → child
else → root
But not sure if we really need this API call?
jrasell
left a comment
There was a problem hiding this comment.
LGTM; I think we just need a changelog entry before merging.
Updated the changelog. |
Description
This PR fixes a regression where nomad-pack run could stop active child jobs (for example, dispatched or periodic runs) during reconciliation. The issue appeared after stale-job cleanup was introduced and affected deployments where child jobs inherit pack metadata.
Root Cause
Reconciliation in reconcile.go lists jobs by pack.deployment_name and treats jobs not in the current rendered root job set as stale. Child jobs created from a root pack job were included in that list and misclassified as stale, which caused unintended Deregister calls. The previous logic did not robustly distinguish root jobs from generated child runs.
Fix
- ParentID (preferred),
- pack.job metadata consistency (pack.job != job ID means child),
- slash-shape fallback only when pack.job metadata is missing.(Currently slash-based child suffixes are hardcoded and used in nomad)
Why this approach
ISSUE
Testing:
Reminders
CHANGELOG.mdentryChanges to Security Controls
Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.