Skip to content

All builds stuck "discovering any new versions", seems to be NATS cert expiry underneath, recover instructions not worked #334

@RichardBradley

Description

@RichardBradley

Summary

All my builds got stuck saying "discovering any new versions" for ages. I looked at concourse/#844 and its linked issues for a while.

As part of that red herring, I found the following issue:

sh-4.2$ fly -t xxx check-resource -r x/x
checking x/x in build 42946363
initializing check: x
resource config creds evaluation: Get "https://xxx:8844/info": x509: certificate has expired or is not yet valid: current time2023-05-23T15:09:56Z is after 2023-05-23T10:03:25Z
errored

Which looks a lot like https://github.com/EngineerBetter/control-tower/blob/master/docs/troubleshooting.md#bosh-director-certificate-has-expired

We have had similar issues before and had followed the NATS cert renewal instructions last week in an attempt to avoid this.

I followed those instructions but then got:

Deploying:
  Creating instance 'bosh/0':
    Waiting until instance is ready:
      Post https://mbus/:<redacted>@54.77.80.216:6868/agent: x509: certificate has expired or is not yet valid

I then tried to follow https://github.com/EngineerBetter/control-tower/blob/master/docs/troubleshooting.md#nats-certificate-is-expired

but I currently have:

Task 8269 | 16:00:19 | Error: Failed to acquire lock for lock:deployment:concourse uid: 04075eac-579a-4839-98d9-2b4d840de459. Locking taskid is 8264, description: 'scan and fix'

Task 8269 Started  Tue May 23 16:00:19 UTC 2023
Task 8269 Finished Tue May 23 16:00:19 UTC 2023
Task 8269 Duration 00:00:00
Task 8269 error

Updating deployment:
  Expected task '8269' to succeed but state is 'error'

Exit code 1

In step 6 of the above, where it says " Run bosh deploy --recreate --fix <(bosh manifest)", what should I use for "bosh manifest"?

Steps to reproduce

Run Concourse for more than one year.

Expected results

Concourse should continue to work, or be easily recoverable.

If there are any errors they should be clear and suggest fixes.

Actual results

Concourse fails with all builds stuck on "discovering any new versions"

Additional context

Triaging info

  • Concourse version: current
  • Browser (if applicable):
  • Did this used to work? no, this happens every year

Any help or advice would be gratefully received

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions