Skip to content
This repository was archived by the owner on Jan 30, 2020. It is now read-only.
This repository was archived by the owner on Jan 30, 2020. It is now read-only.

Running units on part of the cluster stopped and started after master disconnect #1690

@simonvanderveldt

Description

@simonvanderveldt

I believe that fleet should no longer stop units if it loses it's connection to the cluster, but that's what's seemed to have happened to us.
We run a cluster of 17 machines, of which we've dedicated 3 to master duty.

We're running CoreOS 899.13.0 (because we had stability issues with the 1000 series).
It's using the following versions for fleetd and etcd2

$ fleetd --version
fleetd version 0.11.5
$ etcd2 --version
etcd2 --version
etcd Version: 2.2.3
Git SHA: 05b564a
Go Version: go1.4.3
Go OS/Arch: linux/amd64

It started with a single non-master node having etcd2 connectivity issues

proxy: client 127.0.0.1:58758 closed request prematurely
proxy: client 127.0.0.1:58878 closed request prematurely
proxy: client 127.0.0.1:58876 closed request prematurely
etc.

This is something that eventually all regular nodes showed.

We then got a new etcd2 leader election

2016-10-19 10:34:07.697 infra-prod-master-10.9.8.200    e109f17bc0f23c3f is starting a new election at term 28
2016-10-19 10:34:07.701 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 28] received a MsgVote message with higher term from e109f17bc0f23c3f [term: 29]
2016-10-19 10:34:07.701 infra-prod-master-10.9.8.200    e109f17bc0f23c3f became candidate at term 29
2016-10-19 10:34:07.705 infra-prod-master-10.9.8.200    e109f17bc0f23c3f received vote from e109f17bc0f23c3f at term 29
2016-10-19 10:34:07.705 infra-prod-master-10.9.8.150    fc9295b9360ddd69 became follower at term 29
2016-10-19 10:34:07.710 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [logterm: 28, index: 1760268102, vote: 0] voted for e109f17bc0f23c3f [logterm: 28, index: 1760268102] at term 29
2016-10-19 10:34:07.711 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [logterm: 28, index: 1760268102] sent vote request to d2703a25587493c9 at term 29
2016-10-19 10:34:07.715 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [logterm: 28, index: 1760268102] sent vote request to fc9295b9360ddd69 at term 29
2016-10-19 10:34:07.716 infra-prod-master-10.9.8.150    raft.node: fc9295b9360ddd69 lost leader d2703a25587493c9 at term 29
2016-10-19 10:34:07.720 infra-prod-master-10.9.8.150    raft.node: fc9295b9360ddd69 elected leader e109f17bc0f23c3f at term 29
2016-10-19 10:34:07.720 infra-prod-master-10.9.8.200    raft.node: e109f17bc0f23c3f lost leader d2703a25587493c9 at term 29
2016-10-19 10:34:07.726 infra-prod-master-10.9.8.200    e109f17bc0f23c3f received vote from fc9295b9360ddd69 at term 29
2016-10-19 10:34:07.730 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [q:2] has received 2 votes and 0 vote rejections
2016-10-19 10:34:07.734 infra-prod-master-10.9.8.200    e109f17bc0f23c3f became leader at term 29
2016-10-19 10:34:07.738 infra-prod-master-10.9.8.200    raft.node: e109f17bc0f23c3f elected leader e109f17bc0f23c3f at term 29

Then there's the following that repeats 300+ times from the other 2 master nodes that weren't disconnected

2016-10-19 10:34:12.601 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.602 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.607 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.612 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.613 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]

And finally we see the following in fleet

2016-10-19 10:34:08.884 infra-prod-dev-10.9.8.131   ERROR engine.go:217: Engine leadership lost, renewal failed: context deadline exceeded
2016-10-19 10:34:12.213 infra-prod-dev-10.9.8.103   ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:12.217 infra-prod-dev-10.9.8.103   INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:12.250 infra-prod-dev-10.9.8.103   INFO server.go:168: Starting server components
2016-10-19 10:34:40.498 infra-prod-tst-10.9.8.246   ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:40.503 infra-prod-tst-10.9.8.246   INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:40.648 infra-prod-master-10.9.8.191    ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:40.652 infra-prod-master-10.9.8.191    INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:46.573 infra-prod-tst-10.9.8.246   INFO server.go:168: Starting server components
2016-10-19 10:34:47.915 infra-prod-dev-10.9.8.22    ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:47.919 infra-prod-dev-10.9.8.22    INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:50.744 infra-prod-tools-10.9.8.118 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:50.747 infra-prod-tools-10.9.8.118 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:55.312 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job(dev-quote-service-master@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.318 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-quote-service-master@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.360 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job(dev-admin-add-snackbar-service@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.366 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-admin-add-snackbar-service@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.415 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job(dev-adds-rev-css-images@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.421 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-adds-rev-css-images@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.470 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job(dev-mtf-125-close-menu-on-select@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.476 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-mtf-125-close-menu-on-select@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.529 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job(dev-mock-loads-widget-sass-from-shared@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.536 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-mock-loads-widget-sass-from-shared@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.623 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job(dev-admin-cs-number@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
etc...

It seems like the reconciler was triggered, though IMHO it shouldn't be. What could be the cause of this?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions