-
Notifications
You must be signed in to change notification settings - Fork 299
Running units on part of the cluster stopped and started after master disconnect #1690
Description
I believe that fleet should no longer stop units if it loses it's connection to the cluster, but that's what's seemed to have happened to us.
We run a cluster of 17 machines, of which we've dedicated 3 to master duty.
We're running CoreOS 899.13.0 (because we had stability issues with the 1000 series).
It's using the following versions for fleetd and etcd2
$ fleetd --version
fleetd version 0.11.5
$ etcd2 --version
etcd2 --version
etcd Version: 2.2.3
Git SHA: 05b564a
Go Version: go1.4.3
Go OS/Arch: linux/amd64
It started with a single non-master node having etcd2 connectivity issues
proxy: client 127.0.0.1:58758 closed request prematurely
proxy: client 127.0.0.1:58878 closed request prematurely
proxy: client 127.0.0.1:58876 closed request prematurely
etc.
This is something that eventually all regular nodes showed.
We then got a new etcd2 leader election
2016-10-19 10:34:07.697 infra-prod-master-10.9.8.200 e109f17bc0f23c3f is starting a new election at term 28
2016-10-19 10:34:07.701 infra-prod-master-10.9.8.150 fc9295b9360ddd69 [term: 28] received a MsgVote message with higher term from e109f17bc0f23c3f [term: 29]
2016-10-19 10:34:07.701 infra-prod-master-10.9.8.200 e109f17bc0f23c3f became candidate at term 29
2016-10-19 10:34:07.705 infra-prod-master-10.9.8.200 e109f17bc0f23c3f received vote from e109f17bc0f23c3f at term 29
2016-10-19 10:34:07.705 infra-prod-master-10.9.8.150 fc9295b9360ddd69 became follower at term 29
2016-10-19 10:34:07.710 infra-prod-master-10.9.8.150 fc9295b9360ddd69 [logterm: 28, index: 1760268102, vote: 0] voted for e109f17bc0f23c3f [logterm: 28, index: 1760268102] at term 29
2016-10-19 10:34:07.711 infra-prod-master-10.9.8.200 e109f17bc0f23c3f [logterm: 28, index: 1760268102] sent vote request to d2703a25587493c9 at term 29
2016-10-19 10:34:07.715 infra-prod-master-10.9.8.200 e109f17bc0f23c3f [logterm: 28, index: 1760268102] sent vote request to fc9295b9360ddd69 at term 29
2016-10-19 10:34:07.716 infra-prod-master-10.9.8.150 raft.node: fc9295b9360ddd69 lost leader d2703a25587493c9 at term 29
2016-10-19 10:34:07.720 infra-prod-master-10.9.8.150 raft.node: fc9295b9360ddd69 elected leader e109f17bc0f23c3f at term 29
2016-10-19 10:34:07.720 infra-prod-master-10.9.8.200 raft.node: e109f17bc0f23c3f lost leader d2703a25587493c9 at term 29
2016-10-19 10:34:07.726 infra-prod-master-10.9.8.200 e109f17bc0f23c3f received vote from fc9295b9360ddd69 at term 29
2016-10-19 10:34:07.730 infra-prod-master-10.9.8.200 e109f17bc0f23c3f [q:2] has received 2 votes and 0 vote rejections
2016-10-19 10:34:07.734 infra-prod-master-10.9.8.200 e109f17bc0f23c3f became leader at term 29
2016-10-19 10:34:07.738 infra-prod-master-10.9.8.200 raft.node: e109f17bc0f23c3f elected leader e109f17bc0f23c3f at term 29
Then there's the following that repeats 300+ times from the other 2 master nodes that weren't disconnected
2016-10-19 10:34:12.601 infra-prod-master-10.9.8.150 fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.602 infra-prod-master-10.9.8.200 e109f17bc0f23c3f [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.607 infra-prod-master-10.9.8.150 fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.612 infra-prod-master-10.9.8.150 fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.613 infra-prod-master-10.9.8.200 e109f17bc0f23c3f [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
And finally we see the following in fleet
2016-10-19 10:34:08.884 infra-prod-dev-10.9.8.131 ERROR engine.go:217: Engine leadership lost, renewal failed: context deadline exceeded
2016-10-19 10:34:12.213 infra-prod-dev-10.9.8.103 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:12.217 infra-prod-dev-10.9.8.103 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:12.250 infra-prod-dev-10.9.8.103 INFO server.go:168: Starting server components
2016-10-19 10:34:40.498 infra-prod-tst-10.9.8.246 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:40.503 infra-prod-tst-10.9.8.246 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:40.648 infra-prod-master-10.9.8.191 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:40.652 infra-prod-master-10.9.8.191 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:46.573 infra-prod-tst-10.9.8.246 INFO server.go:168: Starting server components
2016-10-19 10:34:47.915 infra-prod-dev-10.9.8.22 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:47.919 infra-prod-dev-10.9.8.22 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:50.744 infra-prod-tools-10.9.8.118 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:50.747 infra-prod-tools-10.9.8.118 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:55.312 infra-prod-dev-10.9.8.131 INFO engine.go:256: Unscheduled Job(dev-quote-service-master@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.318 infra-prod-dev-10.9.8.131 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-quote-service-master@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.360 infra-prod-dev-10.9.8.131 INFO engine.go:256: Unscheduled Job(dev-admin-add-snackbar-service@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.366 infra-prod-dev-10.9.8.131 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-admin-add-snackbar-service@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.415 infra-prod-dev-10.9.8.131 INFO engine.go:256: Unscheduled Job(dev-adds-rev-css-images@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.421 infra-prod-dev-10.9.8.131 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-adds-rev-css-images@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.470 infra-prod-dev-10.9.8.131 INFO engine.go:256: Unscheduled Job(dev-mtf-125-close-menu-on-select@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.476 infra-prod-dev-10.9.8.131 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-mtf-125-close-menu-on-select@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.529 infra-prod-dev-10.9.8.131 INFO engine.go:256: Unscheduled Job(dev-mock-loads-widget-sass-from-shared@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.536 infra-prod-dev-10.9.8.131 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: dev-mock-loads-widget-sass-from-shared@1.service, MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.623 infra-prod-dev-10.9.8.131 INFO engine.go:256: Unscheduled Job(dev-admin-cs-number@1.service) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
etc...
It seems like the reconciler was triggered, though IMHO it shouldn't be. What could be the cause of this?