-
Notifications
You must be signed in to change notification settings - Fork 484
NullPointerException caused zkClient to close socket with zookeeper and kafka broker not functioning #71
Description
Hello,
We use kafka version 0.10.2.1 which uses zkclient 0.10 (zkclient-0.10.jar). We use a setup of 3 kafka brokers and 3 zookeeper nodes. Both kafka and zookeeper run in docker containers. We have seen a few occasions where one kafka broker remains running, but there is no connection with zookeeper (netstat reports no socket connected to port 2181). This is probably caused by a transient network issue or transient DNS resolution issue, however it seems that the transient issue becomes permanent and the broker never recovers. Instead the broker remains out of sync with the other 2 brokers. I can recover it only by restarting the broker.
I can see the following from the broker logs (unfortunately no debug logs are available):
2018-06-16T00:12:14.357000000Z org.apache.zookeeper.ClientCnxn$SendThread: Client session timed out, have not heard from server in 26679ms for sessionid 0x16405b171770003
2018-06-16T00:12:30.795000000Z org.apache.zookeeper.ClientCnxn$SendThread: Unable to reconnect to ZooKeeper service, session 0x16405b171770003 has expired
2018-06-16T00:12:33.811000000Z org.I0Itec.zkclient.ZkEventThread: Could not establish session with zookeeperorg.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper.tas01.local:2181/eventqueue
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:72)
org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1289)
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:721)
org.I0Itec.zkclient.ZkClient.process(ZkClient.java:632)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2018-06-16T00:12:33.816000000Z org.I0Itec.zkclient.ZkEventThread: Could not establish session with zookeeperorg.I0Itec.zkclient.exception.ZkException: Unable to connect to zookeeper.tas01.local:2181/eventqueue
org.I0Itec.zkclient.ZkConnection.connect(ZkConnection.java:72)
org.I0Itec.zkclient.ZkClient.reconnect(ZkClient.java:1289)
org.I0Itec.zkclient.ZkClient.processStateChanged(ZkClient.java:721)
org.I0Itec.zkclient.ZkClient.process(ZkClient.java:632)
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530)
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:505)
2018-06-16T00:12:33.817000000Z org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Children of /config/changes changed sent to kafka.common.ZkNodeChangeNotificationListener$NodeChangeListener$@75be6c7a]java.lang.NullPointerException
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:115)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:703)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:991)
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:709)
org.I0Itec.zkclient.ZkClient$10.run(ZkClient.java:842)
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:72)
2018-06-16T00:12:33.818000000Z org.I0Itec.zkclient.ZkEventThread: Error handling event ZkEvent[Data of /controller changed sent to kafka.server.ZookeeperLeaderElector$LeaderChangeListener@57f56c84]java.lang.NullPointerException
org.I0Itec.zkclient.ZkConnection.exists(ZkConnection.java:115)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:703)
org.I0Itec.zkclient.ZkClient$5.call(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient.retryUntilConnected(ZkClient.java:991)
org.I0Itec.zkclient.ZkClient.exists(ZkClient.java:700)
org.I0Itec.zkclient.ZkClient$9.run(ZkClient.java:820)
org.I0Itec.zkclient.ZkEventThread.run(ZkEventThread.java:72)
I checked the last log ("Error handling event ..."). Looking at the code it seems that the NullPointerException is caught here: link
However after the exception is caught the while loops seems to continue, and no more events arrive, possibly. There does not seem to be any re-connection attempts or even a process restart. This assumption matches with what I saw in the particular environment. The kafka broker remained running but there was no connection with zookeeper (no connected socket reported by netstat) and the broker did not respond to any events. Also the broker was out of sync for all topics/partitions. A broker restart fixed the problem.
Is the above assumption valid, or did I miss something? Do you think this is a bug of the zkClient code? How can we ensure that the broker will reconnect to zookeeper?
I found a similar issue already reported: KAFKA-2182. This issue appears to be resolved, which is puzzling.
Best regards,
Klearchos