We have a cluster in Production with one node on the Stream tab of the "Decisioning: Services" landing page remaining in JOINING_FAILED status (see attached screen shot). We traced it to this snippet in the Kafka server.log file:
[2019-11-25 11:13:23,513] INFO Creating /brokers/ids/6 (is it secure? false) (kafka.zk.KafkaZkClient)
[2019-11-25 11:13:23,541] ERROR Error while creating ephemeral at /brokers/ids/6, node already exists and owner '31139484958261249' does not match current session '31140168996945922' (kafka.zk.KafkaZkClient$CheckedEphemeral)
[2019-11-25 11:13:23,541] INFO Result of znode creation at /brokers/ids/6 is: NODEEXISTS (kafka.zk.KafkaZkClient)
[2019-11-25 11:13:23,550] ERROR [KafkaServer id=6] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
Thank you for your response, Anandh. We've been having this problem since past Friday. We've been in production since June with twice-monthly deployments. There was no deployment last week. I didn't clean the pr_sys_statusnode table since we have not upgraded. Should I delete the row related to the machine with the problem? I tried your suggestion of stopping the JVM and killing any other Java processes after JVM stopped. There was one. It didn't help to kill it and restart. On the other machines there are two other Java processes, which makes sense since Kafka and Cassandra are both running fine over there.
Hope all the nodes are able to communicate with each other.Please try to ping from one node to other & check the response.If it can talk to each other then stop all the nodes,Clear pr_sys_statusnodes table & restart server node by node.First start util nodes followed by stream node & then web nodes.
Please let me know If issue still persists after that.
Sorry it took so long to get back to this. Hectic weekend with Black Friday and all. We have permission to restart all the nodes tonight. I will let you know if it helped. All the nodes can ping each other.
The restart was postponed to last night to coincide with other downtime. We took the nodes down and when we looked at the table it was empty. We brought the nodes back up and the Kafka server is still down. The four records in the table are back. So no, it didn't work.
As far as I could gather this is supposed to force the broker id. It wasn't sufficient, though. The node kept on starting up with the wrong id. I eventually traced it to some db table that still had an entry for the node with the wrong broker id. After I deleted that the Kafka node started functioning correctly again. It resolved a LOT of stability issues. Unfortunately I do not recall the name of the db table and I no longer have contact with the client. I don't think it was pr_sys_statusnodes, though.