Posted: 18 Feb 2016 22:51 EST Last activity: 1 Mar 2016 15:06 EST
Why dont all the nodes in a PEGA PRPC cluster remain synchronised
I have instances when multiple nodes in a PEGA cluster no longer remain connected. Often this will happen when rebooting all nodes in the cluster but it's not the only time when this occurs. In order to fix it then I usually have to restart the isolated PEGA PRPC node and hope that it connects (otherwise I just go through this process again). So my questions are:
a. Why does this occur?
b. Is there a way of getting the isolated PEGA PRPC node to try to make a connection to the other nodes without having to reboot the node.
Sorry for not responding earlier. I was unavailable due to illness last week.However to follow up on your questions:
1. No longer remain connected means that in the catalina,out log one of the nodes is no longer being listed as a member of the cluster.This happens on a random basis even though the nodes themselves are still accessible. It can also occur when restarting members of the cluster.
2. Running 7.1.8 of prpc. I have checked subsequent release notes to see if any problem has been identified but couldn't see anything in particular.
I wonder if this is related the race condition for hazelcast cluster consistency check (I know there are a couple prconfig settings to minimize the potential failure), likely to happen when you start multiple nodes simultaneously. Genesis team, any comments on this?
Can you tell me a little more about your cluster? Are they located on the same machine or the same sub-cluster? Are there any exceptions in the logs when the node leaves the cluster (both the node leaving and one that is still in the cluster). Would it be possible to get some more details around the scenario of restarting all nodes would be helpful so we could try to recreate the issue internally.
In terms of the node not reconnecting to the cluster, are there any exceptions or messages that could give us a clue as to what is happening at that point in time? Do you see a message indicating that the node is attempting to connect from other nodes in the cluster? Does the connecting node indicate any problems?