In our project we have got Pega 7.1.7 deployed in WebLogic cluster having 9 managed servers (nodes).
Is it recommended to do a sequential restart or parallel restart of servers can be done without any issue ?
Past observation :- In few instances, it was observed that node level data pages, which loads on start up of each servers, got corrupted. Our assumption for this issue was that it might have happened due to parallel restart, hence, we have been doing sequential restarts since then. We have never observed any issue after we are doing sequential restarts of nodes.
Pega uses hazelcast,It is a clustering technology.Whenever server restarts then it forms a cluster.There will be a main node which forms a cluster and other nodes try to join this cluster.This will not only help for performance but also during elastic search.You can check the startup logs to get better understanding how cluster forms.
Thanks Abhinav for your insight. Can you please tell if there will be any issue with node level data pages loading if we perform below two activities -
1. First thing is to remove al nodes from Load balancer before restart
2. Then restart all nodes in parallel.
3. Putting all nodes back to LB.
We are seeking this information as our project demands a very high availability environment setup and there is a plan to horizontally scale the nodes from 9 to 13 in coming days. Sequential restart will take a lot of time, therefore, we wanted to check on the impact of doing parallel restart in the node level data pages which loads on start up.
One more question regarding a particular scenario where one of the node is having stuck thread/memory issue then could we follow this -
Take the problematic node off the LB , having rest of the node active on LB so they are available to serve user requests
Restart the application server on problematic node (which includes refreshing node level pages on the respective node)
Put the node back on the LB
What we would like to understand is when we do Step 2 in the process above if this would corrupt node level cache on other active nodes in LB.
I would like to know whether Pega Hazelcast has anything to do with Node Level Data Pages. Kindly refer to the query that it was a general observation, right after parallel restarts the Node Level Data Pages were getting corrupted at the start up of the server node, but the risk was seemingly reduced when sequential restart of nodes was performed. We understand that Pega uses Hazelcast for Search but any direct impact on node level data pages due to Hazelcast is not known to us. Hence, the query in regards to Data Pages functioning after parallel restart remains to be clarified. If the Data Pages refresh at node start up also depend on Hazelcast cluster, then we can understand the suggestion of sequential restart. Kindly confirm this once.
I don't think node level data page is related to hazelcast.Hazelcast is a clustering technology whereas as far as I know node level data page is a normal datapage which is accessible only by the requestors of a particular node
Did you verify startup logs during parallel restart which made datapages corrupt.Please compare start up logs of both restarts.You will find difference.