To start with, below URL is having details on how to check node status if its running or not available. Based on response, we have to take appropriate action to remove the instance from LB. Pega is having HTTP Service types which can be used in this need.
Thanks for the reply but it's not exactly what I'm looking for. I need the ability to change on the fly the returned status of the monitored URI (probe) when we want to say quiesce the app server for maintenance. That way the LB will redirect new users to the other app servers and should send quiesced users back to the LB for redirection.
Thank you! Adding more details on the Quiesce process in general, this is how it works and should be working with Azure. Hope it helps!
1. The system admin identifies that there is a system maintenance activity which requires a system restart
2. Admin disables the node in the load balancers, so that no new sessions will be established in that node but it will continue serving the existing sessions as session affinity is there
3. Admin will initiate the Quiesce of the node from any of landing page, AES, Mbeans or SMA
4. When Quiesce is initiated, Pega 7 will passivate all the inactive sessions into the shared storage immediately.
5. For all Active user sessions, Passivation will start after the passivation timeout which is configured with session/ha/quiesce/PassivationTimeout DASS
6. Default value of passivation timeout is 5 seconds.
7. System maintains a passivation queue and sessions ready to be passivated will be added to the queue. And the UI state/Clipboard pages of the requestors are persisted in the shared storage or Database.
8. System also stops all non-essential agents and all listeners except MDB listeners
9. Among active sessions, quiesce administrators are exceptions, and hence all user sessions except the sessions of operators who hold “PegaRULES:HighAvailabilityQuiesceInvestigator” role in their access group will be passivated. This enables those user who possess this role to still log in to the system and troubleshoot\administer the quiesce operations.
10. When the active user counts drops to 0, it updates the node status to “Quiesce Complete” and this doesn’t count users with the quiesce administrator role.
11. Once status gets updates to “Quiesce Complete” Admin can bring down the node for system maintenance
12. Once a user session is passivated and a new requests comes, the node will invalidate the session and sends a redirect request to load balancer. Load balancer will handle by forwarding that request to other node which is active in the load balancer pool.
13. The new node will access the shared storage to get the UI state and Clipboard threads of the passivated user and redraw the UI in the new node and responds to the client. From then all the requests of that session will be sent to the new node.
14. After the system maintenance activity, admin can CancelQuiesce the node in landing page, AES, MBean or SMA
15. Also load balancer will be configured to enable this node back in the pool to accept new session requests.
Yes that was my understanding, but step/point 2 is the problem. There is nothing within the App Gateway to temporarily disable the LB. Hence my requirement to trick it into thinking the node isn't available any longer my setting the URI to be a non 200 HTTP status and therefore getting the the LB to switch to the other app servers automatically.
Perhaps I'm not being very clear. I want/need the ability within Pega to tell the LB that this particular node is not available, this could be for any reason, such as CPU is at capacity or RAM reaching set limit etc, not just purely for maintenance. The problem is that the Azure App Gateway LB only supports RR (Round Robin) distribution so would still send new users to a node that might not be able to handle the load. So I would need a service that we control per node that can tell the LB this node is not available even if it is technically available.
Perhaps I'll need to develop a jsp page within tomcat that does what I require and get the LB to monitor that page instead.
I think I have solved my issue by creating a jsp page that monitors the local /prweb application within Tomcat and returns a non 200 http status when CPU is above a certain percentage and also when RAM is below a certain percentage. It also returns a non 200 status if Pega is not responding in a timely fashion. This is then used as the monitoring probe within the Azure App Gateway. It also allows us to turn this jsp application off thus forcing the LB to receive a non 200 status when we want to force users onto the other Pega node.