System processing is becoming very slow in production
On an average, we are creating 5-7 million cases every day. Most of the day, this process works fine but sometime in a week the processing becomes too slow. I do not see anything wrong happening in the pega logs, also it is not an issue with the load as we are processing more than 8 million data without any issues.However, the issue gets resolved after RDS restart.
During this time the overall connections with postgres seems to be around 1k (20-90 active connections+800-900 idle connections ) however, the connection pool size is set to 5k for postgres it seems. I am not really getting how RDS restart is helping here.
I understand that, this is not that straight-forward but any guidance or clue will be highly appreciated.
Note: We have a process which runs every hour to delete idle connections older than 20 minutes.
After looking into DB logs we found that, the issue was with the pr_sys_queues table having high number of dead rows. This table is being heavily used in our application.
Though auto vacuum is configured on this table, it doesn't help completely as 5k-8k records per minute are getting inserted and are getting processed as well. Due to the high number of dead rows, the size of pr_sys_queues table was grown significantly (~6GB), wherein the actual size of this table is reduced to 20MB after running full vacuum.
I had many encounters with pr_sys_queues, but all of those situations, the underlying issue is something else and pr_sys_queues is the victim.
Based on what you have posted above, it seems you are running Agents very frequently in your production system and running the agents more number of times, which might be causing additional overhead to the system, especially with the Database resources.
Since you mentioned, you are seeing this issue in Production, I suspect, some of the OOTB Agent Scedulers as well as your application agents must have Agent wide setting turned on. As a result, when a new Node ID is added to the system, it must be enabling the Agent Scheduler for which you may or may not needed to enable.
First, I would recommend, Identify and clean up unwanted Node ID's for the System Name that you are using in your Production.
Then disbale Agent-wide settings for your application level agents (if you have already enabled, if not ignore) then perform the same thing Pega OOTB Agent schedules, in case if you can not turv off Agent wide setting for any of the instance, at lest try to uncheck the start up where ever applicable.
Restart the system to reflect the changes above.
Monitor pr_sys_queues table with it's status and based on the results, take necessary actions as needed.
You can use below query to better understand what is being store in that table with its status:
Select pyAgentName,pyItemStatus,Count (*) from pr_sys_queues
Group by pyAgentName,pyItemStatus
Note: I have attached Pega 7.3.1 Recommended settings for Agent Scheduler (use it as reference). I dont have the similar document for latest Pega Version.