In our application we have 4 nodes. User reported an issue that they are not getting mails and we found out that the SLA events were not processed. When we checked SMA, we saw that there was no Last Run Finished time getting printed for Pega-ProCom.ServiceLevelEvents agent. So, we followed the below article, deleted the data agent instances for Pega-ProCom agent from all the 4 nodes and restarted the server. Now we could see that the last run finished time getting printed.
But when we query the pr_sys_queue_sla table with item status as 'Scheduled', we see the count getting decreased and again getting increased and this continues. I mean : at one time lets say its 428567, after some time it is 428562, again it is 428569, again 428564 like that, and its not uniform. The pending items are from 1/4 to 1/15 and there is around 400K+ records in that table.
Is there a way we can determine if the agent is running and processing the items correctly? The max attempt is given as 1.
It seems like the agent went in a loop as for several cases the SLA goal and deadline time is being updated with the same value (1/3/2020) once the agent processes it. So, in next run also its taking the same cases and trying to process.
If we update pyMinimumDateTimeForProcessing to a future date (let's say 1/3/2024), will these items be ignored for the time being and the agent will start processing the other items? I mean the items whose pyMinimumDateTimeForProcessing is in past but not 1/3/2020?
When we faced some issues with our SLA agents, we were also having huge number of records in this table and agent was not doing anything.
We actually manipulated the status of some of these items.
Before starting, stop all SLA agents.
and you could delete all broken if you've got a lot (not mandatory cleaning)
Then, Directly in DB, you could select/update oldest 'Scheduled' records to 'Scheduled-Pending' (or anything you'll recognize), this will park them from being picked-up by the agent.
Keep only a decent number of 'Scheduled' records to be handled by SLA agents.
Start 1 SLA agents and you'll see whether the scheduled count you kept is decreasing.
You might also check some items which are stuck in 'Now-Processing' Status for a while. There are lost records anyway and could be deleted (or update Status with something else to ease investigations, see second point)
Second investigation is to check whether the stuck 'Now-Processing' are not all linked to same kind of data. We were also having some issues with one kind of data and we parked all SLA records linked to this kind of data.
on our side, combination of above fixed our SLA agent issue.