Enabled services tracing affects their functionality when there is trouble with hazelcast
Hello!
My team faced some strange issue related to hazelcast and integrations.
Prod Env:
Pega Platform 7.3.0
WebSphere Proprietary information hidden + Oracle 12c
Production Level = 5
14 nodes in cluster.
dss trace/cluster/ServiceRuleWatchMaxProductionLevel = 5
There were some troubles with one of nodes, it became unhealthy, logged that it left cluster and return a couple of times. Also there were some logs from hazelcast like this:
java.util.concurrent.TimeoutException: MemberCallableTaskOperation failed to complete within 30 SECONDS.
And then all our rest services logs errors like this:
java.lang.IllegalMonitorStateException: Current thread is not owner of the lock! -> Owner: 0a311d34-35d4-48cc-aad3-1e54d47d0e8f, thread-id: 862961
at com.hazelcast.concurrent.lock.operations.UnlockOperation.unlock(UnlockOperation.java:74)
at com.hazelcast.concurrent.lock.operations.UnlockOperation.run(UnlockOperation.java:63)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:186)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:401)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:117)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:102)
Hello!
My team faced some strange issue related to hazelcast and integrations.
Prod Env:
Pega Platform 7.3.0
WebSphere Proprietary information hidden + Oracle 12c
Production Level = 5
14 nodes in cluster.
dss trace/cluster/ServiceRuleWatchMaxProductionLevel = 5
There were some troubles with one of nodes, it became unhealthy, logged that it left cluster and return a couple of times. Also there were some logs from hazelcast like this:
java.util.concurrent.TimeoutException: MemberCallableTaskOperation failed to complete within 30 SECONDS.
And then all our rest services logs errors like this:
java.lang.IllegalMonitorStateException: Current thread is not owner of the lock! -> Owner: 0a311d34-35d4-48cc-aad3-1e54d47d0e8f, thread-id: 862961
at com.hazelcast.concurrent.lock.operations.UnlockOperation.unlock(UnlockOperation.java:74)
at com.hazelcast.concurrent.lock.operations.UnlockOperation.run(UnlockOperation.java:63)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:186)
at com.hazelcast.spi.impl.operationservice.impl.OperationRunnerImpl.run(OperationRunnerImpl.java:401)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.process(OperationThread.java:117)
at com.hazelcast.spi.impl.operationexecutor.impl.OperationThread.run(OperationThread.java:102)
at ------ submitted from ------.(Unknown Source)
at java.lang.Thread.getStackTrace(Thread.java:1117)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:114)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrowIfException(InvocationFuture.java:75)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:155)
at com.hazelcast.spi.impl.AbstractInvocationFuture.join(AbstractInvocationFuture.java:136)
at com.hazelcast.concurrent.lock.LockProxySupport.unlock(LockProxySupport.java:149)
at com.hazelcast.map.impl.proxy.MapProxyImpl.unlock(MapProxyImpl.java:280)
at com.pega.pegarules.cluster.internal.PRHazelcastDistributedMapImpl.unlock(PRHazelcastDistributedMapImpl.java:430)
at com.pega.pegarules.monitor.internal.tracer.DistributedRuleWatchImpl.getClientRequestorID(DistributedRuleWatchImpl.java:413)
at com.pega.pegarules.monitor.internal.tracer.TracerSessionRackImpl.doINeedDibsFlag(TracerSessionRackImpl.java:436)
at com.pega.pegarules.monitor.internal.tracer.TracerSessionRackImpl.getTracerSessionIfEnabled(TracerSessionRackImpl.java:414)
at com.pega.pegarules.integration.engine.internal.services.ServiceAPI.getTracerSession(ServiceAPI.java:2651)
at com.pega.pegarules.integration.engine.internal.services.ServiceAPI.initializeThreadContext(ServiceAPI.java:2635)
at com.pega.pegarules.integration.engine.internal.services.ServiceAPI.withLockSetup(ServiceAPI.java:1318)
at com.pega.pegarules.session.external.engineinterface.service.EngineAPI.processRequestInner(EngineAPI.java:379)
at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
at java.lang.reflect.Method.invoke(Method.java:508)
at com.pega.pegarules.session.internal.PRSessionProviderImpl.performTargetActionWithLock(PRSessionProviderImpl.java:1315)
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:1052)
at com.pega.pegarules.session.internal.PRSessionProviderImpl.doWithRequestorLocked(PRSessionProviderImpl.java:907)
at com.pega.pegarules.session.external.engineinterface.service.EngineAPI.processRequest(EngineAPI.java:334)
at com.pega.pegarules.integration.engine.internal.services.StatelessServiceAPI.processRequest(StatelessServiceAPI.java:46)
at com.pega.pegarules.integration.engine.internal.services.http.HTTPService.invoke(HTTPService.java:463)
Also then almost all WebContainer threads became parked with hazelcast opertaion on top of each on most of nodes.
The incident was resolved by setting dss to 4, and by node cluster restart.
So, are there any hot fixes for hazelcast functionality or configuration tips? Or maybe some debug recommedations for hazelcast issues and recovery methods?