Using Real time Data flow with Kafka Data set to only read new records
We are exploring how PEGA works with Kafka data set using real time data flow. How does it keep track of records which are processed/read from data set. Here is an example & observations.
1] Configure a real time data flow with Kafka data set.
2] Set the Read options as 'only read new records'.
3] Using Kafka producer post few messages to a topic which is configured in the Kafka data set. Say you have posted 3 messages.
4] Review the component statistics – Data flow run stats. – shows 3 successful records
5] Stop the data flow and post another message – 4th message
6] Start the data flow and post another message – 5th message
7] Review the components statistics – data flow run stats – you will see it has processed only the 5th message which means 4th message is lost or not processed. Is this an expected behaviour? What is the definition of ‘new record’? Is it anything posted after the data flow is started/re-started or everything posted since last processed record?
We have raised a support request with PEGA for messages getting lost based on above scenario. The GCS team suggested to raise this on support community, hence this post.
As per my understanding, for a data flow which is configured with Kafka data set as source , any record that gets queued to the kafka stream should get processed.
Best way to confirm if the request has been queued , is to open the kafka stream data set and go to Actions->Run->Browse, to see the records that have been queued to the stream for datalfow processing.
The browse window presents the list of queued items similar to that of clipboard representation for a list of pages.(Results(1),Results(2)..so on PFA.).(The queued items are retained in the queue for no of days configured on the kafka data set rule under "Retention period")
So ideally , if the record is visible in the queue ,then it should get processed.But even I have observed once that ,if we queue a request while the associated real time DF is in "Stopped" status, on the next restart ,the processing doesnt start processing from the last queued record,rather it processes any record which has been queued after the restart.
I tried to look at the help topics for kafka data set and got to know that ,each record in the kafka stream are stored as chunks of data referred to as topics ( some sort of partitions) and each incoming record is pushed to a topic containing a key, Value(request sent) and a timestamp.
So my understanding is that , if the status of real time data flow object( available from Configur->Decisioning->Decisions->Dataflow->Realtime landing page(PFA) , is Stopped , then on the next restart it checks and compares the between the start time of the real time dataflow and the timestamp of the record in queue , and ignores any records which are older than the start time of DF.
Ideally, the scenario which you have been mentioning of stopping and restarting,while parallely queuing the records could be rare.In case of node failure or so, the resumable DF's (real time ,here) internally have the capability to process the records from after the last successfully processed record ,as it internally pauses the DF run on the failed node and resumes processing of partition record on another available healthy node,so that way this issue may not arise.For rest of the time ,since it is real time processing ,the dataflow is expected to be "always ON" or running so that any records queued get immediately processed.