Using Real time Data flow with Kafka Data set to only read new records

Question

AmitB441

Member since 2019

1 post

Nationwide

Posted: Nov 22, 2019

Last activity: Feb 26, 2020

Posted: 22 Nov 2019 8:38 EST
Last activity: 26 Feb 2020 14:06 EST

Closed

Using Real time Data flow with Kafka Data set to only read new records

Report

We are exploring how PEGA works with Kafka data set using real time data flow. How does it keep track of records which are processed/read from data set. Here is an example & observations.

1] Configure a real time data flow with Kafka data set.

2] Set the Read options as 'only read new records'.

3] Using Kafka producer post few messages to a topic which is configured in the Kafka data set. Say you have posted 3 messages.

4] Review the component statistics – Data flow run stats. – shows 3 successful records

5] Stop the data flow and post another message – 4^th message

6] Start the data flow and post another message – 5^th message

7] Review the components statistics – data flow run stats – you will see it has processed only the 5^th message which means 4^th message is lost or not processed. Is this an expected behaviour? What is the definition of ‘new record’? Is it anything posted after the data flow is started/re-started or everything posted since last processed record?

We have raised a support request with PEGA for messages getting lost based on above scenario. The GCS team suggested to raise this on support community, hence this post.

To see attachments, please log in.

Data Integration

Like (0)
Share this page Facebook Twitter LinkedIn Email Copying... Copied!

Posted: 4 years ago

Posted: 30 Dec 2019 11:17 EST

chalr1 replied to AmitB441

Report

Hi Amit

As per my understanding, for a data flow which is configured with Kafka data set as source , any record that gets queued to the kafka stream should get processed.

Best way to confirm if the request has been queued , is to open the kafka stream data set and go to Actions->Run->Browse, to see the records that have been queued to the stream for datalfow processing.

The browse window presents the list of queued items similar to that of clipboard representation for a list of pages.(Results(1),Results(2)..so on PFA.).(The queued items are retained in the queue for no of days configured on the kafka data set rule under "Retention period")

So ideally , if the record is visible in the queue ,then it should get processed.But even I have observed once that ,if we queue a request while the associated real time DF is in "Stopped" status, on the next restart ,the processing doesnt start processing from the last queued record,rather it processes any record which has been queued after the restart.

I tried to look at the help topics for kafka data set and got to know that ,each record in the kafka stream are stored as chunks of data referred to as topics ( some sort of partitions) and each incoming record is pushed to a topic containing a key, Value(request sent) and a timestamp.

Hi Amit

As per my understanding, for a data flow which is configured with Kafka data set as source , any record that gets queued to the kafka stream should get processed.

https://community.pega.com/sites/default/files/help_v82/procomhelpmain.htm#rule-/rule-decision-/rule-decision-dataset/dsm-creating-kafka-data-set-tsk.htm

So my understanding is that , if the status of real time data flow object( available from Configur->Decisioning->Decisions->Dataflow->Realtime landing page(PFA) , is Stopped , then on the next restart it checks and compares the between the start time of the real time dataflow and the timestamp of the record in queue , and ignores any records which are older than the start time of DF.

Ideally, the scenario which you have been mentioning of stopping and restarting,while parallely queuing the records could be rare.In case of node failure or so, the resumable DF's (real time ,here) internally have the capability to process the records from after the last successfully processed record ,as it internally pauses the DF run on the failed node and resumes processing of partition record on another available healthy node,so that way this issue may not arise.For rest of the time ,since it is real time processing ,the dataflow is expected to be "always ON" or running so that any records queued get immediately processed.

Regards

Renukavalli

Show Less

To see attachments, please log in.

Likes (1)

Renukavalli Challa

Posted: 4 years ago

Posted: 30 Dec 2019 17:20 EST

pawann

Swedbank AB

replied to AmitB441

Report

This seems expected since the RT DF is down and would not track data for that duration. To avoid any data loss, you can set the read options as read existing and new records.

To see attachments, please log in.

Like (0)

Posted: 4 years ago

Posted: 26 Feb 2020 14:06 EST

mahar2 replied to AmitB441

Report

Hi Amit ,

Read options as 'only read new records'.

If you stop the data flow and queue some events to the Kafka queue (Stream data set), then it will not read those events after DF start. It will only read the events which come after the DF start.

But if you PAUSE the data flow and queue some events to Kafka queue then when you resume the DF it will pick the unprocessed events.

Thanks,

Rakesh

To see attachments, please log in.

Likes (3)

Amit Bhabhe Renukavalli Challa Vladimir Taran

Get Started with Community

Question

Using Real time Data flow with Kafka Data set to only read new records

Need help or want to help others?

Experience the benefits of Support Center when you log in.

Question

Using Real time Data flow with Kafka Data set to only read new records

Related content:

Need help or want to help others?

Experience the benefits of Support Center when you log in.

We'd prefer it if you saw us at our best.