We want to understand how the messages are delivered while using data sets of type Kafka. Are the messages pushed exactly-once, at-least once or at-most once by Pega? And also, is it possible to override this behavior via some settings in the config file?
I assume you are referring to having a Pega Data Flow consume Kafka messages that already exist on a Topic?
The Data Set itself identifies which Kafka instance to connect to, what Topic to work with, and how to convert between the Topic data and a Pega clipboard instance.
The Kafka server & topic configuration determine how long a message is retained on a Topic. Consuming a message from the Topic doesn't remove the message (there may be many consumers of that Topic also interested in that message).
A Data Flow which references the Kafka Data Set controls which messages are consumed from the Topic on the Data Set Source configuration:
I haven't played extensively with this, but here are my observations...
Starting a new Data Flow Run with Read existing and new records set will consume all messages still on the topic, even if they have been consumed before. The Data Flow Run - once started - will track what has been consumed and only deliver new messages that subsequently arrive.
If you Pause and Resume the Data Flow Run - or restart the RealTime nodes which the Data Flow Run is running - this state is retained.
If you Stop the Data Flow Run, this tracking is lost. If a Stopped Data Flow Run is Started again, and the Source Data Set is "Read existing and new records", all messages still on the Topic are consumed.
If Only read new records is selected, I expect that only the records published to the Topic after the Data Flow Run is Started will be consumed. If you Pause the Data Flow Run and new messages are published to the Topic and Resume the paused Data Flow Run, I'm not sure whether those messages published whilst the Data Flow Run was paused are consumed. I like to think they would be, and the "Read options" configuration drives only the behaviour when the Data Flow Run starts.
We are trying to see how to configure the data flows to achieve these constraints (exactly-once, at-most once, at-least once). I did explore the 2 data flow configs highlighted by you, however, I couldn't find any detailed documentation to understand how these settings behave under failure scenarios.
I am interested in understanding 'Only read new records' setting under failures. I see below info -
Only read new records - When the data flow has started, the data flow receives real-time data records from the streaming data
Based on this, I assume that if we pause the data flow with this config, we will lose the messages published while the data flow wasn't running which we want to avoid.
However, publishing the message via DataSet is different. I couldn't find any setting to configure delivery semantics (maybe this link can help understand what I am referring to)