I have a cluster of two nodes and both are declared to run data flows. But if I start one with Dataflow-Executed it is only started on one node. How can I distribute the data flow, that it is running on more than one node?
Background: I want to test Kafka interface and if two nodes processes Kafka Messages twice, which is not wanted.
I'm currently doing prototyping, so defined a simple integer as a partioning key. But what I figured out in the meantime is, that partioning key is only relevant if Pega sends Kafka message. I also did a test in a cluster. I defined two dataflow nodes. Then I configured a Real-Time dataflow, but this flow is only running on the first node and not on the second. The main purpose I would achive is to prevent duplicate message, because dataflow it started on several nodes.
I also did an additional test, I implemented to data flow with same Kafka dataset listening to same topic and here both datalfows will be processed by one message. Which makes sense as it is publish/subcribe and not queue based message exchange pattern. Anyway for a single dataflow distributed to different nodes I expect that the message is only processed once. With Kafka-Producer tool I was able to set the PartioningKey and in Pega I can see that value in pzPartition.
Andre, hi, thanks for the context, I was originally answering from a pure dataflow/dataset perspective, the Kafka dataset is outside of my experience but I'll check with others as to how the partitioning should be working.