Question
Data flow Partitions
How does the system derive the number of partitions to execute a data flow? Say I have a node (10 cores) configured as a dataflow node. Say the thread count is set up to 1 when we create a Batch processing data flow work object. How many partitions would be the system create? Does that depend on the data? If so, how would the data affect the partitions created by the data flow?
Assume that we are processing 100 records with partition keys distributed across 0 to 9.
***Edited by Moderator: Pallavi to change content type from Discussion to Question***
The partition depends upon the source data set you are using in DF
eg - if the source is database table then based on the partition key it will distribute the data across each partition
I can take age as the partition key, so in this case, for each age group, it will create one partition.
--- In the case of real-time data flow with source as stream data set to default 20 partitions can be created. however, that can be changed.
In the case of stream data set I can choose customer id as the partition key.
In this case (only for stream data set) , Kafka topic creates partition using a hash algorithm.
in the case of file data set as the source , it creates only one partition.