Posted: 22 Jun 2018 1:39 EDT Last activity: 23 Jul 2018 14:21 EDT
No of assignments created during data flow execution depends on Number of partitions in source data or Thread count and batch scalability factor?
I am bit confused on how data flow processing occurs, as in how many assignments are created during execution.
The data flow help suggests the following
"Specify the number of the Pega 7 Platform threads that are assigned to process running the data flows and the batch scalability factor to use idle threads for running the data flows.
For example, when the source of a data flow is divided into five partitions, the data flow run is divided into five assignments that can be processed simultaneously on separate threads if there are enough threads.
The number of available threads is calculated by multiplying the thread count by the number of nodes. With two nodes and five threads in the system, the data flow run uses five threads and five threads remain idle. After you set the batch scalability factor to two, all 10 threads are used to process five assignments.
Enter the number of threads.
Note: The number of threads for running data flows is the same across all decision data nodes that are configured for the Data Flow service.
Enter the batch scalability factor."
If you observe the Italic lines in the above Data flow help, it suggests no of assignments depends on the number of partitions.
But if you see the attached PNG file showing data flow settings, there it is mentioned that Number of assignments = No of nodes * Thread Count * Batch scalability factor.
So question is which one is correct and how actually data flow parallel processing happens and what is the role of partitions, node count, thread count and batch scalability factor?
The way I understand it is that the number of assignments is determined by number of partitions. That is not going to change, regardless of number of nodes and/or threads.
Now, number of available threads is going to change and if it is higher than the number of assignments, then you have to increase the batch scalability in order to take advantage of idle threads. This tweaking will not change the number of assignments.
This is correct, however, I do like to add to this:
When looking at partitions from the point of view of assignments, you can see the total partitions as the total assignments that need to be picked up by threads during the execution of a data flow run. This is distinct from the number of simultaneous assignments.
The number of partitions is defined by the source, the number of simultaneous assignments by the number of threads * number of nodes (and batch scalability factor in case of batch runs).
However, some sources set the number of partitions based on the number of threads. For example the 'Monte Carlo data set' partitions its data non-deterministically over all threads, it does this by setting the number of partitions to 'nodes * threads' so all partitions will be executed at the same time (as opposed to one by one if you have more partitions than threads).
Thanks for clarifying it, so that means the hover text "Number of assignments = No of nodes * Thread Count * Batch scalability factor." in the attached screenshot is wrong and probably needs to be corrected by pega.
Still not very clear e.g if the source has ten partitions then there will be ten assignments.
Now if no of nodes = 6, Thread count = 5 and batch scalability factor = 6, then as per your understanding no. of simultaneous assignments will 6*5*6 = 180. But obviously it is much more than No of assignments 10, so how it will work?
Based on what's mentioned in data flow help and what PaulGentile_GCS has mentioned , I believe formula is probably the below one
No of assignments = No of partitions in source , no argument on this.
No of parallel threads working on the assignments = (batch scalability factor/No of nodes) * (No of nodes * Thread count) = batch scalability factor * Thread count.
Now if "No of parallel threads working on the assignments" < No of assignments then all assignments won't be simultaneously processed,
if "No of parallel threads working on the assignments" = No of assignments, then all assignments will be simultaneously processed one by each thread,
if "No of parallel threads working on the assignments" > No of assignments, then more than one thread will be working on each assignment.
Not sure if it's possible to make batch scalability factor more than no of nodes, if yes how it will work.
Do you guys think I am arriving at right conclusion?
Multiple threads cannot work on the same assignment.
If for example you have a source that does not support partitioning, there will only be one partition. Multiple threads/nodes cannot divide the work among them (this is what partitions are used for) so only one thread will be executing this data flow.
So: if "No of parallel threads working on the assignments" > No of assignments, then all assignments will be simultaneously processed one by each thread.
Btw, it would be better to use the terminology of partition instead of 'No of assignments' in the above statement. Then it can be simplified as if "no of assignments" > partitions, then all partitions will be processed simultaneously each by one thread.
Ok, but in the case of "No of parallel threads > No of partitions if each partition is processed by one thread only then that means there will be idle threads, doesn't that defeat the purpose of batch scalibility factor stated by the DF help text which I have highlighted below
"The number of available threads is calculated by multiplying the thread count by the number of nodes. With two nodes and five threads in the system, the data flow run uses five threads and five threads remain idle. After you set the batch scalability factor to two, all 10 threads are used to process five assignments."
You are absolutely right, the help documentation is incorrect in implying that multiple threads will process the same partition/assignment.
We will amend the help text documentation with the following description on Batch Scalability:
"It is used to calculate the suggested number of partitions to be used in a data flow run, that number is calculated using this formula: numOfNodes * threadCount * scalabilityFactor. Keep in mind that this calculation will only suggest the number of partitions, it's up to the dataset implementation to decide how many partitions will actually be used."