What is the purpose of 'Thread Count' & 'Batch scalability factor' in the Edit Settings of Data Flow configuration screen
We are trying to optimize our data flow to handle around 14 million records. The Designer Studio --> Infrastructure --> Services --> Data Flow configuration screen when you click on the 'Edit settings' we see 2 configuration items - 'Thread Count' and 'Batch Scalability Factor'. What is the purpose of the 'Batch Scalability factor' setting? If we set the Thread Count to 3 and Batch Scalability factor to 2, what exactly does it imply?
Batch scalability factor is used to calculate the suggested number of partitions to be used in a data flow run, that number is calculated using this formula (numOfNodes * threadCount * scalabilityFactor). Keep in mind that this calculation will only suggest a number of partition, it's up to the dataset implementation to decide how many partitions will actually be used.
Thread count by default nodes are configured to run with 5 threads. Each node that will take part in the data flow execution needs to be included in the service cluster. Note that setting a large number for thread count won't necessarily improve data flow execution speed. It's important to take the number of available cores in consideration when deciding on this value.
Does this mean that the scalabilityFactor does not have effect on the number of partitions that will be processed in parallel? If we have 5 nodes and set threadCount to 2 and batchScalabilityFactor to 2 and have number of partitions to 20, then the number of partitions that will be processed in parallel is 10 (5 nodes * 2 threads = 10) and not 20 (5 nodes * 2 threads * 2 scalabilityFactor).
From what we have seen in out testing, the batchScalabilityFactor does not have any effect on the number of partitions that are processed in parallel. Only the 'Thread Count' determines how many partitions are processed in parallel. This is the case for RDMBS database (Oracle). We heard that the batchScalabilityFactor will come into the picture when dealing with a Cassandra database - but I am not sure how that works.