Posted: 26 Oct 2017 0:10 EDT Last activity: 26 Jun 2018 1:27 EDT
Defining a partition key and using it in a report definition (data flow)
Wondering if any teams out there could speak to how they identified their partition key (what should the value be) and can speak to the behavior of using a partition key in the report definition source of a data flow. We currently have a batch which will take a very long time due to the number of records involved and looking for ways to optimize the batch data flow execution.
Also, to note: the table in context which I want to partition is going to be truncated and reloaded on a nightly basis.
***Updated by moderator: Lochan to update Categories***
***Edited by Moderator Marissa to add SR Details***
We will define a partition key column (e.g. SequenceNumber) and evenly distribute the records in our data table to 2x our JVMs (so if we had 16 JVMs, we would set the partition between 1 to 32 and ensure there is an even number of record distribution.
The question is, how does Pega's data flow rule (with a report definition source) handle the distribution of load across our JVMs if we pick this sequence number as the partition key for the source component? Will Pega know to use the SequenceNumber (partition key) defined as an integer to evenly distribute the record processing within the data flow to each node/JVM?
In our dev and test environments we have a single node so it makes validating the performance improvements of defining a partition key harder as we would need to move to our non-functional testing environment (multi-node) in order to test this and want to know as much details as possible before proceeding.
Table columns will remain the same, will we just truncate and reload the columns (and rebuild the partition key/sequence number).
For each record of 'select distinct partitionKey from table', pega creates assignment that will be scheduled to process set of customer records('select * from table where partitionkey = ?'). each thread takes an assignment & processes it. Once a thread completes an assignment, then it will check for next assignment until no more assignments.
As we have previously discussed. The use of a partition key functions as you noted. However, defining the # of partitions should be investigated. What should the range of numbers be based on the server hardware (e.g. is it dependent on # of JVMs/nodes, CPU, memory, etc.)?