In the project I'm working we are trying to perform large scale what-if simulations. We have almost 1000 propositions and 1 million customers.
If we run single-node simulation for 100.000 customer to se what would happen if some changes go live, the full simulation spend close to 10h.
We have tried to configure our pre-prod environment to run multi-node simulations following instructions from Pega Training Course "Decisioning Simulations for System Architects 7.1" and also "DSM Reference Guide 7.1.7", but something should be missed or wrong because the performance has increased dramatically (currently maybe could be near to 2 or 3 days).
Our preprod environment has 2 servers with 2 nodes each one.
First of all what I've done is create a new "ProcessBatchJob" Agent in my application (as image below), and verify in SMA it was running:
After that I could modify Topology settings to set them as in the first image (note: I tried with 2 threads per node and there is no meaningfull performance difference).
Then, following training course instructions, I created in customer database table a new column named "PartitionKey", numeric and I set random values between 1 and 10. Then created Property in customer Pega Class and re-mapped with database table.
And lastly I modified the Input Definition setting PartitionKey Customer's Property in partition Key field in Distributed Runs, and also modified Report Definition to add "Partition Key Parameter" and
Do you have any idea about what would be happening? Or let me know if you need more information to clarify something.
The setup looks fine to me. can you please send the Batch Progress screenprint from Simulation History.
Run Simulation for a smaller data set of 5000 records, (default batch size is 250 records * 4 nodes) so we should get a stats for 5 fetches. then inspect the simulation history. refer the screen print below.
here it show the speed at various stages, like
1. IH read speed - speed for fetching the records from IH fact responses.
2, Read Speed - speed for fetching the records Customer table. this also include the speed to fetch "Additional Embedded pages" from associated tables. this is enabled in the Input Definition.
3. Execution Speed - speed from executing strategy.
4. Write Speed - speed for writing the strategy output to output table defined in Output Definition.
Try comparing this statistics when in single node and multimode setup to check where the slowness occurring.
you can switch between single and multi node by using an Input Definition and Report definition that doesn't use Partition key.
apologizes but I couldn't repeat the test and reply you before.
Find below the screenshot corresponding to multi-node test results (with 5000 records):
And here you are test results in single node:
At the moment I'm not able to evaluate if something is wrong or not, or if those are appropiate values.
Let me include new data we have achieved. We run, once again, a simulation for 100.000 records and we "discovered", "saw" that the execution seems to be 250 records each 15 minutes more or less. I mean, we click button "Re-execute" and no matter how many times we click refresh or refresh database output table content, nothing happens until 15 minutes. Then refreshing in landing page appears 250 records, and in db table 250 results, then, again spend 15 minutes until new data appears... Is meaningful my explanation? Is clear? I'm not sure if I'm able to tell you properly what is happening.
Find here other screenshot, related to this execution with 100.000 records, stoped after 2h and a half (aprox).
In customer database table we have an unique index for customer Id property (PK), no more index.
Before starting the Simulation, the process perform a delete operation for that particular Simulation Work ID to clear existing output. For 100000 records, it may take some time. if this table was created by output definition, it will have a index on pyWorkID column, please cross check that.
if you have only one simulation using the underlying output table, then you can perform a manual truncate of the table before starting a Simulation run, this is eliminate the time taken for delete operation.
however this delete is a one time operation at the beginning. From the statistics, the Read speed for both IH and Customer Data(It also includes fetching additional embedded pages) has reduced drastically when compared to 5000 records run, have you checked and compared the AWR reports (assuming this is Oracle). was DB not performing well when this test was run.
on a general note, in all of the tests (single and multinode), I noticed that the Execution speed is slow, 1 rec/sec, which brings the average speed down, the execution speed denote the speed at which strategy is executed, you will need to profile the strategy by running it in an Interaction rule, see if you can notice which component of the strategy is taking more time.