How to effectively load Customer Spine data with Cassandra/DDS in Pega Marketing (Cloud)
We are using Pega Platform 8.1.5 (PegaCloud) with Pega Marketing 8.1.5
In our current setup we are using the following repositories in order to store Customer data:
Customer DDS table (internal Cassandra) - contains 100% of the Customer data
Customer DB table (Postgress) - simply contains the CustomerID field (we do this due the limitation on Pega Marketing Segments which require a relational database table). We might add new fields if required for segmentation.
The (simplified) load process is as follows:
We get a Customer file in JSON format as our input. It contains a flat structure of customer attributes.
Using a combination of a dataflow and several datasets, we:
truncate the Customer DB table
truncate the Customer DDS table
Load the JSON file into Pega
We then write the Customer properties into:
the Customer DB table
the Customer DDS table
So, basically, we do a truncate and an insert.
This works for us because is a simple data load process using dataflows/datasets which provides a flexible integration point (the input JSON file can be modified to contain new attributes which are transparently loaded into Pega).
At this point, the file is getting pretty big (and it will get bigger) and it takes about 40 mins to load.
The new requirement is to have 24/7 data availability.
We are considering 2 options in order to achieve it:
OPTION 1: Changing the load process from "full" to "incremental".
This is the preferred option right now since we believe it would achieve 24/7 availability.
Basically, instead of doing a truncate/insert, we would to never truncate and would do an insert/update.
We believe this is easily achievable by configuring the datasets appropriately:
The DB dataset as "Insert new and overwrite existing records"
The DDS dataset does an insert/update by default.
The question for this option is:
Assuming we are loading several million records to DDS/Cassandra, taking approx 1 hour: What is the impact if records are being retrieved from Cassandra to serve the inbound channel whilst the data load is in progress.
any other issues to watch for?
[Note: As suggested by Kevin in the comments below, we will investigate the option of using partitions and splitting the input file into several to take advantage of parallel data flow runs.]
OPTION 2: Introducing the concept of an staging area.
We think this option would get us close to 24/7 availability, but not to 100%.
We could optimize the process by introducing the concept of a typical "staging area".
If we were doing this with old fashion ETL, we would load the data into a Customer DB staging table, and once the load is complete, we would simply rename the tables.
We are trying to do something similar with datasets, something like:
Load the input JSON file into Pega using a File DataSet - no changes here
Write the data (Customer IDs only) into a staging Customer DB table - change, but no issues here
Write the data (100% of the data) into a staging Customer DDS - change, but no issues here
Instead of renaming the Customer DB table (and considering that we are on PegaCloud and don’t have full DB control), we are thinking about creating a dataflow which as follows:
Source component: the staging Customer DB table
Destination component: the "final" Customer DB table
Instead of renaming the Customer DDS, we are thinking about creating a dataflow which as follows:
Source component: the staging Customer DDS
Destination component: the "final" Customer DDS
So, essentially, we would achieve the same as in a typical ETL world, but moving the data from the staging area to the "final" area still takes a few minutes.
Do you guys have any suggestions? Do these approaches make sense? Any suggestions to improve the process are highly appreciated.
one thought regarding your initial file dataset loading process, you should consider using partitions if possible (versus using one giant file) as distinct files can be processed in parallel in dataflow distribution runs. This potentially can shorten the initial loading time.
I've done some tests as you suggested. I split the one giant file into chunks of 100,000 records each so that each of the files can be processed in parallel in dataflow distribution runs. Since we have 3 dataflow nodes, the load time is now around one third of the original time.
Thanks a lot for this suggestion. It really helps.