Data Analytics Blog: Parallelization of ETL Workflows

One of the most often overlooked ETL architecture design techniques is to ensure that all available hardware resources, in particular CPUs, are utilized. Consider the sample ETL workflow in the figure below. In this workflow, all ETL tasks are scheduled to be executed in series, i.e. one after another. In general, execution tasks only utilize one CPU (with some exceptions). Thus, executing one task after another would only utilize a single CPU, no matter how many CPUs are available.

Thus, in order to utilize all available CPUs on a server, ETL tasks should be scheduled to execute in parallel. Scheduling a highly parallelized ETL workflow requires some planning as all ETL task dependencies need to be understood first. For example, if DimTable1 and DimTable2 are dimension tables for FactTable1, then DimTable1 and DimTable2 need to be loaded first before FactTable1 can be loaded. In addition, if there are no dependencies between DimTable1 and DimTable2, then we can execute their specific ETL processes in parallel.

The output of this dependency analysis is an ETL Task Execution Order Map, as outlined in the figure below. The ETL Task Execution Order Map will then be the template for the ETL workflow layout. As each task will utilize an available CPU, this approach will be able to utilize multiple (if not all) CPUs and provide better performance.

One word of caution, though: The number of parallel scheduled tasks should not significantly exceed the number of CPUs on your server. If there are many more concurrent tasks than CPUs, then the overhead of the OS scheduler switching jobs in and out of CPU time may actually decrease performance.

To avoid this problem, many ETL tools (such as Informatica Power Center) provide a global setting that limits the number of concurrently processed tasks regardless of how many tasks are scheduled to run in parallel. Let us call this setting N. As a general rule, N should be set to the number of CPU (cores) of the server. In this case, one should schedule as many tasks in parallel as outlined in the ETL Task Execution Order Map. This option is very useful in environments with different number of CPUs in the development, test, and production hardware as we do not have to create workflows based on the underlying hardware constraints. In this case, we just set N in each environment to the respective number of CPUs.

2 comments:

rnm1978May 6, 2011 at 5:59 AM
Hi,

Database parallelism could also be used to improve resource usage, particularly on later stages of ETL such as stats gathering.
As with parallel workflows, you have to be careful not to flood the server's resources by using too high a degree of database parallelism.
GuidentMay 27, 2011 at 7:30 AM
That is correct. However, if the ETL server is hosted on its own physical server, database parallelism will not help with increasing the CPU utilization on the ETL server (even though it will increase CPU utilization on the DB server, of course).

Thank you for your feedback.
Best regards,

Guident

Friday, April 29, 2011

Parallelization of ETL Workflows

2 comments: