Data Analytics Blog: Oracle Data Integrator

Showing posts with label Oracle Data Integrator. Show all posts

Friday, April 29, 2011

Parallelization of ETL Workflows

One of the most often overlooked ETL architecture design techniques is to ensure that all available hardware resources, in particular CPUs, are utilized. Consider the sample ETL workflow in the figure below. In this workflow, all ETL tasks are scheduled to be executed in series, i.e. one after another. In general, execution tasks only utilize one CPU (with some exceptions). Thus, executing one task after another would only utilize a single CPU, no matter how many CPUs are available.

Thus, in order to utilize all available CPUs on a server, ETL tasks should be scheduled to execute in parallel. Scheduling a highly parallelized ETL workflow requires some planning as all ETL task dependencies need to be understood first. For example, if DimTable1 and DimTable2 are dimension tables for FactTable1, then DimTable1 and DimTable2 need to be loaded first before FactTable1 can be loaded. In addition, if there are no dependencies between DimTable1 and DimTable2, then we can execute their specific ETL processes in parallel.

The output of this dependency analysis is an ETL Task Execution Order Map, as outlined in the figure below. The ETL Task Execution Order Map will then be the template for the ETL workflow layout. As each task will utilize an available CPU, this approach will be able to utilize multiple (if not all) CPUs and provide better performance.

One word of caution, though: The number of parallel scheduled tasks should not significantly exceed the number of CPUs on your server. If there are many more concurrent tasks than CPUs, then the overhead of the OS scheduler switching jobs in and out of CPU time may actually decrease performance.

To avoid this problem, many ETL tools (such as Informatica Power Center) provide a global setting that limits the number of concurrently processed tasks regardless of how many tasks are scheduled to run in parallel. Let us call this setting N. As a general rule, N should be set to the number of CPU (cores) of the server. In this case, one should schedule as many tasks in parallel as outlined in the ETL Task Execution Order Map. This option is very useful in environments with different number of CPUs in the development, test, and production hardware as we do not have to create workflows based on the underlying hardware constraints. In this case, we just set N in each environment to the respective number of CPUs.

Tuesday, August 3, 2010

Oracle Analytic Functions in ODI

Oracle Analytic functions are a great way to write efficient, complex SQL statements. Instead of having to write multiple joins and subqueries you can write a similar statement in just one line. This is a great time saver especially when using a tool such as Oracle Data Integrator (ODI), which makes it difficult to do subqueries. Unfortunately, ODI’s knowledge modules do not support all analytic function out of the box. The problem is when ODI sees the SUM keyword it automatically triggers the use of the GROUP BY and HAVING clause regardless if is a regular SUM or analytical query SUM. If you ever tried using such a function in ODI you probably received “ORA-00979: not a GROUP BY expression”.

With just a few lines of code you can easily implement a solution to fix this issue:

1) Navigate to the KM you wish to customize to use analytic functions (can be either LKM or IKM).

2) Create a new KM Option

3) Open the knowledge module and navigate to the Details tab, “Load Data” step (or “Insert flow into I$ table” step for IKM).

4) In the Definition tab look for the lines of code that contains (either Command on Target or Source):

<%=snpRef.getGrpBy()%>
<%=snpRef.getHaving()%>

And replace it with the following

<% if (odiRef.getOption("USE_ANALYTIC_FUNCTION").equals("0")){
out.print(odiRef.getGrpBy());
out.print(odiRef.getHaving());
} else
{
out.print("--Group by functions are suppresses by KM");
}
%>

5) Click the option tab and be sure to check your option name (USE_ANALYTIC_FUNCTION). Click Okay to complete.

6) When creating your interface and choosing your KM, in the flow tab you now have the ability to select the user defined USE_ANALYTYIC_FUNCTION.

The USE_ANALYIC_FUNTION option works by suppressing the GROUP BY and HAVING clause of the query when the value of Yes is selected. Because the GROUP BY will not be used you can use any analytic function you like.

Friday, February 19, 2010

Creating an ODI (Oracle Data Integrator) Dashboard with OBIEE

One of our clients needed a simple way to view ETL success and failures statistics in an easy to navigate monitoring tool. The ODI Operator is great for technical users who have ODI installed but can be complex for business users. Operator can also become slow due to the large number of interfaces that run each night. Exposing the ODI work schema to OBIEE was a logical solution to this problem. We used OBIEE as a way to report and graph ODI ETL executions.

It is also worth noting that having good naming conventions can really help with navigating the ETL repository. For example all of our interfaces are prefixed with INF_%, and Packages with PKG_%. We also have naming conventions for loading Staging, Dimensions, and Facts, etc.

Below are the steps to bring ODI ETL statistics into OBIEE

Metadata

Connect to ODI Work Repository to an OBIEE Subject Area and Expose SNP_STEP_REPORT and SNP_EXP_TXT tables to metadata.

Once the physical schema is brought in you will want to rename the columns to something more meaningful. The only column that needs further definition is the decode statement for STEP_STATUS (Status). These can also be found in the ODI Documentation Substitution Methods Reference.

'D' = Success'
'E' = 'Error'
'Q' = 'Queued'
'W' = 'Waiting'
'M' = 'Warning'

Join SNP_STEP_REPORT and SNP_EXP_TEXT on I_TXT_STEP_MESS = I_TXT. This is the join for the error message text and will display the corresponding error message with the user clicks on the error id.

Dashboard

We created a dashboard for our client with the following elements.

Last ETL Run date
Past 5 Days execution time
Number of Errors
Past 24 hour execution log

Reports

Past 5 Days Execution Time (in Hours) - High level ETL execution time.

Sum Duration / 3600 and group by date.

All Executions in Past 24 Hours (with drill-through)

View every scenario that ran in the past 24 hours.
Conditional Formatting

WHEN Status = Success then Green Background
WHEN Status <> Success then Red Background

Drill Through

Interface name to display all executions for this Interface – Drills to All Executions Per Interface
Error ID to display any error message for this interface - Drills to Interfaces Error Message

All executions per Interface

This is used so you can see all the executions over its life for a specific scenario, particularly useful to compare with previous runtimes.
Drills from Past 24 hour log based on scenario name