Friday, November 12, 2010

Identifying Source System Data Changes for Incremental ETL Processes

Problem: When designing incremental ETL processes, ETL Architects face the challenge of identifying algorithms to identify data changes in the source system between ETL runs. Some of the options that might be available are (from the best case scenario to the worst):

  1. Database Log Readers: This approach utilizes an ETL tool that is capable of reading the source database log files to identify inserted and updated records. For example, Informatica Power Center supports this through its Change Data Capture (CDC) functionality. However, source system owners may not be willing to grant read access to the database logs or an ETL tool that supports this functionality may not be available.
  2. Timestamp columns in the source database: If the source system maintains an insert and update timestamp column for each table of interest, then the ETL process can utilize these columns to identify source system changes since the last ETL execution timestamp. Chances are, however, that the source system does not provide that functionality.
  3. Triggers to populate log tables: This is by far the worst option since it adds a significant resource utilization burden to the source system. In this case, triggers are created for all tables of interest. The purpose of these triggers is to capture all insert/updates/deletes into log tables. The ETL process then reads the data changes from the log tables and removes all records that it has successfully processed. Again, source system owners will most likely be very hesitant to support this approach.
What to do if none of these options are available?



Solution: We propose the following checksum based approach. In this blog, we will utilize SQL Server’s CHECK_SUM algorithm; however, Oracle’s ORA_HASH can be used in a similar fashion.

This approach requires that the entire source table (only columns and rows of interest, of course) be loaded into a staging table. During the staging load, the ETL process will assign a checksum value to each record. For example, when loading data from SOURCE_TABLE_A into STAGING_TABLE_A, the SQL would look something like this:
insert into staging_table_a ( col1, col2, cold3, business_key, check_sum)
select col1, col2, cold3, business_key, check_sum(col1,col2,cold3,business_key)
from source_table_a

Let us further assume that business_key is the primary key of the source system record. In other words, business_key uniquely identifies a record in the source system.

Both business_key and check_sum must be stored in their corresponding dimension tables. In our example, the dimension table for source_table_a would include a surrogate dimension key (dim_key), business_key, and check_sum as shown below.
For performance optimization reasons, we recommend to create a composite index on business_key and check_sum.

In order to identify new records that were inserted into the source system since the last ETL run, we have to find all business_keys in the staging table that have no corresponding business_key in the dimension table. The SQL code would look something like this:


select s.* from staging_table_a s
Where not exists (select * from dimension_table_a d
where d.business_key = s.business_key)


To identify updated records since the last ETL run, we have to find all records in the staging table that have a matching business_key in the dimension table with a different check_sum value. Here is the SQL code for this:

select s.*
from staging_table_a s
inner join dimension_table_a d on
d.business_key = s.business_key and
d.check_sum <> s.check_sum
In all cases, the joins against the dimension table will be based on index lookups because we have a composite index on business_key and check_sum. Therefore, identifying new or updated records is quite efficient. The drawback of this solution is the necessity to perform a full data load into the staging area, which may not be feasible for large source systems.

One of the major benefits of this approach is its immunity against getting out-of-sync with the source system (due to aborted or failed ETL processes). No matter at what point the previous ETL process has failed, this approach will always correctly identify source system changes and re-sync without any additional human intervention.

In summary, the check_sum approach may be a feasible alternative for environments that have no other means for identifying data changes in the source system.

Please contact us if you have any questions.