Friday, June 11, 2010

Redundancy in the BI Data Model

Recently, an experienced database professional who had just started his first business intelligence (BI) project asked me two questions:
  1. Is data redundancy allowed in a BI data model?
  2. How much normalization is industry standard in BI if at all?
I had no hesitation answering the first question. Yes, absolutely, data redundancy is not only allowed but is recommended in many situations in BI data models. Redundancy is the key to simple BI data models and fast query response. The rules of normalization, which minimize data redundancy, were designed with transaction processing systems in mind and were also designed at a time when computer resources were scarce and expensive and data storage devices had limited capacity and slow I/O speeds.




One of the primary goals of normalizing to eliminate redundancy was to ensure data consistency. You didn't want to capture the same data at multiple entry points, since this meant extra effort of people typing in what should be the same data but often wasn't because of typos and variations in usage of abbreviations, nicknames, etc. Second, if the data changed and you had redundancy in the data model you had to go back to update multiple records in many tables - not necessarily easy to program and manage. Third normal form data models eliminate these problems and store data efficiently, but not without a price. The proliferation of tables with third normal form means queries have to join many tables. This is no big deal for transaction processing activity because individual transactions only insert or update a handful of rows in each table and typically use procedural code to do this.

With BI data models we don't care about capturing data. That is the job of the source application. So long as the source did a good job of normalizing and capturing the data properly, the BI model does not need to repeat the normalization process to ensure good source data. Second we are not supposed to update records in BI models - data warehouses are supposed to be static. We preserve point-in-time history so we typically don't have to go back and make updates to multiple occurrences of redundant data.

BI queries are very different from source transactions. Having to join many tables in a non-procedural SQL query has a huge cost when you are talking about queries that touch hundreds of thousands or even millions of records which is common for BI. Therefore redundancy that eliminates table joins for runtime queries is a recommended practice in BI. Fewer tables in the model also make it easier for end users to understand the model and easier to write ad hoc queries. Dimensional data modeling featuring the use of star schemas which may include redundancy is the technique most frequently used to reduce the number of tables in the model.

Other examples of acceptable redundancy in BI databases include having the same data stored in staging tables as well as production tables. And having variations of the same data stored in summary tables with different levels of aggregation so standard reports that frequently use the aggregated data run faster.

The answer to the second question is not so easy. There are two diametrically opposed schools of thought on data modeling for data warehousing. The one school, associated with Bill Inmon who is often called the father of data warehousing, believes that data warehouses should first acquire and store all data in non-redundant third normal form. They believe this is still required for good data management practices and do not believe that dimensional data models are robust enough for large data warehouses. However since BI tools like Business Objects and MicroStrategy run best with dimensional models, once the data is safely stored in a third normal form warehouse the model is extended with redundant downstream dimensional data marts that re-extract and reload data out of the data warehouse model into the data mart models.

The other school of thought, associated with Ralph Kimball who is one of the pioneers of dimensional data modeling, believes that dimensional models are perfectly capable of managing data of any size and complexity and are suitable for data warehouses or data marts no matter their size. Followers of this school avoid the extra effort of designing and maintaining two models (one third normal and one downstream dimensional) and two ETL jobs to load the two models. Consequently they also typically deliver new BI projects with shorter development cycles.

No comments:

Post a Comment