Tuesday, June 22, 2010

Enterprise Search vs. a Centralized Electronic Information Repository

Enterprise Findability: Leveraging Synergies between the Common Electronic Repository and Enterprise Search

This paper describes synergies the organizations can achieve if Enterprise Content Management (ECM) and Enterprise Search technologies are considered and implemented together.

Many organizations are required to identify, retain, and share mission-critical information efficiently. Historically, individual departments in an organization took responsibility for assuring appropriate retention and access. However, the increasing complexity of regulation mean that this mission-critical information increasingly applies across entire organizations. Information in one department may be relevant to another department, as together they work to provide consolidated information to the external world.

A key challenge is to achieve the best use of information assets in its repositories. The goals are to eliminate duplicate content, maximize its reuse, and assure that information is protected and accessible. These goals summarize the concept of findability. In essence, findability is the art and science of locating information in or about electronic documents. People want to find answers, not search for them. AIIM, the industry Association for Imaging and Information Management, says in a 2008 report that “effective Findability retrieves content in context. Therein lies the crux of Findability. It cannot be attained simply by search, even a powerful search.” (AIIM MarketIQ, 2008). Improving findability requires a cooperative strategy, achieved by combining complementary technologies and systems. Findability is critical to effective use of information at many organizations.



No organization today can afford to duplicate assets or investments, whether in enterprise software or knowledge assets developed by its workers. Savvy organizations instead are adopting Information Lifecycle Management (ILM) practices. These ILM practices are “based on aligning the business value of information to the most appropriate and cost effective infrastructure.” (SNIA, 2004). ILM practices recognize that multiple technologies are critical to attaining desired organizational outcomes.

Findability Efforts at a Large Government Agency: An Overview

In 2006, a large federal government agency recognized the critical role of its information and resources by creating a Board to better coordinate IT investment. The board also initiated a set of enterprise-wide initiatives aimed at modernizing its IT systems. Among these initiatives is creation of a common electronic document repository, whose objective is to integrate individual repositories and contain the vast majority of documents created or received by the agency. This would:

1. Improve access to the content and its associated metadata, and


2. Facilitate reviewers’ and others’ ability to do their jobs effectively and efficiently.

More recently, the agency launched an Enterprise Search initiative to provide agency wide searches of its information repositories, one of which would be the Common electronic repository.

These two unfolding projects position the agency to plan and implement them in concert to meet Information Lifecycle Management best practices: increase findability, with maximum effectiveness and minimum cost.

Here is how both support findability.

Findability: Concepts and Technologies

Because of their interlocking components, a variety of technologies can enhance findability. Organizations seeking to enhance findability should select whatever combination of technologies that best meet their needs. The agency has already determined that two critical, enterprise technologies are needed to attain findability: Common electronic repository and Enterprise Search. Together these can overcome an agency's findability challenges:

1. Multiple silos of information that segregate potentially useful content into individual repositories,


2. Multiple sets of metadata and terminology, making it challenging to identify all potentially relevant content,


3. Rapid growth in content that burdens storage and hinders implementing electronic record policies.

Attributes, properties, and metadata all refer to the same thing: information about, not inside, the content. Both a Centralized Electronic repository and Enterprise Search will use metadata. Organizations will enable employees to find and use what they need and when they need it by identifying the synergies of these two systems.


How the Centralized Electronic Repository Enhances Findability


A Common Electronic Repository increases findability by:


1. Providing a hierarchical folder structure that shows content groupings and relationships


2. Associating metadata with content, providing document context and enhancing internal search of the content


3. Supporting the setting of security levels and other access controls

As an ECM system, the Common Electronic repository provides a hierarchical folder structure (or taxonomy) for content storage. By merely looking at this taxonomy, users can understand important content groupings and their relationships.

Another key feature of ECM systems for a Common Electronic Repository is their capability to associate metadata with content. Metadata adds additional context to the content, helping users better understand how, when and why the content was created. For example, each piece of content in a Common Electronic Repository will have several common metadata attributes such as “Document Authors,” indicating whom to contact for more information.

Metadata can be designed to use controlled, predefined lists of keywords. A specific attribute such "drug additive" could contain only one of a small set of values. By constraining the list of values with one like “Drug Evaluation and Research,” Enterprise Search will return more relevant results. Enterprise Search would not need lists of synonymous names.

A Common Electronic Repository also supports the setting of security levels and other access controls. These also can provide context for the content. Content might be considered available for limited release, such as within a specific research group, or have constrained usage based on specific time periods. Access controls also reduce visual clutter, since users see only what they have rights to see, and they can change content only as policy permits.

In summary, a Common Electronic Repository will enhance findability. The system’s folder structure and metadata are shared. Folders provide additional relevant context. The system allows content to cross organizational boundaries, enhancing findability. The organization also establishes a shared understanding of the domain and its content.


How Enterprise Search Enhances Findability


Enterprise Search will also play an important role in findability. That is why enterprise search systems are among the first technologies organizations consider as they wrestle with findability challenges. The most basic enterprise search function is to generate indexes for content items. For example, search systems generate indexes of key words to search content. Search systems also provide relevance ranking. However, credible relevance ranking requires advanced Enterprise Search features. Incorporating these advanced features adds additional value to findability:

• Create and manage organization specific thesauri. This helps a user searching for a specific word missing from documents of interest. Thesauri help the search systems return all documents of interest by finding those containing words meaning the same, but spelled differently, from what the user searched for.


• Support Term weighting. This identifies those terms that center users might find more important than others, when all have similar meanings. Term weighting, combined with Thesaurus support, enhances findability.


• Provide natural language processing. This allows Search to analyze content beyond merely identifying key words. For example, a document that contains the word “bush” could be analyzed to determine whether it was about a United States president or a type of vegetation.

Since a Common Electronic Repository will contain both internal and external content from large numbers of sources, the Enterprise Search system’s natural language support will help searchers sift through these different kinds of information.

Because Search systems work with indexes created from content throughout the enterprise, they can find relevant content no matter where it is stored. No navigation through a pre-set folder structure is needed. Such navigation requires choices which may not be intuitive when a user is not familiar with the domain.

In summary, Enterprise Search will play an important role in meeting an organization’s findability needs. Because an organization cannot pre-determine all relevant organizational structures, or other context for content, Enterprise Search will provide the opportunity to avoid dealing with specific folder structures, such as those in a Common Electronic Repository, and still find useful content.


A Common Electronic Repository Provides Value to Enterprise Search

One of the limitations of any enterprise search system is its brute-force nature. Search systems operate primarily on individual words, which by themselves are isolated from context. The result is that users often have to wade through large lists of search results to find what they really are looking for. An Electronic Content Management system is a good source of context to add value to a Enterprise search engine and can also reduce the length of those lists. A Common Electronic Repository can help organize search results by providing groups (“facets”) of Enterprise Search results. A good source of those facets is the Electronic Repository folder taxonomy.

Enterprise search systems can also use folder names to refine search results by allowing a search restricted to a particular branch in a folder hierarchy. Many search engines also allow advanced use of dictionaries and thesauri. Since every organization is unique, these dictionaries are generally not available “out-of-the-box” but instead must be built to reflect the organization’s vocabularies. However, a Common Electronic Repository folder structure could serve as an initial set of preferred terminology for Enterprise Search dictionaries, rather than requiring an organization to create that starter dictionary from scratch.

Enterprise Search can index metadata in a Common Electronic Repository to focus the types of searches available, again providing context to the content. The investment made adding rich metadata values to a Common Electronic Repository becomes immediately available to Enterprise Search. For example, a user might want to see content related to the a specific drug Lisinopril, but only when that document was written as part of a site inspection.

By making use of Common Electronic Repository metadata, an Enterprise Search query could say in effect “show me only those documents containing the word ‘Lisinopril’ which also have been tagged as a ‘site inspection’.”


Search provides value to ECM

Just as the Common EDR will add value to Enterprise Search, Enterprise Search can greatly enhance the value of an Common Electronic Repository. Like all ECM systems, a common Electronic Repository provides structures to store and process content according to an organization’s business rules. However, a Common Electronic Repository can provide only rudimentary searching.

• Enterprise Search will provide richer searching than basic search that is part of the Common Electronic Repository. By reusing metadata already describing content in the Common Electronic Repository, Enterprise Search can provide more relevant search results.

• By supporting dictionaries (such as lists of synonyms), Enterprise Search can provide additional ways to find content when the Common Electronic Repository folder names don’t match a user’s search query.

Enterprise Search will also provide a findability alternative to navigating a Common Electronic Repository folders. Rich Enterprise Search features can even allow a searcher to influence the search process to create his or her own context, as opposed to the one represented by the single Common EDR folder structure.

Lastly, Enterprise Search will provide another important feature: search logs. Search logs provide a record of what search queries users ran. Search administrators can analyze these logs to show how content is used, and logs can even suggest possible changes to the Common Electronic Repository folder structure, metadata elements and values.
Leveraging the Synergies

To repeat, neither Enterprise Search nor a Common Electronic Repository alone can provide a complete findability solution. Implemented together, they not only support richer findability, they do so more efficiently than either by itself.

A Common Electronic Repository, with its pre-set folder structure, and Enterprise Search with its ability to cross storage locations, provide two different approaches to finding content. Both approaches will be valuable depending on each user’s particular needs. One person familiar with the Common Electronic Repository folders may find navigating its folders faster and more effective than using Enterprise Search, which might seem more “scattershot.” Another person, unfamiliar with the Common Electronic Repository, could prefer Enterprise Search for rapidly finding relevant content. For that user, navigating through unfamiliar folders and reviewing content within each folder might be cumbersome.

A key operational challenge for deploying any enterprise search system is building connections to various ECM systems and translating their metadata elements to those used in the Search system. Integrating most content into one repository, the Common Electronic Repository, reduces the number of bridges and maps for Enterprise Search. This in turn reduces initial implementation cost as well as ongoing maintenance costs. Failure to consolidate content into the Common Electronic Repository would increase costs as the number and size of island repositories increases. Enterprise Search system administrators would have to spend ever-increasing resources to maintain those ECM system bridges and maps. Over time, the result would be a babble of inconsistencies, reduced relevancy, and decreased confidence in the Enterprise Search system’s results.

Deploying both a Common Electronic Repository and an Enterprise Search system also reduces the costs of governance for each. A single set of centralized governance processes applied to Common Electronic Repository content and folder structures minimizes costs, since only one folder structure needs to be reviewed, updated, and managed. Enterprise Search system governance decreases since metadata and the meaning of taxonomy nodes in the Common Electronic Repository are stable, predictable, and understood by Enterprise Search users.


Conclusions

With a Common Electronic Repository and Enterprise Search working together, they achieve findability levels unavailable to each alone. Each system brings unique advantages to enhancing findability. Implementing both Enterprise Search and a Common Electronic Repository is critical to reducing costs, getting best use from technology investments, and achieving the level of findability that an organization's mission requires.

References

AIIM MarketIQ (Q2 2008) “Findability: The Art and Science of Making Content Easy to Find. http://www.aiim.org/Research/MarketIQ/Findability-7-16-08.aspx

SNIA: Storage Networking Industry Association. (2004). Information Lifecycle Management: A Vision for the Future. http://www.snia.org/forums/dmf/programs/ilmi/ilm_docs/SRC-Profile_ILM_Vision_3-29-04.pdf (accessed March 10, 2010).

For more information on these topics, go to http://www.guident.com/ or contact the author directly at mailto:rweiner@guident.com.

Friday, June 11, 2010

Redundancy in the BI Data Model

Recently, an experienced database professional who had just started his first business intelligence (BI) project asked me two questions:
  1. Is data redundancy allowed in a BI data model?
  2. How much normalization is industry standard in BI if at all?
I had no hesitation answering the first question. Yes, absolutely, data redundancy is not only allowed but is recommended in many situations in BI data models. Redundancy is the key to simple BI data models and fast query response. The rules of normalization, which minimize data redundancy, were designed with transaction processing systems in mind and were also designed at a time when computer resources were scarce and expensive and data storage devices had limited capacity and slow I/O speeds.




One of the primary goals of normalizing to eliminate redundancy was to ensure data consistency. You didn't want to capture the same data at multiple entry points, since this meant extra effort of people typing in what should be the same data but often wasn't because of typos and variations in usage of abbreviations, nicknames, etc. Second, if the data changed and you had redundancy in the data model you had to go back to update multiple records in many tables - not necessarily easy to program and manage. Third normal form data models eliminate these problems and store data efficiently, but not without a price. The proliferation of tables with third normal form means queries have to join many tables. This is no big deal for transaction processing activity because individual transactions only insert or update a handful of rows in each table and typically use procedural code to do this.

With BI data models we don't care about capturing data. That is the job of the source application. So long as the source did a good job of normalizing and capturing the data properly, the BI model does not need to repeat the normalization process to ensure good source data. Second we are not supposed to update records in BI models - data warehouses are supposed to be static. We preserve point-in-time history so we typically don't have to go back and make updates to multiple occurrences of redundant data.

BI queries are very different from source transactions. Having to join many tables in a non-procedural SQL query has a huge cost when you are talking about queries that touch hundreds of thousands or even millions of records which is common for BI. Therefore redundancy that eliminates table joins for runtime queries is a recommended practice in BI. Fewer tables in the model also make it easier for end users to understand the model and easier to write ad hoc queries. Dimensional data modeling featuring the use of star schemas which may include redundancy is the technique most frequently used to reduce the number of tables in the model.

Other examples of acceptable redundancy in BI databases include having the same data stored in staging tables as well as production tables. And having variations of the same data stored in summary tables with different levels of aggregation so standard reports that frequently use the aggregated data run faster.

The answer to the second question is not so easy. There are two diametrically opposed schools of thought on data modeling for data warehousing. The one school, associated with Bill Inmon who is often called the father of data warehousing, believes that data warehouses should first acquire and store all data in non-redundant third normal form. They believe this is still required for good data management practices and do not believe that dimensional data models are robust enough for large data warehouses. However since BI tools like Business Objects and MicroStrategy run best with dimensional models, once the data is safely stored in a third normal form warehouse the model is extended with redundant downstream dimensional data marts that re-extract and reload data out of the data warehouse model into the data mart models.

The other school of thought, associated with Ralph Kimball who is one of the pioneers of dimensional data modeling, believes that dimensional models are perfectly capable of managing data of any size and complexity and are suitable for data warehouses or data marts no matter their size. Followers of this school avoid the extra effort of designing and maintaining two models (one third normal and one downstream dimensional) and two ETL jobs to load the two models. Consequently they also typically deliver new BI projects with shorter development cycles.