Data Analytics Blog: April 2011

One of the most often overlooked ETL architecture design techniques is to ensure that all available hardware resources, in particular CPUs, are utilized. Consider the sample ETL workflow in the figure below. In this workflow, all ETL tasks are scheduled to be executed in series, i.e. one after another. In general, execution tasks only utilize one CPU (with some exceptions). Thus, executing one task after another would only utilize a single CPU, no matter how many CPUs are available.

Thus, in order to utilize all available CPUs on a server, ETL tasks should be scheduled to execute in parallel. Scheduling a highly parallelized ETL workflow requires some planning as all ETL task dependencies need to be understood first. For example, if DimTable1 and DimTable2 are dimension tables for FactTable1, then DimTable1 and DimTable2 need to be loaded first before FactTable1 can be loaded. In addition, if there are no dependencies between DimTable1 and DimTable2, then we can execute their specific ETL processes in parallel.

The output of this dependency analysis is an ETL Task Execution Order Map, as outlined in the figure below. The ETL Task Execution Order Map will then be the template for the ETL workflow layout. As each task will utilize an available CPU, this approach will be able to utilize multiple (if not all) CPUs and provide better performance.

One word of caution, though: The number of parallel scheduled tasks should not significantly exceed the number of CPUs on your server. If there are many more concurrent tasks than CPUs, then the overhead of the OS scheduler switching jobs in and out of CPU time may actually decrease performance.

To avoid this problem, many ETL tools (such as Informatica Power Center) provide a global setting that limits the number of concurrently processed tasks regardless of how many tasks are scheduled to run in parallel. Let us call this setting N. As a general rule, N should be set to the number of CPU (cores) of the server. In this case, one should schedule as many tasks in parallel as outlined in the ETL Task Execution Order Map. This option is very useful in environments with different number of CPUs in the development, test, and production hardware as we do not have to create workflows based on the underlying hardware constraints. In this case, we just set N in each environment to the respective number of CPUs.

Introduction

Google Web Toolkit (GWT) is an open source development toolkit for building web applications. GWT allows developers to write sophisticated AJAX applications in Java and then compile the source in optimized JavaScript that runs across all browsers. While Java delivers the benefits of “write once, run anywhere”, GWT creates the possibility of “write once, integrate anywhere”.

GWT can be used to develop fully functional UI widgets that interact with back-end services. Most often a GWT widget needs to be integrated into an application. If the application is also developed in Java, the integration is a simple matter of importing the widget's source code or JAR in class path. But what if the application is built with a non-Java platform, such as .NET? It is important to keep in mind that GWT compiles Java code into JavaScript and JavaScript can run in any browser. Therefore GWT widgets, as compiled in JavaScript, can be integrated into any browser-based web application.

StockWatcher [1] is likely to be the first example for most GWT developers. It is used here to demonstrate how GWT widgets can be integrated into web applications regardless of platform.

Local Mode
It is assumed that the steps in [1] are followed to create StockWatcher with client-side only Java code. Since there is no server-side code, the "WEB-INF" folder can be skipped or safely removed after GWT compiler compiles. The remaining component only consists of a set of static files, including JavaScript, CSS and other HTML resources. A copy of these files can be included as local resources in any web application to integrate StockWatcher. This is defined as "local" integration mode.

For example, Application ABC just needs to include three lines in one of its HTML files to integrate StockWatcher. In fact, the host page can be any typical web file such as JSP, ASP or ColdFusion, as long as its source is rendered as HTML in the end.

The integration is self-contained and non-intrusive. When opened in any browser, StockWatcher becomes a fully functional section within the host page. Apart from the three lines above to integrate the StockWatcher widget, the rest of the host page can include any content. The example here only has the text "My Application". No matter what it is, GWT widgets will co-exist with the rest of the page side by side.

When the application is running, server output shows that StockWatcher communicates with the quote server to receive updates. The callback parameter indicates that the communication is in JavaScript Object Notation, with padding (JSONP) format. See the following server log.

Remote Mode In "Local" integration mode, applications obtain a copy of StockWatcher and refer to it as local resources. Unfortunately, this means that each application must update its local copy for every new release of StockWatcher. Just imagine if Google Maps were integrated in this way. An alternative is to host StockWatcher on a server and let all applications refer to it from this single remote entry point. This technique is defined as "Remote" integration mode. This mode centralizes and simplifies management of the widget.

The updated host page in My Application will refer to StockWatcher deployed on locahost at port 8080. Note that it uses the same three lines but different “href” and “src” values as compared to those of “Local” mode.

However, StockWatcher does not show up correctly (as expected) when opening the host page in the browser. A blank section is displayed instead. Close inspection finds the following lines in Firefox Error Console.

It turns out that cross-site behavior must be enabled for the communication to pass through. This is done by adding xs-linker in StockWatcher.gwt.xml.

Once StockWatcher is recompiled and redeployed, the host page opens up successfully in the browser and StockWatcher behaves the same as in "Local" mode. The quote server also shows the same output.

Summary
As shown, GWT widgets can be integrated in either "local" or "remote" mode into any web application. This can be achieved by developing client-side only Java code and using JSONP to overcome the Same Origin Policy (SOP) security constraint. In the case of "remote" mode, xs-linker needs to be added to enable cross-site behavior. Altogether, this proves to be an efficient way to reuse existing investments and opens the possibility to create some really interesting applications.

References

[1] StockWatcher, http://code.google.com/webtoolkit/doc/latest/tutorial/Xsite.html

[2] JSONP, http://en.wikipedia.org/wiki/JSON#JSONP

[3] SOP, JSONP and XS Linker, http://code.google.com/webtoolkit/doc/latest/FAQ_Server.html

Data Analytics Blog

Friday, April 29, 2011

Parallelization of ETL Workflows

Friday, April 8, 2011

Efficiently Integrate Google Web Toolkit (GWT) Widgets Across Multiple Platforms