Jan 6, 2014

ETL Application Comparison, Part 2


In this part of the ETL Application Comparison series, we are going to take a look at PDI 5.0.1 (Pentaho Data Integration), an ETL application that I have used several times in the past.

PDI is a workflow-based ETL application that is fairly simple and intuitive to use. PDI features a large variety of steps, allowing one to read and write data from a wide array of data sources. When setting up database connections, the user is presented with a large array of different databases to choose from. There are many different steps available for manipulating data and controlling the flow of rows between the steps. The implication of this, however, is that there are often several ways of doing the same things by using different steps.

There are a number of scripting steps that add a lot of power to the tool. There is a Java Expression step, that allows one to evaluate any Java expression to assign its value to a field. There is a Javascript step that allows one to define a javascript program that will run once for every data row that passes through the step. The most powerful step is the Java Class step, that allows one to process the rows using Java code. Rows are read and written using Java statements, which implies that incoming and outgoing rows need not be one-to-one. One can implement the step to output many rows for each incoming row or to group a number of incoming rows into one row.

Another useful feature of PDI is the Step Metrics that are displayed while a transformation is running (or after it finished). The Step Metrics report displays the number of rows passing into and out of each of the steps and also shows the speed of each step as rows/second. This can help to identify any bottlenecks if a transformation is taking too long to run.

I have built the data loading problem as defined in the previous post of this series in PDI, and PDI handled the scenario with relative ease. Here is a screenshot of the PDI “transformation” that I have built. A very brief description of the transformation follows.
 

The icons above are known as steps. Every step in a so-called “transformation” receives data rows from a previous step, one after the other, processes it, and outputs the altered rows to the following step(s). The Read Accounts step at the top-left corner reads all the CSV files using a regular expression to find the file names in the input folder that it should read. The Account Found? step splits the stream of incoming rows into two branches depending on whether the account already exists in the database or not. The Remove values step removes the fields that were created in the account-not-found branch, so that the rows of the two branches can be merged at the Build file paths step. Finally, the Group by File Name step groups the incoming rows by the corresponding source file names, outputting only one row per file. The Move Files step then moves each file to the processed folder.

Scheduling a PDI job or transformation to run on a periodical basis is not too trivial. You have to use the command line tool Pan.bat (or pan.sh for Unix-based systems) for transformations or Kitchen.bat (kitchen.sh) for jobs to achieve that (what we have built above is a transformation; a job consists of a number of entries that execute in sequence, and may include entries that invoke transformations). The recommended approach to scheduling is to write a simple .bat file that calls Kitchen with the necessary parameters and redirects the console output to a log file; then schedule the .bat file to run using Windows Schedule Tasks.

In conclusion we have seen that implementing the data loading problem in PDI was fairly straight-forward. The biggest catch for an inexperienced user of the application is probably the fact that any new, temporary fields that were defined in a branch need to be removed in order to merge the streams together again. Scheduling the process is doable, but could have been made easier.

No comments:

Post a Comment