Jan 27, 2014

ETL Application Comparison, Part 5


The application that we are evaluating in this post is Jaspersoft ETL 5.2. This application provides a workflow-based designer similar to PDI, offering a large variety of components to use. The interesting aspect of Jaspersoft ETL is that the application dynamically generates procedural Java code from the visual design. One can view the Java code, but not edit it directly. The generated code is laborious and ugly, but viewing it provides transparency and can help to understand what is happening when problems arise or when something is not working as expected.


The values of most of the properties that can be configured on the components are based on Java expressions. This is similar to the way that Linx works with C# expressions and the consistency-aspect of this feature is appealing. These expressions are substituted in the generated code directly. There is no expression editor and that’s fine for users familiar with Java, but unforgiving for users having none or little Java experience.

It seems like Jaspersoft ETL does not feature any specialized data-manipulation components like PDI does. Data manipulation tasks are typically done through the Java-Row component which allows one to write any block of Java code.

This is what the data loading job that I’ve built in Jaspersoft ETL looks like:
The two light-blue rectangles above represent two sub-jobs. The topmost sub-job reads all the account files in the input folder and stores the data in the database. Once this sub-job completed running, the second sub-job is started, moving all the files in the input folder to the processed folder.

While the resulting job above looks fairly simple, I have experienced a number of problems while building it. Some components do not indicate what values they output, which means that the user has to look into the generated Java source code in order to find the names of the variables to use in following components. Furthermore, many components absorb incoming rows instead of passing them on to following components. Compared to PDI, Jaspersoft ETL seems to be far less agile in the way that rows are allowed to flow between the components (this actually makes sense considering the application's code-generation approach). Jaspersoft ETL is essentially terrible to use for new users of the application not yet familiar with all of its quirks.

Jaspersoft ETL facilitates scheduling a job through its Scheduler. After selecting the days and times when the job should run, the scheduler generates a crontab file that can be used with Cron on Linux and Unix, or with a program like cronw on Windows. So, whereas Jaspersoft ETL offers some help with the scheduling, it is not as seamless as Linx’s built-in scheduling support.

Jan 20, 2014

ETL Application Comparison, Part 4


In this part of our evaluation series we are having a look at SSIS (SQL Server Integration Services). SSIS is part of the Microsoft SQL Server software package and allows for performing a broad range of data migration tasks. SSIS is, similar to PDI, a workflow-based application, but it features separate sections for your project’s Control Flow and Data Flow. The Control Flow section contains the high level steps to be executed in sequence, and the Data Flow section specifies how the data is read and processed.



Figuring out how to read the data from the CSV file and moving the data into the database, with the help of a tutorial, was not too hard. However, at first I simply could not find a way to dynamically select all of the files in the input folder. The application allowed me to select one file only. Eventually, Google came to the rescue and taught me that I need to loop through the files using a Foreach Loop Container in the Control Flow section. The Foreach Loop Container looks at a fixed folder and loops through all files matching a specified wild card expression. During each iteration a variable containing the file path is set, which can then be used in the data flow section. It is conceptually simple but actually a bit tricky to get right.

Here is a screenshot of the Control Flow that I have built:
 
And here is what the Data Flow section looks like:
 

All the string manipulation, as well as the generation of the GUIDs is done by the Process Data component. It is a Script Component which allows the user to manipulate the incoming data rows using custom C# code.

For the development of our test scenario SSIS was found to be very developer-oriented and not particularly user friendly. One can easily get lost among all the different sections and properties that are available to configure. SSIS features a number of very powerful components, such as the Script Component, which allows one to write a block of C# code that is executed for each data row that passes through. Scheduling the SSIS package can be done fairly easily by adding a job to SQL Server Agent using SQL Server Management Studio.

Jan 13, 2014

ETL Application Comparison, Part 3


The next application we are evaluating in this series is Linx 5. This article represents my experiences with the current beta-version of Linx 5 (hereafter referred to as just “Linx”) while using it for the first time to implement the data loading problem as defined in the first post of this series. Linx distinguishes itself from the application of the previous post, PDI, in the sense that Linx programs are procedural in their structure as opposed to workflow-based. This aspect can make the application immediately very intuitive to use for anyone having some prior programming experience.


Instead of featuring a large array of different components to use like PDI, Linx offers a fairly minimal set of components. The number of components is smaller, but the components that are available tend to be more versatile. The ExecuteSQL component, for instance, can be used to read data from a database, or for lookups, or for updating or inserting into a database. This means that there is a smaller number of components one needs to know of in order to get the task done.

Many properties of the components support expressions, and all expressions are written in the same language, namely C#. This consistency makes things easier for Linx users, as there is no need for using different languages in a Linx program. In PDI, on the other hand, a complex transformation with many steps often uses different types of expressions. Furthermore, most fields do not support expressions, forcing one to introduce additional steps for fetching variables or making calculations. The simplicity and consistency in Linx contributes to a more positive user experience when building a Linx program.

Opposed to PDI, Linx currently lacks a component for writing a block of C# code. This did not cause any trouble with the implementation of our simple data loading scenario, but such a component would come in handy when there is a need to do some involved task that is not directly supported by any of the existing components.

Implementing the data loading problem seemed fairly natural and I was satisfied with the result. Here is a screenshot of the program that I’ve built. As I went along, I have given the components concise, but descriptive names in order to keep the program readable.

 

As you can see, the program turned out nice and simple.

Linx offers a very simple and clean way to schedule and monitor processes. A Windows Service that runs in the background periodically checks the schedules and automatically runs the processes when they are due. With the help of services, a process can be scheduled to run on a periodical basis, or whenever changes in a folder are detected.

In this exercise, building the data loading process using Linx has produced a program with a very concise representation. The task of scheduling the process to run on a periodical basis is a quick and easy process. Even though it did not hamper us in this example, one shortcoming of Linx that presented itself was the lack of a component to enable the user to do more advanced things that cannot be done using the existing components. Maybe Linx 5 will soon add a component to address that issue.

Jan 6, 2014

ETL Application Comparison, Part 2


In this part of the ETL Application Comparison series, we are going to take a look at PDI 5.0.1 (Pentaho Data Integration), an ETL application that I have used several times in the past.

PDI is a workflow-based ETL application that is fairly simple and intuitive to use. PDI features a large variety of steps, allowing one to read and write data from a wide array of data sources. When setting up database connections, the user is presented with a large array of different databases to choose from. There are many different steps available for manipulating data and controlling the flow of rows between the steps. The implication of this, however, is that there are often several ways of doing the same things by using different steps.

There are a number of scripting steps that add a lot of power to the tool. There is a Java Expression step, that allows one to evaluate any Java expression to assign its value to a field. There is a Javascript step that allows one to define a javascript program that will run once for every data row that passes through the step. The most powerful step is the Java Class step, that allows one to process the rows using Java code. Rows are read and written using Java statements, which implies that incoming and outgoing rows need not be one-to-one. One can implement the step to output many rows for each incoming row or to group a number of incoming rows into one row.

Another useful feature of PDI is the Step Metrics that are displayed while a transformation is running (or after it finished). The Step Metrics report displays the number of rows passing into and out of each of the steps and also shows the speed of each step as rows/second. This can help to identify any bottlenecks if a transformation is taking too long to run.

I have built the data loading problem as defined in the previous post of this series in PDI, and PDI handled the scenario with relative ease. Here is a screenshot of the PDI “transformation” that I have built. A very brief description of the transformation follows.
 

The icons above are known as steps. Every step in a so-called “transformation” receives data rows from a previous step, one after the other, processes it, and outputs the altered rows to the following step(s). The Read Accounts step at the top-left corner reads all the CSV files using a regular expression to find the file names in the input folder that it should read. The Account Found? step splits the stream of incoming rows into two branches depending on whether the account already exists in the database or not. The Remove values step removes the fields that were created in the account-not-found branch, so that the rows of the two branches can be merged at the Build file paths step. Finally, the Group by File Name step groups the incoming rows by the corresponding source file names, outputting only one row per file. The Move Files step then moves each file to the processed folder.

Scheduling a PDI job or transformation to run on a periodical basis is not too trivial. You have to use the command line tool Pan.bat (or pan.sh for Unix-based systems) for transformations or Kitchen.bat (kitchen.sh) for jobs to achieve that (what we have built above is a transformation; a job consists of a number of entries that execute in sequence, and may include entries that invoke transformations). The recommended approach to scheduling is to write a simple .bat file that calls Kitchen with the necessary parameters and redirects the console output to a log file; then schedule the .bat file to run using Windows Schedule Tasks.

In conclusion we have seen that implementing the data loading problem in PDI was fairly straight-forward. The biggest catch for an inexperienced user of the application is probably the fact that any new, temporary fields that were defined in a branch need to be removed in order to merge the streams together again. Scheduling the process is doable, but could have been made easier.