Dec 30, 2013

ETL Application Comparison, Part 1

We are starting a series of articles in which we are evaluating and comparing a number of ETL/automation applications to see how Linx matches up to the competition. We have set up a simple data loading scenario that we are going to solve using different products to see how things go. In our evaluations we will focus on how easy and intuitive it is to build the program, and we will also take a look at how easy it is to schedule a task to run on a periodical basis.

The data loading problem that we have defined involves reading a number of CSV files containing account information, and storing and updating the information in a number of database tables. Each input CSV file looks something like this:

911528,970,Mr.,William,Velez,3/20/1988,,083 741 0515,1115 Rissik St,Nasaret
470206,392,Mr.,Matthew,Seeley,6/30/1978,,084 926 4283,2428 Mandela Dr,...
725923,663,Ms.,Alta,Stewart,1/29/1942,,084 201 3307,435 Visser St,...
067808,224,Mr.,William,Gilbert,8/11/1942,,084 445 8760,2241 Visser St,...
179263,308,Mr.,Michael,Herrmann,3/25/1934,,082 318 0187,1473 Bath Rd,...
The contents of the CSV files have been composed with the help of the Fake Name Generator, so all the information is random and does not represent real life data.

This information must be stored in a SQL Server database into three tables: Account, Person, and PersonContact. The first value of each row in the CSV file represents a reference number for the account that will be used to do a lookup to see if the account already exists. If the account does not yet exist, new entries will be made into all three tables. Otherwise, only the balance in the Account table will be updated. The Title, GivenName, and Surname fields must be concatenated together with spaces in between and stored in the FullName column of the Person table. The spaces in the TelephoneNumber must be removed before it is stored in PersonContact. The full name and the email address fields must have their lengths checked and be truncated at the end in case they would be too long to fit into the corresponding table columns. Finally, each input file must be moved from the input folder to a processed folder as soon as it is processed.

I am a new employee at Digiata and have been assigned to work with the Linx development team. I have a fair amount of experience working with PDI (Pentaho Data Integration), and in the next post of this series I will be building the data loading problem using PDI. Later in the series I will be using Linx 5 for the first time to implement the data loading scenario, and other ETL applications will follow after that.

No comments:

Post a Comment