Dec 30, 2013

ETL Application Comparison, Part 1


We are starting a series of articles in which we are evaluating and comparing a number of ETL/automation applications to see how Linx matches up to the competition. We have set up a simple data loading scenario that we are going to solve using different products to see how things go. In our evaluations we will focus on how easy and intuitive it is to build the program, and we will also take a look at how easy it is to schedule a task to run on a periodical basis.

The data loading problem that we have defined involves reading a number of CSV files containing account information, and storing and updating the information in a number of database tables. Each input CSV file looks something like this:


Ref,Balance,Title,GivenName,Surname,Birthday,EmailAddress,TelephoneNumber,StreetAddress,City
911528,970,Mr.,William,Velez,3/20/1988,WilliamLVelez@einrot.com,083 741 0515,1115 Rissik St,Nasaret
470206,392,Mr.,Matthew,Seeley,6/30/1978,MatthewBSeeley@fleckens.hu,084 926 4283,2428 Mandela Dr,...
725923,663,Ms.,Alta,Stewart,1/29/1942,AltaAStewart@jourrapide.com,084 201 3307,435 Visser St,...
067808,224,Mr.,William,Gilbert,8/11/1942,WilliamRGilbert@fleckens.hu,084 445 8760,2241 Visser St,...
179263,308,Mr.,Michael,Herrmann,3/25/1934,MichaelJHerrmann@einrot.com,082 318 0187,1473 Bath Rd,...
.
.
The contents of the CSV files have been composed with the help of the Fake Name Generator, so all the information is random and does not represent real life data.

This information must be stored in a SQL Server database into three tables: Account, Person, and PersonContact. The first value of each row in the CSV file represents a reference number for the account that will be used to do a lookup to see if the account already exists. If the account does not yet exist, new entries will be made into all three tables. Otherwise, only the balance in the Account table will be updated. The Title, GivenName, and Surname fields must be concatenated together with spaces in between and stored in the FullName column of the Person table. The spaces in the TelephoneNumber must be removed before it is stored in PersonContact. The full name and the email address fields must have their lengths checked and be truncated at the end in case they would be too long to fit into the corresponding table columns. Finally, each input file must be moved from the input folder to a processed folder as soon as it is processed.

I am a new employee at Digiata and have been assigned to work with the Linx development team. I have a fair amount of experience working with PDI (Pentaho Data Integration), and in the next post of this series I will be building the data loading problem using PDI. Later in the series I will be using Linx 5 for the first time to implement the data loading scenario, and other ETL applications will follow after that.

Dec 2, 2013

Linx 4: Writing large files

We recently had an issue reported that Linx 4 takes a long time to build strings. The problem showed up when trying to write a large file in Linx 4 took a lot longer than it used to in Linx 2. The process to prove it was


  1. Create a string variable.
  2. Read a database table with 1 million records.
  3. Loop through the records and add to the string variable i.e. for each record string = string + record data.
  4. Write the string to file.


The problem turned out to be the creation of a very large string by appending to the same string in a loop. We have improved the performance to Linx 2 levels by using a different string concatenation mechanism when we detect the process wants to append to a string. Even though this solves the speed problem it does not mean that building large strings in memory is a good idea.

To illustrate the problem I built a small application that writes 500 character long rows to build 3 different file sizes. One method uses the string concatenation method and the other writes directly to file. There was not much difference in speed but memory usage was drastically different:


So be careful when building very large strings in memory, especially if they can happen simultaneously. At least now you won't have to wait very long for the crash...

Linx 4.0.693.3441 released

Linx 4.0.693.3441 is available for download. It includes a mix of bug fixes, performance improvements and new components. Please see the release notes for all the gory details.

There is a very good reason to upgrade:



Remember to give Murray a high five when you see him again.

The search looks for the complete string, including spaces, in the solution. Double click on a result to open the relevant tab. Please let us know how we can improve it.