ETL – Extract Transform and Load

There are several common use case scenarios where we need to EXTRACT data from a given source and TRANSFORM the data that we extracted so that it can be LOADED into a database or another system. In this article, I am going to talk about a CSV upload and a few of the considerations one needs to make to Extract, Transform and Load the CSV data into a database.

There are several common use case scenarios where we need to EXTRACT data from a given source and TRANSFORM the data that we extracted so that it can be LOADED into a database or another system. In this article, I am going to talk about a CSV upload and a few of the considerations one needs to make to Extract, Transform and Load the CSV data into a database.

Extract

To put it in simple words, a CSV file is a plain text file with data separated by commas, tabs, or semi-colons. When a spreadsheet application opens a CSV, it displays the data as a table. Independently of your choice of language (JavaPythonJavaScriptPHP, etc), you will find a CSV library code to deal with the data.

Below an example of a comma-separated CSV data:

"Date","Pupil","Grade"
"25 May","Bloggs, Fred","C"
"25 May","Doe, Jane","B"
"15 July","Bloggs, Fred","A"
"15 April","Muniz, Alvin ""Hank""","A"

Transform

Usually, data will have to be transformed, for example by combining two columns into a single one (first name + middle name), or some data needs to be split into several individual pieces, i.e. when addresses come as a single cell record but displayed into several lines. The two tables below show an example of this case.

Some data example

Load

To keep performance high, the CSV is read line by line and the data in every line is inserted into a database or printed onto the screen so that the user can select one or more roles to be imported. You want to process the single row straight away and discard it from memory as soon as you finish.

In many cases, the first row contains or displays column names, which can then be used to match pre-existent data in the system. This allows flexibility when parsing the data in the order it is presented.

A more complex case is described in this case study, where the data is requested from the source system, then transformed before finally loaded into the destination system.