ETL (Extract, Transform, and Load) Process & Conceptby Applied Informatics
According to Wikepedia:
Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that:
- Extracts data from homogeneous or heterogeneous data sources
- Transforms the data for storing it in proper format or structure for querying and analysis purpose
- Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse)
ETL helps organizations to make meaningful, data-driven decisions by interpreting and transforming enormous amounts of structured and unstructured data. The need for ETL has increased considerably, with upsurge in data volumes.
ETL processes have been the way to move and prepare data for data analysis. ETL process involves the following tasks:
1. Extracting the data from different sources – the data sources can be files (like CSV, JSON, XML) or RDBMS etc.
This is the first step in ETL process. It covers data extraction from the source system and makes it accessible for further processing. The main objective of the extraction step is to retrieve all required data from source system with as little resources as possible. The extraction step should be designed in a way that it does not negatively affect the source system. Most data projects consolidate data from different source systems. Each separate source uses a different format. Common data-source formats include RDBMS, XML (like CSV, JSON). Thus the extraction process must convert the data into a format suitable for further transformation.
2. Transforming the data – this may involve cleaning, filtering, validating and applying business rules.
In this step, certain rules are applied on the extracted data. The main aim of this step is to load the data to the target database in a cleaned and general format (depending on the organization’s requirement). This is because when the data is collected from different sources each source will have their own standards like –
For example if we have two different data sources A and B. In source A, date format is like dd/mm/yyyy, and in source B, it is yyyy-mm-dd. In the transforming step we convert these dates to a general format. The other things that are carried out in this step are:
- Cleaning (e.g. “Male” to “M” and “Female” to “F” etc.)
- Filtering (e.g. selecting only certain columns to load)
- Enriching (e.g. Full name to First Name , Middle Name , Last Name)
- Splitting a column into multiple columns and vice versa
- Joining together data from multiple sources
In some cases data does not need any transformations and here the data is said to be “rich data” or “direct move” or “pass through” data.
3. Loading – data is loaded into a data warehouse or any other database or application that houses data.
This is the final step in the ETL process. In this step, the extracted data and transformed data is loaded to the target database. In order to make data load efficient, it is necessary to index the database and disable constraints before loading the data.
All the three steps in the ETL process can be run parallel. Data extraction takes time and so the second step of transformation process is executed simultaneously. This prepares data for the third step of loading. As soon as some data is ready it is loaded without waiting for completion of the previous steps.