Hi Friends, so am back with a very generic topic , but having a lot of information in store for you.
For last many decades, the petabytes and much more data is collected and processed for Data Analytics. This helps in business growth, as the raw data collected is now available in meaningful form.
For this above requirement, we had ETL tools in market since year 2000, and companies have adopted the technology handedly, but then a new change appear which looked like ELT. Nothin gmuch of a difference. Just a switch of ‘T’ and ‘L’. Doesnt it??
The answer is No. ETL (extract, transform, load) has been the old approach for data warehousing, but ELT ie (extract, load , transform) added a new flavor to working modules. Want to know how ? Read further !
First lets understand what these 3 stages of ‘E’, ‘T’, ‘L’ mean for both ETL and ELT:
- E for Extract: Pulling the data from Database or data source. It can be same or heterogeneous. It is the raw data which is unstructured and needs cleansing, enrichment. With ETL, the data goes into a temporary staging data repository. With ELT it goes immediately into a data lake storage system.
- T for Transform: Structuring,and converting the raw data as per the business requirements and architecture.Its same for both ETL and ELT.
- L for Load: Loading the transformed and structure data into a data warehouse or data mart etc. This switch happens to make lot of difference.
So for understanding the importance of ‘L for Load‘,
a) we need to analyze the difference between loading the data ‘before or after’ the transformation step
If our system uses the typical OLAP data warehouses, which can only handle SQL based structures, and cannot handle Unstructured Data (mentioned in point B below), then we need to do the transformation first and then load the structured data, in the relational table format. But if we consider the Data lakes (Big data, hadoop), which can handle the unstructured data, we do the loading directly to Data repository skipping the Staging area step. This saves lot of space concerns and is faster.
b) we need to understand the ‘type of data’ available in market now a days and how technology is evolving.
Type of data refers to one of the 5V’s we generally talk about in today’s technology market consisting of Big Data, Hadoop, Data lake and much more. 5V’s are:
- Volume: the volume of (raw) data
- Variety: the variety (e.g. structured, unstructured, semi-structured) of data
- Velocity: the speed of data processing, consummation or analytics of data
- Veracity: the level of trust in the data
- (Value): the value behind the data
So the old legacy systems which needs to be handled in a process, uses ETL tools, that help them maintain the integrity of data. But if you seem to talk about volume of data in today’s world, data is increasing at unknown pace every minute. It can be in any form, structured (relational), unstructured (containing text and multimedia contents, videos etc) and much more, so to handle this we need new flexible ways is ELT tools, which help us to load this data into data lakes for further analysis and processing.
c) We need to understand the ‘benefits and cons’ of ETL vs ELT wrt to various modes like:
i) Load time : ETL take more time, ass it loads first in staging area , then transformation happens and then to Target system so extra time to load data. whereas in ELT, everything is loaded into one target system and then transformation whichever required happens so load only once mechanism, so its faster.
ii) Time – Transformation :In ETL, transformation time is more, as data grows every time. so especially for big data chunks, transformation time increases. In ELT, everything is already present in target system, so speed is not dependent on data size.
iii) Usability: This feature can be thought in many ways. Like for data size, if smaller legacy data, ETL is best, whereas if massive amount of data, ELT is best. If agile technology, with flexibility the project follows then ELT is best, and not ETL.If full data is required for audit and retaining process , ELT is best. From maintenance perspective, ELT is better as it supports Cloud based tools, so less maintenance is required, wheres ETL is legacy and requires high maintenance, as and when new data arrives.
iv) Cost Effective: ETL is not so much cost-effective as every time, data size grows, the same process needs to be repeated. Plus is not scalable also. Hence its good for smaller data companies . whereas ELTÂ is Scalable and available to all project sizes using On-Prem or cloud Solutions.
v) Speed Analysis: Transforms only the data you need: In ETL, we need to transform the data loaded in Staging area and then load to Target system. If any extra calculation needs to be done, it has to be processed as Append column or modify existing, so more analysis time required. Whereas in ELT , all data is available in one place. You transform only the data required for a particular analysis. Secondly, if any process or calculation changes later in ETL, then whole ETL pipeline needs to be modified and process re runs. In ELT, we just need to modify the calculation logic and done.
vii) Data Compliance and Security: Often companies regulated by GDPR or other Security bonds, need to mask or encrypt specific data fields of production data to protect the privacy of their clients. Since in ELT, we load full data in start, and retain it, its less secure wrt to ETL, where, we load data in Staging and then transform it before sending further to Target. So compliance issues are less to happen in ETL wrt to ELT.
So my final verdict is ETL tools are outdated and used only for projects with limited amount of structured data in more security and fixed timeline environment projects. For today’s time, we need to be agile, flexible & quirky to handle any type and amount of data, so use latest ELT tools, which give in handy information in a way required.
Stay tuned for more tutorials !!
Click here for Beginner tutorial for ODI 11g & 12c