31
Should ETL Become Should ETL Become Obsolete? Obsolete? Why a Business-Rules driven Why a Business-Rules driven E-LT” Architecture may be E-LT” Architecture may be better better

Should ETL Become Obsolete

Embed Size (px)

DESCRIPTION

We compare the traditional ETL approach to the newer Business Rules-driven E-LT paradigm, the answer whether conventional ETL tools should be considered obsolete and phased out of the Enterprise Architecture, and tools based on Business Rules and E-LT take their place.

Citation preview

Page 1: Should ETL Become Obsolete

Should ETL Become Should ETL Become Obsolete?Obsolete?

Why a Business-Rules driven Why a Business-Rules driven

““E-LT” Architecture may be betterE-LT” Architecture may be better

Page 2: Should ETL Become Obsolete

IntroductionIntroduction

In today’s information-based economy, organizations In today’s information-based economy, organizations must be able to integrate vast amounts of data from must be able to integrate vast amounts of data from disparate sources in order to support strategic IT disparate sources in order to support strategic IT initiatives such as:initiatives such as:

Business IntelligenceBusiness Intelligence Corporate Performance Management, including Corporate Performance Management, including Master Data Management, Master Data Management, Data Warehousing and Data Marts. Data Warehousing and Data Marts.

At the same time, IT organizations are under constant At the same time, IT organizations are under constant pressure to get more done with fewer resources. pressure to get more done with fewer resources.

Page 3: Should ETL Become Obsolete

The only way to satisfy these conflicting goals is to The only way to satisfy these conflicting goals is to adopt a cost-effective integration solution that enhances adopt a cost-effective integration solution that enhances the productivity of the IT organization and helps to the productivity of the IT organization and helps to streamline a broad range of integration initiatives.streamline a broad range of integration initiatives.

Over the past several decades, many organizations Over the past several decades, many organizations have turned to commercial ETL (Extract, Transform, and have turned to commercial ETL (Extract, Transform, and Load) tools as a means to reduce the effort associated Load) tools as a means to reduce the effort associated with the most common integration approach: Manual with the most common integration approach: Manual Coding. Coding.

Using a centralized ETL “engine” as an integration hub, Using a centralized ETL “engine” as an integration hub, and powered by a single, platform independent and powered by a single, platform independent language, most ETL tools do reduce the effort language, most ETL tools do reduce the effort associated with point-to-point manual integration. associated with point-to-point manual integration.

Page 4: Should ETL Become Obsolete

However, many of the users of these tools have also However, many of the users of these tools have also encountered a number of issues that are a encountered a number of issues that are a consequence of the traditional ETL architecture. consequence of the traditional ETL architecture.

This begs the question: This begs the question:

Is the conventional approach to ETL Is the conventional approach to ETL obsolete?obsolete?

Page 5: Should ETL Become Obsolete

The Problems with the Traditional ETL The Problems with the Traditional ETL ApproachApproach

Traditional ETL tools operate by:Traditional ETL tools operate by:

First, First, ExtractingExtracting the data from various sources the data from various sources

Second, Second, TransformingTransforming the data on a proprietary, the data on a proprietary, middle-tier ETL enginemiddle-tier ETL engine

Third, Third, LoadingLoading the transformed data into the the transformed data into the target Data Warehouse or integration server. target Data Warehouse or integration server.

Hence the term “ETL” represents both the names Hence the term “ETL” represents both the names and the order of the operations performed, as and the order of the operations performed, as shown in the next slideshown in the next slide. .

Page 6: Should ETL Become Obsolete
Page 7: Should ETL Become Obsolete

Most ETL tools use a graphical programming model Most ETL tools use a graphical programming model to shield the user from the complexity of coding the to shield the user from the complexity of coding the transformations and make the tools easier to learn transformations and make the tools easier to learn and use. and use.

While in theory ETL tools should improve Developer While in theory ETL tools should improve Developer productivity, this is not always the case. productivity, this is not always the case.

The primary issues with the traditional ETL The primary issues with the traditional ETL approach fall into the following three categories:approach fall into the following three categories:

1.1. Productivity/maintainabilityProductivity/maintainability2.2. PerformancePerformance3.3. Cost.Cost.

Page 8: Should ETL Become Obsolete

Productivity/Maintainability Productivity/Maintainability IssuesIssues

Virtually all ETL tools use some kind of graphical Virtually all ETL tools use some kind of graphical programming model as an alternative to manual programming model as an alternative to manual coding.coding. While at first glance many of these data-flow While at first glance many of these data-flow oriented GUIs look similar, there are significant oriented GUIs look similar, there are significant differences that impact the number of intermediate differences that impact the number of intermediate steps that must be defined, and the maintainability steps that must be defined, and the maintainability and reuse of the job and its Components over time.and reuse of the job and its Components over time.

The conventional ETL approach first requires the user The conventional ETL approach first requires the user to describe what they want to do in plain English. to describe what they want to do in plain English. That means describing the Business Rules as text That means describing the Business Rules as text and then detailing precisely how to implement these and then detailing precisely how to implement these rules, step by step. This phase of development is rules, step by step. This phase of development is often referred to as “often referred to as “designing the Data Flowdesigning the Data Flow”. ”.

Page 9: Should ETL Become Obsolete

This requires a strong understanding of the structure of the This requires a strong understanding of the structure of the data and the architecture of the IT system.data and the architecture of the IT system.

As illustrated in the next figure, the Data Flow needs to be As illustrated in the next figure, the Data Flow needs to be defined manually, and repeated for each individual defined manually, and repeated for each individual process. process.

Users must understand not only what the overall Users must understand not only what the overall transformation is supposed to do, but also define what transformation is supposed to do, but also define what each incremental step is needed to perform it. each incremental step is needed to perform it.

This usually ends up as a long chain of multiple mapping This usually ends up as a long chain of multiple mapping operations tied to a number of temporary outputs.operations tied to a number of temporary outputs.

Defining all of these intermediate stages requires Defining all of these intermediate stages requires additional analysis and development work, and hence, has additional analysis and development work, and hence, has a negative impact on productivity.a negative impact on productivity.

Page 10: Should ETL Become Obsolete

Defining all of these intermediate stages requires Defining all of these intermediate stages requires additional analysis and development work, and additional analysis and development work, and hence, has a negative impact on productivity.hence, has a negative impact on productivity.

Page 11: Should ETL Become Obsolete

What’s worse, when there is a need to change a What’s worse, when there is a need to change a Business Rule or accommodate additional data Business Rule or accommodate additional data sources or targets, significant rework may be sources or targets, significant rework may be required to the existing Data Flow since the required to the existing Data Flow since the transformations are highly fragmented, requiring transformations are highly fragmented, requiring edits to many sub-blocks.edits to many sub-blocks.

This can make maintenance even more challenging This can make maintenance even more challenging – especially when the resource maintaining the – especially when the resource maintaining the processes is not the same as the resource who processes is not the same as the resource who created them.created them.

Page 12: Should ETL Become Obsolete

Performance Issues with Performance Issues with Traditional ETLTraditional ETL

The data transformation step of the ETL process is The data transformation step of the ETL process is by far the most computer-intensive, and is by far the most computer-intensive, and is performed entirely by the proprietary ETL engine on performed entirely by the proprietary ETL engine on a dedicated server. a dedicated server.

The ETL engine performs data transformations (and The ETL engine performs data transformations (and sometimes data quality checks) on a row-by-row sometimes data quality checks) on a row-by-row basis, and hence, can easily become the bottleneck basis, and hence, can easily become the bottleneck in the overall process. in the overall process.

In addition, the data must be moved over the In addition, the data must be moved over the network twice – once between the sources and the network twice – once between the sources and the ETL server, and again between the ETL server and ETL server, and again between the ETL server and the target Data Warehouse. the target Data Warehouse.

Page 13: Should ETL Become Obsolete

Moreover, if one wants to ensure referential Moreover, if one wants to ensure referential integrity by comparing Data Flow integrity by comparing Data Flow referencesreferences against values from the target Data Warehouse, the against values from the target Data Warehouse, the referenced data must be downloaded from the referenced data must be downloaded from the target to the engine.target to the engine.

Thus further increasing network traffic, download Thus further increasing network traffic, download time, and leading to additional performance issues.time, and leading to additional performance issues.

Let’s consider, for example, how a traditional ETL Let’s consider, for example, how a traditional ETL job would look up values from the target database job would look up values from the target database to enrich data coming from source systems. to enrich data coming from source systems.

To perform such a job, a traditional ETL tool could To perform such a job, a traditional ETL tool could be used in one of the following three ways:be used in one of the following three ways:

Page 14: Should ETL Become Obsolete

1.1. Load look-up tables into memory: Load look-up tables into memory: • The entire look-up table is retrieved from the The entire look-up table is retrieved from the

target server and loaded into the engine’s target server and loaded into the engine’s memory. memory.

• Matching (or joining) this look-up data with Matching (or joining) this look-up data with source records is done in memory before the source records is done in memory before the resulting transformed data is written back to resulting transformed data is written back to the target server. the target server.

• If the look-up table is large, the operation will If the look-up table is large, the operation will require a large amount of memory and a long require a large amount of memory and a long time to download its data and re-index it in time to download its data and re-index it in the engine.the engine.

Page 15: Should ETL Become Obsolete

2.2. Perform row-by-row look-ups “on the fly”: Perform row-by-row look-ups “on the fly”: • For every row, the ETL engine sends a query For every row, the ETL engine sends a query

to the look-up table located on the target to the look-up table located on the target server. server.

• The query returns a single row that is The query returns a single row that is matched (or joined) to the current row of the matched (or joined) to the current row of the flow. If the look-up table contains, for flow. If the look-up table contains, for example, 500,000 rows, the ETL engine will example, 500,000 rows, the ETL engine will send 500,000 queries. send 500,000 queries.

• This will dramatically slow down the data This will dramatically slow down the data integration process and add significant integration process and add significant overhead to your target system.overhead to your target system.

Page 16: Should ETL Become Obsolete

3.3. Use manual coding within the ETL job: Use manual coding within the ETL job: • Use the ETL engine only for loading source Use the ETL engine only for loading source

data to the target RDBMS and manually write data to the target RDBMS and manually write SQL code to join this data to the target look-SQL code to join this data to the target look-up table. up table.

• This raises the question: why would you buy This raises the question: why would you buy a tool that requires manual coding on the a tool that requires manual coding on the target server, knowing that you lose all the target server, knowing that you lose all the benefits of metadata management and benefits of metadata management and development productivity by doing so? development productivity by doing so?

• Unfortunately, this is what many users end Unfortunately, this is what many users end up doing once they notice (10 x) ten-times up doing once they notice (10 x) ten-times degradation in the overall performance of the degradation in the overall performance of the integration process (when compared to the integration process (when compared to the same operations executed by manual code).same operations executed by manual code).

Page 17: Should ETL Become Obsolete

Cost IssuesCost Issues

Most ETL tool purchases are justified based on Most ETL tool purchases are justified based on potential labor savings. potential labor savings.

Unfortunately, there are other up-front and recurring Unfortunately, there are other up-front and recurring costs that must be considered in the ROI analysis. costs that must be considered in the ROI analysis.

The most obvious initial cost is that of the dedicated The most obvious initial cost is that of the dedicated server and proprietary ETL engine software. Because server and proprietary ETL engine software. Because these middle tier components carry out all the these middle tier components carry out all the compute-intensive transformation operations, a compute-intensive transformation operations, a powerful server is required, and in some cases powerful server is required, and in some cases multiple servers and run-time engines are necessary multiple servers and run-time engines are necessary to meet the throughput requirements. to meet the throughput requirements.

Page 18: Should ETL Become Obsolete

There are also ongoing hardware and software There are also ongoing hardware and software maintenance costs associated with these assets.maintenance costs associated with these assets.

This can result in hundreds of thousands of dollars in This can result in hundreds of thousands of dollars in additional hardware, software and maintenance additional hardware, software and maintenance expenses. expenses.

In addition, as the Data Warehouse grows to In addition, as the Data Warehouse grows to accommodate higher throughput demands, the ETL hub accommodate higher throughput demands, the ETL hub server will need to scale up with it, necessitating server will need to scale up with it, necessitating additional hardware and software purchases in the additional hardware and software purchases in the future.future.

Conventional ETL tools also have a number of hidden Conventional ETL tools also have a number of hidden costs, including the consulting expenses required for costs, including the consulting expenses required for setup and tuning, and the rip-up and re-write of code as setup and tuning, and the rip-up and re-write of code as integration requirements evolve over time.integration requirements evolve over time.

Page 19: Should ETL Become Obsolete

A Better Approach: “E-LT” A Better Approach: “E-LT” Architecture + Business RulesArchitecture + Business Rules

In response to the issues described before, a new In response to the issues described before, a new architecture has emerged, which in many ways architecture has emerged, which in many ways incorporates the best aspects of both manual coding incorporates the best aspects of both manual coding and ETL approaches in the same solution. and ETL approaches in the same solution.

Known as “E-LT”, this new approach changes where Known as “E-LT”, this new approach changes where and how data transformation takes place, and and how data transformation takes place, and leverages the existing Developer skills, RDBMS leverages the existing Developer skills, RDBMS engines and server hardware to the greatest extent engines and server hardware to the greatest extent possible. possible.

In essence, E-LT moves the data transformation step In essence, E-LT moves the data transformation step to the target RDBMS, changing the order of to the target RDBMS, changing the order of operations to: Extract the data from the source operations to: Extract the data from the source tables, Load the tables into the destination server, tables, Load the tables into the destination server, and then Transform the data on the target RDBMS and then Transform the data on the target RDBMS using native SQL operators.using native SQL operators.

Page 20: Should ETL Become Obsolete

Note, with E-LT there is no need for a middle-tier Note, with E-LT there is no need for a middle-tier engine or server as shown in the figure below.engine or server as shown in the figure below.

Page 21: Should ETL Become Obsolete

Why the ETL Market is Why the ETL Market is ChangingChanging

When commercial ETL tools first appeared in the When commercial ETL tools first appeared in the 1990’s the most widely used RDBMSs such as Oracle, 1990’s the most widely used RDBMSs such as Oracle, DB2, Teradata and Sybase did not support a rich DB2, Teradata and Sybase did not support a rich enough set of SQL operators to handle the complex enough set of SQL operators to handle the complex data transformation tasks required for Data data transformation tasks required for Data Warehouse applications. Warehouse applications.

Hence the dedicated ETL engine and proprietary Hence the dedicated ETL engine and proprietary transformation language emerged as the best transformation language emerged as the best alternative to laborious manual coding at the time.alternative to laborious manual coding at the time.

However, over the past decade the RDBMS vendors However, over the past decade the RDBMS vendors have increased the functionality of the SQL provided have increased the functionality of the SQL provided to Developers by an order of magnitude, while to Developers by an order of magnitude, while improving the performance and reliability of their improving the performance and reliability of their engines at the same time. engines at the same time.

Page 22: Should ETL Become Obsolete

For example:For example:

The CASE…WHEN statement (equivalent to an IF… The CASE…WHEN statement (equivalent to an IF… THEN… ELSE…) can be used for complex THEN… ELSE…) can be used for complex transformation rules. transformation rules.

Outer joins (LEFT OUTER, RIGHT OUTER, or FULL Outer joins (LEFT OUTER, RIGHT OUTER, or FULL JOIN) can be used to easily join data sets in a JOIN) can be used to easily join data sets in a variety of different manners. variety of different manners.

Ranking and windowing functions (MIN OVER Ranking and windowing functions (MIN OVER PARTITION, MAX OVER PARTITION, LEAD, LAG, and PARTITION, MAX OVER PARTITION, LEAD, LAG, and RANK) allow for more effective handling of complex RANK) allow for more effective handling of complex aggregations of large volumes of data.aggregations of large volumes of data.

Page 23: Should ETL Become Obsolete

For example:For example:

Complementing the richer language support, RDBMS Complementing the richer language support, RDBMS vendors now provide a long list of out-of-the-box vendors now provide a long list of out-of-the-box features and utilities that enable impressive features and utilities that enable impressive performance when executing ETL-type operations: performance when executing ETL-type operations:

Some features are dedicated to efficient loading Some features are dedicated to efficient loading data from sources to targetsdata from sources to targets

Others directly process various data formats Others directly process various data formats such as XML files. such as XML files.

These are just a few examples of what can be done These are just a few examples of what can be done with the native SQL solutions provided with RDBMS with the native SQL solutions provided with RDBMS packages today.packages today.

Page 24: Should ETL Become Obsolete

E-LT Architecture Offers Better E-LT Architecture Offers Better Performance with Bulk Performance with Bulk

ProcessingProcessingBy generating efficient, native SQL code, the E-LT By generating efficient, native SQL code, the E-LT approach leverages the powerful bulk data approach leverages the powerful bulk data transformation capabilities of the RDBMS, and the transformation capabilities of the RDBMS, and the power of the server(s) that hosts it. power of the server(s) that hosts it.

In addition, since the data is loaded directly from In addition, since the data is loaded directly from source systems to the target server, only one set of source systems to the target server, only one set of network transfers are required, not two or more as network transfers are required, not two or more as with the traditional ETL approach. with the traditional ETL approach.

Only relational DBMS engines can perform set Only relational DBMS engines can perform set operations and bulk data transformations, enabling operations and bulk data transformations, enabling processes to achieve higher performance. Inserts processes to achieve higher performance. Inserts and updates are handled as “Bulk” operations and and updates are handled as “Bulk” operations and no longer performed row-by-row. no longer performed row-by-row.

Page 25: Should ETL Become Obsolete

Thanks to “Set processing” logic, the E-LT approach Thanks to “Set processing” logic, the E-LT approach can achieve exceptional performance with data can achieve exceptional performance with data transformations up to ten to twenty times more transformations up to ten to twenty times more efficiently than traditional ETL tools.efficiently than traditional ETL tools.

Page 26: Should ETL Become Obsolete

Business-Rules-Driven Approach Brings Business-Rules-Driven Approach Brings Better Productivity and MaintainabilityBetter Productivity and Maintainability

With a business-rules-driven paradigm, the With a business-rules-driven paradigm, the developer only defines what they want to do, and developer only defines what they want to do, and the data integration tool automatically generates the data integration tool automatically generates the data flow, including whatever intermediate the data flow, including whatever intermediate steps are required, based on a library of steps are required, based on a library of “knowledge modules”. “knowledge modules”.

The “what to do”, i.e. the Business Rules, are The “what to do”, i.e. the Business Rules, are specified using expressions that would make sense specified using expressions that would make sense to Business Analysts, and are stored in a central to Business Analysts, and are stored in a central metadata repository where they can be easily metadata repository where they can be easily reused. reused.

Page 27: Should ETL Become Obsolete

The implementation details, specifying “how to do The implementation details, specifying “how to do it”, are stored in a separate knowledge module it”, are stored in a separate knowledge module library, and can be shared between multiple library, and can be shared between multiple Business Rules within multiple ETL processes. Business Rules within multiple ETL processes.

The key advantage of this approach is that it is very The key advantage of this approach is that it is very easy to make incremental changes either to the easy to make incremental changes either to the rules or to the implementation details, as they are, rules or to the implementation details, as they are, in essence, independent. in essence, independent.

When a change needs to be applied to operations When a change needs to be applied to operations logic (e.g. creating a backup copy of every target logic (e.g. creating a backup copy of every target table before loading the new records) it is simply table before loading the new records) it is simply implemented in the appropriate knowledge module. implemented in the appropriate knowledge module.

Page 28: Should ETL Become Obsolete

That change is then automatically reflected in the That change is then automatically reflected in the hundreds of processes that reference it, without hundreds of processes that reference it, without having to touch the Business Rules. having to touch the Business Rules.

With a traditional ETL approach, such a change would With a traditional ETL approach, such a change would require opening every process to manually add new require opening every process to manually add new steps, increasing the risks of errors and steps, increasing the risks of errors and inconsistencies. inconsistencies.

This makes a huge difference in Developer This makes a huge difference in Developer productivity, especially in long-term program productivity, especially in long-term program maintenance.maintenance.

Page 29: Should ETL Become Obsolete

Combine E-LT with Business Combine E-LT with Business Rules to Lower the TCORules to Lower the TCO

With no middle-tier ETL engine and no dedicated ETL With no middle-tier ETL engine and no dedicated ETL server hardware required, the initial hardware and server hardware required, the initial hardware and software capital costs are significantly lower, as are software capital costs are significantly lower, as are the ongoing software and hardware maintenance the ongoing software and hardware maintenance expenses. expenses.

E-LT software also tends to be less expensive E-LT software also tends to be less expensive because it does not require the development of a because it does not require the development of a proprietary engine for transformations. proprietary engine for transformations.

It uses any standard RDBMS to execute the ETL jobs. It uses any standard RDBMS to execute the ETL jobs.

This means significant savings for both the software This means significant savings for both the software provider and the customer. provider and the customer.

Page 30: Should ETL Become Obsolete

These savings are on top of the Developer These savings are on top of the Developer productivity improvement, which enables productivity improvement, which enables Information Technology organizations to Information Technology organizations to dramatically reduce the cost of developing and dramatically reduce the cost of developing and maintaining comprehensive Data Warehouses.maintaining comprehensive Data Warehouses.

Page 31: Should ETL Become Obsolete

Conclusion: Traditional ETL is Conclusion: Traditional ETL is Indeed Becoming ObsoleteIndeed Becoming Obsolete

Now that we’ve compared the traditional ETL Now that we’ve compared the traditional ETL approach to the newer Business Rules-driven E-LT approach to the newer Business Rules-driven E-LT paradigm, the answer to the original question is paradigm, the answer to the original question is somewhat clearer:somewhat clearer:

Conventional ETL tools Conventional ETL tools shouldshould be considered be considered obsolete and phased out of the Enterprise obsolete and phased out of the Enterprise Architecture, and tools based on Business Rules and Architecture, and tools based on Business Rules and E-LT E-LT shouldshould begin to take their place. begin to take their place.