Upload
adriano-patrick-cunha
View
258
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Apresentação sobre os métodos aplicados no processo de ETL, aprofundando sobre os métodos CDC que são utilizados em ETL de DataWarehouse de Tempo Real.
Citation preview
1Adriano Patrick Cunha
ETL in DW Real-Time
Adriano Patrick do N. Cunha
2Adriano Patrick Cunha
Data Warehouse (DW)
Conceits
3Adriano Patrick Cunha
Conceits
Data Warehouse (DW)
“is a prominent approach to materialized data integration. Data of interest, scattered across multiple heterogeneous sources is integrated into a central database system.” (Jörg e Dessloch)
“provides information for analytical processing, decision making and data mining tools. A DW collects data from multiple heterogeneous operational source systems OLTP and stores summarized integrated business data in a central repository used by analytical applications OLAP” (Bernadino e Santos)
4Adriano Patrick Cunha
ETL – Extraction, Transformation and Loading“Is a process extract the data from source system, transforms the data according to business rule, and loads results into the target data warehouse.”Actions:
1)The identification of relevant information at the source side.
2)The extraction of this information.3)The customization and integration of the information
coming from multiple sources into common format.4)The cleaning of the result data set on the basis of
database and business rules.5)The propagation of the data to the DW and DM
(Kakish e Kraft)
Conceits
5Adriano Patrick Cunha
Data Warehouse (DW) – Data Quality Dimensions
CompletenessConformityConsistencyAccuracyDuplicationIntegrity
Conceits
6Adriano Patrick Cunha
ETL Process
Extract“Taking out the data from a variety of disparate source system correctly is often the most challenging aspect of ETL ...”“The goal of the extraction phase is to convert the data into a single format which is appropriate for transformation process...”Relational DB, flat files, IMS, VSAM, ISAM etc.
“Most of the time the data in source system is very complex, thus determining which data is relevant is very difficult...”
(Kakish e Kraft)
7Adriano Patrick Cunha
ETL Process
Extract
Logical Methods for extraction:
Full extractionNo need to keep track change
Incremental extractionCDC mechanism
Staging Area
8Adriano Patrick Cunha
ETL Process
Extract
Physical Methods for extraction:
Online extractionConnect to source system to extract in preconfigured format.
Offline extractionThe data extracted is staged outside
9Adriano Patrick Cunha
ETL Process
Transform
Types Transformation
1. Selecting only certain columns to load;
2. Translating coded values (1 for male and 2 for famale, but DW M and F);
3. Encoding free-form values (mapping “Male” to “1”);
4. Deriving a new calculated value;
5. Sorting;
6. Joining data from multiple sources and removing data duplicating;
7. Aggregation;
8. Generating surrogate-key values;
10Adriano Patrick Cunha
ETL Process
Transform
Types Transformation
1. Transposing or pivoting (turning multiple columns into multiple rows or vice versa);
2. Splitting a column into multiple columns;
3. Disaggregation of repeating columns into a separate detail table;
4. Lookup and validate the relevant data from tables or referential files for slowly change dimensions; and
5. Applying any form of simple or complex data validation.
11Adriano Patrick Cunha
ETL Process
Load
Mechanisms to load include:
1. SQL loader: used in flat files into DW;
2. External Tables: store data in virtual table to queried and joined;
3. Oracle Call interface (OCI): is a API used when the transformation process is done outside database;
4. Export/Import
12Adriano Patrick Cunha
Types ETL´s
13Adriano Patrick Cunha
CDC - Change Data Capture
Snapshot Sources - Performs the ETL to a file and run a compare with the previous version of the file
Logged Sources - Uses change logs, usually using triggers to go with storing the logs changes, but may also be used by the business logic of the applications or even using specific utilities of the DBMS, such as database log scraping or log sniffing, who loggin transactions
Timestamped Sources - the tables have attributes audit, which indicate when the attribute is created or changed
14Adriano Patrick Cunha
CDC - Change Data Capture
Snapshot Sources
15Adriano Patrick Cunha
CDC - Change Data Capture
Logged Sources
16Adriano Patrick Cunha
Bibliografia
Near real-time data warehousing using state-of-the-art ETL toolsThomas Jörg, Stefan Dessloch (2010)Lecture Notes in Business Information Processing 41 LNBI
Real-time data warehouse loading methodologyRicardo Jorge Santos, Jorge Bernardino (2008)Proceedings of the 2008 international symposium on Database engineering & applications - IDEAS '08http://portal.acm.org/citation.cfm?doid=1451940.1451949 Near real-time data warehousing with multi-stage trickle and flipJanis Zuters (2011)Lecture Notes in Business Information Processing 90 LNBIP
A Triggering and scheduling approach for ETL in a real-time data warehouseJie Song, Yubin Bao, Jingang Shi (2010)Proceedings - 10th IEEE International Conference on Computer and Information Technology, CIT-2010, 7th IEEE International Conference on Embedded Software and Systems, ICESS-2010, ScalCom-2010
Creating a Real Time Data WarehouseJoseph Guerra, David A Andrews (2011)Andrews Consulting Group
ETL Evolution for Real-Time Data WarehousingKamal Kakish, Theresa A Kraft (2012)Proceedings of the Conference on Information Systems Applied Research p. 1-12www.aitp-edsig.org
17Adriano Patrick Cunha
All text and image content in this document is licensed under the Creative Commons Attribution-Share Alike 3.0 License (unless otherwise specified). "LibreOffice" and "The Document Foundation" are registered trademarks. Their respective logos and icons are subject to international copyright laws. The use of these therefore is subject to the trademark policy.
Thank you …