06-Unit6

Business Intelligence and Tools Unit 6

Sikkim Manipal University Page No.: 130

Unit 6 Data Extraction

Structure

6.1 Introduction

Objectives

6.2 ETL Overview

6.2.1 Significance of ETL Processes

6.2.2 ETL Requirements and Steps

Self Assessment Question(s) (SAQs)

6.3 Overview of the Data Extraction Process


6.4 Types of Data in Operational Systems

6.4.1 Current Value

6.4.2 Periodic Status


6.5 Source Identification


6.6 Data Extraction Techniques

6.6.1 Immediate Data Extraction

6.6.2 Deferred Data Extraction


6.7 Evaluation of the techniques


6.8 Summary

6.9 Terminal Questions (TQs)

6.10 Multiple Choice Questions (MCQs)

6.11 Answers to SAQs, TQs, and MCQs

6.11.1 Answers to Self Assessment Questions (SAQs)

6.11.2 Answers to Terminal Questions (TQs)

6.11.3 Answers to Multiple Choice Questions (MCQs)



6.1 Introduction

In this Unit, we discuss the data extraction process that is carried out from

the sources systems into data warehouses or data marts. As we have

already discussed, data extraction is the first step in the execution of the

ETL (Extraction, Transaction, and Loading) functions to build a data

warehouse. This extraction can be done from an OLTP database and non-

OLTP systems, such as text files, legacy systems, and spreadsheets. The

data extraction process is complex in its nature because of the tremendous

diversity that exists among the source systems in practice.

Objectives:

The objectives of the Unit are to make you understand:

Source Identification for extraction of the data. Various methods being used fro data extraction. Evaluation of the extraction techniques. Exception handling in case some data has not been extracted properly.

6.2 ETL Overview

Mostly the information contained in a data warehouse comes from the

operational systems. But we all know that the operational systems could not

be used to provide the strategic information. So you need to carefully

understand what constitutes the difference between the data in the source

operational systems and the information in the data warehouse. It is all ETL

functions that reshape the relevant data from the source systems into useful

information to be stored in the data warehouse. There would be no strategic

information in a data warehouse in the absence of these functions.

6.2.1 Significance of ETL Processes

The ETL functions act as the back-end processes that cover the extraction

of the data from the source systems. Also, they include all the functions and

procedures for changing the source data into the exact formats and

structures appropriate for storage in the data warehouse database. After the

transformation of the data, the processes include all processes that



physically move the data into the data warehouse repository. After capturing

the information, you cannot dump the data into the data warehouse. You

have to carefully subject the extracted data to all manner of transformations

so that the data will be fit to convert into information.

Data ExtractionExtraction from heterogeneous internal and external source systems

Data SummarizationCreating aggregate datasets on the basis of predefined procedures

Data TransformationConversion and restructuring of data as per the transformation rules

Data IntegrationCombining the entire data on the basis of source-to-target mapping

Data CleansingScrubbing and enriching as per the cleansing rules

Initial Data LoadingApply initial data in large volumes to the warehouse

Meta Data UpdatingMaintain and use metadata for ETL functions

Ongoing Data LoadingApply ongoing incremental loads, periodic refreshes to the warehouse

Fig 6.1: ETL Processes Summary



Let us try to understand the significance of the ETL function by taking an

example. For instance, you want to analyze and compare sales according to

stores, product and month. But the sales data is available in various

applications of your organization. Therefore, you have to have the entire

sales details in the data warehouse database. You can do this by providing

the sales and price in a fact table, the products in a product dimension table,

the stores in a stores dimension table and months in a time dimension table.

To do this, you need to extract the data from the respective source systems,

reconcile the variations in the data representations among the source

systems, transform the entire sales details, and load the sales into fact and

dimension table. Thus the execution of ETL functions is challenging

because of the nature of the source systems.

Also, the amount of time to be spent on performing the ETL functions is as

much as 50-70% of the total effort to be put for building a data warehouse.

To extract the data, you have to know the time window during each day to

extract the data from a specific source system without impacting the usage

of that system. Also, you need to determine the mechanism for capturing the

changes in the data in each of the relevant systems. Apart from the ETL

functions, the building of a data warehouse includes the functions like data

integration, data summarization and metadata updating. Figure 6.1 details

the processes involved to execute the ETL functions in building a data

warehouse.

6.2.2 ETL Requirements and Steps

Ideally, you are required to undergo the following steps provided in

Fig. 6.2 for the execution of the ETL functions.



Fig 6.2: Steps involved in executing the ETL functions


For Section 6.2

1. Mention the ETL functions and explain how they are important in

building a data warehouse?

2. The implementation of ETL function includes various steps from

determining the source systems to preparation of fact and dimension

tables. List out the major steps involved in the ETL process in a

sequence?

6.3 Overview of the Data Extraction Process

In general, data extraction is performed within the source system itself,

especially if it is a relational database to which the extraction procedures

can easily be added. Also, it is possible for the extraction logic to exist in the

1. To determine the Target data of the Data Warehouse

3. To map the sources with the target data elements

2. To identify the internal and external data sources

4. To establish comprehensive data extraction rules

5. To prepare data transformation and cleansing rules

6. To plan for aggregate tables

7. To organize the data staging area and test tools

8. To write the procedures for all data loads

9. To execute ETL functions for dimension tables

10. To execute ETL functions for fact tables



data warehouse staging area and query the source system for data using

ODBC (Open Database Connectivity), OLE DB (Object Linking and

Embedding, Database), or other APIs (Application Programming Interfaces).

The most common method of data extraction for the legacy systems is to

produce text files, although many newer systems offer direct query APIs or

accommodate access through ODBC or OLE DB.

Data extraction in the context of a data warehousing is different from that of

building an operating system. For a data warehouse, you have to extract

data from many disparate sources. Also, you have to extract data on the

changes for ongoing incremental loads as well as for a one-time initial full

load. But for operational systems, all you need is just one-time extractions

and data conversions. These two factors increase the complexity of data

extraction for a data warehouse and so an effective data extraction is a key

to the success of your data warehouse. Therefore, you need to pay special

attention to the issues and formulate an exclusive data extraction strategy

for your data warehouse.

Some of the important issues that you need to take care while extracting the

data from various source systems are as follows:

Source identification that involves identification of the source applications and source structures

Method of data extraction from each of the source systems and to define whether the extraction process is to be carried out manually or by using

the tool

Preparation of time window for each data source to denote the time for the data extraction

Determination of the extraction frequency for each data source in order to establish how frequently the data extraction must be done (like daily,

weekly, quarterly, etc.)



Job sequencing to determine whether the beginning of one job in an extraction job stream has to wait until the previous job has been finished

successfully

Exception handling to determine an approach to handle input records that cannot be extracted


For Section 6.3

1. What do you mean by Data Extraction? Discuss various issues involved

while extracting the data in building a data warehouse?

6.4 Types of Data in Operational Systems

As the data in the data warehouse is extracted from the operational system,

we now discuss the characteristics of the data in an operational system. The

data in the source operational systems is said to be time-dependent or

temporal as the source data changes with the time. As business transaction

keeps changing, the data in the source systems, the value of an attribute in

source system is the value of that attribute at the current time. For instance,

when a customer moves to another location, the data about customer

address changes in the customer table in the source system.

Similarly, the product data gets changed in the source system when a new

feature has been added to the existing product line. In case of change of

address, the current address of the customer is important in the operational

systems. But in the data warehouse the entire history is to be maintained. If

you want to make sales analysis based on the location for a given time

interval, the sales of the customer prior to the change of address is taken for

the previous period, and the current address is considered for the current

period.



Now we discuss how the data is stored in the source systems. There are

two ways of storing the data. The type of data extraction method you have

to use depends on nature of each of these two categories.

6.4.1 Current Value

The current value is the stored value of an attribute at that moment of time.

These values are transient and change as and when business transactions

happen. You cannot forecast when the value changes next. For instance,

the current balance amount in a bank account and the stock price of a

company are some examples of this category. Thus the value of an attribute

remains constant until a business transaction changes it and so data

extraction for preserving the history of the changes in the data warehouse

gets quite involved.

6.4.2 Periodic Status

Here, the value of the attributes is stored as the status every time a change

occurs. It means the status value is stored with reference to the time. For

example, the data about an insurance policy is stored as the status data of

the policy at each point of time. So the history of the changes is preserved in

the source systems themselves. Thus the data extraction becomes relatively

easier.


For Section 6.4

1. What are the types of storing the data in the source systems?

6.5 Source Identification

Source identification is a critical process in the data extraction process. For

instance, if you have intended to design a data warehouse to provide

strategic information on fulfillment of orders, you need to store the

information as fulfilled and pending orders. If you deliver the orders through



multiple channels, you also need to capture the data about the delivery

channels.

If you provide the customers an option to get the status of the order, you are

required to extract the data on the status of each of these orders. To

achieve this, you need to maintain the attributes like total order amount,

discounts, and expected delivery time in the fact table for order fulfillment.

Also, you need to prepare dimension tables for product, customers, order

disposition, delivery channel, etc.

You may take up the following step-wise approach for the source

identification:

Prepare a list of all data items or facts needed in fact tables Prepare each dimension attribute from all dimensions Find out the source system and source data item for every target data

item

Identify the preferred source in case there are multiple sources for a specific data item

Identify the single source filed for multiple target fields and establish splitting rules

Ascertain the default values and inspect the source data for missing values


For Section 6.5

1. Discuss various steps involved in the source identification process of

extracting the data.



6.6 Data Extraction Techniques

The initial data that you might have moved to the data warehouse to get it

started is called the initial load. After the initial load, you need to update the

data warehouse so as to maintain the history of the changes.

Broadly, there are two types of data extraction from the source operation

systems. They are as is (or static data) and data of revisions.

Static data is the capture of data at a specific point of time. For current data, this capture includes all transient data identified for extraction. In

case of the data categorized periodically, the capture has to include

each status at each point of time as available in the source systems.

Generally, the static data capture is used for the initial load of the data

warehouse and you may want a full refreshing of a dimensional table in

certain cases. For instance, you may do a full refreshing of the product

dimension table if the product master of the source application is

completely revamped.

Data of revisions (also known as incremental data capture) includes the revisions since the last time data was captured. If the source data is

transient, the capture of the revisions is a difficult exercise. But for

periodic status data, the incremental data capture includes the values of

attributes at specific times. Here, you need to extract the status that has

been recorded since the last date of extract. This incremental data

capture can be divided into immediate data extraction and deferred data

extraction.

6.6.1 Immediate Data Extraction

The immediate data extraction is a real-time data extraction. There are three

types of methods in the immediate data extraction process as discussed

below.



6.6.1.1 Capture through Transaction Logs

In this method, the transaction logs of the DBMSs are used to capture the

data. Whenever there is an updating in the database table, the DBMS writes

entries on the log file. The method reads the transaction log and selects all

the committed transactions. Since logging is already a part of the

transaction processing, there will not be any additional overheads in this

method.

However, you need to ensure extraction of all the transactions before the log

file gets refreshed. This method works well if all of the source systems are

database applications. In case of non-database applications, you need to

use some other data extraction method as there are no log files for these

applications.

6.6.1.2 Capture through Database Triggers

The database triggers are specially stored procedures or programs stored

on the database and are fired when a pre-defined event occurs. Here, you

can create the trigger programs for all events for which you need to capture

the data. The output of these trigger programs is written on a separate file

and will be used to extract data for the data ware house. For instance, you

can write a trigger program to capture all updates and deletes for a specific

table if you intend to capture all changes that happened to the records of the

table. But this method is also applicable only if the source operating systems

are database applications.

This method is quite reliable as the data capturing occurs right at the source

and you can capture both before and after images. However, you need to

put additional efforts on the front to build and maintain trigger programs.

Also, the execution of trigger procedures during the transaction processing

leads to additional overhead on the source systems.



6.6.1.3 Capture in Source Applications

In this method, the source application is made to assist in the data capture

for the data ware house. Here, you need to modify the relevant application

programs that write to the source files and databases. You may revise the

programs to write all additions and deletions on the source files and

database tables and the other extract programs can use the separate file

containing the changes to the source data.

The advantage of this method is that the method can be used for all types of

source data irrespective of the type of the source systems, say, databases,

indexed files or other flat files. But the difficulty lies with revising the

programs in the source operational systems and maintaining them properly.

Also, the method may reduce the performance of the source applications as

additional processing is required to capture the changes on separate files.

This method is also referred to as application - assisted data capture.

6.6.2 Deferred Data Extraction

All the methods in the immediate data extraction involve the real-time data

capture. In contrast, these deferred data extraction methods do not capture

the changes in real time but does the same in later period.

6.6.2.1 Capture based on Date and Time Stamp

In this method, every source record created or updated is marked with a

stamp that shows the date and time. The data capture occurs at a later time

after the creation or updating of a source record and the time stamp

provides the basis for selecting the records for data extraction. However, the

limitation of the method is that it presumes that all the relevant source

records contain date and time stamps. Also, the technique works well in

case the revised records are less in number.



6.6.2.2 Capture by Comparing Files

If any of the above techniques is not feasible for you to extract the data in

your environment, then you may consider the following method. According

to this method, you capture the changes to your product data by comparing

the current source data with the previous captured data. Though the method

seems simple, the comparison of the data becomes difficult in case of a

large file. But this method is only an option to deal with the legacy data

sources that do have any transaction logs and time stamps on the source

records. This method is also called as snapshot differential method as it

compares two snapshots of the source data.


For Section 6.6

1. Differentiate the terms static data capture and incremental data

capture to extract the data from the source systems?

2. Discuss various methods involved in the immediate data capture of

data extraction?

3. What is Deferred data extraction? Describe the methods of deferred

data extraction?

6.7 Evaluation of the techniques

As discussed in the above section, you may use any of the following

methods based on their application in your environment:

A. Capture of static data

B. Capture through transaction logs

C. Capture through database triggers

D. Capture in source applications

E. Capture based on date and time stamp

F. Capture by comparing files



Among all these methods, the methods of using the transaction logs and

database triggers are already available through out the database products.

Also, these methods are comparatively easy to implement and economical.

The technique based on transaction logs is perhaps the least expensive.

The method of data capture in source systems may be the most expensive

in terms of deployment and maintenance as it requires substantial revisions

to the existing source systems. But the method of file comparison is time-

consuming for the data extraction and so can be used only if any of the

other methods are not applicable.

The features of various data capture methods and their advantages and

limitations in implementation are as follows:

A. Capture of static data

Flexible to capture specifications Does not affect the performance of source systems No revisions required to existing applications Can be used on legacy systems and file-oriented systems

B. Capture through transaction logs

Not flexible to capture specifications Does not affect the performance of source systems No revisions to existing applications Can be used on most legacy systems, but cannot be used on file-

oriented systems.

C. Capture through database triggers

Not flexible to capture specifications Does not seriously affect the performance of source systems No revisions to existing applications Cannot be used on most legacy systems, file-oriented systems



D. Capture in source applications

Flexible to capture specifications Does not comprehensibly affect the performance of source systems Major revisions to existing applications Can be used on most legacy systems, file-oriented systems High internal costs because of in-house work

E. Capture based on date and time stamp

Flexible to capture specifications Does not affect the performance of source systems Likely revisions to existing applications. Cannot be used on legacy systems. Can be used on file-oriented systems.

F. Capture by comparing files

Flexible to capture specifications Does not affect the performance of source systems No revisions to existing applications May be used on legacy systems, file-oriented systems


For Section 6.7

1. List out the advantages and limitations of various data extraction

methods that are under practice.

6.8 Summary

ETL (Extraction, Transaction, and Loading) functions are crucial in building

a data warehouse. These ETL functions act as the back-end processes and

the amount of time to be spent on performing these functions is as much as

50-70% of the total effort to be put for building a data warehouse. These

functions include the following processes; Data Extraction, Data



Transformation, Data Integration, Data Cleansing, Data Summarization,

Initial Data Loading, Meta Data Updating, Ongoing Data Loading. Here, data

extraction is the first and crucial process.

Following are some of the challenges one may face while implementing the

data extraction process: identification of all source systems, choosing an

appropriate method of data extraction, preparation of time window,

determination of the extraction frequency, Job sequencing, and exception

handling. There are two ways of storing the data in the source systems.

They are current value and periodic status. The current value is the stored

value of an attribute at that moment of time. Periodic status is the status

every time a change occurs. So the status value is stored with reference to

a specific time frame. The type of data extraction method you have to use

depends on the nature of each of these two categories.

Broadly there are two types of data extraction: as is (or static data) and

data of revisions. Static data is the capture of data at a specific point of

time. For current data, this capture includes all transient data identified for

extraction. Data of revisions (also known as incremental data capture)

includes the revisions since the last time a data was captured. There are

three types of methods in the immediate data extraction process. They are:

capture through transaction logs, capture through database triggers, and

capture from source applications. All the methods in the immediate data

extraction involve the real-time data capture. But there are two methods in

practice under deferred data extraction where the data capturing is done in

later period, not in the real time. These methods are capture based on date

and time stamp and capture by comparing files. But one has to be careful

in choosing an appropriate method to implement the data extraction process

and it has to be done after understanding the advantages and limitations of

these methods.



6.9 Terminal Questions (TQs)

1. Give the appropriate reasons why you think ETL functions are most

challenging in a data warehouse environment.

2. Brief the major issues you may face in the process of data extraction

and explain how to deal with those issues?

3. Supposing you have chosen an ETL tool to perform the extraction,

transformation, and loading functions, discuss the activities carried out

by an ETL in executing the above mentioned functions.

6.10 Multiple Choice Questions (MCQs)

1. What do you mean by 'exception handling' in extracting the data?

a. to determine the ways to handle the input records that cannot be

extracted

b. to identify the source systems for extracting the data into a

warehouse

c. to arrange the alternative system for a data warehouse to work in

contingency situations

d. to re-design the architecture of a data warehouse if the performance

of the warehouse is not up to the mark

2. The time period set for repetition of the data extraction from a specific

source system is called ______.

a. Extraction period

b. Extraction level

c. Extraction frequency

d. None of the above

3. The process of data extraction for a data warehouse is complex

compared to that of an operational system, because ________.

a. For a data warehouse, you have to extract data from many disparate

sources



b. For a data warehouse, you have to extract the data on the changes

for ongoing incremental loads as well as for a one-time initial full load

c. Both (a) and (b)


4. Why is the data in a source system said to be time-dependent or

temporal?

a. The data in the source system changes with the time

b. The data in the source system does not change with the time

c. The attributes of the data in source system changes with the time

d. The data in the source system gets erased or deleted after a specific

period of time

5. Which of the following statements is false?

a. The history of data records are ignored in case of OLTP systems

b. Operational data in the source systems can be categorized into

'current value' and 'periodic status'

c. the information required for OLTP systems is drawn from the data in

the data warehouse


6. If the data capture is not done immediately or not done in real-time, the

type of extraction is called _________.

a. Delayed extraction

b. Deferred extraction

c. Delimit extraction

d. Deteriorated extraction

7. Which of the following is not an option for immediate data extraction?

a. Capture through transaction logs

b. Capture in source applications

c. Capture through the source code

d. Capture through database triggers



8. The ETL functions in a data warehouse environment are challenging in

nature. Because ________.

a. Source systems are very diverse and disparate

b. Many source systems are older legacy applications running on

obsolete database technologies.

c. Generally, historical data on changes in values are not preserved

in source operational systems. Historical information is critical in a

data warehouse.

d. All the above

9. Which of the following represents the duration in which the data can be

extracted from a specific source system so that the system does not

get impacted by the extraction process?

a. Source window

b. extraction window

c. time window

d. Impact window

10. The creation of aggregate datasets on the basis of the pre-determined

procedures is called ___________.

a. Data cleansing

b. Data integration

c. Data setting

d. Data summarization

11. Which of the following is a type of Immediate Data Extraction method to

capture the data from the source systems?

a. Capture through Transaction Logs

b. Capture through Database Triggers

c. Capture in Source Applications

d. All the above



12. Which of the following is a type of Deferred Data Extraction method for

the data capture?

a. Capture based on date and time Stamp

b. Capture by comparing files

c. Both of (a) and (b)


13. Which of the following methods is the least preferred method and is

implemented only when any of the other methods cannot be

implemented?

a. Capture based on date and time Stamp

b. Capture by comparing files

c. Capture in Source Applications

d. Capture through Database Triggers

6.11 Answers to SAQs, TQs, and MCQs

6.11.1 Answers to Self Assessment Questions (SAQs)

Section 6.2

1. ETL stands for Extraction, Transformation and Loading. To implement

these functions, you may be required to undergo the following

processes: data extraction, data transformation, data integration, data

cleansing, data summarization, initial data loading, metadata updating,

and ongoing data loading. Also write the significance of these ETL

processes as provided in the Section 6.2.1.

2. The steps to be followed in the implementation of ETL functions are

listed in the Figure 6.2. You may list out these steps with a small

description of each of these steps.



Section 6.3

1. Data extracting is the process of retrieving and collecting the data from

the appropriate source systems. The various issues one may face in the

data extraction process include identification of all source systems,

choosing an appropriate method of data extraction, preparation of time

window, determination of the extraction frequency, Job sequencing, and

exception handling. You may describe these processes as discussed in

the section 6.3.

Section 6.4

1. There are two ways of storing the data in the source systems; current

value and periodic status. The current value is the stored value of an

attribute at that moment of time. Periodic status is the status every time

a change occurs. So the status value is stored with reference to a

specific time frame. You may describe these methods as discussed in

the sections 6.4.1 and 6.4.2.

Section 6.5

1. The source identification process includes the following steps;

preparation of a list of all data items or facts needed in fact tables,

preparation of each dimension attribute from all dimensions, identifying

the source system and source data item for every target data item,

identifying the preferred source in case there are multiple sources for a

specific data item, identifying the single source filed for multiple target

fields and establishing splitting rules, and ascertaining the default values

and inspecting the source data for missing values.

Section 6.6

1. Static data is the capture of data at a specific point of time. For current

data, this capture includes all transient data identified for extraction.

Data of revisions (also known as incremental data capture) includes the



revisions since the last time the data was captured. You may include the

other points as discussed in the Section 6.6.

2. There are three types of methods in the immediate data extraction

process. They are: capture through transaction logs, capture through

database triggers, and capture in source applications. You can describe

each of these methods as discussed in the Section 6.6.1.

3. The deferred data extraction methods do not capture the changes in real

time. They capture the data in later period. The methods used in

deferred data extraction include capture based on date and time stamp,

capture by comparing files. You can describe each of these methods as

discussed in the Section 6.6.2.

Section 6.7

1. One can adopt any of the following methods for the data extraction;

Capture of static data, Capture through transaction logs, Capture

through database triggers, Capture in source applications, Capture

based on date and time stamp, and Capture by comparing files. The

advantages and limitations of each of these methods are discussed in

the Section 6.7.

6.11.2 Answers to Terminal Questions (TQs)

1. Because of the following reasons, you can say that the ETL functions

are challenging in the context of building a data warehouse:

Source systems are very diverse and disparate. You may have to deal with source systems on multiple platforms and different

operating systems.

Many source systems are legacy applications running on obsolete database technologies. The quality of data is also dubious in many

of the old source systems.



Source system structures keep changing over time because of new business conditions. Therefore, the ETL functions need to be

modified accordingly.

Historical data on changes in values are not preserved in source operational systems, which is critical in a data warehouse.

Even when inconsistent data is detected among disparate source systems, lack of a means for resolving mismatches escalates the

problem of inconsistency.

Many of the source systems do not represent the data in types or formats that are meaningful to the users.

2. The major issues involved in the data extraction process are discussed

in the Section 6.3. Accordingly, you may discuss the issues viz., source

identification, selection of an appropriate method of data extraction,

preparation of time window for each of the data source systems,

determination of the extraction frequency, sequencing of the jobs,

exception handling to handle the missing records during the extraction

process.

3. Before the selection of an ETL tool, you need to understand whether the

tool has the following capacities:

Data extraction from various regional databases of leading vendors, from old legacy databases, indexed files, and flat files

Data transformation from one format to another with variations in source and target fields

Performing of standard conversions, key reformatting, and structural changes

Provision of audit trails from source to target Application of business rules for extraction and transformation Combining of several records from the source systems into one

integrated target record

Recording and management of metadata.



6.11.3 Answers to Multiple Choice Questions (MCQs)

1. Ans: a

2. Ans: c

3. Ans: c

4. Ans: a

5. Ans: c

6. Ans: b

7. Ans: c

8. Ans: d

9. Ans: c

10. Ans: d

11. Ans: d

12. Ans: c

13. Ans: b

Documents

06-Unit6