06-Unit6

  • Upload
    gaardi

  • View
    10

  • Download
    0

Embed Size (px)

DESCRIPTION

FF

Citation preview

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 130

    Unit 6 Data Extraction

    Structure

    6.1 Introduction

    Objectives

    6.2 ETL Overview

    6.2.1 Significance of ETL Processes

    6.2.2 ETL Requirements and Steps

    Self Assessment Question(s) (SAQs)

    6.3 Overview of the Data Extraction Process

    Self Assessment Question(s) (SAQs)

    6.4 Types of Data in Operational Systems

    6.4.1 Current Value

    6.4.2 Periodic Status

    Self Assessment Question(s) (SAQs)

    6.5 Source Identification

    Self Assessment Question(s) (SAQs)

    6.6 Data Extraction Techniques

    6.6.1 Immediate Data Extraction

    6.6.2 Deferred Data Extraction

    Self Assessment Question(s) (SAQs)

    6.7 Evaluation of the techniques

    Self Assessment Question(s) (SAQs)

    6.8 Summary

    6.9 Terminal Questions (TQs)

    6.10 Multiple Choice Questions (MCQs)

    6.11 Answers to SAQs, TQs, and MCQs

    6.11.1 Answers to Self Assessment Questions (SAQs)

    6.11.2 Answers to Terminal Questions (TQs)

    6.11.3 Answers to Multiple Choice Questions (MCQs)

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 131

    6.1 Introduction

    In this Unit, we discuss the data extraction process that is carried out from

    the sources systems into data warehouses or data marts. As we have

    already discussed, data extraction is the first step in the execution of the

    ETL (Extraction, Transaction, and Loading) functions to build a data

    warehouse. This extraction can be done from an OLTP database and non-

    OLTP systems, such as text files, legacy systems, and spreadsheets. The

    data extraction process is complex in its nature because of the tremendous

    diversity that exists among the source systems in practice.

    Objectives:

    The objectives of the Unit are to make you understand:

    Source Identification for extraction of the data. Various methods being used fro data extraction. Evaluation of the extraction techniques. Exception handling in case some data has not been extracted properly.

    6.2 ETL Overview

    Mostly the information contained in a data warehouse comes from the

    operational systems. But we all know that the operational systems could not

    be used to provide the strategic information. So you need to carefully

    understand what constitutes the difference between the data in the source

    operational systems and the information in the data warehouse. It is all ETL

    functions that reshape the relevant data from the source systems into useful

    information to be stored in the data warehouse. There would be no strategic

    information in a data warehouse in the absence of these functions.

    6.2.1 Significance of ETL Processes

    The ETL functions act as the back-end processes that cover the extraction

    of the data from the source systems. Also, they include all the functions and

    procedures for changing the source data into the exact formats and

    structures appropriate for storage in the data warehouse database. After the

    transformation of the data, the processes include all processes that

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 132

    physically move the data into the data warehouse repository. After capturing

    the information, you cannot dump the data into the data warehouse. You

    have to carefully subject the extracted data to all manner of transformations

    so that the data will be fit to convert into information.

    Data ExtractionExtraction from heterogeneous internal and external source systems

    Data SummarizationCreating aggregate datasets on the basis of predefined procedures

    Data TransformationConversion and restructuring of data as per the transformation rules

    Data IntegrationCombining the entire data on the basis of source-to-target mapping

    Data CleansingScrubbing and enriching as per the cleansing rules

    Initial Data LoadingApply initial data in large volumes to the warehouse

    Meta Data UpdatingMaintain and use metadata for ETL functions

    Ongoing Data LoadingApply ongoing incremental loads, periodic refreshes to the warehouse

    Fig 6.1: ETL Processes Summary

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 133

    Let us try to understand the significance of the ETL function by taking an

    example. For instance, you want to analyze and compare sales according to

    stores, product and month. But the sales data is available in various

    applications of your organization. Therefore, you have to have the entire

    sales details in the data warehouse database. You can do this by providing

    the sales and price in a fact table, the products in a product dimension table,

    the stores in a stores dimension table and months in a time dimension table.

    To do this, you need to extract the data from the respective source systems,

    reconcile the variations in the data representations among the source

    systems, transform the entire sales details, and load the sales into fact and

    dimension table. Thus the execution of ETL functions is challenging

    because of the nature of the source systems.

    Also, the amount of time to be spent on performing the ETL functions is as

    much as 50-70% of the total effort to be put for building a data warehouse.

    To extract the data, you have to know the time window during each day to

    extract the data from a specific source system without impacting the usage

    of that system. Also, you need to determine the mechanism for capturing the

    changes in the data in each of the relevant systems. Apart from the ETL

    functions, the building of a data warehouse includes the functions like data

    integration, data summarization and metadata updating. Figure 6.1 details

    the processes involved to execute the ETL functions in building a data

    warehouse.

    6.2.2 ETL Requirements and Steps

    Ideally, you are required to undergo the following steps provided in

    Fig. 6.2 for the execution of the ETL functions.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 134

    Fig 6.2: Steps involved in executing the ETL functions

    Self Assessment Question(s) (SAQs)

    For Section 6.2

    1. Mention the ETL functions and explain how they are important in

    building a data warehouse?

    2. The implementation of ETL function includes various steps from

    determining the source systems to preparation of fact and dimension

    tables. List out the major steps involved in the ETL process in a

    sequence?

    6.3 Overview of the Data Extraction Process

    In general, data extraction is performed within the source system itself,

    especially if it is a relational database to which the extraction procedures

    can easily be added. Also, it is possible for the extraction logic to exist in the

    1. To determine the Target data of the Data Warehouse

    3. To map the sources with the target data elements

    2. To identify the internal and external data sources

    4. To establish comprehensive data extraction rules

    5. To prepare data transformation and cleansing rules

    6. To plan for aggregate tables

    7. To organize the data staging area and test tools

    8. To write the procedures for all data loads

    9. To execute ETL functions for dimension tables

    10. To execute ETL functions for fact tables

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 135

    data warehouse staging area and query the source system for data using

    ODBC (Open Database Connectivity), OLE DB (Object Linking and

    Embedding, Database), or other APIs (Application Programming Interfaces).

    The most common method of data extraction for the legacy systems is to

    produce text files, although many newer systems offer direct query APIs or

    accommodate access through ODBC or OLE DB.

    Data extraction in the context of a data warehousing is different from that of

    building an operating system. For a data warehouse, you have to extract

    data from many disparate sources. Also, you have to extract data on the

    changes for ongoing incremental loads as well as for a one-time initial full

    load. But for operational systems, all you need is just one-time extractions

    and data conversions. These two factors increase the complexity of data

    extraction for a data warehouse and so an effective data extraction is a key

    to the success of your data warehouse. Therefore, you need to pay special

    attention to the issues and formulate an exclusive data extraction strategy

    for your data warehouse.

    Some of the important issues that you need to take care while extracting the

    data from various source systems are as follows:

    Source identification that involves identification of the source applications and source structures

    Method of data extraction from each of the source systems and to define whether the extraction process is to be carried out manually or by using

    the tool

    Preparation of time window for each data source to denote the time for the data extraction

    Determination of the extraction frequency for each data source in order to establish how frequently the data extraction must be done (like daily,

    weekly, quarterly, etc.)

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 136

    Job sequencing to determine whether the beginning of one job in an extraction job stream has to wait until the previous job has been finished

    successfully

    Exception handling to determine an approach to handle input records that cannot be extracted

    Self Assessment Question(s) (SAQs)

    For Section 6.3

    1. What do you mean by Data Extraction? Discuss various issues involved

    while extracting the data in building a data warehouse?

    6.4 Types of Data in Operational Systems

    As the data in the data warehouse is extracted from the operational system,

    we now discuss the characteristics of the data in an operational system. The

    data in the source operational systems is said to be time-dependent or

    temporal as the source data changes with the time. As business transaction

    keeps changing, the data in the source systems, the value of an attribute in

    source system is the value of that attribute at the current time. For instance,

    when a customer moves to another location, the data about customer

    address changes in the customer table in the source system.

    Similarly, the product data gets changed in the source system when a new

    feature has been added to the existing product line. In case of change of

    address, the current address of the customer is important in the operational

    systems. But in the data warehouse the entire history is to be maintained. If

    you want to make sales analysis based on the location for a given time

    interval, the sales of the customer prior to the change of address is taken for

    the previous period, and the current address is considered for the current

    period.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 137

    Now we discuss how the data is stored in the source systems. There are

    two ways of storing the data. The type of data extraction method you have

    to use depends on nature of each of these two categories.

    6.4.1 Current Value

    The current value is the stored value of an attribute at that moment of time.

    These values are transient and change as and when business transactions

    happen. You cannot forecast when the value changes next. For instance,

    the current balance amount in a bank account and the stock price of a

    company are some examples of this category. Thus the value of an attribute

    remains constant until a business transaction changes it and so data

    extraction for preserving the history of the changes in the data warehouse

    gets quite involved.

    6.4.2 Periodic Status

    Here, the value of the attributes is stored as the status every time a change

    occurs. It means the status value is stored with reference to the time. For

    example, the data about an insurance policy is stored as the status data of

    the policy at each point of time. So the history of the changes is preserved in

    the source systems themselves. Thus the data extraction becomes relatively

    easier.

    Self Assessment Question(s) (SAQs)

    For Section 6.4

    1. What are the types of storing the data in the source systems?

    6.5 Source Identification

    Source identification is a critical process in the data extraction process. For

    instance, if you have intended to design a data warehouse to provide

    strategic information on fulfillment of orders, you need to store the

    information as fulfilled and pending orders. If you deliver the orders through

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 138

    multiple channels, you also need to capture the data about the delivery

    channels.

    If you provide the customers an option to get the status of the order, you are

    required to extract the data on the status of each of these orders. To

    achieve this, you need to maintain the attributes like total order amount,

    discounts, and expected delivery time in the fact table for order fulfillment.

    Also, you need to prepare dimension tables for product, customers, order

    disposition, delivery channel, etc.

    You may take up the following step-wise approach for the source

    identification:

    Prepare a list of all data items or facts needed in fact tables Prepare each dimension attribute from all dimensions Find out the source system and source data item for every target data

    item

    Identify the preferred source in case there are multiple sources for a specific data item

    Identify the single source filed for multiple target fields and establish splitting rules

    Ascertain the default values and inspect the source data for missing values

    Self Assessment Question(s) (SAQs)

    For Section 6.5

    1. Discuss various steps involved in the source identification process of

    extracting the data.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 139

    6.6 Data Extraction Techniques

    The initial data that you might have moved to the data warehouse to get it

    started is called the initial load. After the initial load, you need to update the

    data warehouse so as to maintain the history of the changes.

    Broadly, there are two types of data extraction from the source operation

    systems. They are as is (or static data) and data of revisions.

    Static data is the capture of data at a specific point of time. For current data, this capture includes all transient data identified for extraction. In

    case of the data categorized periodically, the capture has to include

    each status at each point of time as available in the source systems.

    Generally, the static data capture is used for the initial load of the data

    warehouse and you may want a full refreshing of a dimensional table in

    certain cases. For instance, you may do a full refreshing of the product

    dimension table if the product master of the source application is

    completely revamped.

    Data of revisions (also known as incremental data capture) includes the revisions since the last time data was captured. If the source data is

    transient, the capture of the revisions is a difficult exercise. But for

    periodic status data, the incremental data capture includes the values of

    attributes at specific times. Here, you need to extract the status that has

    been recorded since the last date of extract. This incremental data

    capture can be divided into immediate data extraction and deferred data

    extraction.

    6.6.1 Immediate Data Extraction

    The immediate data extraction is a real-time data extraction. There are three

    types of methods in the immediate data extraction process as discussed

    below.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 140

    6.6.1.1 Capture through Transaction Logs

    In this method, the transaction logs of the DBMSs are used to capture the

    data. Whenever there is an updating in the database table, the DBMS writes

    entries on the log file. The method reads the transaction log and selects all

    the committed transactions. Since logging is already a part of the

    transaction processing, there will not be any additional overheads in this

    method.

    However, you need to ensure extraction of all the transactions before the log

    file gets refreshed. This method works well if all of the source systems are

    database applications. In case of non-database applications, you need to

    use some other data extraction method as there are no log files for these

    applications.

    6.6.1.2 Capture through Database Triggers

    The database triggers are specially stored procedures or programs stored

    on the database and are fired when a pre-defined event occurs. Here, you

    can create the trigger programs for all events for which you need to capture

    the data. The output of these trigger programs is written on a separate file

    and will be used to extract data for the data ware house. For instance, you

    can write a trigger program to capture all updates and deletes for a specific

    table if you intend to capture all changes that happened to the records of the

    table. But this method is also applicable only if the source operating systems

    are database applications.

    This method is quite reliable as the data capturing occurs right at the source

    and you can capture both before and after images. However, you need to

    put additional efforts on the front to build and maintain trigger programs.

    Also, the execution of trigger procedures during the transaction processing

    leads to additional overhead on the source systems.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 141

    6.6.1.3 Capture in Source Applications

    In this method, the source application is made to assist in the data capture

    for the data ware house. Here, you need to modify the relevant application

    programs that write to the source files and databases. You may revise the

    programs to write all additions and deletions on the source files and

    database tables and the other extract programs can use the separate file

    containing the changes to the source data.

    The advantage of this method is that the method can be used for all types of

    source data irrespective of the type of the source systems, say, databases,

    indexed files or other flat files. But the difficulty lies with revising the

    programs in the source operational systems and maintaining them properly.

    Also, the method may reduce the performance of the source applications as

    additional processing is required to capture the changes on separate files.

    This method is also referred to as application - assisted data capture.

    6.6.2 Deferred Data Extraction

    All the methods in the immediate data extraction involve the real-time data

    capture. In contrast, these deferred data extraction methods do not capture

    the changes in real time but does the same in later period.

    6.6.2.1 Capture based on Date and Time Stamp

    In this method, every source record created or updated is marked with a

    stamp that shows the date and time. The data capture occurs at a later time

    after the creation or updating of a source record and the time stamp

    provides the basis for selecting the records for data extraction. However, the

    limitation of the method is that it presumes that all the relevant source

    records contain date and time stamps. Also, the technique works well in

    case the revised records are less in number.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 142

    6.6.2.2 Capture by Comparing Files

    If any of the above techniques is not feasible for you to extract the data in

    your environment, then you may consider the following method. According

    to this method, you capture the changes to your product data by comparing

    the current source data with the previous captured data. Though the method

    seems simple, the comparison of the data becomes difficult in case of a

    large file. But this method is only an option to deal with the legacy data

    sources that do have any transaction logs and time stamps on the source

    records. This method is also called as snapshot differential method as it

    compares two snapshots of the source data.

    Self Assessment Question(s) (SAQs)

    For Section 6.6

    1. Differentiate the terms static data capture and incremental data

    capture to extract the data from the source systems?

    2. Discuss various methods involved in the immediate data capture of

    data extraction?

    3. What is Deferred data extraction? Describe the methods of deferred

    data extraction?

    6.7 Evaluation of the techniques

    As discussed in the above section, you may use any of the following

    methods based on their application in your environment:

    A. Capture of static data

    B. Capture through transaction logs

    C. Capture through database triggers

    D. Capture in source applications

    E. Capture based on date and time stamp

    F. Capture by comparing files

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 143

    Among all these methods, the methods of using the transaction logs and

    database triggers are already available through out the database products.

    Also, these methods are comparatively easy to implement and economical.

    The technique based on transaction logs is perhaps the least expensive.

    The method of data capture in source systems may be the most expensive

    in terms of deployment and maintenance as it requires substantial revisions

    to the existing source systems. But the method of file comparison is time-

    consuming for the data extraction and so can be used only if any of the

    other methods are not applicable.

    The features of various data capture methods and their advantages and

    limitations in implementation are as follows:

    A. Capture of static data

    Flexible to capture specifications Does not affect the performance of source systems No revisions required to existing applications Can be used on legacy systems and file-oriented systems

    B. Capture through transaction logs

    Not flexible to capture specifications Does not affect the performance of source systems No revisions to existing applications Can be used on most legacy systems, but cannot be used on file-

    oriented systems.

    C. Capture through database triggers

    Not flexible to capture specifications Does not seriously affect the performance of source systems No revisions to existing applications Cannot be used on most legacy systems, file-oriented systems

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 144

    D. Capture in source applications

    Flexible to capture specifications Does not comprehensibly affect the performance of source systems Major revisions to existing applications Can be used on most legacy systems, file-oriented systems High internal costs because of in-house work

    E. Capture based on date and time stamp

    Flexible to capture specifications Does not affect the performance of source systems Likely revisions to existing applications. Cannot be used on legacy systems. Can be used on file-oriented systems.

    F. Capture by comparing files

    Flexible to capture specifications Does not affect the performance of source systems No revisions to existing applications May be used on legacy systems, file-oriented systems

    Self Assessment Question(s) (SAQs)

    For Section 6.7

    1. List out the advantages and limitations of various data extraction

    methods that are under practice.

    6.8 Summary

    ETL (Extraction, Transaction, and Loading) functions are crucial in building

    a data warehouse. These ETL functions act as the back-end processes and

    the amount of time to be spent on performing these functions is as much as

    50-70% of the total effort to be put for building a data warehouse. These

    functions include the following processes; Data Extraction, Data

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 145

    Transformation, Data Integration, Data Cleansing, Data Summarization,

    Initial Data Loading, Meta Data Updating, Ongoing Data Loading. Here, data

    extraction is the first and crucial process.

    Following are some of the challenges one may face while implementing the

    data extraction process: identification of all source systems, choosing an

    appropriate method of data extraction, preparation of time window,

    determination of the extraction frequency, Job sequencing, and exception

    handling. There are two ways of storing the data in the source systems.

    They are current value and periodic status. The current value is the stored

    value of an attribute at that moment of time. Periodic status is the status

    every time a change occurs. So the status value is stored with reference to

    a specific time frame. The type of data extraction method you have to use

    depends on the nature of each of these two categories.

    Broadly there are two types of data extraction: as is (or static data) and

    data of revisions. Static data is the capture of data at a specific point of

    time. For current data, this capture includes all transient data identified for

    extraction. Data of revisions (also known as incremental data capture)

    includes the revisions since the last time a data was captured. There are

    three types of methods in the immediate data extraction process. They are:

    capture through transaction logs, capture through database triggers, and

    capture from source applications. All the methods in the immediate data

    extraction involve the real-time data capture. But there are two methods in

    practice under deferred data extraction where the data capturing is done in

    later period, not in the real time. These methods are capture based on date

    and time stamp and capture by comparing files. But one has to be careful

    in choosing an appropriate method to implement the data extraction process

    and it has to be done after understanding the advantages and limitations of

    these methods.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 146

    6.9 Terminal Questions (TQs)

    1. Give the appropriate reasons why you think ETL functions are most

    challenging in a data warehouse environment.

    2. Brief the major issues you may face in the process of data extraction

    and explain how to deal with those issues?

    3. Supposing you have chosen an ETL tool to perform the extraction,

    transformation, and loading functions, discuss the activities carried out

    by an ETL in executing the above mentioned functions.

    6.10 Multiple Choice Questions (MCQs)

    1. What do you mean by 'exception handling' in extracting the data?

    a. to determine the ways to handle the input records that cannot be

    extracted

    b. to identify the source systems for extracting the data into a

    warehouse

    c. to arrange the alternative system for a data warehouse to work in

    contingency situations

    d. to re-design the architecture of a data warehouse if the performance

    of the warehouse is not up to the mark

    2. The time period set for repetition of the data extraction from a specific

    source system is called ______.

    a. Extraction period

    b. Extraction level

    c. Extraction frequency

    d. None of the above

    3. The process of data extraction for a data warehouse is complex

    compared to that of an operational system, because ________.

    a. For a data warehouse, you have to extract data from many disparate

    sources

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 147

    b. For a data warehouse, you have to extract the data on the changes

    for ongoing incremental loads as well as for a one-time initial full load

    c. Both (a) and (b)

    d. None of the above

    4. Why is the data in a source system said to be time-dependent or

    temporal?

    a. The data in the source system changes with the time

    b. The data in the source system does not change with the time

    c. The attributes of the data in source system changes with the time

    d. The data in the source system gets erased or deleted after a specific

    period of time

    5. Which of the following statements is false?

    a. The history of data records are ignored in case of OLTP systems

    b. Operational data in the source systems can be categorized into

    'current value' and 'periodic status'

    c. the information required for OLTP systems is drawn from the data in

    the data warehouse

    d. None of the above

    6. If the data capture is not done immediately or not done in real-time, the

    type of extraction is called _________.

    a. Delayed extraction

    b. Deferred extraction

    c. Delimit extraction

    d. Deteriorated extraction

    7. Which of the following is not an option for immediate data extraction?

    a. Capture through transaction logs

    b. Capture in source applications

    c. Capture through the source code

    d. Capture through database triggers

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 148

    8. The ETL functions in a data warehouse environment are challenging in

    nature. Because ________.

    a. Source systems are very diverse and disparate

    b. Many source systems are older legacy applications running on

    obsolete database technologies.

    c. Generally, historical data on changes in values are not preserved

    in source operational systems. Historical information is critical in a

    data warehouse.

    d. All the above

    9. Which of the following represents the duration in which the data can be

    extracted from a specific source system so that the system does not

    get impacted by the extraction process?

    a. Source window

    b. extraction window

    c. time window

    d. Impact window

    10. The creation of aggregate datasets on the basis of the pre-determined

    procedures is called ___________.

    a. Data cleansing

    b. Data integration

    c. Data setting

    d. Data summarization

    11. Which of the following is a type of Immediate Data Extraction method to

    capture the data from the source systems?

    a. Capture through Transaction Logs

    b. Capture through Database Triggers

    c. Capture in Source Applications

    d. All the above

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 149

    12. Which of the following is a type of Deferred Data Extraction method for

    the data capture?

    a. Capture based on date and time Stamp

    b. Capture by comparing files

    c. Both of (a) and (b)

    d. None of the above

    13. Which of the following methods is the least preferred method and is

    implemented only when any of the other methods cannot be

    implemented?

    a. Capture based on date and time Stamp

    b. Capture by comparing files

    c. Capture in Source Applications

    d. Capture through Database Triggers

    6.11 Answers to SAQs, TQs, and MCQs

    6.11.1 Answers to Self Assessment Questions (SAQs)

    Section 6.2

    1. ETL stands for Extraction, Transformation and Loading. To implement

    these functions, you may be required to undergo the following

    processes: data extraction, data transformation, data integration, data

    cleansing, data summarization, initial data loading, metadata updating,

    and ongoing data loading. Also write the significance of these ETL

    processes as provided in the Section 6.2.1.

    2. The steps to be followed in the implementation of ETL functions are

    listed in the Figure 6.2. You may list out these steps with a small

    description of each of these steps.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 150

    Section 6.3

    1. Data extracting is the process of retrieving and collecting the data from

    the appropriate source systems. The various issues one may face in the

    data extraction process include identification of all source systems,

    choosing an appropriate method of data extraction, preparation of time

    window, determination of the extraction frequency, Job sequencing, and

    exception handling. You may describe these processes as discussed in

    the section 6.3.

    Section 6.4

    1. There are two ways of storing the data in the source systems; current

    value and periodic status. The current value is the stored value of an

    attribute at that moment of time. Periodic status is the status every time

    a change occurs. So the status value is stored with reference to a

    specific time frame. You may describe these methods as discussed in

    the sections 6.4.1 and 6.4.2.

    Section 6.5

    1. The source identification process includes the following steps;

    preparation of a list of all data items or facts needed in fact tables,

    preparation of each dimension attribute from all dimensions, identifying

    the source system and source data item for every target data item,

    identifying the preferred source in case there are multiple sources for a

    specific data item, identifying the single source filed for multiple target

    fields and establishing splitting rules, and ascertaining the default values

    and inspecting the source data for missing values.

    Section 6.6

    1. Static data is the capture of data at a specific point of time. For current

    data, this capture includes all transient data identified for extraction.

    Data of revisions (also known as incremental data capture) includes the

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 151

    revisions since the last time the data was captured. You may include the

    other points as discussed in the Section 6.6.

    2. There are three types of methods in the immediate data extraction

    process. They are: capture through transaction logs, capture through

    database triggers, and capture in source applications. You can describe

    each of these methods as discussed in the Section 6.6.1.

    3. The deferred data extraction methods do not capture the changes in real

    time. They capture the data in later period. The methods used in

    deferred data extraction include capture based on date and time stamp,

    capture by comparing files. You can describe each of these methods as

    discussed in the Section 6.6.2.

    Section 6.7

    1. One can adopt any of the following methods for the data extraction;

    Capture of static data, Capture through transaction logs, Capture

    through database triggers, Capture in source applications, Capture

    based on date and time stamp, and Capture by comparing files. The

    advantages and limitations of each of these methods are discussed in

    the Section 6.7.

    6.11.2 Answers to Terminal Questions (TQs)

    1. Because of the following reasons, you can say that the ETL functions

    are challenging in the context of building a data warehouse:

    Source systems are very diverse and disparate. You may have to deal with source systems on multiple platforms and different

    operating systems.

    Many source systems are legacy applications running on obsolete database technologies. The quality of data is also dubious in many

    of the old source systems.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 152

    Source system structures keep changing over time because of new business conditions. Therefore, the ETL functions need to be

    modified accordingly.

    Historical data on changes in values are not preserved in source operational systems, which is critical in a data warehouse.

    Even when inconsistent data is detected among disparate source systems, lack of a means for resolving mismatches escalates the

    problem of inconsistency.

    Many of the source systems do not represent the data in types or formats that are meaningful to the users.

    2. The major issues involved in the data extraction process are discussed

    in the Section 6.3. Accordingly, you may discuss the issues viz., source

    identification, selection of an appropriate method of data extraction,

    preparation of time window for each of the data source systems,

    determination of the extraction frequency, sequencing of the jobs,

    exception handling to handle the missing records during the extraction

    process.

    3. Before the selection of an ETL tool, you need to understand whether the

    tool has the following capacities:

    Data extraction from various regional databases of leading vendors, from old legacy databases, indexed files, and flat files

    Data transformation from one format to another with variations in source and target fields

    Performing of standard conversions, key reformatting, and structural changes

    Provision of audit trails from source to target Application of business rules for extraction and transformation Combining of several records from the source systems into one

    integrated target record

    Recording and management of metadata.

  • Business Intelligence and Tools Unit 6

    Sikkim Manipal University Page No.: 153

    6.11.3 Answers to Multiple Choice Questions (MCQs)

    1. Ans: a

    2. Ans: c

    3. Ans: c

    4. Ans: a

    5. Ans: c

    6. Ans: b

    7. Ans: c

    8. Ans: d

    9. Ans: c

    10. Ans: d

    11. Ans: d

    12. Ans: c

    13. Ans: b