1 Data Warehousing & Mining Data Warehouse Architecture: Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how the data warehouse is built. There is no right or wrong architecture. The worthiness of the architecture can be judged in how the conceptualization aids in the building, maintenance, and usage of the data warehouse. One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers: Operational database layer The source data for the data warehouse - An organization's ERP systems fall into this layer. Informational access layer The data accessed for reporting and analyzing and the tools for reporting and analyzing data - BI tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this article, have to do with this layer. Data access layer The interface between the operational and informational access layer - Tools to extract, transform, Load data into the warehouse fall into this layer. Metadata layer The data directory - This is often usually more detailed than an operational system data directory. There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be accessed by a particular reporting and analysis tool. Normalized versus dimensional approach for storage of data There are two leading approaches to storing data in a data warehouse - the dimensional approach and the normalized approach. In the dimensional approach, transaction data are partitioned into either "facts", which are generally numeric transaction data, and "dimensions", which are the reference information that gives context to the facts. For example, a sales transaction can be broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order. A key advantage of a dimensional approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data from different operational systems is complicated, and 2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business. In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd normalization rule. Tables are grouped together by subject areas that reflect general data categories (e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because of the number of tables involved, it can be difficult for users both to 1) join data from different sources into meaningful information and then 2) access the information without a precise understanding of the sources of data and of the data structure of the data warehouse. These approaches are not exact opposites of each other. Dimensional approaches can involve Datawarehousing & Mining - www.viplavkambli.com

Data Warehousing & Mining - onkarsule | Leave a comment … ·  · 2012-08-221 Data Warehousing & Mining Data Warehouse Architecture: Architecture, in the context of an organization's

  • Upload

  • View

  • Download

Embed Size (px)

Citation preview


Data Warehousing & Mining

Data Warehouse Architecture:

Architecture, in the context of an organization's data warehousing efforts, is a conceptualization of how

the data warehouse is built. There is no right or wrong architecture. The worthiness of the architecture

can be judged in how the conceptualization aids in the building, maintenance, and usage of the data

warehouse. One possible simple conceptualization of a data warehouse architecture consists of the following interconnected layers:

Operational database layer The source data for the data warehouse - An organization's ERP systems fall into this layer.

Informational access layer

The data accessed for reporting and analyzing and the tools for reporting and analyzing data - BI tools fall into this layer. And the Inmon-Kimball differences about design methodology, discussed later in this

article, have to do with this layer.

Data access layer The interface between the operational and informational access layer - Tools to extract, transform,

Load data into the warehouse fall into this layer.

Metadata layer

The data directory - This is often usually more detailed than an operational system data directory.

There are dictionaries for the entire warehouse and sometimes dictionaries for the data that can be

accessed by a particular reporting and analysis tool.

Normalized versus dimensional approach for storage of data

There are two leading approaches to storing data in a data warehouse - the dimensional approach and

the normalized approach.

In the dimensional approach, transaction data are partitioned into either "facts", which are generally

numeric transaction data, and "dimensions", which are the reference information that gives context to

the facts. For example, a sales transaction can be broken up into facts such as the number of products

ordered and the price paid for the products, and into dimensions such as order date, customer name,

product number, order ship-to and bill-to locations, and salesperson responsible for receiving the order.

A key advantage of a dimensional approach is that the data warehouse is easier for the user to

understand and to use. Also, the retrieval of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are:

1) In order to maintain the integrity of facts and dimensions, loading the data warehouse with data

from different operational systems is complicated, and

2) It is difficult to modify the data warehouse structure if the organization adopting the dimensional

approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd

normalization rule. Tables are grouped together by subject areas that reflect general data categories

(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is

straightforward to add information into the database. A disadvantage of this approach is that, because

of the number of tables involved, it can be difficult for users both to

1) join data from different sources into meaningful information and then

2) access the information without a precise understanding of the sources of data and of the data

structure of the data warehouse. These approaches are not exact opposites of each other. Dimensional approaches can involve

Datawarehousing & Mining - www.viplavkambli.com


normalizing data to a degree.

Evolution in organization use of data warehouses

Organizations generally start off with relatively simple use of data warehousing. Over time, more

sophisticated use of data warehousing evolves. The following general stages of use of the data

warehouse can be distinguished:

Off line Operational Databases

Data warehouses in this initial stage are developed by simply copying the data of an operational system to another server where the processing load of reporting against the copied data does not impact the

operational system's performance. Off line Data Warehouse

Data warehouses at this stage are updated from data in the operational systems on a regular basis and

the data warehouse data is stored in a data structure designed to facilitate reporting.

Real Time Data Warehouse

Data warehouses at this stage are updated every time an operational system performs a transaction

(e.g., an order or a delivery or a booking.)

Integrated Data Warehouse

Data warehouses at this stage are updated every time an operational system performs a transaction.

The data warehouses then generate transactions that are passed back into the operational systems.hich

are the reference information that gives context to the facts. For example, a sales transaction can be

broken up into facts such as the number of products ordered and the price paid for the products, and into dimensions such as order date, customer name, product number, order ship-to and bill-to

locations, and salesperson responsible for receiving the order. A key advantage of a dimensional

approach is that the data warehouse is easier for the user to understand and to use. Also, the retrieval

of data from the data warehouse tends to operate very quickly. The main disadvantages of the dimensional approach are: 1) In order to maintain the integrity of facts and dimensions, loading the

data warehouse with data from different operational systems is complicated, and 2) It is difficult to

modify the data warehouse structure if the organization adopting the dimensional approach changes the way in which it does business.

In the normalized approach, the data in the data warehouse are stored following, to a degree, the Codd

normalization rule. Tables are grouped together by subject areas that reflect general data categories

(e.g., data on customers, products, finance, etc.) The main advantage of this approach is that it is straightforward to add information into the database. A disadvantage of this approach is that, because

of the number of tables involved, it can be difficult for users both to 1) join data from different sources

into meaningful information and then 2) access the information without a precise understanding of the

sources of data and of the data structure of the data warehouse. These approaches are not exact opposites of each other. Dimensional approaches can involve

normalizing data to a degree.

Fact table:

In data warehousing, a fact table consists of the measurements, metrics or facts of a business process.

It is often located at the centre of a star schema, surrounded by dimension tables.Fact tables provide the (usually) additive values which act as independent variables by which dimensional attributes are

analyzed. Fact tables are often defined by their grain. The grain of a fact table represents the most

atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as

"Sales volume by Day by Product by Store". Each record in this fact table is therefore uniquely defined

by a day, product and store. Other dimensions might be members of this fact table (such as

location/region) but these add nothing to the uniqueness of the fact records. These "affiliate

dimensions" allow for additional slices of the independent facts but generally provide insights at a higher level of aggregation (region is made up of many stores)

Datawarehousing & Mining - www.viplavkambli.com


A data warehouse dimension provides the means to "slice and dice" data in a data warehouse.

Dimensions provide structured labeling information to otherwise unordered numeric measures. For

example, "Customer", "Date", and "Product" are all dimensions that could be applied meaningfully to a

sales receipt. A dimensional data element is similar to a categorical variable in statistics. The primary function of dimensions is threefold: to provide filtering, grouping and labeling. For example, in a data warehouse where each person is categorized as having a gender of male, female or

unknown, a user of the data warehouse would then be able to filter or categorize each presentation or report by either filtering based on the gender dimension or displaying results broken out by the gender.

Star Schema:

The star schema (sometimes referenced as star join schema) is the simplest style of data warehouse schema. The star schema consists of a few "fact tables" (possibly only one, justifying the name)

referencing any number of "dimension tables". The star schema is considered an important special case

of the snowflake schema.


Star schema used by example query.

Consider a database of sales, perhaps from a store chain, classified by date, store and product. The image of the schema to the right is a star schema version of the sample schema provided in the

snowflake schema article.

Fact_Sales is the fact table and there are three dimension tables Dim_Date, Dim_Store and


Each dimension table has a primary key on its Id column, relating to one of the columns of the

Fact_Sales table's three-column primary key (Date_Id, Store_Id, Product_Id). The non-primary key

Units_Sold column of the fact table in this example represents a measure or metric that can be used in

calculations and analysis. The non-primary key columns of the dimension tables represent additional

attributes of the dimensions (such as the Year of the Dim_Date dimension).

Star schema used by example query.

The following query extracts how many TV sets have been sold, for each brand and country, in 1997.

Normalization: Database normalization, sometimes referred to as canonical synthesis, is a technique for designing

relational database tables to minimize duplication of information and, in so doing, to safeguard the

database against certain types of logical or structural problems, namely data anomalies. For example, when multiple instances of a given piece of information occur in a table, the possibility exists that these

instances will not be kept consistent when the data within the table is updated, leading to a loss of data

integrity. A table that is sufficiently normalized is less vulnerable to problems of this kind, because its

structure reflects the basic assumptions for when multiple instances of the same information should be represented by a single instance only.

Higher degrees of normalization typically involve more tables and create the need for a larger number

of joins, which can reduce performance. Accordingly, more highly normalized tables are typically used

in database applications involving many isolated transactions (e.g. an automated teller machine), while less normalized tables tend to be used in database applications that need to map complex relationships

between data entities and data attributes (e.g. a reporting application, or a full-text search application).

Database theory describes a table's degree of normalization in terms of normal forms of successively

Datawarehousing & Mining - www.viplavkambli.com


higher degrees of strictness. A table in third normal form (3NF), for example, is consequently in second

normal form (2NF) as well; but the reverse is not necessarily the case.

Although the normal forms are often defined informally in terms of the characteristics of tables, rigorous definitions of the normal forms are concerned with the characteristics of mathematical

constructs known as relations. Whenever information is represented relationally, it is meaningful to consider the extent to which the representation is normalized.

materialised view: In a database management system following the relational model, a view is a virtual table representing

the result of a database query. Whenever an ordinary view's table is queried or updated, the DBMS

converts these into queries or updates against the underlying base tables. A materialized view takes a

different approach in which the query result is cached as a concrete table that may be updated from the

original base tables from time to time. This enables much more efficient access, at the cost of some

data being potentially out-of-date. It is most useful in data warehousing scenarios, where frequent

queries of the actual base tables can be extremely expensive.

In addition, because the view is manifested as a real table, anything that can be done to a real table

can be done to it, most importantly building indexes on any column, enabling drastic speedups in query

time. In a normal view, it's typically only possible to exploit indexes on columns that come directly from (or have a mapping to) indexed columns in the base tables; often this functionality is not offered at all.

Materialized views were implemented first by the Oracle database.

There are three types of materialized views:

1) Read only

Cannot be updated and complex materialized views are supported

2) Updateable

Can be updated even when disconnected from the master site. Are refreshed on demand.

Consumes fewer resources.

Requires Advanced Replication option to be installed.

3) Writeable

Created with the for update clause. Changes are lost when view is refreshed.

Requires Advanced Replication option to be installed.

Data Warehouses, OLTP, OLAP, and Data Mining

A relational database is designed for a specific purpose. Because the purpose of a data warehouse

differs from that of an OLTP, the design characteristics of a relational database that supports a data

warehouse differ from the design characteristics of an OLTP database.

Data warehouse database OLTP database

Designed for analysis of business measures by categories and attributes

Designed for real-time business operations

Optimized for bulk loads and large,

complex, unpredictable queries that

access many rows per table

Optimized for a common set of

transactions, usually adding or retrieving a

single row at a time per table

Datawarehousing & Mining - www.viplavkambli.com


Loaded with consistent, valid data;

requires no real time validation Optimized for validation of incoming data

during transactions; uses validation data


Supports few concurrent users relative to

OLTP Supports thousands of concurrent users

A Data Warehouse Supports OLTP

A data warehouse supports an OLTP system by providing a place for the OLTP database to offload data

as it accumulates, and by providing services that would complicate and degrade OLTP operations if they

were performed in the OLTP database.

Without a data warehouse to hold historical information, data is archived to static media such as

magnetic tape, or allowed to accumulate in the OLTP database.

If data is simply archived for preservation, it is not available or organized for use by analysts and

decision makers. If data is allowed to accumulate in the OLTP so it can be used for analysis, the OLTP

database continues to grow in size and requires more indexes to service analytical and report queries.

These queries access and process large portions of the continually growing historical data and add a

substantial load to the database. The large indexes needed to support these queries also tax the OLTP

transactions with additional index maintenance. These queries can also be complicated to develop due

to the typically complex OLTP database schema.

A data warehouse offloads the historical data from the OLTP, allowing the OLTP to operate at peak

transaction efficiency. High volume analytical and reporting queries are handled by the data warehouse

and do not load the OLTP, which does not need additional indexes for their support. As data is moved to

the data warehouse, it is also reorganized and consolidated so that analytical queries are simpler and

more efficient.

OLAP is a Data Warehouse Tool

Online analytical processing (OLAP) is a technology designed to provide superior performance for ad

hoc business intelligence queries. OLAP is designed to operate efficiently with data organized in

accordance with the common dimensional model used in data warehouses.

A data warehouse provides a multidimensional view of data in an intuitive model designed to match the

types of queries posed by analysts and decision makers. OLAP organizes data warehouse data into

multidimensional cubes based on this dimensional model, and then preprocesses these cubes to provide

maximum performance for queries that summarize data in various ways. For example, a query that

requests the total sales income and quantity sold for a range of products in a specific geographical

region for a specific time period can typically be answered in a few seconds or less regardless of how

many hundreds of millions of rows of data are stored in the data warehouse database.

OLAP is not designed to store large volumes of text or binary data, nor is it designed to support high

volume update transactions. The inherent stability and consistency of historical data in a data

warehouse enables OLAP to provide its remarkable performance in rapidly summarizing information for

analytical queries.

Datawarehousing & Mining - www.viplavkambli.com


In SQL Server 2000, Analysis Services provides tools for developing OLAP applications and a server

specifically designed to service OLAP queries.

Data Mining is a Data Warehouse Tool

Data mining is a technology that applies sophisticated and complex algorithms to analyze data and

expose interesting information for analysis by decision makers. Whereas OLAP organizes data in a

model suited for exploration by analysts, data mining performs analysis on data and provides the

results to decision makers. Thus, OLAP supports model-driven analysis and data mining supports data-

driven analysis.

Data mining has traditionally operated only on raw data in the data warehouse database or, more

commonly, text files of data extracted from the data warehouse database. In SQL Server 2000, Analysis

Services provides data mining technology that can analyze data in OLAP cubes, as well as data in the

relational data warehouse database. In addition, data mining results can be incorporated into OLAP

cubes to further enhance model-driven analysis by providing an additional dimensional viewpoint into the OLAP model. For example, data mining can be used to analyze sales data against customer

attributes and create a new cube dimension to assist the analyst in the discovery of the information

embedded in the cube data.

For more information and details about data mining in SQL Server 2000, see Chapter 24, "Effective

Strategies for Data Mining," in the SQL Server 2000 Resource Kit.

Designing a Data Warehouse: Prerequisites

Before embarking on the design of a data warehouse, it is imperative that the architectural goals of the

data warehouse be clear and well understood. Because the purpose of a data warehouse is to serve

users, it is also critical to understand the various types of users, their needs, and the characteristics of

their interactions with the data warehouse.

Data Warehouse Architecture Goals

A data warehouse exists to serve its users—analysts and decision makers. A data warehouse must be

designed to satisfy the following requirements:

Deliver a great user experience—user acceptance is the measure of success

Function without interfering with OLTP systems

Provide a central repository of consistent data

Answer complex queries quickly

Provide a variety of powerful analytical tools, such as OLAP and data mining

Most successful data warehouses that meet these requirements have these common characteristics:

Datawarehousing & Mining - www.viplavkambli.com


Are based on a dimensional model

Contain historical data

Include both detailed and summarized data

Consolidate disparate data from multiple sources while retaining consistency

Focus on a single subject, such as sales, inventory, or finance

Data warehouses are often quite large. However, size is not an architectural goal—it is a characteristic

driven by the amount of data needed to serve the users.

Data Warehouse Users

The success of a data warehouse is measured solely by its acceptance by users. Without users,

historical data might as well be archived to magnetic tape and stored in the basement. Successful data

warehouse design starts with understanding the users and their needs.

Data warehouse users can be divided into four categories: Statisticians, Knowledge Workers,

Information Consumers, and Executives. Each type makes up a portion of the user population as

illustrated in this diagram.

Figure 1. The User Pyramid

Statisticians: There are typically only a handful of sophisticated analysts—Statisticians and operations

research types—in any organization. Though few in number, they are some of the best users of the

data warehouse; those whose work can contribute to closed loop systems that deeply influence the

operations and profitability of the company. It is vital that these users come to love the data

warehouse. Usually that is not difficult; these people are often very self-sufficient and need only to be

pointed to the database and given some simple instructions about how to get to the data and what

times of the day are best for performing large queries to retrieve data to analyze using their own

sophisticated tools. They can take it from there.

Datawarehousing & Mining - www.viplavkambli.com


Knowledge Workers: A relatively small number of analysts perform the bulk of new queries and

analyses against the data warehouse. These are the users who get the "Designer" or "Analyst" versions

of user access tools. They will figure out how to quantify a subject area. After a few iterations, their

queries and reports typically get published for the benefit of the Information Consumers. Knowledge

Workers are often deeply engaged with the data warehouse design and place the greatest demands on

the ongoing data warehouse operations team for training and support.

Information Consumers: Most users of the data warehouse are Information Consumers; they will

probably never compose a true ad hoc query. They use static or simple interactive reports that others

have developed. It is easy to forget about these users, because they usually interact with the data

warehouse only through the work product of others. Do not neglect these users! This group includes a

large number of people, and published reports are highly visible. Set up a great communication

infrastructure for distributing information widely, and gather feedback from these users to improve the

information sites over time.

Executives: Executives are a special case of the Information Consumers group. Few executives

actually issue their own queries, but an executive's slightest musing can generate a flurry of activity

among the other types of users. A wise data warehouse designer/implementer/owner will develop a

very cool digital dashboard for executives, assuming it is easy and economical to do so. Usually this

should follow other data warehouse work, but it never hurts to impress the bosses.

How Users Query the Data Warehouse

Information for users can be extracted from the data warehouse relational database or from the output

of analytical services such as OLAP or data mining. Direct queries to the data warehouse relational

database should be limited to those that cannot be accomplished through existing tools, which are often

more efficient than direct queries and impose less load on the relational database.

Reporting tools and custom applications often access the database directly. Statisticians frequently extract data for use by special analytical tools. Analysts may write complex queries to extract and

compile specific information not readily accessible through existing tools. Information consumers do not

interact directly with the relational database but may receive e-mail reports or access web pages that

expose data from the relational database. Executives use standard reports or ask others to create

specialized reports for them.

When using the Analysis Services tools in SQL Server 2000, Statisticians will often perform data mining,

Analysts will write MDX queries against OLAP cubes and use data mining, and Information Consumers

will use interactive reports designed by others.

Developing a Data Warehouse: Details

The phases of a data warehouse project listed below are similar to those of most database projects,

starting with identifying requirements and ending with deploying the system:

Identify and gather requirements

Design the dimensional model

Datawarehousing & Mining - www.viplavkambli.com


Develop the architecture, including the Operational Data Store (ODS)

Design the relational database and OLAP cubes

Develop the data maintenance applications

Develop analysis applications

Test and deploy the system

Identify and Gather Requirements

Identify sponsors. A successful data warehouse project needs a sponsor in the business organization

and usually a second sponsor in the Information Technology group. Sponsors must understand and

support the business value of the project.

Understand the business before entering into discussions with users. Then interview and work with the

users, not the data—learn the needs of the users and turn these needs into project requirements. Find

out what information they need to be more successful at their jobs, not what data they think should be

in the data warehouse; it is the data warehouse designer's job to determine what data is necessary to

provide the information. Topics for discussion are the users' objectives and challenges and how they go

about making business decisions. Business users should be closely tied to the design team during the

logical design process; they are the people who understand the meaning of existing data. Many

successful projects include several business users on the design team to act as data experts and

"sounding boards" for design concepts. Whatever the structure of the team, it is important that

business users feel ownership for the resulting system.

Interview data experts after interviewing several users. Find out from the experts what data exists and

where it resides, but only after you understand the basic business needs of the end users. Information

about available data is needed early in the process, before you complete the analysis of the business

needs, but the physical design of existing data should not be allowed to have much influence on

discussions about business needs.

Communicate with users often and thoroughly—continue discussions as requirements continue to

solidify so that everyone participates in the progress of the requirements definition.

Design the Dimensional Model

User requirements and data realities drive the design of the dimensional model, which must address

business needs, grain of detail, and what dimensions and facts to include.

The dimensional model must suit the requirements of the users and support ease of use for direct

access. The model must also be designed so that it is easy to maintain and can adapt to future

changes. The model design must result in a relational database that supports OLAP cubes to provide

"instantaneous" query results for analysts.

An OLTP system requires a normalized structure to minimize redundancy, provide validation of input

data, and support a high volume of fast transactions. A transaction usually involves a single business

Datawarehousing & Mining - www.viplavkambli.com


event, such as placing an order or posting an invoice payment. An OLTP model often looks like a spider

web of hundreds or even thousands of related tables.

In contrast, a typical dimensional model uses a star or snowflake design that is easy to understand and

relate to business needs, supports simplified business queries, and provides superior query

performance by minimizing table joins.

For example, contrast the very simplified OLTP data model in the first diagram below with the data warehouse dimensional model in the second diagram. Which one better supports the ease of developing

reports and simple, efficient summarization queries?

Figure 2. Flow Chart (click for larger image)

Figure 3. Star Diagram

Dimensional Model Schemas

The principal characteristic of a dimensional model is a set of detailed business facts surrounded by

multiple dimensions that describe those facts. When realized in a database, the schema for a

dimensional model contains a central fact table and multiple dimension tables. A dimensional model

Datawarehousing & Mining - www.viplavkambli.com


may produce a star schema or asnowflake schema.

Star Schemas

A schema is called a star schema if all dimension tables can be joined directly to the fact table. The

following diagram shows a classic star schema.

Figure 4. Classic star schema, sales (click for larger image)

The following diagram shows a clickstream star schema.

Datawarehousing & Mining - www.viplavkambli.com


Figure 5. Clickstream star schema (click for larger image)

Snowflake Schemas

A schema is called a snowflake schema if one or more dimension tables do not join directly to the fact

table but must join through other dimension tables. For example, a dimension that describes products

may be separated into three tables (snowflaked) as illustrated in the following diagram.

Figure 6. Snowflake, three tables (click for larger image)

A snowflake schema with multiple heavily snowflaked dimensions is illustrated in the following diagram.

Datawarehousing & Mining - www.viplavkambli.com


Figure 7. Many dimension snowflake (click for larger image)

Star or Snowflake

Both star and snowflake schemas are dimensional models; the difference is in their physical

implementations. Snowflake schemas support ease of dimension maintenance because they are more

normalized. Star schemas are easier for direct user access and often support simpler and more efficient

queries. The decision to model a dimension as a star or snowflake depends on the nature of the

dimension itself, such as how frequently it changes and which of its elements change, and often

involves evaluating tradeoffs between ease of use and ease of maintenance. It is often easiest to

maintain a complex dimension by snow flaking the dimension. By pulling hierarchical levels into

separate tables, referential integrity between the levels of the hierarchy is guaranteed. Analysis

Services reads from a snowflaked dimension as well as, or better than, from a star dimension.

However, it is important to present a simple and appealing user interface to business users who are developing ad hoc queries on the dimensional database. It may be better to create a star version of the

snowflaked dimension for presentation to the users. Often, this is best accomplished by creating an

indexed view across the snowflaked dimension, collapsing it to a virtual star.

Dimension Tables

Dimension tables encapsulate the attributes associated with facts and separate these attributes into

logically distinct groupings, such as time, geography, products, customers, and so forth.

A dimension table may be used in multiple places if the data warehouse contains multiple fact tables or

contributes data to data marts. For example, a product dimension may be used with a sales fact table

and an inventory fact table in the data warehouse, and also in one or more departmental data marts. A

dimension such as customer, time, or product that is used in multiple schemas is called a conforming

dimension if all copies of the dimension are the same. Summarization data and reports will not

correspond if different schemas use different versions of a dimension table. Using conforming

dimensions is critical to successful data warehouse design.

User input and evaluation of existing business reports help define the dimensions to include in the data

warehouse. A user who wants to see data "by sales region" and "by product" has just identified two

dimensions (geography and product). Business reports that group sales by salesperson or sales by

Datawarehousing & Mining - www.viplavkambli.com


customer identify two more dimensions (salesforce and customer). Almost every data warehouse

includes a time dimension.

In contrast to a fact table, dimension tables are usually small and change relatively slowly. Dimension

tables are seldom keyed to date.

The records in a dimension table establish one-to-many relationships with the fact table. For example,

there may be a number of sales to a single customer, or a number of sales of a single product. The

dimension table contains attributes associated with the dimension entry; these attributes are rich and

user-oriented textual details, such as product name or customer name and address. Attributes serve as

report labels and query constraints. Attributes that are coded in an OLTP database should be decoded

into descriptions. For example, product category may exist as a simple integer in the OLTP database,

but the dimension table should contain the actual text for the category. The code may also be carried in

the dimension table if needed for maintenance. This denormalization simplifies and improves the

efficiency of queries and simplifies user query tools. However, if a dimension attribute changes frequently, maintenance may be easier if the attribute is assigned to its own table to create a snowflake


It is often useful to have a pre-established "no such member" or "unknown member" record in each

dimension to which orphan fact records can be tied during the update process. Business needs and the

reliability of consistent source data will drive the decision as to whether such placeholder dimension

records are required.


The data in a dimension is usually hierarchical in nature. Hierarchies are determined by the business

need to group and summarize data into usable information. For example, a time dimension often

contains the hierarchy elements: (all time), Year, Quarter, Month, Day, or (all time), Year Quarter,

Week, Day. A dimension may contain multiple hierarchies—a time dimension often contains both

calendar and fiscal year hierarchies. Geography is seldom a dimension of its own; it is usually a

hierarchy that imposes a structure on sales points, customers, or other geographically distributed

dimensions. An example geography hierarchy for sales points is: (all), Country or Region, Sales-region,

State or Province, City, Store.

Note that each hierarchy example has an "(all)" entry such as (all time), (all stores), (all customers),

and so forth. This top-level entry is an artificial category used for grouping the first-level categories of a

dimension and permits summarization of fact data to a single number for a dimension. For example, if

the first level of a product hierarchy includes product line categories for hardware, software, peripherals, and services, the question "What was the total amount for sales of all products last year?"

is equivalent to "What was the total amount for the combined sales of hardware, software, peripherals,

and services last year?" The concept of an "(all)" node at the top of each hierarchy helps reflect the way

users want to phrase their questions. OLAP tools depend on hierarchies to categorize data—Analysis

Services will create by default an "(all)" entry for a hierarchy used in a cube if none is specified.

A hierarchy may be balanced, unbalanced, ragged, or composed of parent-child relationships such as an

organizational structure. For more information about hierarchies in OLAP cubes, see SQL Server Books


Datawarehousing & Mining - www.viplavkambli.com


Surrogate Keys

A critical part of data warehouse design is the creation and use of surrogate keys in dimension tables. A

surrogate key is the primary key for a dimension table and is independent of any keys provided by

source data systems. Surrogate keys are created and maintained in the data warehouse and should not

encode any information about the contents of records; automatically increasing integers make good

surrogate keys. The original key for each record is carried in the dimension table but is not used as the

primary key. Surrogate keys provide the means to maintain data warehouse information when

dimensions change. Special keys are used for date and time dimensions, but these keys differ from

surrogate keys used for other dimension tables.


Avoid using GUIDs (globally unique identifiers) as keys in the data warehouse database. GUIDs may be

used in data from distributed source systems, but they are difficult to use as table keys. GUIDs use a

significant amount of storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans

to read. Indexes on GUID columns may be relatively slower than indexes on integer keys because

GUIDs are four times larger. The Transact-SQL NEWID function can be used to create GUIDs for a

column of uniqueidentifier data type, and the ROWGUIDCOL property can be set for such a column to

indicate that the GUID values in the column uniquely identify rows in the table, but uniqueness is not


Because a uniqueidentifier data type cannot be sorted, the GUID cannot be used in a GROUP BY

statement, nor can the occurrences of the uniqueidentifierGUID be distinctly counted—both GROUP

BY and COUNT DISTINCT operations are very common in data warehouses. The uniqueidentifier GUID

cannot be used as a measure in an Analysis Services cube.

The IDENTITY property and IDENTITY function can be used to create identity columns in tables and to

manage series of generated numeric keys. IDENTITY functionality is more useful in surrogate key

management than uniqueidentifier GUIDs.

Date and Time Dimensions

Each event in a data warehouse occurs at a specific date and time; and data is often summarized by a

specified time period for analysis. Although the date and time of a business fact is usually recorded in

the source data, special date and time dimensions provide more effective and efficient mechanisms for

time-oriented analysis than the raw event time stamp. Date and time dimensions are designed to meet

the needs of the data warehouse users and are created within the data warehouse.

A date dimension often contains two hierarchies: one for calendar year and another for fiscal year.

Time Granularity

A date dimension with one record per day will suffice if users do not need time granularity finer than a

single day. A date by day dimension table will contain 365 records per year (366 in leap years).

A separate time dimension table should be constructed if a fine time granularity, such as minute or

second, is needed. A time dimension table of one-minute granularity will contain 1,440 rows for a day,

and a table of seconds will contain 86,400 rows for a day. If exact event time is needed, it should be

Datawarehousing & Mining - www.viplavkambli.com


stored in the fact table.

When a separate time dimension is used, the fact table contains one foreign key for the date dimension

and another for the time dimension. Separate date and time dimensions simplify many filtering

operations. For example, summarizing data for a range of days requires joining only the date dimension

table to the fact table. Analyzing cyclical data by time period within a day requires joining just the time

dimension table. The date and time dimension tables can both be joined to the fact table when a

specific time range is needed.

For hourly time granularity, the hour breakdown can be incorporated into the date dimension or placed

in a separate dimension. Business needs influence this design decision. If the main use is to extract

contiguous chunks of time that cross day boundaries (for example 11/24/2000 10 p.m. to 11/25/2000

6 a.m.), then it is easier if the hour and day are in the same dimension. However, it is easier to analyze cyclical and recurring daily events if they are in separate dimensions. Unless there is a clear reason to

combine date and hour in a single dimension, it is generally better to keep them in separate dimensions.

Date and Time Dimension Attributes

It is often useful to maintain attribute columns in a date dimension to provide additional convenience or

business information that supports analysis. For example, one or more columns in the time-by-hour

dimension table can indicate peak periods in a daily cycle, such as meal times for a restaurant chain or

heavy usage hours for an Internet service provider. Peak period columns may be Boolean, but it is

better to "decode" the Boolean yes/no into a brief description, such as "peak"/"offpeak". In a report,

the decoded values will be easier for business users to read than multiple columns of "yes" and "no".

These are some possible attribute columns that may be used in a date table. Fiscal year versions are

the same, although values such as quarter numbers may differ.



Column name Data type Comment

date_key int yyyymmdd

day_date smalldatetime

day_of_week char Monday


e smalldatetime

week_num tinyint 1 to 52 or 53 Week 1 defined by business


month_num tinyint 1 to 12

Datawarehousing & Mining - www.viplavkambli.com


month_name char January


me char Jan

month_end_date smalldatetime Useful for days in the month

days_in_month tinyint Alternative for, or in addition to


yearmo int yyyymm

quarter_num tinyint 1 to 4

quarter_name char 1Q2000

year smallint

weekend_ind bit Indicates weekend

workday_ind bit Indicates work day


ay char weekend Alternative

for weekend_ind andweekday _ind. Can be used to make

reports more readable.

holiday_ind bit

Hardware & I/O considerations:

Overview of Hardware and I/O Considerations in Data Warehouses

I/O performance should always be a key consideration for data warehouse designers and

administrators. The typical workload in a data warehouse is especially I/O intensive, with operations

such as large data loads and index builds, creation of materialized views, and queries over large

volumes of data. The underlying I/O system for a data warehouse should be designed to meet these

heavy requirements.

In fact, one of the leading causes of performance issues in a data warehouse is poor I/O configuration.

Database administrators who have previously managed other systems will likely need to pay more

careful attention to the I/O configuration for a data warehouse than they may have previously done for

Datawarehousing & Mining - www.viplavkambli.com


other environments.

This chapter provides the following five high-level guidelines for data-warehouse I/O configurations:

Configure I/O for Bandwidth not Capacity

Stripe Far and Wide

Use Redundancy

Test the I/O System Before Building the Database

Plan for Growth

The I/O configuration used by a data warehouse will depend on the characteristics of the specific

storage and server capabilities, so the material in this chapter is only intended to provide guidelines for

designing and tuning an I/O system.

Configure I/O for Bandwidth not Capacity

Storage configurations for a data warehouse should be chosen based on the I/O bandwidth that they

can provide, and not necessarily on their overall storage capacity. Buying storage based solely on

capacity has the potential for making a mistake, especially for systems less than 500GB is total size.

The capacity of individual disk drives is growing faster than the I/O throughput rates provided by those

disks, leading to a situation in which a small number of disks can store a large volume of data, but

cannot provide the same I/O throughput as a larger number of small disks.

As an example, consider a 200GB data mart. Using 72GB drives, this data mart could be built with as

few as six drives in a fully-mirrored environment. However, six drives might not provide enough I/O

bandwidth to handle a medium number of concurrent users on a 4-CPU server. Thus, even though six

drives provide sufficient storage, a larger number of drives may be required to provide acceptable

performance for this system.

While it may not be practical to estimate the I/O bandwidth that will be required by a data warehouse

before a system is built, it is generally practical with the guidance of the hardware manufacturer to

estimate how much I/O bandwidth a given server can potentially utilize, and ensure that the selected

I/O configuration will be able to successfully feed the server. There are many variables in sizing the I/O

systems, but one basic rule of thumb is that your data warehouse system should have multiple disks for

each CPU (at least two disks for each CPU at a bare minimum) in order to achieve optimal performance.

Stripe Far and Wide

The guiding principle in configuring an I/O system for a data warehouse is to maximize I/O bandwidth

by having multiple disks and channels access each database object. You can do this by striping the

datafiles of the Oracle Database. A striped file is a file distributed across multiple disks. This striping can

be managed by software (such as a logical volume manager), or within the storage hardware. The goal

is to ensure that each tablespace is striped across a large number of disks (ideally, all of the disks) so

Datawarehousing & Mining - www.viplavkambli.com


that any database object can be accessed with the highest possible I/O bandwidth.

Use Redundancy

Because data warehouses are often the largest database systems in a company, they have the most

disks and thus are also the most susceptible to the failure of a single disk. Therefore, disk redundancy

is a requirement for data warehouses to protect against a hardware failure. Like disk-striping,

redundancy can be achieved in many ways using software or hardware.

A key consideration is that occasionally a balance must be made between redundancy and performance.

For example, a storage system in a RAID-5 configuration may be less expensive than a RAID-0+1

configuration, but it may not perform as well, either. Redundancy is necessary for any data warehouse,

but the approach to redundancy may vary depending upon the performance and cost constraints of

each data warehouse.

Test the I/O System Before Building the Database

The most important time to examine and tune the I/O system is before the database is even created.

Once the database files are created, it is more difficult to reconfigure the files. Some logical volume

managers may support dynamic reconfiguration of files, while other storage configurations may require

that files be entirely rebuilt in order to reconfigure their I/O layout. In both cases, considerable system

resources must be devoted to this reconfiguration.

When creating a data warehouse on a new system, the I/O bandwidth should be tested before creating

all of the database datafiles to validate that the expected I/O levels are being achieved. On most

operating systems, this can be done with simple scripts to measure the performance of reading and

writing large test files.

Plan for Growth

A data warehouse designer should plan for future growth of a data warehouse. There are many

approaches to handling the growth in a system, and the key consideration is to be able to grow the I/O system without compromising on the I/O bandwidth. You cannot, for example, add four disks to an

existing system of 20 disks, and grow the database by adding a new tablespace striped across only the

four new disks. A better solution would be to add new tablespaces striped across all 24 disks, and over

time also convert the existing tablespaces striped across 20 disks to be striped across all 24 disks.

Storage Management

Two features to consider for managing disks are Oracle Managed Files and Automatic Storage

Management. Without these features, a database administrator must manage the database files, which,

in a data warehouse, can be hundreds or even thousands of files. Oracle Managed Files simplifies the

administration of a database by providing functionality to automatically create and manage files, so the

database administrator no longer needs to manage each database file. Automatic Storage Management

provides additional functionality for managing not only files but also the disks. With Automatic Storage Management, the database administrator would administer a small number of disk groups. Automatic

Datawarehousing & Mining - www.viplavkambli.com


Storage Management handles the tasks of striping and providing disk redundancy, including rebalancing

the database files when new disks are added to the system. Data parallelism:

Data parallelism (also known as loop-level parallelism) is a form of parallelization of computing across

multiple processors in parallel computing environments. Data parallelism focuses on distributing the

data across different parallel computing nodes. It contrasts to task parallelism as another form of


In a multiprocessor system executing a single set of instructions (SIMD), data parallelism is achieved

when each processor performs the same task on different pieces of distributed data. In some situations,

a single execution thread controls operations on all pieces of data. In others, different threads control

the operation, but they execute the same code.

For instance, if we are running code on a 2-processor system (CPUs A and B) in a parallel environment,

and we wish to do a task on some data D, it is possible to tell CPU A to do that task on one part of D

and CPU B on another part simultaneously, thereby reducing the runtime of the execution. The data can

be assigned using conditional statements. As a specific example, consider adding two matrices. In a

data parallel implementation, CPU A could add all elements from the top half of the matrices, while CPU

B could add all elements from the bottom half of the matrices. Since the two processors work in parallel, the job of performing matrix addition would take one half the time of performing the same

operation in serial using one CPU alone.

Data parallelism emphasizes the distributed (parallelized) nature of the data, as opposed to the

processing (task parallelism). Most real programs fall somewhere on a continuum between Task

parallelism and Data parallelism.-

Data Extraction, Transformation, and Loading Techniques

"Data Warehouse Design Considerations," discussed the use of dimensional modeling to design

databases for data warehousing. In contrast to the complex, highly normalized, entity-relationship

schemas of online transaction processing (OLTP) databases, data warehouse schemas are simple and

denormalized. Regardless of the specific design or technology used in a data warehouse, its

implementation must include mechanisms to migrate data into the data warehouse database. This

process of data migration is generally referred to as the extraction, transformation, and loading (ETL)


Some data warehouse experts add an additional term—management—to ETL, expanding it to ETLM.

Others use the M to mean meta data. Both refer to the management of the data as it flows into the

data warehouse and is used in the data warehouse. The information used to manage data consists of

data about data, which is the definition of meta data.

Datawarehousing & Mining - www.viplavkambli.com


The topics in this chapter describe the elements of the ETL process and provide examples of procedures

that address common ETL issues such as managing surrogate keys, slowly changing dimensions, and

meta data.

The code examples in this chapter are also available on the SQL Server 2000 Resource Kit CD-ROM, in

the file \Docs\ChapterCode\CH19Code.txt. For more information, see Chapter 39, "Tools, Samples,

eBooks, and More."


During the ETL process, data is extracted from an OLTP database, transformed to match the data

warehouse schema, and loaded into the data warehouse database. Many data warehouses also

incorporate data from non-OLTP systems, such as text files, legacy systems, and spreadsheets; such

data also requires extraction, transformation, and loading.

In its simplest form, ETL is the process of copying data from one database to another. This simplicity is

rarely, if ever, found in data warehouse implementations; in reality, ETL is often a complex combination

of process and technology that consumes a significant portion of the data warehouse development

efforts and requires the skills of business analysts, database designers, and application developers.

When defining ETL for a data warehouse, it is important to think of ETL as a process, not a physical

implementation. ETL systems vary from data warehouse to data warehouse and even between

department data marts within a data warehouse. A monolithic application, regardless of whether it is

implemented in Transact-SQL or a traditional programming language, does not provide the flexibility for

change necessary in ETL systems. A mixture of tools and technologies should be used to develop

applications that each perform a specific ETL task.

The ETL process is not a one-time event; new data is added to a data warehouse periodically. Typical

periodicity may be monthly, weekly, daily, or even hourly, depending on the purpose of the data

warehouse and the type of business it serves. Because ETL is an integral, ongoing, and recurring part of

a data warehouse, ETL processes must be automated and operational procedures documented. ETL also changes and evolves as the data warehouse evolves, so ETL processes must be designed for ease of

modification. A solid, well-designed, and documented ETL system is necessary for the success of a data

warehouse project.

Data warehouses evolve to improve their service to the business and to adapt to changes in business

processes and requirements. Business rules change as the business reacts to market influences—the data warehouse must respond in order to maintain its value as a tool for decision makers. The ETL

implementation must adapt as the data warehouse evolves.

Microsoft® SQL Server™ 2000 provides significant enhancements to existing performance and

capabilities, and introduces new features that make the development, deployment, and maintenance of ETL processes easier and simpler, and its performance faster.

ETL Functional Elements

Datawarehousing & Mining - www.viplavkambli.com


Regardless of how they are implemented, all ETL systems have a common purpose: they move data

from one database to another. Generally, ETL systems move data from OLTP systems to a data

warehouse, but they can also be used to move data from one data warehouse to another. An ETL

system consists of four distinct functional elements:

• Extraction

• Transformation

• Loading

• Meta data


The ETL extraction element is responsible for extracting data from the source system. During extraction, data may be removed from the source system or a copy made and the original data retained

in the source system. It is common to move historical data that accumulates in an operational OLTP

system to a data

warehouse to maintain OLTP performance and efficiency. Legacy systems may require too much effort

to implement such offload processes, so legacy data is often copied into the data warehouse, leaving

the original data in place. Extracted data is loaded into the data warehouse staging area (a relational

database usually separate from the data warehouse database), for manipulation by the remaining ETL


Data extraction is generally performed within the source system itself, especially if it is a relational

database to which extraction procedures can easily be added. It is also possible for the extraction logic

to exist in the data warehouse staging area and query the source system for data using ODBC, OLE DB,

or other APIs. For legacy systems, the most common method of data extraction is for the legacy system

to produce text files, although many newer systems offer direct query APIs or accommodate access

through ODBC or OLE DB.

Data extraction processes can be implemented using Transact-SQL stored procedures, Data

Transformation Services (DTS) tasks, or custom applications developed in programming or scripting



The ETL transformation element is responsible for data validation, data accuracy, data type conversion,

and business rule application. It is the most complicated of the ETL elements. It may appear to be more

efficient to perform some transformations as the data is being extracted (inline transformation);

however, an ETL system that uses inline transformations during extraction is less robust and flexible

than one that confines transformations to the transformation element. Transformations performed in

Datawarehousing & Mining - www.viplavkambli.com


the OLTP system impose a performance burden on the OLTP database. They also split the

transformation logic between two ETL elements and add maintenance complexity when the ETL logic


Tools used in the transformation element vary. Some data validation and data accuracy checking can be

accomplished with straightforward Transact-SQL code. More complicated transformations can be

implemented using DTS packages. The application of complex business rules often requires the

development of sophisticated custom applications in various programming languages. You can use DTS

packages to encapsulate multi-step transformations into a single task.

Listed below are some basic examples that illustrate the types of transformations performed by this


Data Validation

Check that all rows in the fact table match rows in dimension tables to enforce data integrity.

Data Accuracy

Ensure that fields contain appropriate values, such as only "off" or "on" in a status field.

Data Type Conversion

Ensure that all values for a specified field are stored the same way in the data warehouse regardless of

how they were stored in the source system. For example, if one source system stores "off" or "on" in its

status field and another source system stores "0" or "1" in its status field, then a data type conversion

transformation converts the content of one or both of the fields to a specified common value such as

"off" or "on".

Business Rule Application

Ensure that the rules of the business are enforced on the data stored in the warehouse. For example,

check that all customer records contain values for both FirstName and LastName fields.


The ETL loading element is responsible for loading transformed data into the data warehouse database.

Data warehouses are usually updated periodically rather than continuously, and large numbers of

records are often loaded to multiple tables in a single data load. The data warehouse is often taken

offline during update operations so that data can be loaded faster and SQL Server 2000 Analysis

Services can update OLAP cubes to incorporate the new data. BULK INSERT, bcp, and the Bulk Copy

API are the best tools for data loading operations. The design of the loading element should focus on

efficiency and performance to minimize the data warehouse offline time. For more information and

details about performance tuning, see Chapter 20, "RDBMS Performance Tuning Guide for Data


Datawarehousing & Mining - www.viplavkambli.com


Meta Data

The ETL meta data functional element is responsible for maintaining information (meta data) about the

movement and transformation of data, and the operation of the data warehouse. It also documents the

data mappings used during the transformations. Meta data logging provides possibilities for automated

administration, trend prediction, and code reuse.

Examples of data warehouse meta data that can be recorded and used to analyze the activity and

performance of a data warehouse include:

• Data Lineage, such as the time that a particular set of records was loaded into the data warehouse.

• Schema Changes, such as changes to table definitions.

• Data Type Usage, such as identifying all tables that use the "Birthdate" user- defined data type.

• Transformation Statistics, such as the execution time of each stage of a

transformation, the number of rows processed by the transformation, the last time the transformation was executed, and so on.

• DTS Package Versioning, which can be used to view, branch, or retrieve any historical version of a particular DTS package.

• Data Warehouse Usage Statistics, such as query times for reports. ETL Design Considerations

Regardless of their implementation, a number of design considerations are common to all ETL systems:


ETL systems should contain modular elements that perform discrete tasks. This encourages reuse and

makes them easy to modify when implementing changes in response to business and data warehouse

changes. Monolithic systems should be avoided.


ETL systems should guarantee consistency of data when it is loaded into the data warehouse. An entire

data load should be treated as a single logical transaction—either the entire data load is successful or

Datawarehousing & Mining - www.viplavkambli.com


the entire load is rolled back. In some systems, the load is a single physical transaction, whereas in

others it is a series of transactions. Regardless of the physical implementation, the data load should be

treated as a single logical transaction.


ETL systems should be developed to meet the needs of the data warehouse and to accommodate the

source data environments. It may be appropriate to accomplish some transformations in text files and

some on the source data system; others may require the development of custom applications. A variety

of technologies and techniques can be applied, using the tool most appropriate to the individual task of

each ETL functional element.


ETL systems should be as fast as possible. Ultimately, the time window available for ETL processing is

governed by data warehouse and source system schedules. Some data warehouse elements may have

a huge processing window (days), while others may have a very limited processing window (hours).

Regardless of the time available, it is important that the ETL system execute as rapidly as possible.


ETL systems should be able to work with a wide variety of data in different formats. An ETL system that

only works with a single type of source data is useless.

Meta Data Management

ETL systems are arguably the single most important source of meta data about both the data in the

data warehouse and data in the source system. Finally, the ETL process itself generates useful meta

data that should be retained and analyzed regularly. Meta data is discussed in greater detail later in this


ETL Architectures

Before discussing the physical implementation of ETL systems, it is important to understand the

different ETL architectures and how they relate to each other. Essentially, ETL systems can be classified

in two architectures: the homogenous architecture and the heterogeneous architecture.

Homogenous Architecture

A homogenous architecture for an ETL system is one that involves only a single source system and a

single target system. Data flows from the single source of data through the ETL processes and is loaded

into the data warehouse, as shown in the following diagram.

Datawarehousing & Mining - www.viplavkambli.com


Most homogenous ETL architectures have the following characteristics:

• Single data source: Data is extracted from a single source system, such as an OLTP system.

• Rapid development: The development effort required to extract the data is straightforward because there is only one data format for each record type.

• Light data transformation: No data transformations are required to achieve

consistency among disparate data formats, and the incoming data is often in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs and other formatting transformations.

• Light structural transformation: Because the data comes from a single source,

the amount of structural changes such as table alteration is also very light. The structural changes typically involve denormalization efforts to meet data warehouse schema requirements.

• Simple research requirements: The research efforts to locate data are generally simple: if the data is in the source system, it can be used. If it is not,

it cannot. The homogeneous ETL architecture is generally applicable to data marts, especially those focused on a

single subject matter.

Heterogeneous Architecture

A heterogeneous architecture for an ETL system is one that extracts data from multiple sources, as

shown in the following diagram. The complexity of this architecture arises from the fact that data from

more than one source must be merged, rather than from the fact that data may be formatted

differently in the different sources. However, significantly different storage formats and database

schemas do provide additional complications.

Datawarehousing & Mining - www.viplavkambli.com


Most heterogeneous ETL architectures have the following characteristics:

• Multiple data sources.

• More complex development: The development effort required to extract the

data is increased because there are multiple source data formats for each record type.

• Significant data transformation: Data transformations are required to achieve

consistency among disparate data formats, and the incoming data is often not in a format usable in the data warehouse. Transformations in this architecture typically involve replacing NULLs, additional data formatting, data conversions, lookups, computations, and referential integrity verification. Precomputed calculations may require combining data from multiple sources, or data that has multiple degrees of granularity, such as allocating shipping costs to individual line items.

• Significant structural transformation: Because the data comes from multiple

sources, the amount of structural changes, such as table alteration, is significant.

• Substantial research requirements to identify and match data elements. Heterogeneous ETL architectures are found more often in data warehouses than in data marts.

ETL Development

ETL development consists of two general phases: identifying and mapping data, and developing functional element implementations. Both phases should be carefully documented and stored in a

central, easily accessible location, preferably in electronic form.

Datawarehousing & Mining - www.viplavkambli.com


Identify and Map Data

This phase of the development process identifies sources of data elements, the targets for those data

elements in the data warehouse, and the transformations that must be applied to each data element as

it is migrated from its source to its destination. High level data maps should be developed during the

requirements gathering and data modeling phases of the data warehouse project. During the ETL system design and development process, these high level data maps are extended to thoroughly specify system details.

Identify Source Data

For some systems, identifying the source data may be as simple as identifying the server where the

data is stored in an OLTP database and the storage type (SQL Server database, Microsoft Excel

spreadsheet, or text file, among others). In other systems, identifying the source may mean preparing

a detailed definition of the meaning of the data, such as a business rule, a definition of the data itself,

such as decoding rules (O = On, for example), or even detailed documentation of a source system for

which the system documentation has been lost or is not current.

Identify Target Data

Each data element is destined for a target in the data warehouse. A target for a data element may be

an attribute in a dimension table, a numeric measure in a fact table, or a summarized total in an

aggregation table. There may not be a one-to-one correspondence between a source data element and

a data element in the data warehouse because the destination system may not contain the data at the

same granularity as the source system. For example, a retail client may decide to roll data up to the

SKU level by day rather than track individual line item data. The level of item detail that is stored in the

fact table of the data warehouse is called the grain of the data. If the grain of the target does not match the grain of the source, the data must be summarized as it moves from the source to the target.

Map Source Data to Target Data

A data map defines the source fields of the data, the destination fields in the data warehouse and any

data modifications that need to be accomplished to transform the data into the desired format for the data warehouse. Some transformations require aggregating the source data to a coarser granularity,

such as summarizing individual item sales into daily sales by SKU. Other transformations involve

altering the source data itself as it moves from the source to the target. Some transformations decode

data into human readable form, such as replacing "1" with "on" and "0" with "off" in a status field. If

two source systems encode data destined for the same target differently (for example, a second source

system uses Yes and No for status), a separate transformation for each source system must be defined. Transformations must be documented and maintained in the data maps. The relationship between the

source and target systems is maintained in a map that is referenced to execute the transformation of the data before it is loaded in the data warehouse.

Develop Functional Elements

Design and implementation of the four ETL functional elements, Extraction, Transformation, Loading,

Datawarehousing & Mining - www.viplavkambli.com


and meta data logging, vary from system to system. There will often be multiple versions of each

functional element.

Each functional element contains steps that perform individual tasks, which may execute on one of

several systems, such as the OLTP or legacy systems that contain the source data, the staging area

database, or the data warehouse database. Various tools and techniques may be used to implement the

steps in a single functional area, such as Transact-SQL, DTS packages, or custom applications

developed in a programming language such as Microsoft Visual Basic®. Steps that are discrete in one functional element may be combined in another.


The extraction element may have one version to extract data from one OLTP data source, a different

version for a different OLTP data source, and multiple versions for legacy systems and other sources of

data. This element may include tasks that execute SELECT queries from the ETL staging database

against a source OLTP system, or it may execute some tasks on the source system directly and others in the staging database, as in the case of generating a flat file from a legacy system and then importing

it into tables in the ETL database. Regardless of methods or number of steps, the extraction element is

responsible for extracting the required data from the source system and making it available for

processing by the next element.


Frequently a number of different transformations, implemented with various tools or techniques, are

required to prepare data for loading into the data warehouse. Some transformations may be performed

as data is extracted, such as an application on a legacy system that collects data from various internal

files as it produces a text file of data to be further transformed. However, transformations are best

accomplished in the ETL staging database, where data from several data sources may require varying

transformations specific to the incoming data organization and format.

Data from a single data source usually requires different transformations for different portions of the

incoming data. Fact table data transformations may include summarization, and will always require

surrogate dimension keys to be added to the fact records. Data destined for dimension tables in the

data warehouse may require one process to accomplish one type of update to a changing dimension

and a different process for another type of update.

Transformations may be implemented using Transact-SQL, as is demonstrated in the code examples

later in this chapter, DTS packages, or custom applications.

Regardless of the number and variety of transformations and their implementations, the transformation

element is responsible for preparing data for loading into the data warehouse.


The loading element typically has the least variety of task implementations. After the data from the

various data sources has been extracted, transformed, and combined, the loading operation consists of

inserting records into the various data warehouse database dimension and fact tables. Implementation

Datawarehousing & Mining - www.viplavkambli.com

Datawarehousing & Mini


may vary in the loading tasks, such as using BULK INSERT, bcp, or the Bulk Copy API. The loading

element is responsible for loading data into the data warehouse database tables.

Meta Data Logging

Meta data is collected from a number of the ETL operations. The meta data logging implementation for

a particular ETL task will depend on how the task is implemented. For a task implemented by using a

custom application, the application code may produce the meta data. For tasks implemented by using

Transact-SQL, meta data can be captured with Transact-SQL statements in the task processes. The

meta data logging element is responsible for capturing and recording meta data that documents the

operation of the ETL functional areas and tasks, which includes identification of data that moves

through the ETL system as well as the efficiency of ETL tasks.

Common Tasks

Each ETL functional element should contain tasks that perform the following functions, in addition to

tasks specific to the functional area itself:

Confirm Success or Failure. A confirmation should be generated on the success or failure of the

execution of the ETL processes. Ideally, this mechanism should exist for each task so that rollback

mechanisms can be implemented to allow for incremental responses to errors.

Scheduling. ETL tasks should include the ability to be scheduled for execution. Scheduling mechanisms reduce repetitive manual operations and allow for maximum use of system resources during recurring

periods of low activity. Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between

variables, and then to validate the findings by applying the detected patterns to new subsets of data.

The ultimate goal of data mining is prediction - and predictive data mining is the most common

type of data mining and one that has the most direct business applications. The process of data mining

consists of three stages: (1) the initial exploration, (2) model building or pattern identification with

validation/verification, and (3) deployment (i.e., the application of the model to new data in

order to generate predictions).

Stage 1: Exploration. This stage usually starts

with data preparation which may involve cleaning

data, data transformations, selecting subsets of

records and - in case of data sets with large

numbers of variables ("fields") - performing some

preliminary feature selection operations to bring

the number of variables to a manageable range

(depending on the statistical methods which are

being considered). Then, depending on the nature of

the analytic problem, this first stage of the process

ng - www.viplavkambli.com


of data mining may involve anywhere between a simple choice of straightforward predictors for a

regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical

methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables

and determine the complexity and/or the general nature of models that can be taken into account in

the next stage.

Stage 2: Model building and validation. This stage involves considering various models and

choosing the best one based on their predictive performance (i.e., explaining the variability in question

and producing stable results across samples). This may sound like a simple operation, but in fact, it

sometimes involves a very elaborate process. There are a variety of techniques developed to achieve

that goal - many of which are based on so-called "competitive evaluation of models," that is, applying

different models to the same data set and then comparing their performance to choose the best. These

techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage

and applying it to new data in order to generate predictions or estimates of the expected outcome.

The concept of Data Mining is becoming increasingly popular as a business information management

tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited

certainty. Recently, there has been increased interest in developing new analytic techniques specifically

designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but

Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory

Data Analysis (EDA) and modeling and it shares with them both some components of its general

approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and the

traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards

applications than the basic nature of the underlying phenomena. In other words, Data Mining is

relatively less concerned with identifying the specific relations between the involved variables. For

example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is

on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among

others a "black box" approach to data exploration or knowledge discovery and uses not only the

traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of

the interrelations between the variables on which the predictions are based.

Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of

interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997,

p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area

(also in statistics) where important theoretical advances are being made (see, for example, the recent

annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American

Statistical Association).

For information on Data Mining techniques, please review the summary topics included below in this

chapter of the Electronic Statistics Textbook. There are numerous books that review the theory and

practice of data mining; the following books offer a representative sample of recent general books on

Datawarehousing & Mining - www.viplavkambli.com


data mining, representing a variety of approaches and perspectives: Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.

Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD:

Two Crows Corp.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge

discovery & data mining. Cambridge, MA: MIT Press.

Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining,

inference, and prediction. New York: Springer. Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.

Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-

Kaufman. Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.

Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann.

Crucial Concepts in Data Mining

Bagging (Voting, Averaging)

The concept of bagging (voting for classification, averaging for regression-type problems with

continuous dependent variables of interest) applies to the area of predictive data mining, to

combine the predicted classifications (prediction) from multiple models, or from the same type of model

for different learning data. It is also used to address the inherent instability of results when applying

complex models to relatively small data sets. Suppose your data mining task is to build a model for

predictive classification, and the dataset from which to train the model (learning data set, which

contains observed classifications) is relatively small. You could repeatedly sub-sample (with

replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to

the successive samples. In practice, very different trees will often be grown for the different samples,

illustrating the instability of models often evident with small datasets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some

simple voting: The final classification is the one most often predicted by the different trees. Note that

some weighted combination of predictions (weighted vote, weighted average) is also possible, and

commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted

prediction or voting is the Boosting procedure.


The concept of boosting applies to the area of predictive data mining, to generate multiple models

or classifiers (for prediction or classification), and to derive weights to combine the predictions from

those models into a single prediction or predicted classification (see also Bagging).

A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier

such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight.

Datawarehousing & Mining - www.viplavkambli.com


Compute the predicted classifications, and apply weights to the observations in the learning sample that

are inversely proportional to the accuracy of the classification. In other words, assign greater weight to

those observations that were difficult to classify (where the misclassification rate was high), and lower

weights to those that were easy to classify (where the misclassification rate was low). In the context of

C&RT for example, different misclassification costs (for the different classes) can be applied, inversely

proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted

data (or with different misclassification costs), and continue with the next iteration (application of the

analysis method for classification to the re-weighted data).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an

"expert" in classifying observations that were not well classified by those preceding it. During

deployment (for prediction or classification of new cases), the predictions from the different classifiers

can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best

prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support weights or

misclassification costs. In that case, random sub-sampling can be applied to the learning data in the

successive steps of the iterative boosting procedure, where the probability for selection of an

observation into the subsample is inversely proportional to the accuracy of the prediction for that

observation in the previous iteration (in the sequence of iterations of the boosting procedure).


See Models for Data Mining.

Data Preparation (in Data Mining) Data preparation and cleaning is an often neglected but extremely important step in the data mining

process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining

projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the

input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations

(e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened

for such problems can produce highly misleading results, in particular in predictive data mining.

Data Reduction (for Data Mining) The term Data Reduction in the context of data mining is usually applied to projects where the goal is to

aggregate or amalgamate the information contained in large datasets into manageable (smaller)

information nuggets. Data reduction methods can include simple tabulation, aggregation (computing

descriptive statistics) or more sophisticated techniques like clustering, principal components

analysis, etc.

See also predictive data mining, drill-down analysis.


The concept of deployment in predictive data mining refers to the application of a model for

prediction or classification to new data. After a satisfactory model or set of models has been identified

(trained) for a particular application, one usually wants to deploy those models so that predictions or

predicted classifications can quickly be obtained for new data. For example, a credit card company may

want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly

identify transactions which have a high probability of being fraudulent.

Datawarehousing & Mining - www.viplavkambli.com


Drill-Down Analysis

The concept of drill-down analysis applies to the area of data mining, to denote the interactive

exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender,

geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next one may want to "drill-down" to expose and further analyze the data

"underneath" one of the categorizations, for example, one might want to further review the data for

males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At

the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of

male customers from one region, for a certain income group, etc., and to offer to those customers

some particular services of particular utility to that group.

Feature Selection

One of the preliminary stage in predictive data mining, when the data set includes more variables

than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when

data are collected via automated (computerized) methods, it is not uncommon that measurements are

recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic

methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical

when the number of predictors exceed more than a few hundred variables.

Feature selection selects a subset of predictors from a large list of candidate predictors without

assuming that the relationships between the predictors and the dependent or outcome variables of

interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data

mining, to select manageable sets of predictors that are likely related to the dependent (outcome)

variables of interest, for further analyses with any of the other methods for regression and


Machine Learning Machine learning, computational learning theory, and similar terms are often used in the context of

Data Mining, to denote the application of generic model-fitting or classification algorithms for

predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with

the estimation of population parameters by statistical inference, the emphasis in data mining (and

machine learning) is usually on the accuracy of prediction (predicted classification), regardless of

whether or not the "models" or techniques that are used to generate the prediction is interpretable or

open to simple explanation. Good examples of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, etc. These methods

usually involve the fitting of very complex "generic" models, that are not related to any reasoning or

theoretical understanding of underlying causal processes; instead, these techniques can be shown to

generate accurate predictions or classification in cross validation samples.


The concept of meta-learning applies to the area of predictive data mining, to combine the

predictions from multiple models. It is particularly useful when the types of models included in the

project are very different. In this context, this procedure is also referred to as Stacking (Stacked


Datawarehousing & Mining - www.viplavkambli.com


Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear

discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications

for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification

rates) can be computed. Experience has shown that combining the predictions from multiple methods

often yields more accurate predictions than can be derived from any one method (e.g., see Witten and

Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which

will attempt to combine the predictions to create a final best predicted classification. So, for example,

the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s)

can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification


One can apply meta-learners to the results from different meta-learners to create "meta-meta"-

learners, and so on; however, in practice such exponential increase in the amount of data processing,

in order to derive an accurate prediction, will yield less and less marginal utility.

Models for Data Mining

In the business environment, complex data mining projects may require the coordinate efforts of

various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize

the process of gathering data, analyzing data, disseminating results, implementing results, and

monitoring improvements.

One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid- 1990s by a European consortium of companies to serve as a non-proprietary standard process model

for data mining. This general approach postulates the following (perhaps not particularly controversial)

general sequence of steps for data mining projects:

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for

eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery,

management, and other business activities. This model has recently become very popular (due to its

successful implementations) in various American industries, and it appears to gain favor worldwide. It

postulated a sequence of, so-called, DMAIC steps -

- that grew up from the manufacturing, quality improvement, and process control traditions and is

Datawarehousing & Mining - www.viplavkambli.com


particularly well suited to production environments (including "production of services," i.e., service


Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by

SAS Institute called SEMMA -

- which is focusing more on the technical activities typically involved in a data mining project.

All of these models are concerned with the process of how to integrate data mining methodology into an

organization, how to "convert data into information," how to involve important stake-holders, and how

to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these

specific frameworks.

The general underlying philosophy of StatSoft's STATISTICA Data Miner is to provide a flexible data

mining workbench that can be integrated into any organization, industry, or organizational culture,

regardless of the general data mining process-model that the organization chooses to adopt. For

example, STATISTICA Data Miner can include the complete set of (specific) necessary tools for ongoing

company wide Six Sigma quality control efforts, and users can take advantage of its (still optional)

DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into

ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either

the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also,

STATISTICA Data Miner offers all the advantages of a general data mining oriented "development kit"

that includes easy to use tools for incorporating into your projects not only such components as custom

database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems

of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprise-wide systems (e.g., following the CRISP, SEMMA, or a combination of

both models) that involve your entire organization.

Predictive Data Mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to

identify a statistical or neural network model or set of models that can be used to predict some

response of interest. For example, a credit card company may want to engage in predictive data

mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can

quickly identify transactions which have a high probability of being fraudulent. Other types of data

mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers),

in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).


See Models for Data Mining.

Stacked Generalization

See Stacking.

Datawarehousing & Mining - www.viplavkambli.com


Stacking (Stacked Generalization)

The concept of stacking (short for Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of

models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear

discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications

for a cross validation sample, from which overall goodness-of-fit statistics (e.g., misclassification

rates) can be computed. Experience has shown that combining the predictions from multiple methods

often yields more accurate predictions than can be derived from any one method (e.g., see Witten and

Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta- learner, which attempts to combine the predictions to create a final best predicted classification. So,

for example, the predicted classifications from the tree classifiers, linear model, and the neural network

classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to

"learn" from the data how to combine the predictions from the different models to yield maximum

classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g., from multiple

datasets used for learning) are Boosting and Bagging (Voting).

Text Mining

While Data Mining is typically concerned with the detection of patterns in numeric data, very often

important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of

(multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text

processed in that manner for further analyses with numeric data mining techniques (e.g., to determine

co-occurrences of concepts, key phrases, names, addresses, product names, etc.).

Data Transformation Services (DTS) in SQL Server 2000

Most organizations have multiple formats and locations in which data is stored. To support decision-

making, improve system performance, or upgrade existing systems, data often must be moved from

one data storage location to another.

Microsoft® SQL Server™ 2000 Data Transformation Services (DTS) provides a set of tools that lets you

extract, transform, and consolidate data from disparate sources into single or multiple destinations. By

using DTS tools, you can create custom data movement solutions tailored to the specialized needs of

your organization, as shown in the following scenarios:

• You have deployed a database application on an older version of SQL Server or

another platform, such as Microsoft Access. A new version of your application requires SQL Server 2000, and requires you to change your database schema and convert some data types.

Datawarehousing & Mining - www.viplavkambli.com


• To copy and transform your data, you can build a DTS solution that copies database

objects from the original data source into a SQL Server 2000 database, while at the same time remapping columns and changing data types. You can run this solution using DTS tools, or you can embed the solution within your application.

• You must consolidate several key Microsoft Excel spreadsheets into a SQL Server

database. Several departments create the spreadsheets at the end of the month, but there is no set schedule for completion of all the spreadsheets.

• To consolidate the spreadsheet data, you can build a DTS solution that runs when a

message is sent to a message queue. The message triggers DTS to extract data from the spreadsheet, perform any defined transformations, and load the data into a SQL Server database.

• Your data warehouse contains historical data about your business operations, and

you use Microsoft SQL Server 2000 Analysis Services to summarize the data. Your data warehouse needs to be updated nightly from your Online Transaction Processing (OLTP) database. Your OLTP system is in-use 24-hours a day, and performance is critical.

You can build a DTS solution that uses the file transfer protocol (FTP) to move data

files onto a local drive, loads the data into a fact table, and aggregates the data

using Analysis Services. You can schedule the DTS solution to run every night, and

you can use the new DTS logging options to track how long this process takes,

allowing you to analyze performance over time.

What Is DTS?

DTS is a set of tools you can use to import, export, and transform heterogeneous data between one or

more data sources, such as Microsoft SQL Server, Microsoft Excel, or Microsoft Access. Connectivity is

provided through OLE DB, an open-standard for data access. ODBC (Open Database Connectivity) data

sources are supported through the OLE DB Provider for ODBC.

You create a DTS solution as one or more packages. Each package may contain an organized set of

tasks that define work to be performed, transformations on data and objects, workflow constraints that

define task execution, and connections to data sources and destinations. DTS packages also provide

services, such as logging package execution details, controlling transactions, and handling global


These tools are available for creating and executing DTS packages:

• The Import/Export Wizard is for building relatively simple DTS packages, and supports data migration and simple transformations.

• The DTS Designer graphically implements the DTS object model, allowing you to create DTS packages with a wide range of functionality.

• DTSRun is a command-prompt utility used to execute existing DTS packages.

Datawarehousing & Mining - www.viplavkambli.com


• DTSRunUI is a graphical interface to DTSRun, which also allows the passing of global variables and the generation of command lines.

• SQLAgent is not a DTS application; however, it is used by DTS to schedule package execution.

Using the DTS object model, you also can create and run packages programmatically, build custom

tasks, and build custom transformations.

What's New in DTS?

Microsoft SQL Server 2000 introduces several DTS enhancements and new features:

• New DTS tasks include the FTP task, the Execute Package task, the Dynamic Properties task, and the Message Queue task.

• Enhanced logging saves information for each package execution, allowing you to

maintain a complete execution history and view information for each process within a task. You can generate exception files, which contain rows of data that could not be processed due to errors.

• You can save DTS packages as Microsoft Visual Basic® files.

• A new multiphase data pump allows advanced users to customize the operation of

data transformations at various stages. Also, you can use global variables as input parameters for queries.

• You can use parameterized source queries in DTS transformation tasks and the Execute SQL task.

• You can use the Execute Package task to dynamically assign the values of global variables from a parent package to a child package.

Using DTS Designer

DTS Designer graphically implements the DTS object model, allowing you to graphically create DTS

packages. You can use DTS Designer to:

• Create a simple package containing one or more steps.

• Create a package that includes complex workflows that include multiple steps using conditional logic, event-driven code, or multiple connections to data sources.

• Edit an existing package.

The DTS Designer interface consists of a work area for building packages, toolbars containing package

elements that you can drag onto the design sheet, and menus containing workflows and package

Datawarehousing & Mining - www.viplavkambli.com


management commands.

Figure 1: DTS Designer interface

By dragging connections and tasks onto the design sheet, and specifying the order of execution with

workflows, you can easily build powerful DTS packages using DTS Designer. The following sections

define tasks, workflows, connections, and transformations, and illustrate the ease of using DTS

Designer to implement a DTS solution.

Tasks: Defining Steps in a Package

A DTS package usually includes one or more tasks. Each task defines a work item that may be

performed during package execution. You can use tasks to:

• Transform data

Datawarehousing & Mining - www.viplavkambli.com


• Copy and manage data

• Run tasks as jobs from within a package

Datawarehousing & Mining - www.viplavkambli.com


1 New in SQL Server 2000.

2 Available only when SQL Server 2000 Analysis Services is installed.

You also can create custom tasks programmatically, and then integrate them into DTS Designer using

the Register Custom Task command.

To illustrate the use of tasks, here is a simple DTS Package with two tasks: a Microsoft ActiveX® Script

task and a Send Mail task:

Figure 2: DTS Package with two tasks

The ActiveX Script task can host any ActiveX Scripting engine including Microsoft Visual Basic Scripting

Edition (VBScript), Microsoft JScript®, or ActiveState ActivePerl, which you can download from

http://www.activestate.com . The Send Mail task may send a message indicating that the package

has run. Note that there is no order to these tasks yet. When the package executes, the ActiveX Script

task and the Send Mail task run concurrently.

Workflows: Setting Task Precedence

When you define a group of tasks, there is usually an order in which the tasks should be performed.

When tasks have an order, each task becomes a step of a process. In DTS Designer, you manipulate

tasks on the DTS Designer design sheet and use precedence constraints to control the sequence in

which the tasks execute.

Precedence constraints sequentially link tasks in a package. The following table shows the types of precedence constraints you can use in DTS.




On Completion

(blue arrow)

On Success (green arrow)

If you want Task 2 to wait until Task 1 completes, regardless of the outcome, link Task 1 to Task 2 with

an On Completion precedence constraint.

If you want Task 2 to wait until Task 1 has successfully

completed, link Task 1 to Task 2 with an On Success

precedence constraint.

On Failure (red arrow)

If you want Task 2 to begin execution only if Task 1

fails to execute successfully, link Task 1 to Task 2 with

an On Failure precedence constraint.

The following illustration shows the ActiveX Script task and the Send Mail task with an On Completion

Datawarehousing & Mining - www.viplavkambli.com


precedence constraint. When the Active X Script task completes, with either success or failure, the Send

Mail task runs.

Figure 3: ActiveX Script task and the Send Mail task with an On Completion precedence constraint

You can configure separate Send Mail tasks, one for an On Success constraint and one for an On Failure

constraint. The two Send Mail tasks can send different messages based on the success or failure of the

ActiveX script.

Figure 4: Mail tasks

You also can issue multiple precedence constraints on a task. For example, the Send Mail task "Admin

Notification" could have both an On Success constraint from Script #1 and an On Failure constraint

from Script #2. In these situations, DTS assumes a logical "AND" relationship. Therefore, Script #1

must successfully execute and Script #2 must fail for the Admin Notification message to be sent.

Figure 5: Example of multiple precedence constraints on a task

Connections: Accessing and Moving Data

To successfully execute DTS tasks that copy and transform data, a DTS package must establish valid

connections to its source and destination data and to any additional data sources, such as lookup


Datawarehousing & Mining - www.viplavkambli.com


When creating a package, you configure connections by selecting a connection type from a list of

available OLE DB providers and ODBC drivers. The types of connections that are available are:

• Microsoft Data Access Components (MDAC) drivers

• Microsoft Jet drivers

• Other drivers

DTS allows you to use any OLE DB connection. The icons on the Connections toolbar provide easy

access to common connections.

The following illustration shows a package with two connections. Data is being copied from an Access

database (the source connection) into a SQL Server production database (the destination connection).

Figure 6: Example of a package with two connections

The first step in this package is an Execute SQL task, which checks to see if the destination table

already exists. If so, the table is dropped and re-created. On the success of the Execute SQL task, data

is copied to the SQL Server database in Step 2. If the copy operation fails, an e-mail is sent in Step 3.

The Data Pump: Transforming Data

The DTS data pump is a DTS object that drives the import, export, and transformation of data. The

data pump is used during the execution of the Transform Data, Data Driven Query, and Parallel Data

Pump tasks. These tasks work by creating rowsets on the source and destination connections, then

creating an instance of the data pump to move rows between the source and destination.

Datawarehousing & Mining - www.viplavkambli.com


Transformations occur on each row as the row is copied.

In the following illustration, a Transform Data task is used between the Access DB task and the SQL

Production DB task in Step 2. The Transform Data task is the gray arrow between the connections.

Figure 7: Example of a Transform Data task

To define the data gathered from the source connection, you can build a query for the transformation

tasks. DTS supports parameterized queries, which allow you to define query values when the query is


You can type a query into the task's Properties dialog box, or use the Data Transformation Services

Query Designer, a tool for graphically building queries for DTS tasks. In the following illustration, the

Query Designer is used to build a query that joins three tables in the pubs database.

Figure 8: Data Transformation Services Query Designer interface

In the transformation tasks, you also define any changes to be made to data. The following table

describes the built-in transformations that DTS provides.

Transformation Description

Copy Column Use to copy data directly from source to destination

columns, without any transformations applied to the


Datawarehousing & Mining - www.viplavkambli.com


ActiveX Script Use to build custom transformations. Note that since

the transformation occurs on a row-by-row basis, an

ActiveX script can affect the execution speed of a DTS


DateTime String Use to convert a date or time in a source column to a

different format in the destination column.

Lowercase String Use to convert a source column to lowercase

characters and, if necessary, to the destination data


Uppercase String Use to convert a source column to all uppercase

characters and, if necessary, to the destination data


Middle of String Use to extract a substring from the source column,

transform it, and copy the result to the destination


Trim String Use to remove leading, trailing, and embedded white

space from a string in the source column and copy the

result to the destination column.

Read File Use to open the contents of a file, whose name is

specified in a source column, and copy the contents

into a destination column.

Write File Use to copy the contents of a source column (data

column) to a file whose path is specified by a second

source column (file name column).

You can also create your own custom transformations programmatically. The quickest way to build

custom transformations is to use the Active Template Library (ATL) custom transformation template,

which is included in the SQL Server 2000 DTS sample programs.

Data Pump Error Logging

A new method of logging transformation errors is available in SQL Server 2000. You can define three

exception log files for use during package execution: an error text file, a source error rows file, and a

destination error rows file.

• General error information is written to the error text file.

• If a transformation fails, then the source row is in error, and that row is written to the source error rows file.

Datawarehousing & Mining - www.viplavkambli.com


• If an insert fails, then the destination row is in error, and that row is written to the destination error rows file.

The exception log files are defined in the tasks that transform data. Each transformation task has its

own log files.

Data pump phases

By default, the data pump has one phase: row transformation. That phase is what you configure when

mapping column-level transformations in the Transform Data task, Data Driven Query task, and Parallel

Data Pump task, without selecting a phase.

Multiple data pump phases are new in SQL Server 2000. By selecting the multiphase data pump option

in SQL Server Enterprise Manager, you can access the data pump at several points during its operation

and add functionality.

When copying a row of data from source to a destination, the data pump follows the basic process

shown in the following illustration.

Figure 9: . Data pump process

After the data pump processes the last row of data, the task is finished and the data pump operation


Advanced users who want to add functionality to a package so that it supports any data pump phase

can do so by:

• Writing an ActiveX script phase function for each data pump phase to be

customized. If you use ActiveX script functions to customize data pump phases, no additional code outside of the package is required.

Datawarehousing & Mining - www.viplavkambli.com


• Creating a COM object in Microsoft Visual C++® to customize selected data pump

phases. You develop this program external to the package, and the program is called for each selected phase of the transformation. Unlike the ActiveX script method of accessing data pump phases, which uses a different function and entry point for each selected phase, this method provides a single entry point that is called by multiple data pump phases, while the data pump task executes.

Options for Saving DTS Packages

These options are available for saving DTS packages:

• Microsoft SQL Server

Save your DTS package to Microsoft SQL Server if you want to store packages on

any instance of SQL Server on your network, keep a convenient inventory of those

packages, and add and delete package versions during the package development


• SQL Server 2000 Meta Data Services

Save your DTS package to Meta Data Services if you plan to track package version,

meta data, and data lineage information.

• Structured storage file

Save your DTS package to a structured storage file if you want to copy, move, and

send a package across the network without having to store the package in a

Microsoft SQL Server database.

• Microsoft Visual Basic

Save your DTS package that has been created by DTS Designer or the DTS

Import/Export Wizard to a Microsoft Visual Basic file if you want to incorporated it

into Visual Basic programs or use it as a prototype for DTS application development.

DTS as an Application Development Platform

The DTS Designer provides a wide variety of solutions to data movement tasks. DTS extends the

number of solutions available by providing programmatic access to the DTS object model. Using

Microsoft Visual Basic, Microsoft Visual C++, or any other application development system that

supports COM, you can develop a custom DTS solution for your environment using functionality

unsupported in the graphical tools.

DTS offers support for the developer in several different ways:

Datawarehousing & Mining - www.viplavkambli.com


• Building packages

You can develop extremely complex packages and access the full range of

functionality in the object model, without the using the DTS Designer or DTS

Import/Export Wizard.

• Extending packages

You can add new functionality through the construction of custom tasks and

transforms, customized for your business and reusable within DTS.

• Executing packages

Execution of DTS packages does not have to be from any of the tools provided, it is

possible to execute DTS packages programmatically and display progress through

COM events, allowing the construction of embedded or custom DTS execution


Sample DTS programs are available to help you get started with DTS programming. The samples can be

installed with SQL Server 2000.

If you develop a DTS application, you can redistribute the DTS files. For more information, see

Redist.txt on the SQL Server 2000 compact disc.

Association Rule:

In data mining, association rule learning is a popular and well researched method for discovering

interesting relations between variables in large databases. Piatetsky-Shapiro describes analyzing and

presenting strong rules discovered in databases using different measures of interestingness. Based on

the concept of strong rules, For example, the rule found in the

sales data of a supermarket would indicate that if a customer buys onions and potatoes together, he or she is likely to also buy beef.


Clustering is the classification of objects into different groups, or more precisely, the partitioning of a

data set into subsets (clusters), so that the data in each subset (ideally) share some common trait -

often proximity according to some defined distance measure. Data clustering is a common technique for

statistical data analysis, which is used in many fields, including machine learning, data mining, pattern

recognition, image analysis and bioinformatics. The computational task of classifying the data set into k

clusters is often referred to as k-clustering.

Types of clustering

Data clustering algorithms can be hierarchical. Hierarchical algorithms find successive clusters using

previously established clusters. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive

("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them

Datawarehousing & Mining - www.viplavkambli.com


into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it

into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive

algorithms in the hierarchical clustering.

Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are

clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously.

Another important distinction is whether the clustering uses symmetric or asymmetric distances. A

property of Euclidean space is that distances are symmetric (the distance from object A to B is the

same as the distance from B to A). In other applications (e.g., sequence-alignment methods, see

Prinzie & Van den Poel (2006)), this is not the case.

Data classification is the determining of class intervals and class boundaries in that data to be mapped

and it depends in part on the number of observations. Most of the maps are designed with 4-6

classifications however with more observations you have to choose a large number of classes but too

many classes are also not good, since it makes the map interpretation difficult. There are four

classification methods for making a graduated color or graduated symbol map. All these methods reflect

different patterns affecting the map display.

Natural Breaks Classification

It is a manual data classification method that divides data into classes based on the natural groups in the data distribution. It uses a statistical formula (Jenk's optimization) that calculates groupings of data

values based on data distribution, and also seeks to reduce variance within groups and maximize

variance between groups.

This method is based on subjective decision and it is best choice for combining similar values. Since the

class ranges are specific to individual dataset, it is difficult to compare a map with another map and to

choose the optimum number of classes especially if the data is evenly distributed. Quantile Classification

Quantile classification method distributes a set of values into groups that contain an equal number of values. This method places the same number of data values in each class and will never have empty

classes or classes with too few or too many values. It is attractive in that this method always produces

distinct map patterns.

Equal Interval Classification Equal Interval Classification method divides a set of attribute values into groups that contain an equal

range of values. This method better communicates with continuous set of data. The map designed by

using equal interval classification is easy to accomplish and read . It however is not good for clustered

data because you might get the map with many features in one or two classes and some classes with no features because of clustered data.

Standard Deviation Classification

Standard deviation classification method finds the mean value, and then places class breaks above and below the mean at intervals of either 0.25, 0.5 or, one standard deviation until all the data values are

contained within the classes. Values that are beyond the three standard deviations from the mean are

aggregated into two classes; greater than three standard deviation above the mean and less than three

standard deviation below the mean.

Datawarehousing & Mining - www.viplavkambli.com


Datawarehousing & Mining - www.viplavkambli.com