78
12 The Data Warehouse and Data Mining MIS 304 Winter 2006

12 The Data Warehouse and Data Mining MIS 304 Winter 2006

Embed Size (px)

Citation preview

Page 1: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

12

The Data Warehouseand Data Mining

MIS 304 Winter 2006

Page 2: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

2

12

Class Goals

• Be able to understand and apply the concept of a Data Warehouse to database environments.

• Be able to describe the concept of On Line Analytical Processing (OLAP) and show how it might be used.

• Understand the seminaries and differences between DDS, OLAP, and Data Mining

Page 3: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

3

12

Class Goals

• Develop and understanding of the Star Schema and its importance in the design of the data warehouse.

Page 4: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

4

12

How much data is there?

• Remember the reading from the first night’s class.

• We are swimming in data.• What can we do with it?

Page 5: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

5

12

The Need for Data AnalysisThe Need for Data Analysis

• Constant pressure from external and internal forces requires prompt tactical and strategic decisions.

• The decision-making cycle time is reduced, while problems are increasingly complex with a growing number of internal and external variables.

• Managers need support systems for facilitating quick decision making in a complex environment.

• One solution many people use is a Decision support systems (DSS).

Page 6: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

6

12

Data Warehouse Examples

• Center for Disease Control http://www.cdc.gov/nchs/datawh.htm

• United States Geological Survey http://orxddwimdn.er.usgs.gov

• Pharmacia – Upjohn• AT&T• Meijer

Page 7: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

7

12

DSS

• Came about in the 1980’s based on work done at AT&T, IBM, Boston Consulting Group and others.

• Bonczek, Holsapple and Winston 1981, Foundations of Decision Support Systems

Page 8: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

8

12Decision Support SystemsDecision Support Systems

• Decision Support is a methodology (or a series of methodologies) designed to extract information from data and to use such information as a basis for decision making.

• A decision support system (DSS) is an arrangement of computerized tools used to assist managerial decision making within a business. – A DSS usually requires extensive data “massaging” to

produce information.– The DSS is used at all levels within an organization and is

often tailored to focus on specific business areas or problems.– The DSS is interactive and provides ad hoc query tools to

retrieve data and to display data in different formats.

Page 9: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

9

12

Four Components of a DSSFour Components of a DSS

• The data store component is basically a DSS database.

• The data extraction and filtering component is used to extract and validate the data taken from the operational database and the external data sources.

• The end user query tool is used by the data analyst to create the queries that access the database.

• The end user presentation tool is used by the data analyst to organize and present the data.

Page 10: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

10

12Main Components Of A Decision Support System (DSS)

Page 11: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

11

12

Operational Data vs. Decision Support DataOperational Data vs. Decision Support Data

• Most operational data are stored in a relational database in which the structures tend to be highly normalized.

• The operational data storage is optimized to support transactions that represent daily operations.

• Whereas operational data capture daily business transactions, DSS data give tactical and strategic business meaning to the operational data.

Page 12: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

12

12

Operational Data vs. Decision Support Data

• The design needs of the DSS and the Supporting data are often at odds with one another!

• You have two choices:– Modify the structure of your operational

data– Build a separate system.

What is the deciding factor?

Page 13: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

13

12

Operational Data vs. Decision Support Data

Operational Data vs. Decision Support Data

• Time span

– Operational data represent current (atomic) transactions.

– DSS data tend to cover a longer time frame.

• Granularity

– Operational data represent specific transactions that occur at a given time.

– DSS data must be presented at different levels of aggregation.

• Dimensionality

– Operational data focus on representing atomic transactions.

– DSS data can be analyzed from multiple dimensions.

Page 14: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

14

12

Page 15: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

15

12Contrasting Operational And DSS Data Characteristics

Page 16: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

16

12

Decision Support SystemsDecision Support Systems

• The DSS Database Requirements

– Database Schema

• The DSS database schema must support complex (non-normalized) data representations.

• The queries must be able to extract multidimensional time slices.

What is the Impact on the Database?

Page 17: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

17

12

Impacts on the Database

• Less normalization typically means more speed.

• Less normalization typically means bigger tables

• Less normalization typically means “easier” queries (SQL)

Page 18: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

18

12

Data Extraction and LoadingData Extraction and Loading

• The DBMS must support advanced data extracting and filtering tools.

• The data extraction capabilities should support different data sources and multiple vendors.

• Data filtering capabilities must include the ability to check for inconsistent data or data validation rules.

• The DBMS must support advanced data integration, aggregation, and classification capabilities.

Page 19: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

19

12Decision Support SystemsDecision Support Systems

• End-User Analytical Interface

– The DSS DBMS must support advanced data modeling and data presentation tools, data analysis tools, and query generation and optimization components.

– The end user analytical interface is one of the most critical components.

• Database Size Requirements

– DSS databases tend to be very large.

– The DBMS must be capable of supporting very large databases (VLDB).

– The DBMS may be required to use advanced hardware, such as multiple disk arrays and multiple-processor technologies.

Page 20: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

20

12The Data WarehouseThe Data Warehouse

• The Data Warehouse is an integrated, subject-oriented, time-variant, non-volatile database that provides support for decision making.

– Integrated• The Data Warehouse is a centralized, consolidated

database that integrates data retrieved from the entire organization.

– Subject-Oriented

• The Data Warehouse data is arranged and optimized to provide answers to questions coming from diverse functional areas within a company.

Page 21: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

21

12

The Data WarehouseThe Data Warehouse

– Time Variant

• The Warehouse data represent the flow of data through time. It can even contain projected data.

– Non-Volatile

• Once data enter the Data Warehouse, they are never removed.

• The Data Warehouse is always growing.

Page 22: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

22

12Comparison Of Data Warehouse And Operational Database Characteristics

Page 23: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

23

12Creating A Data Warehouse

Page 24: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

24

12

The Other Option

• Find places where this filtering occurs naturally– Accounting systems– Distribution systems– Engineering systems

• Extract the data from the “filtering” system

Page 25: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

25

12

Data MartData Mart• A data mart is a small, single-subject data

warehouse subset that provides decision support to a small group of people.

• Data Marts can serve as a test vehicle for companies exploring the potential benefits of Data Warehouses.

• Data Marts address local or departmental problems, while a Data Warehouse involves a company-wide effort to support decision making at all levels in the organization.

Page 26: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

26

12Twelve Rules That Define a

Data WarehouseTwelve Rules That Define a

Data Warehouse

1.The Data Warehouse and operational environments are separated.

2.The Data Warehouse data are integrated.

3.The Data Warehouse contains historical data over a long time horizon.

4.The Data Warehouse data are snapshot data captured at a given point in time.

5.The Data Warehouse data are subject-oriented.

Page 27: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

27

12

12 Rules cont.

6. The Data Warehouse data are mainly read-only with periodic batch updates from operational data. No online updates are allowed.

7. The Data Warehouse development life cycle differs from classical systems development. The Data Warehouse development is data driven; the classical approach is process driven.

Page 28: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

28

12

8.The Data Warehouse contains data with several levels of detail; current detail data, old detail data, lightly summarized, and highly summarized data.

9.The Data Warehouse environment is characterized by read-only transactions to very large data sets. The operational environment is characterized by numerous update transactions to a few data entities at the time.

10. The Data Warehouse environment has a system that traces data resources, transformation, and storage.

12 Rules cont.12 Rules cont.

Page 29: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

29

12

12 Rules cont.

11. The Data Warehouse’s metadata are a critical component of this environment. The metadata identify and define all data elements. The metadata provide the source, transformation, integration, storage, usage, relationships, and history of each data element.

12. The Data Warehouse contains a charge-back mechanism for resource usage that enforces optimal use of the data by end users.

Page 30: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

30

12

10 Mistakes to Avoid

• Mistake 1:Not gathering business requirements • Mistake 2: Saving time by not creating a subject

area model • Mistake 3: Delivering normalized tables to drive

out mart design • Mistake 4: Designing the data staging process for

the ease of the developers at the expense of the end users

• Mistake 5: Denormalizing without starting with a fully normalized data model

Page 31: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

31

12

10 Mistakes to Avoid

• Mistake 6: Allowing users to drive the level of detail • Mistake 7: Not having a multi-tiered warehousing

environment • Mistake 8: Developing a data model from a list of required

data elements • Mistake 9: Thinking that you must choose between

relational and dimensional models • Mistake 10: Jumping straight into data mart design

Karolyn Duncan, Steve Hoberman, and Laura Reeves, 10 Mistakes to Avoid When Modeling Data Warehouses

Page 32: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

32

12

OnLine Analytical ProcessingOLAP

Page 33: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

33

12

On-Line Analytical ProcessingOn-Line Analytical Processing

• On-Line Analytical Processing (OLAP) is an advanced data analysis environment that supports decision making, business modeling, and operations research activities.

• Four Main Characteristics of OLAP

– Use multidimensional data analysis techniques

– Provide advanced database support

– Provide easy-to-use end user interfaces

– Support client/server architecture

Page 34: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

34

12Multidimensional Data Analysis

TechniquesMultidimensional Data Analysis

Techniques

• The processing of data in which data are viewed as part of a multidimensional structure.

• Multidimensional view allows end users to consolidate or aggregate data at different levels.

• Multidimensional view allows a business analyst to easily switch business perspectives.

Page 35: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

35

12Operational Vs. Multidimensional View Of Sales

Page 36: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

36

12

On-Line Analytical ProcessingOn-Line Analytical Processing

• Additional Functions of Multidimensional Data Analysis Techniques

– Advanced data presentation functions

– Advanced data aggregation, consolidation, and classification functions

– Advanced computational functions

– Advanced data modeling functions

Page 37: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

37

12Integration Of OLAP With A Spreadsheet Program

Page 38: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

38

12Advanced Database SupportAdvanced Database Support

• Access to many different kinds of DBMSs, flat files, and internal and external data sources.

• Access to aggregated Data Warehouse data as well as to the detail data found in operational databases.

• Advanced data navigation features such as drill-down and roll-up.

• Rapid and consistent query response times.

• The ability to map end user requests, expressed in either business or model terms, to the appropriate data source and then to the proper data access language.

• Support for very large databases.

Page 39: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

39

12

On-Line Analytical ProcessingOn-Line Analytical Processing

• Easy-to-Use End User Interface– Easy-to-use graphical user interfaces make

sophisticated data extraction and analysis tools easily accepted and readily used.

• Client/Server Architecture– The client/server environment enables us to

divide an OLAP system into several components that define its architecture.

Page 40: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

40

12

On-Line Analytical ProcessingOn-Line Analytical Processing

• OLAP Architecture

– Three Main Modules

• OLAP Graphical User Interface (GUI)

• OLAP Analytical Processing Logic

• OLAP Data Processing Logic

– OLAP systems are designed to use both operational and Data Warehouse data.

Page 41: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

41

12OLAP Server Arrangement

Page 42: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

42

12OLAP Server With Multidimensional Data Store Arrangement

Page 43: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

43

12OLAP Server With Local Mini Data-Marts

Page 44: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

44

12

Relational OLAPRelational OLAP

• Relational On-Line Analytical Processing (ROLAP) provides OLAP functionality by using relational database and familiar relational query tools.

• Extensions to RDBMS

– Multidimensional data schema support within the RDBMS

– Data access language and query performance optimized for multidimensional data

– Support for very large databases

Page 45: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

45

12

Multidimensional Data Schema Support

• Normalization of tables in relational technology is seen as a stumbling block to its use in OLAP systems.

• DSS data tend to be non-normalized, duplicated, and pre- aggregated.

• ROLAP uses a special design technique to enable RDBMS technology to support multidimensional data representations, known as star schema.

• Star schema creates the near equivalent of a multidimensional database schema from the existing relational database.

Page 46: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

46

12

On-Line Analytical Processing• Data Access Language and Query Performance Optimized for

Multidimensional Data

– Most decision support data requests require the use of multiple-pass SQL queries or multiple nested SQL statements.

– ROLAP extends SQL so that it can differentiate between access requirements for data warehouse data and operational data.

• Support for Very Large Databases

– Decision support data are normally loaded in bulk (batch) mode from the operational data.

– RDBMS must have the proper tools to import, integrate, and populate the data warehouse with operational data.

– The speed of the data-loading operations is important.

Page 47: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

47

12A Typical ROLAP Client/Server Architecture

Page 48: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

48

12Multidimensional OLAP

(MOLAP)Multidimensional OLAP

(MOLAP)

• MOLAP extends OLAP functionality to multidimensional databases (MDBMS).

• MDBMS end users visualize the stored data as a multidimensional cube known as a data cube.

• Data cubes are created by extracting data from the operational databases or from the data warehouse.

• Data cubes are static and require front-end design work.• To speed data access, data cubes are normally held in memory, called

cube cache.• MOLAP is generally faster than their ROLAP counterparts. It is also more

resource-intensive.• MDBMS is best suited for small and medium data sets.• Multidimensional data analysis is also affected by how the database

system handles sparsity.

Page 49: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

49

12MOLAP Client/Server Architecture

Page 50: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

50

12Relational Vs. Multidimensional OLAP

Page 51: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

51

12

Star SchemaStar Schema

• The star schema is a data-modeling technique used to map multidimensional decision support into a relational database.

• Star schemas yield an easily implemented model for multidimensional data analysis while still preserving the relational structure of the operational database.

• Four Components:– Facts– Dimensions– Attributes– Attribute hierarchies

Page 52: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

52

12A Simple Star Schema

Page 53: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

53

12

Star SchemaStar Schema• Facts

– Facts are numeric measurements (values) that represent a specific business aspect or activity.

– The fact table contains facts that are linked through their dimensions.

– Facts can be computed or derived at run-time (metrics).

• Dimensions

– Dimensions are qualifying characteristics that provide additional perspectives to a given fact.

– Dimensions are stored in dimension tables.

Page 54: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

54

12

• Attributes

– Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts.

– Dimensions provide descriptive characteristics about the facts through their attributes.

Possible Attributes For Sales Dimensions

Page 55: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

55

12Three Dimensional View Of Sales

Page 56: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

56

12Slice And Dice View Of Sales

Page 57: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

57

12

Attribute HierarchiesAttribute Hierarchies

• Attributes within dimensions can be ordered in a well-defined attribute hierarchy.

• The attribute hierarchy provides a top-down data organization that is used for two main purposes:

– Aggregation

– Drill-down/roll-up data analysis

Page 58: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

58

12A Location Attribute Hierarchy

Page 59: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

59

12Attribute Hierarchies In Multidimensional Analysis

Page 60: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

60

12

Star Schema RepresentationStar Schema Representation

• Facts and dimensions are normally represented by physical tables in the data warehouse database.

• The fact table is related to each dimension table in a many-to-one (M:1) relationship.

• Fact and dimension tables are related by foreign keys and are subject to the primary/foreign key constraints.

Page 61: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

61

12Star Schema For Sales

Page 62: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

62

12Orders Star Schema

Page 63: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

63

12

Example

Page 64: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

64

12

Star SchemaStar Schema

• Performance-Improving Techniques

– Normalization of dimensional tables

– Multiple fact tables representing different aggregation levels

– Denormalization of fact tables

– Table partitioning and replication

Page 65: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

65

12Normalized Dimension Tables

Page 66: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

66

12Multiple Fact Tables

Page 67: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

67

12Data Warehouse ImplementationData Warehouse Implementation

• The Data Warehouse as an Active Decision Support Network

– A Data Warehouse is a dynamic support framework.

– Implementation of a Data Warehouse is part of a complete database-system-development infrastructure for company-wide decision support.

– Its design and implementation must be examined in the light of the entire infrastructure.

Page 68: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

68

12

Data Warehouse ImplementationData Warehouse Implementation

• A Company-Wide Effort that Requires User Involvement and Commitment at All Levels

– For a successful design and implementation, the designer must:

• Involve end users in the process.

• Secure end users’ commitment from the beginning.

• Create continuous end user feedback.

• Manage end user expectations.

• Establish procedures for conflict resolution.

Page 69: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

69

12

Data Warehouse ImplementationData Warehouse Implementation

• Satisfy the Trilogy: Data, Analysis, and Users

– For a successful design and implementation, the designer must satisfy:

• Data integration and loading criteria.

• Data analysis capabilities with acceptable query performance.

• End user data analysis needs.

Page 70: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

70

12Apply Database Design ProceduresApply Database Design Procedures

– Data Warehouse development is a company-wide effort and requires many resources: people, financial, and technical.

• The sheer and often mind-boggling quantity of decision support data is likely to require the latest hardware and software.

• It is also imperative to have very detailed procedures to orchestrate the flow of data from the operational databases to the Dare Warehouse.

• To implement and support the Data Warehouse architecture, we also need people with advanced database design, software integration, and management skills.

Page 71: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

71

12Data Warehouse Implementation Road Map

Page 72: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

72

12

Data MiningData Mining

• In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive.

• Data mining tools automatically search the data for anomalies and possible relationships, thereby identifying problems that have not yet been identified by the end user.

• Data mining tools -- based on algorithms that form the building blocks for artificial intelligence, neural networks, inductive rules, and predicate logic -- initiate analysis to create knowledge.

Page 73: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

73

12

Data Mining Techniques

• Neural Nets• Regression Analysis• Time Series Analysis• Multidimensional Analysis• Factor Analysis• Nearest Neighbor methods• Cluster Analysis

Page 74: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

74

12Extraction Of Knowledge From Data

Page 75: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

75

12

Data MiningData Mining

• Four Phases of Data Mining

1.Data Preparation

• Identify and cleanse data sets.

• Data Warehouse is usually used for data mining operations.

2.Data Analysis and Classification

• Identify common data characteristics or patterns using

– Data groupings, classifications, clusters, or sequences.

– Data dependencies, links, or relationships.

– Data patterns, trends, and deviations.

Page 76: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

76

12

Data MiningData Mining

3. Knowledge Acquisition

• Select the appropriate modeling or knowledge acquisition algorithms.

• Examples: neural networks, decision trees, rules induction, genetic algorithms, classification and regression tree, memory-based reasoning, or nearest neighbor and data visualization.

4. Prognosis

• Predict future behavior and forecast business outcomes using the data mining findings.

Page 77: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

77

12Data-Mining Phases

Page 78: 12 The Data Warehouse and Data Mining MIS 304 Winter 2006

78

12

Examples

• Credit Card usage (who buys what)• Airline travel (where do people go,

when)• Social Services (who is not paying the

tab)