16
1 Dr. Chen, Data Base Management Chapter 4: Data Mining Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258 [email protected] Dr. Chen, Business Intelligence 2 Covers the following pages pp. 145 171 p. 176 (Decision Trees only) pp. 186 195 AC 4.6 (p.189-192) NO ECC Dr. Chen, Business Intelligence 3 Learning Objectives Define data mining as an enabling technology for business intelligence Understand the objectives and benefits of business analytics and data mining Recognize the wide range of applications of data mining Learn the standardized data mining processes CRISP-DM SEMMA KDD (Continued…) Dr. Chen, Business Intelligence 4 Learning Objectives Understand the steps involved in data preprocessing for data mining Learn different methods and algorithms of data mining Build awareness of the existing data mining software tools Commercial versus free/open source Understand the pitfalls and myths of data mining Dr. Chen, Business Intelligence 5 Questions for the Opening Vignette 1. Why should retailers, especially omni-channel retailers, pay extra attention to advanced analytics and data mining? 2. What are the top challenges for multi-channel retailers? Can you think of other industry segments that face similar problems/challenges? 3. What are the sources of data that retailers such as Cabela’s use for their data mining projects? 4. What does it mean to have a “single view of the customer”? How can it be accomplished? 5. What type of analytics help did Cabela’s get from their efforts? Can you think of any other potential benefits of analytics for large-scale retailers like Cabela’s? 6. What was the reason for Cabela’s to bring together SAS and Teradata, the two leading vendors in the analytics marketplace? 7. What is in-database analytics, and why would you need it? Dr. Chen, Business Intelligence 6 1. Why should retailers, especially omni-channel retailers, pay extra attention to advanced analytics and data mining? Utilizing large and information-rich transactional and customer data (that they collect on a daily basis) to optimize their business processes is not a choice for large-scale retailers anymore, but a necessity to stay competitive. 2. What are the top challenges for multi-channel retailers? Can you think of other industry segments that face similar problems? The retail industry is amongst the most challenging because of the change that they have to deal with constantly. Understanding customer needs, wants, likes, and dislikes is an ongoing challenge. As the volume and complexity of data increase, so does the time spent on preparing and analyzing it. Prior to the integration of SAS and Teradata, data for modeling and scoring customers was stored in a data mart. This process required a large amount of time to construct, bringing together disparate data sources and keeping statisticians from working on analytics.

Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

Embed Size (px)

Citation preview

Page 1: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

1

Dr. Chen, Data Base Management

Chapter 4:

Data Mining

Jason C. H. Chen, Ph.D.

Professor of MIS

School of Business Administration

Gonzaga University

Spokane, WA 99258

[email protected]

Dr. Chen, Business Intelligence

2

Covers the following pages

• pp. 145 – 171

• p. 176 (Decision Trees only)

• pp. 186 – 195

• AC 4.6 (p.189-192)

• NO ECC

Dr. Chen, Business Intelligence 3

Learning Objectives

• Define data mining as an enabling technology for

business intelligence

• Understand the objectives and benefits of business

analytics and data mining

• Recognize the wide range of applications of data

mining

• Learn the standardized data mining processes

CRISP-DM

SEMMA

KDD (Continued…) Dr. Chen, Business Intelligence

4

Learning Objectives

• Understand the steps involved in data

preprocessing for data mining

• Learn different methods and algorithms of data

mining

• Build awareness of the existing data mining

software tools

Commercial versus free/open source

• Understand the pitfalls and myths of data mining

Dr. Chen, Business Intelligence 5

Questions for the Opening Vignette

1. Why should retailers, especially omni-channel retailers, pay extra

attention to advanced analytics and data mining?

2. What are the top challenges for multi-channel retailers? Can you think

of other industry segments that face similar problems/challenges?

3. What are the sources of data that retailers such as Cabela’s use for

their data mining projects?

4. What does it mean to have a “single view of the customer”? How can

it be accomplished?

5. What type of analytics help did Cabela’s get from their efforts? Can

you think of any other potential benefits of analytics for large-scale

retailers like Cabela’s?

6. What was the reason for Cabela’s to bring together SAS and

Teradata, the two leading vendors in the analytics marketplace?

7. What is in-database analytics, and why would you need it?

Dr. Chen, Business Intelligence 6

• 1. Why should retailers, especially omni-channel retailers, pay extra

attention to advanced analytics and data mining?

• Utilizing large and information-rich transactional and customer data

(that they collect on a daily basis) to optimize their business processes is

not a choice for large-scale retailers anymore, but a necessity to stay

competitive.

• 2. What are the top challenges for multi-channel retailers? Can you

think of other industry segments that face similar problems?

• The retail industry is amongst the most challenging because of the

change that they have to deal with constantly. Understanding customer

needs, wants, likes, and dislikes is an ongoing challenge. As the volume

and complexity of data increase, so does the time spent on preparing and

analyzing it.

• Prior to the integration of SAS and Teradata, data for modeling and

scoring customers was stored in a data mart. This process required a

large amount of time to construct, bringing together disparate data

sources and keeping statisticians from working on analytics.

Page 2: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

2

Dr. Chen, Business Intelligence 7

• 3. What are the sources of data that retailers such as

Cabela’s use for their data mining projects?

• Cabela’s uses large and information-rich transactional and

customer data (that they collect on a daily basis). In

addition, through Web mining they track clickstream

patterns of customers shopping online.

• 4. What does it mean to have “a single view of the

customer”? How can it be accomplished?

• Having a single view of the customer means treating the

customer as a single entity across whichever channels the

customer utilizes. Shopping channels include brick-and-

mortar, television, catalog, and e-commerce (through

computers and mobile devices). Achieving this single view

helps to better focus marketing efforts and drive increased

sales.

Dr. Chen, Business Intelligence

8

• 5. What type of analytics help did Cabela’s get from their efforts?

Can you think of any other potential benefits of analytics for large-

scale retailers like Cabela’s?

• Cabela’s has long relied on SAS statistics and data mining tools to

help analyze the data it gathers from sales transactions, market

research, and demographic data associated with its large database of

customers. Using SAS data mining tools, Cabela’s analysts create

predictive models to optimize customer selection for all customer

contacts. Cabela’s uses these prediction scores to maximize

marketing spending across channels and within each customer’s

personal contact strategy.

• These efforts have allowed Cabela’s to continue its growth in a

profitable manner. In addition, dismantling the information silos, and

integration of SAS and Teradata, enabled them to create “a holistic

view of the customer.” Since this works so well for the

sales/customer side of the business, it could also work in other areas

as well. Supply chain is one example. Analytics could help produce a

“holistic view of the vendor” as well.

Dr. Chen, Business Intelligence 9

• 6. What was the reason for Cabela’s to bring

together SAS and Teradata, the two leading

vendors in the analytics marketplace?

• Cabela’s was already using both for different

elements of their business. Each of the two systems

was producing actionable analysis of data.

• But by being separate, too much time was required

to construct data marts, bringing together disparate

data sources and keeping statisticians from working

on analytics. Now, with the integration of the two

systems, statisticians can leverage the power of

SAS using the Teradata warehouse as one source of

information.

Dr. Chen, Business Intelligence 10

• 7. What is in-database analytics, and why would

you need it?

• In-database analytics refers to the practice of

applying analytics directly to a database or data

warehouse rather than the traditional practice of

first transforming into the analytics application’s

data format. The time it takes to transform

production data into a data warehouse format can

be very long. In-database analytics eliminates this

need.

Dr. Chen, Business Intelligence 11

Event Driven Alert - A Scenario

• Three transactions were posted in a credit

account while the account holder was

traveling in summer 2010:

Ranch 99 San Jose, California $102.33 Aug. 1, 2010

Exxon Austin, Texas $99.12 August 3, 2010

Exxon Houston, Texas $120.44 August 5, 2010

• What action will you (the credit company)

take and why?

• How the scenario can be detected?

Dr. Chen, Business Intelligence 12

Q/Quotes

• “What will be the killer applications in the

corporation?” – an “age-old” question

• “Data Mining”

replied by “Dr. Penzias” (Nobel laureate and former chief

scientist of Bell Labs

• “Data mining will become much more important and

companies will throw away nothing about their

customers because it will be so valuable. If you’re

not doing this, you’re out of business”

• “The latest strategic weapon for companies is

analytical decision making (e.g., data mining) …”

Thomas Davenport (2006) in Harvard Business Review.

Page 3: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

3

Dr. Chen, Business Intelligence 13

Database vs. Datawarehouse

DBMS Database

Datawarehouse ???

Dr. Chen, Business Intelligence 14

Database vs. Datawarehouse

DBMS Database

Datawarehouse

Data Mining

Dr. Chen, Business Intelligence 15

What is Data Mining?

• Data mining – the process of analyzing data to extract

information (unknown patterns) not offered by the raw

data alone

• To perform data mining users need data-mining tools

Data-mining tool – uses a variety of techniques to find patterns

and relationships in large volumes of information and infers

rules that predict future behavior and guide decision making

A wide range of data mining techniques are being used by

organization to gain a better understanding of their customers

and their operations and to solve complex organizational

problems.

• An example

Grocery Store in UK (see next slide) Dr. Chen, Business Intelligence

16

CRM and Data Mining (BI)Example • A Grocery store in U.K. with the following “patterns” found:

• Every Thursday afternoon

• Young Fathers (why?) shopping at store

• Two of the followings are always included in their shopping list

Diapers and

Beers

• What other decisions should be made as a store manager (in terms

of store layout)?

• Short term vs. Long term

This is an example of cross-selling

Other types of promotion: up-sell, bundled-sell

• IT (e.g., BI) helps to find valuable information then decision

makers make a timely/right decision for improving/creating

competitive advantages.

Dr. Chen, Business Intelligence 17

Q/A

• Can the “patterns” in the grocery store

example be produced from its Database?

• Y/N

• Why?

• It only can be produced from its “Data

Warehouse” using a kind of “data mining”

software.

17

Dr. Chen, Business Intelligence 18

Why Data Mining?

• More intense competition at the global scale

• Recognition of the value in data sources

• Availability of quality data on customers, vendors,

transactions, Web, etc.

• Consolidation and integration of data repositories

into data warehouses

• The exponential increase in data processing and

storage capabilities; and decrease in cost

• Movement toward conversion of information

resources into nonphysical form

Page 4: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

4

Dr. Chen, Business Intelligence 19

Definitions of Data Mining –

Summary • Technical def.: a process that uses statistical, mathematical,

and artificial intelligence techniques to extract and identify

useful information and subsequent knowledge (or patterns)

from large sets of data.

• General def.: the nontrivial process of identifying valid,

novel, potentially useful, and ultimately understandable

patterns in data stored in structured databases - Fayyad et al., (1996)

• Keywords in this definition: Process, nontrivial, valid, novel,

potentially useful, understandable

• Other names: knowledge extraction, pattern analysis,

knowledge discovery, information harvesting, pattern

searching, data dredging.

Dr. Chen, Business Intelligence 20

Sta

tistic

s

Management Science &

Information Systems

Artificial Intelligence

Databases

Pattern

Recognition

Machine

Learning

Mathematical

Modeling

DATA

MINING

Data Mining at the Intersection of Many Disciplines

Fig 4. 1 Data Mining as a Blend of Multiple Disciplines

Dr. Chen, Business Intelligence 21

• Source of data for DM is often a consolidated data

warehouse (not always!).

• DM environment is usually a client-server or a Web-

based information systems architecture.

• Data is the most critical ingredient for DM which may

include soft/unstructured data.

• The miner is often an end user.

• Striking it rich requires creative thinking.

• Data mining tools’ capabilities and ease of use are

essential (Web, Parallel processing, etc.).

Data Mining

Characteristics/Objectives

Dr. Chen, Business Intelligence 22

Data in Data Mining

• Data: a collection of facts usually obtained as the result of

experiences, observations, or experiments.

• Data may consist of numbers, words, images, …

• Data: lowest level of abstraction (from which information and

knowledge are derived). - DM with different

data types?

- Other data types? Data

Categorical Numerical

Nominal Ordinal Interval Ratio

StructuredUnstructured or

Semi-Structured

MultimediaTextual HTML/XML

martial status: (1)

single, (2)

married, (3)

divorced

rank order: credit

score: (1) low, (2)

medium, (3) high

continuous data

temperature on the

Celsius scale: 1/100 ratio between a

magnitude of a

continuous quantity

discrete data

Fig 4.3 A Simple Taxonomy of Data in

Data Mining

Dr. Chen, Business Intelligence 23

What Does DM Do?

How Does it Work?

• DM builds models to extract/identify patterns

from data

Pattern?

A mathematical (numeric and/or symbolic) relationship

among data items

• Types of patterns

1. Prediction

2. Association

3. Cluster (segmentation)

4. Sequential (or time series) relationships Dr. Chen, Business Intelligence

24

Fig 4.3 A Simple Taxonomy for Data Mining Tasks

Data Mining

Prediction

Classification

Regression

Clustering

Association

Link analysis

Sequence analysis

Learning Method Popular Algorithms

Supervised

Supervised

Supervised

Unsupervised

Unsupervised

Unsupervised

Unsupervised

Decision trees, ANN/MLP, SVM, Rough

sets, Genetic Algorithms

Linear/Nonlinear Regression, Regression

trees, ANN/MLP, SVM

Expectation Maximization, Apriory

Algorithm, Graph-based Matching

Apriory Algorithm, FP-Growth technique

K-means, ANN/SOM

Outlier analysis Unsupervised K-means, Expectation Maximization (EM)

Apriory, OneR, ZeroR, Eclat

Classification and Regression Trees,

ANN, SVM, Genetic Algorithms

No models or hypothesis exists (e.g., with training data)

With models or hypothesis

e.g., decision trees

e.g., Market-Basket Analysis

Page 5: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

5

Dr. Chen, Business Intelligence 25

How do BI Tools Obtain Data?

On

lin

e

Da

taS

tore

BS

tore

A

Operational/

external DB

Database

in Data

Warehouse

Metadata

Data warehouse

Extraction

Transformation

Cleansing

Loading

Summarization

Data reconciliation

process

Basic Reporting

RFM Analysis

OLAP

Data mining

Web clickstream

Web analytics

Decision support

Data

mart

Data

mart

Data

mart

Data marts

Dr. Chen, Business Intelligence 26 Fig Extra Convergence Disciplines for Data Mining

Businesses use statistical techniques to find patterns and relationships

among data and use it for classification and prediction. Data mining

techniques are a blend of statistics and mathematics, artificial intelligence

(AI) and machine-learning.

What are typical data-mining applications?

Data Warehouse

Dr. Chen, Business Intelligence 27

What are typical data-mining applications?

• Data mining is an automated process of discovery and extraction of hidden

and/or unexpected patterns of collected data in order to create models for decision

making that predict future behavior based on analyses of past activity.

• There are two types of data-mining techniques:

Unsupervised data-mining characteristics: No model or hypothesis exists before running the analysis (with “training data” set)

Analysts apply data-mining techniques and then observe the results

Analysts create a hypotheses after analysis is completed

Cluster analysis, a common technique in this category groups entities together that

have similar characteristics

Supervised data-mining characteristics:

Analysts develop a model prior to their analysis

Classification and Decision tree

Apply statistical techniques to estimate parameters of a model

Regression analysis is a technique in this category that measures the impact of a set

of variables on another variable

Neural networks predict values and make classifications.

Used for making predictions

Dr. Chen, Business Intelligence 28

Decision Tree Example for MIS Classes (hypothetical data)

• A decision tree is a hierarchical arrangement of criteria that predicts a

classification or value. It’s an unsupervised data-mining technique that

selects the most useful attributes for classifying entities on some criterion.

It uses if…then rules in the decision process. Here are two examples.

Fig Decision Tree Examples for MIS Class (Hypothetical Data)

If student is a junior and works in a

restaurant, then predict grade 3.0

If student is a senior and is a nonbusiness

major, then predict grade 3.0

If student is a junior and does not work in

a restaurant, then predict grade 3.0

If student is a senior and is a business

major, then make _____ prediction

>

< ---

< ---

no

Dr. Chen, Business Intelligence 29

What are typical data-mining applications?

• A decision tree is a hierarchical arrangement of criteria that predicts a

classification or value. It’s a supervised data-mining technique that selects

the most useful attributes for classifying entities on some criterion. It uses

if…then rules in the decision process. Here are two examples.

Extra Figure: Grades of Students from Past MIS Class (Hypothetical Data)

Fig Credit Score Decision Tree

>

Dr. Chen, Business Intelligence 30

Market-Basket Analysis is a unsupervised data-mining tool for determining sales

patterns. It helps businesses create cross-selling opportunities (i.e., buying relevant

products together). Two terms used with this type of analysis are:

Support: the probability that two items will be purchased together (e.g., Fins and Mask will be

purchased together)

Confidence: a conditional probability estimate (e.g., proportion of the customers who bought a

mask also bought fins)

Lift: ratio of confidence to the base probability (e.g., ratio between customers of buying fins

after buying mask and those buying fins of walking into the store) The more the lift value is

close to 1 the more independent between these item (i.e., unrelated).

A: Fins

B: Mask

Lift is almost double

Extra Figure: Market-Basket Example

P(Fins)=280/1000=.28

(customers walking into

the store and buying fins)

You will use

RapidMiner to

explore this

scenario in the

HW.

Page 6: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

6

Dr. Chen, Business Intelligence 31

Other Market Basket Application

• 75% of customers who buy Coke also by

corn chips

Good for CRM analysis

Dr. Chen, Business Intelligence 32

DM Capabilities Description Example

Associations/Affinity

(Unsupervised):

Association between items

Discover rules that

correlate one set of events

or items with another set

of events or items.

Market Basket Analysis:

75% of customers who buy Coke also buy corn

chips (good for CRM analysis)

Clustering:

Grouping items according to

statistical similarities

(Unsupervised)

Create partitions so that

all members of each set

are similar according to

some metric or set of

metrics.

Customer Segmentation:

Meals charged on a business-issued gold card are

typically purchased on weekdays and have a

mean value of greater than $250, whereas meals

purchased using a personal platinum card occur

predominately on weekends, have a mean value

of $175 and include a bottle of wine more than

65% of the time.

Sequence/Temporal

Patterns (Supervised):

Time-based Affinity

(Statistical Analysis)

Relate events in time

based on a series of

preceding events.

Time-Based Analysis:

60% of customers buy TVs followed by digital

camcorders

Classification:

Assigns new records to

existing classes

(Supervised)

Discover rules that define

whether an item or event

belongs to a particular

subset or class of data

Decision Tree Analysis (Customer

Segmentation):

Customers with excellent credit history have a

debt/equity ratio of less than 10%

What are typical data-mining applications?

Dr. Chen, Business Intelligence 33

Other Data Mining Tasks

• These are in addition to the primary DM tasks (prediction,

association, clustering)

Time-series forecasting

Data can be used to develop models to extrapolate the future values

of the same phenomenon.

Part of sequence or link analysis?

Visualization

Another data mining task that can be used with other DM

techniques to gain a clearer understanding of underlying

relationships.

• Types of DM

Hypothesis-driven data mining

– begin with a proposition by the user.

Discovery-driven data mining

– Uncover hidden patterns, associations, relationships by different

means Dr. Chen, Business Intelligence

34

Data Mining Applications

• Customer Relationship Management (CRM)

Grocery example in UK

• Banking & Other Financial

• Retailing and Logistics

• Manufacturing and Maintenance

• Brokerage and Securities Trading

• Insurance

Dr. Chen, Business Intelligence 35

Data Mining Applications (cont.)

• Computer hardware and software

• Science and engineering

• Government and defense

• Homeland security and law enforcement

• Travel industry

• Healthcare

• Medicine

• Entertainment industry

• Sports

• Etc.

Highly popular application

areas for data mining

Dr. Chen, Business Intelligence 36

Data Mining Process

• A manifestation of best practices

• A systematic way to conduct DM projects

• Different groups have different versions

• Most common standard processes:

1. CRISP-DM (Cross-Industry Standard Process for Data

Mining)

2. SEMMA (Sample, Explore, Modify, Model, and

Assess)

3. KDD (Knowledge Discovery in Databases)

Page 7: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

7

Dr. Chen, Business Intelligence 37

Data Mining Process

Fig. 4.7 Ranking of Data Mining Methodologies/Processes. Source: KDNuggets.com

Dr. Chen, Business Intelligence 38

1. Data Mining Process: CRISP-DM

Data Sources

Business

Understanding

Data

Preparation

Model

Building

Testing and

Evaluation

Deployment

Data

Understanding

6

1 2

3

5

4

Fig 4.4 The Six Step CRISP-DM Data Mining Process

Dr. Chen, Business Intelligence 39

Data Mining Process: CRISP-DM

Step 1: Business Understanding

Step 2: Data Understanding

Step 3: Data Preparation (!)

Step 4: Model Building

Step 5: Testing and Evaluation

Step 6: Deployment

• The process is highly repetitive and

experimental (DM: art versus science?)

Accounts for

~85% of total

project time

How is it

related to

SDLC?

Dr. Chen, Business Intelligence 40

Data Preparation – A Critical DM Task

Data Consolidation

Data Cleaning

Data Transformation

Data Reduction

Well-formed

Data

Real-world

Data

· Collect data

· Select data

· Integrate data

· Impute missing values

· Reduce noise in data

· Eliminate inconsistencies

· Normalize data

· Discretize/aggregate data

· Construct new attributes

· Reduce number of variables

· Reduce number of cases

· Balance skewed data

See Table 4.4 for more details (p.153).

Fig 4.5 Data Preprocessing Steps

Dr. Chen, Business Intelligence 41

2. Data Mining Process: SEMMA

Sample(Generate a representative

sample of the data)

Modify(Select variables, transform

variable representations)

Explore(Visualization and basic

description of the data)

Model(Use variety of statistical and

machine learning models )

Assess(Evaluate the accuracy and

usefulness of the models)

SEMMA

Fig 4.6 SEMMA Data Mining Process Dr. Chen, Business Intelligence

42

3. Knowledge Discovery in Databases (KDD)

• KDD is a process of using data mining methods to find

useful information and patterns in the data, as opposed

to data mining, which involves using algorithms to

identify patterns in data derived through the KDDS

PROCESS.

It is a comprehensive process that encompasses data mining.

• KDD consists of the following steps (Dunham (2003):

1) data selection,

2) data preprocessing,

3) data transformation,

4) data mining, and

5) interpretation/evaluation.

Page 8: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

8

Dr. Chen, Business Intelligence 43

Data Mining Methods: Classification

• Most frequently used DM method

• Part of the machine-learning family

• Employ supervised learning

• Learn from past data, classify new data

• The output variable is categorical (nominal or

ordinal) in nature

• Classification versus regression?

• Classification versus clustering?

Dr. Chen, Business Intelligence 44

• Predictive accuracy

Hit rate

• Speed

Model building; predicting

• Robustness

• Scalability

• Interpretability

Transparency, explainability

Assessment Methods for Classification

Dr. Chen, Business Intelligence 45

Accuracy of Classification Models

• In classification problems, the primary source for

accuracy estimation is the confusion matrix

True

Positive

Count (TP)

False

Positive

Count (FP)

True

Negative

Count (TN)

False

Negative

Count (FN)

True Class

Positive Negative

Po

sitiv

eN

eg

ativ

e

Pre

dic

ted

Cla

ss

FNTP

TPRatePositiveTrue

FPTN

TNRateNegativeTrue

FNFPTNTP

TNTPAccuracy

FPTP

TPrecision

P

FNTP

TPcallRe

Fig 4.8 A simple Confusion Matrix for Tabulation of Two-Class Classification Results

Dr. Chen, Business Intelligence 46

Estimation Methodologies for Classification

• Simple split (or holdout or test sample estimation)

Split the data into 2 mutually exclusive sets training

(~70%) and testing (30%)

For ANN, the data is split into three sub-sets (training

[~60%], validation [~20%], testing [~20%])

Preprocessed

Data

Training Data

Testing Data

Model

Development

Model

Assessment

(scoring)

2/3

1/3

Classifier

Prediction

Accuracy

Fig 4.9 Simple Random Data Splitting

Dr. Chen, Business Intelligence 47

Estimation Methodologies for Classification

• k-Fold Cross Validation (rotation estimation)

Split the data into k mutually exclusive subsets

Use each subset as testing while using the rest of the

subsets as training

Repeat the experimentation for k times

Aggregate the test results for true estimation of

prediction accuracy training

• Other estimation methodologies

Leave-one-out, bootstrapping, jackknifing

Area under the ROC curve

Dr. Chen, Business Intelligence 48

Estimation Methodologies for

Classification – ROC Curve

10.90.80.70.60.50.40.30.20.10

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1

0.9

0.8

False Positive Rate (1 - Specificity)

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity) A

B

C

Page 9: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

9

Dr. Chen, Business Intelligence 49

Classification Techniques

• Decision tree analysis

• Statistical analysis

• Neural networks

• Support vector machines

• Case-based reasoning

• Bayesian classifiers

• Genetic algorithms

• Rough sets

Dr. Chen, Business Intelligence 50

Decision Trees

• Employs the and method

• Recursively divides a training set until each

division consists of examples from one class

1. Create a root node and assign all of the training data

to it.

2. Select the best splitting attribute.

3. Add a branch to the root node for each value of the

split. Split the data into mutually exclusive subsets

along the lines of the specific split.

4. Repeat the steps 2 and 3 for each and every leaf

node until the stopping criteria is reached.

A general

algorithm

for

decision

tree

building

conquer divide

Dr. Chen, Business Intelligence 51

Decision Trees

• DT algorithms mainly differ on

Splitting criteria

Which variable to split first?

What values to use to split?

How many splits to form for each node?

Stopping criteria

When to stop building the tree

Pruning (generalization method)

Pre-pruning versus post-pruning

• Most popular DT algorithms include

ID3, C4.5, C5; CART; CHAID; M5

Dr. Chen, Business Intelligence 52

Decision Trees

• Alternative splitting criteria

Gini index determines the purity of a specific

class as a result of a decision to branch along a

particular attribute/value

Used in CART

Information gain uses entropy to measure the

extent of uncertainty or randomness of a particular

attribute/value split

Used in ID3, C4.5, C5

Chi-square statistics (used in CHAID)

Dr. Chen, Business Intelligence 53

Cluster Analysis for Data Mining

• Used for automatic identification of natural groupings of things

• Part of the machine-learning family

• Employs unsupervised learning

• Learns the clusters of things from past data, then assigns new instances

• There is not an output variable

• Also known as segmentation

Dr. Chen, Business Intelligence 54

Cluster Analysis for Data Mining

• Clustering results may be used to

Identify natural groupings of customers

Identify rules for assigning new cases to classes for targeting/diagnostic purposes

Provide characterization, definition, labeling of populations

Decrease the size and complexity of problems for other data mining methods

Identify outliers in a specific domain (e.g., rare-event detection)

Page 10: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

10

Dr. Chen, Business Intelligence 55

Cluster Analysis for Data Mining

• Analysis methods

Statistical methods (including both hierarchical and

nonhierarchical), such as k-means, k-modes, and so

on

Neural networks (adaptive resonance theory [ART],

self-organizing map [SOM])

Fuzzy logic (e.g., fuzzy c-means algorithm)

Genetic algorithms

Dr. Chen, Business Intelligence 56

Cluster Analysis for Data Mining

• How many clusters?

There is no “truly optimal” way to calculate it

Heuristics are often used

Look at the sparseness of clusters

Number of clusters = (n/2)1/2 (n: no of data points)

Use Akaike information criterion (AIC)

Use Bayesian information criterion (BIC)

• Most cluster analysis methods involve the use of a

distance measure to calculate the closeness

between pairs of items.

Euclidian versus Manhattan (rectilinear) distance

Dr. Chen, Business Intelligence 57

Cluster Analysis for Data Mining

• k-Means Clustering Algorithm

k : pre-determined number of clusters

Algorithm (Step 0: determine value of k)

Step 1: Randomly generate k random points as initial

cluster centers.

Step 2: Assign each point to the nearest cluster center.

Step 3: Re-compute the new cluster centers.

Repetition step: Repeat steps 3 and 4 until some

convergence criterion is met (usually that the

assignment of points to clusters becomes stable).

Dr. Chen, Business Intelligence 58

Cluster Analysis for Data Mining -

k-Means Clustering Algorithm

Step 1 Step 2 Step 3

Fig 4.11 A Graphical illustration of the Steps in k-Means Algorithm

Dr. Chen, Business Intelligence 59

Association Rule Mining

• A very popular DM method in business

• Finds interesting relationships (affinities) between variables (items or events)

• Part of machine learning family

• Employs unsupervised learning

• There is no output variable

• Also known as market basket analysis

• Often used as an example to describe DM to ordinary people, such as the famous “relationship between diapers and beers!”

Dr. Chen, Business Intelligence 60

Association Rule Mining

• Input: the simple point-of-sale transaction data

• Output: Most frequent affinities among items

• Example: according to the transaction data…

“Customer who bought a laptop computer and a virus

protection software, also bought extended service plan 70

percent of the time"

• How do you use such a pattern/knowledge?

Put the items next to each other for ease of finding

Promote the items as a package (do not put one on sale if the other(s)

are on sale)

Place items far apart from each other so that the customer has to

walk the aisles to search for it, and by doing so potentially see and

buy other items

Page 11: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

11

Dr. Chen, Business Intelligence 61

Association Rule Mining

• A representative application of association rule mining

includes

In business: cross-marketing, cross-selling, store design,

catalog design, e-commerce site design, optimization of

online advertising, product pricing, and sales/promotion

configuration

In medicine: relationships between symptoms and

illnesses; diagnosis and patient characteristics and

treatments (to be used in medical DSS); and genes and

their functions (to be used in genomics projects)

Dr. Chen, Business Intelligence 62

Association Rule Mining

• Are all association rules interesting and useful?

A Generic Rule: X Y [S%, C%]

X, Y: products and/or services

X: Left-hand-side (LHS)

Y: Right-hand-side (RHS)

S: Support: how often X and Y go together

C: Confidence: how often Y goes together with X

Example: {Laptop Computer, Antivirus Software}

{Extended Service Plan} [30%, 70%]

Dr. Chen, Business Intelligence 63

Association Rule Mining

• Algorithms are available for generating association rules

Apriori

Eclat

FP-Growth

+ Derivatives and hybrids of the three

• The algorithms help identify the frequent item sets, which are then converted to association rules

Dr. Chen, Business Intelligence 64

Association Rule Mining

• Apriori Algorithm

Finds subsets that are common to at least a minimum number of the itemsets

Uses a bottom-up approach

frequent subsets are extended one item at a time (the size of frequent subsets increases from one-item subsets to two-item subsets, then three-item subsets, and so on), and

groups of candidates at each level are tested against the data for minimum support.

(see the figure) --

Dr. Chen, Business Intelligence 65

Association Rule Mining

Apriori Algorithm

Itemset

(SKUs)Support

Transaction

No

SKUs

(Item No)

1

1

1

1

1

1

1, 2, 3, 4

2, 3, 4

2, 3

1, 2, 4

1, 2, 3, 4

2, 4

Raw Transaction Data

1

2

3

4

3

6

4

5

Itemset

(SKUs)Support

1, 2

1, 3

1, 4

2, 3

3

2

3

4

3, 4

5

3

2, 4

Itemset

(SKUs)Support

1, 2, 4

2, 3, 4

3

3

One-item Itemsets Two-item Itemsets Three-item Itemsets

1001

1002

1003

1004

1005

1006

Fig 4.12 Identification of Frequent Itemsets in Apriori Algorithm

Dr. Chen, Business Intelligence 66

Artificial Neural Networks

for Data Mining

• Artificial neural networks (ANN or NN) are a brain

metaphor for information processing

• a.k.a. Neural Computing

• Very good at capturing highly complex non-linear

functions!

• Many uses – prediction (regression, classification),

clustering/segmentation

• Many application areas - finance, medicine, marketing,

manufacturing, service operations, information systems, …

Page 12: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

12

Dr. Chen, Business Intelligence 67

Biological

versus

Artificial

Neural

Networks

Neuron

Axon

Axon

SynapseSynapse

Dendrites

Dendrites Neuron

w1

w2

wn

x1

x2

xn

.

.

.

Y

Y1

Yn

Y2

Inputs

Weights

Outputs

.

.

.

Processing

Element (PE)

n

i

iiWXS1

)( Sf

Summation

Transfer

Function

Biological Artificial

Neuron

Dendrites

Axon

Synapse

Slow

Many (109)

Node (or PE)

Input

Output

Weight

Fast

Few (102)

Biological NN

Artificial NN

Dr. Chen, Business Intelligence 68

Elements/Concepts of ANN

• Processing element (PE)

• Information processing

• Network structure

Feedforward vs. recurrent vs. multi-layer…

• Learning parameters

Supervised/unsupervised, backpropagation, learning rate, momentum

• ANN Software – NN shells, integrated modules in comprehensive DM software, …

Dr. Chen, Business Intelligence 69

Data Mining

Software

• Commercial

IBM SPSS Modeler (formerly Clementine)

SAS - Enterprise Miner

IBM - Intelligent Miner

StatSoft – Statistica Data Miner

… many more

• Free and/or Open Source

R

RapidMiner

Weka…

0 50 100 150 200 250 300

Predixion Software (3)

WordStat (3)

11 Ants Analytics (4)

Teradata Miner (4)

RapidInsight/Veera (5)

Angoss (7)

SAP (BusinessObjects/Sybase/Hana)(7)

XLSTAT (7)

Salford SPM/CART/MARS/TreeNet/RF (9)

Revolution Computing (11)

C4.5/C5.0/See5 (13)

Bayesia (14)

KXEN (14)

Zementis (14)

Stata (15)

IBM Cognos (16)

Miner3D (19)

Mathematica (23)

JMP (32)

Other commercial software (32)

Oracle Data Miner (35)

Tableau (35)

TIBCO Spotfire / S+ / Miner (37)

Other free software (39)

Microsoft SQL Server (40)

Orange (42)

SAS Enterprise Miner (46)

IBM SPSS Modeler (54)

IBM SPSS Statistics (62)

MATLAB (80)

Rapid-I RapidAnalytics (83)

SAS (101)

StatSoft Statistica (112)

Weka / Pentaho (118)

KNIME (174)

Rapid-I RapidMiner (213)

Excel (238)

R (245)

Fig. 4.13 Popular Data Mining Software Tools (Poll Results). Source: KDNuggets.com Dr. Chen, Business Intelligence 70

Big Data Software Tools and Platforms

0 10 20 30 40 50 60 70 80

Other Hadoop-based tools (10)

Other Big Data software (21)

NoSQL databases (33)

Amazon Web Services (AWS) (36)

Apache Hadoop/Hbase/Pig/Hive (67)

0 50 100 150 200 250 300

F# (5)

Awk/Gawk/Shell (31)

Perl (37)

Other languages (57)

C/C++ (66)

Python (119)

Java (138)

SQL (185)

R (245)

Dr. Chen, Business Intelligence 71

Application Case 4.6

Data Mining Goes to Hollywood:

Predicting Financial Success of Movies

Questions for Discussion

• Decision situation

• Problem

• Proposed solution

• Results

• Answer & discuss the case questions. Dr. Chen, Business Intelligence

72

Application Case 4.6

Data Mining Goes to Hollywood!

Independent Variable Number of

Values Possible Values

MPAA Rating 5 G, PG, PG-13, R, NR

Competition 3 High, Medium, Low

Star value 3 High, Medium, Low

Genre 10

Sci-Fi, Historic Epic Drama,

Modern Drama, Politically

Related, Thriller, Horror,

Comedy, Cartoon, Action,

Documentary

Special effects 3 High, Medium, Low

Sequel 1 Yes, No

Number of screens 1 Positive integer

Class No. 1 2 3 4 5 6 7 8 9

Range

(in $Millions)

< 1

(Flop)

> 1

< 10

> 10

< 20

> 20

< 40

> 40

< 65

> 65

< 100

> 100

< 150

> 150

< 200

> 200

(Blockbuster)

Dependent

Variable

Independent

Variables

A Typical

Classification

Problem

Page 13: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

13

Dr. Chen, Business Intelligence 73

Application Case 4.6

Data Mining Goes to Hollywood!

Model

Development

process

Model

Assessment

process

The DM

Process

Map in

IBM

SPSS

Modeler

Dr. Chen, Business Intelligence 74

Application Case 4.6

Data Mining Goes to Hollywood!

Prediction Models

Individual Models Ensemble Models

Performance

Measure SVM ANN C&RT

Random

Forest

Boosted

Tree

Fusion

(Average)

Count (Bingo) 192 182 140 189 187 194

Count (1-Away) 104 120 126 121 104 120

Accuracy (% Bingo) 55.49% 52.60% 40.46% 54.62% 54.05% 56.07%

Accuracy (% 1-Away) 85.55% 87.28% 76.88% 89.60% 84.10% 90.75%

Standard deviation 0.93 0.87 1.05 0.76 0.84 0.63

* Training set: 1998 – 2005 movies; Test set: 2006 movies

Dr. Chen, Business Intelligence 75

• 1. Why is it important for many Hollywood

professionals to predict the financial success of

movies?

• The movie industry is the “land of hunches and

wild guesses” due to the difficulty associated with

forecasting product demand, making the movie

business in Hollywood a risky endeavor.

• If Hollywood could better predict financial

success, this would mitigate some of the financial

risk.

Dr. Chen, Business Intelligence 76

• 2. How can data mining be used for predicting financial

success of movies before the start of their production

process?

• The way Sharda and Delen did it was to use data from movies

between 1998 and 2005 as training data, and movies of 2006

as test data. They applied individual and ensemble prediction

models, and were able to identify significant variables

impacting financial success.

• They also showed that by using sensitivity analysis, decision

makers can predict with fairly high accuracy how much value

a specific actor (or a specific release date, or the addition of

more technical effects, etc.) brings to the financial success of a

film, making the underlying system an invaluable decision aid.

Dr. Chen, Business Intelligence 77

• 3. How do you think Hollywood did, and perhaps

still is performing, this task without the help of

data mining tools and techniques?

• Most is done by gut feel and trial-and-error. This

may keep the movie business as a financially risky

endeavor, but also allows for creativity.

Sometimes uncertainty is a good thing.

Dr. Chen, Business Intelligence 78

Data Mining Myths

• Data mining …

provides instant solutions/predictions

is not yet viable for business applications

requires a separate, dedicated database

can only be done by those with advanced degrees

is only for large firms that have lots of customer data

is another name for the good-old statistics

Page 14: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

14

Dr. Chen, Business Intelligence 79

Data Mining Myths

Myth Reality

DM provides instant, crystal-ball-like

solutions/predictions

DM is a multi-step process that requires

deliberate, proactive design and use.

DM is not yet viable for business

applications

The current state of the art is ready to go for

almost any business.

DM requires a separate, dedicated

database.

Because of advances in database technology,

a dedicated database is not required, even

though it may be desirable.

DM is only for large firms that have lots

of customer data.

Newer Web-based tools enable managers at

all educational levels to do data mining.

DM can only be done by those with

advanced degrees.

If the data accurately reflect the business or

its customers, a company can use data

mining.

DM is another name for the good-old

statistics.

DM is now available for easy use.

Data mining is a powerful analytical tool that enables business executives to

advance from describing the nature of the past to predicting the future. DM’s

myths and realities are listed below:

Dr. Chen, Business Intelligence 80

Common Data Mining Blunders

1. Selecting the wrong problem for data mining

2. Ignoring what your sponsor thinks data mining is and

what it really can/cannot do

3. Not leaving sufficient time for data acquisition,

selection, and preparation

4. Looking only at aggregated results and not at

individual records/predictions

5. Being sloppy about keeping track of the data mining

procedure and results

6. …more in the book

Dr. Chen, Business Intelligence 81

Extra Example on Correlation

Business Scenario:

Sarah is a regional sales manager for a nationwide supplier of

fossil fuels for home heating. Recent volatility in market

prices for heating oil specifically, coupled with wide

variability in the size of each order for home heating oil, has

Sarah concerned. She feels a need to understand the types of

behaviors and other factors that may influence the demand for

heating oil in the domestic market. What factors are related to

heating oil usage, and how might she use a knowledge of

such factors to better manage her inventory, and anticipate

demand? Sarah believes that data mining can help her begin

to formulate an understanding of these factors and

interactions.

Dr. Chen, Business Intelligence 82

The following attributes are included in the data set:

Insulation: This is a density rating, ranging from one to ten, indicating the

thickness of each home’s insulation. A home with a density rating of one is poorly

insulated, while a home with a density of ten has excellent insulation.

Temperature: This is the average outdoor ambient temperature at each home for

the most recent year, measure in degree Fahrenheit.

Heating_Oil: This is the total number of units of heating oil purchased by the

owner of each home in the most recent year.

Num_Occupants: This is the total number of occupants living in each home.

Avg_Age: This is the average age of those occupants.

Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size.

The higher the number, the larger the home.

Dr. Chen, Business Intelligence 83

Step 1. File New Process “Read CSV” then select

“Chapter04DataSet”

Dr. Chen, Business Intelligence 84

Step 2. Change from “Semicolon” to “Comma”

Page 15: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

15

Dr. Chen, Business Intelligence 85

Dr. Chen, Business Intelligence 86

Dr. Chen, Business Intelligence 87

The “Read CSV” with input data has been added to the process.

Dr. Chen, Business Intelligence 88

Next, add “correlation matrix” operator and connect all ports

as shown. Click “Run”

Dr. Chen, Business Intelligence 89

The result is obtained (with “Meta Data View” shown.)

Dr. Chen, Business Intelligence 90

Result:

What do we obtain from this result?

Page 16: Covers the following pages Chapter 4: Data Miningbarney.gonzaga.edu/~chen/mbus673/chap_ppt/BI_ch04.pdfChapter 4: Data Mining ... Why should retailers, especially omni-channel retailers,

16

Dr. Chen, Business Intelligence 91

• We learned through our investigation, that the two most strongly

correlated attributes in our data set are Heating_Oil (or

Insulation) and Avg_Age, with a coefficient of 0.848 (or 0.643).

Thus, we know that in this data set, as the average age of the

occupants in a home increases, so too does the heating oil usage

in that home.

Illustration of positive correlation Dr. Chen, Business Intelligence

92

• We learned through our investigation, that the two most strongly

correlated attributes in our data set are Heating_Oil and

Avg_Age, with a coefficient of 0.848. Thus, we know that in this

data set, as the average age of the occupants in a home increases,

so too does the heating oil usage in that home.

• What we do not know is why that occurs. Data analysts often

make the mistake of interpreting correlation as causation.

• The assumption that correlation proves causation is

dangerous and often false.

Dr. Chen, Business Intelligence 93

Correlation Strengths

Fig. Correlation strengths between -1 and 1

Illustration of positive correlation

Illustration of negative correlation Dr. Chen, Business Intelligence 94

Summary on Correlation

• Correlation can be a quick and easy way to see how elements of

a given problem may be interacting with one another. Whenever

you find yourself asking how certain factors in a problem

you’re trying to solve interact with one another, consider

building a correlation matrix to find out.

• For example,

1) does customer satisfaction change based on time of year?

2) Does the amount of rainfall change the price of a crop?

3) Does household income influence which restaurants a

person patronizes?

• The answer to each of these questions is probably ‘yes’, but

correlation can not only help us know if that’s true, it can also

help us learn how strongly the interactions are when, and if,

they occur.

Dr. Chen, Business Intelligence 95

End of the Chapter

• Questions, comments