Data Mining: A Closer Look Typical Problems. Data Mining: Typical Problems Classification Estimation Prediction

Data Mining: A Closer Look

Typical Problems

Data Mining: Typical Problems

• Classification

• Estimation

• Prediction

Classification & Estimation• Classification deals with discrete outcomes: yes

or no; big or small; strange or no strange; sick or healthy; yellow, green or red; etc. It determines a class membership of a certain object.

• Estimation is often used to perform a classification task: estimating the number of children in a family; estimating a family’s total household income; etc.

• Neural networks and regression models are the best tools for classification/estimation

3

Prediction

• Prediction is the same as classification or estimation, except that the records are classified according to some predicted future behavior or estimated future value.

• Any of the techniques used for classification and estimation can be used in prediction.

4

Classification and Prediction: Implementation

• To implement both classification and prediction, we should use the training examples, where the value of the variable to be predicted is already known or membership of the data instance to be classified is already known.

5

Is Data Mining Appropriate for My Problem?

6

Will Data Mining help me?• Can we clearly define the problem?

• Do potentially meaningful data exist?

• Do the data contain hidden knowledge or the data is useful for reporting purposes only?

• Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining?

7

Data Mining vs. Data Query• Shallow Knowledge

• Multidimensional Knowledge

• Hidden Knowledge

• Deep Knowledge

8

Shallow Knowledge

Shallow knowledge is factual. It can be easily stored and manipulated in a database.

9

Multidimensional Knowledge

Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.

10

Hidden Knowledge

Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease.

11

Deep Knowledge

Deep knowledge is knowledge stored in a database that can only be found if we are given some direction about what we are looking for.

12

Data Mining vs. Data Query• Shallow Knowledge ( can be extracted by the data base query language like SQL)• Multidimensional Knowledge (can be extracted by the On-line Analytical Processing (OLAP) tools)• Hidden Knowledge represents patterns and regularities in data that can not be easily found (data mining tools can be used)• Deep Knowledge can be found if we are given some direction about what we are looking for (data mining tools can be used)

13

Data Mining vs. Data Query:

• Use data query if you already almost know what you are looking for.

• Use data mining to find regularities in data that are not obvious and (or) that are hidden.

14

A Simple Data Mining Process Model

15

Data Mining: A KDD Process

– Data mining: the core of knowledge discovery process.

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

16

The Data Warehouse

The data warehouse is a historical database designed for decision support.

17

Data Mining Strategies

A hierarchy of data mining strategies

Data MiningStrategies

SupervisedLearning

Market BasketAnalysis

UnsupervisedClustering

PredictionEstimationClassification

Supervised Data Mining Algorithms:

• A single output attribute/multiple output attributes

• Output attributes are also called dependent variables because they depend on the values of input attributes (variables):

• Input attributes are also known as independent variables

1

1 1

( ,..., )

( ,..., ) ( ,..., )n

k n

y f x x

y y f x x

Data Mining Strategies: Classification

• Learning is supervised.

• The dependent variable(s) (output) is categorical or numeric.

• Well-defined classes.

• Current rather than future behavior.

Classify a loan applicant as a good or poor credit riskDevelop a customer profile To classify a patient as sick or healthy

Data Mining Strategies: Estimation

Learning is supervised.

The dependent variable(s) (output) is numeric.

Well-defined classes.

Current rather than future behavior.Estimate the number of minutes before a thunderstorm will

reach a given locationEstimate the amount of credit card purchases Estimate the salary of an individual

Data Mining Strategies:Prediction

• The emphasis is on predicting future rather than current outcomes.

• The output attribute may be categorical or numeric.

Predict next week’s (year’s) currency exchange ratePredict next week’s (year’s) Dow Jones Industrial closing valuePredict a level of the power consumption for some period of time

Classification, Estimation or Prediction?

The nature of the data determines whether a model is suitable for classification, estimation, or prediction.

The Cardiology Patient Dataset

This dataset contains 303 instances. Each instance holds information about a patient who either has or does not have a heart condition.

The Cardiology Patient Dataset

• 138 instances represent patients with heart disease.• 165 instances contain information about patients free of heart disease.

Table 2.1 • Cardiology Patient Data Attribute Mixed Numeric Name Values Values Comments

Age Numeric Numeric Age in years

Sex Male, Female 1, 0 Patient gender

Chest Pain Type Angina, Abnormal Angina, 1–4 NoTang = Nonanginal NoTang, Asymptomatic pain

Blood Pressure Numeric Numeric Resting blood pressure upon hospital admission

Cholesterol Numeric Numeric Serum cholesterol

Fasting Blood True, False 1, 0 Is fasting blood sugar less Sugar < 120 than 120?

Resting ECG Normal, Abnormal, Hyp 0, 1, 2 Hyp = Left ventricular hypertrophy

Maximum Heart Numeric Numeric Maximum heart rate Rate achieved

Induced Angina? True, False 1, 0 Does the patient experience angina as a result of exercise?

Old Peak Numeric Numeric ST depression induced by exercise relative to rest

Slope Up, flat, down 1–3 Slope of the peak exercise ST segment

Number Colored 0, 1, 2, 3 0, 1, 2, 3 Number of major vessels Vessels colored by fluorosopy

Thal Normal fix, rev 3, 6, 7 Normal, fixed defect, reversible defect

Concept Class Healthy, Sick 1, 0 Angiographic disease status

• Most and Least Typical Instances from the Cardiology Domain Attribute Most Typical Least Typical Most Typical Least Typical Name Healthy Class Healthy Class Sick Class Sick Class Age 52 63 60 62 Sex Male Male Male Female Chest Pain Type NoTang Angina Asymptomatic Asymptomatic Blood Pressure 138 145 125 160 Cholesterol 223 233 258 164 Fasting Blood Sugar < 120 False True False False Resting ECG Normal Hyp Hyp Hyp Maximum Heart Rate 169 150 141 145 Induced Angina? False False True False Old Peak 0 2.3 2.8 6.2 Slope Up Down Flat Down Number of Colored Vessels 0 0 1 3 Thal Normal Fix Rev Rev

Classification, Estimation or Prediction?

The next two slides each contain a rule generated from this dataset. Are either of these rules predictive?

A Healthy Class Rule for the Cardiology Patient Dataset

IF 169 <= Maximum Heart Rate <=202

THEN Concept Class = Healthy

Rule accuracy: 85.07%

Rule coverage: 34.55%

A Sick Class Rule for the Cardiology Patient Dataset

IF Thal = Rev & Chest Pain Type = Asymptomatic

THEN Concept Class = Sick

Rule accuracy: 91.14%

Rule coverage: 52.17%

Is the rule appropriate for classification or prediction?

• Prediction: has one’s maximum heart rate checked on a regular basis is low, he/she may be at risk of having a heart attack.

• Classification: If one has a heart attack, expect a maximum heart rate to decrease.

Data Mining Strategies: Unsupervised Clustering

Unsupervised Clustering can be used to:

• determine if relationships can be found in the data.

• evaluate the likely performance of a supervised model.• find a best set of input attributes for supervised learning.• detect outliers.

Data Mining Strategies: Market Basket Analysis

• Find interesting relationships among retail products.

• Uses association rule algorithms.

Supervised Data Mining Techniques

Generation of Production Rules

A Hypothesis for the Credit Card Promotion Database

A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

• The Credit Card Promotion Database

Income Magazine Watch Life Insurance Credit Card Range ($) Promotion Promotion Promotion Insurance Sex Age

40–50K Yes No No No Male 45 30–40K Yes Yes Yes No Female 40 40–50K No No No No Male 42 30–40K Yes Yes Yes Yes Male 43 50–60K Yes No Yes No Female 38 20–30K No No No No Female 55 30–40K Yes No Yes Yes Male 35 20–30K No Yes No No Male 27 30–40K Yes No No No Male 43 30–40K Yes Yes Yes No Female 41 40–50K No Yes Yes No Female 43 20–30K No Yes Yes No Male 29 50–60K Yes Yes Yes No Female 39 40–50K No Yes No No Male 55 20–30K No No Yes Yes Female 19

Rule Accuracy and Rule Coverage

• Rule accuracy is the correctness of the rule in terms of a percentage with respect to the class to be determined by this rule. For example, if the rule holds for 9 of 10 instances, to which it is applicable, the accuracy is 90%.

• Rule coverage is the coverage of the class to be classified by this rule in terms of a percentage. For example, if the rule covers 10 of 20 instances from the class to be classified, the rule coverage is 50%.

Rule Accuracy and Rule Coverage

• Rule accuracy is a between-class measure.

• Rule coverage is a within-class measure.

Production Rules for theCredit Card Promotion Database

• IF Sex = Female & 19 <=Age <= 43THEN Life Insurance Promotion = YesRule Accuracy: 100.00% Rule Coverage: 66.67%• IF Sex = Male & 40K<=Income Range <= 50KTHEN Life Insurance Promotion = NoRule Accuracy: 100.00% Rule Coverage: 50%• IF Credit Card Insurance= YesTHEN Life Insurance Promotion = YesRule Accuracy: 100.00% Rule Coverage: 33.33%• IF 30K<=Income Range <= 40K & Watch Promotion=YesTHEN Life Insurance Promotion = YesRule Accuracy: 100.00% Rule Coverage: 33.33%

Production Rules for theCredit Card Promotion Database

• Rules 1-3 are predictive for new card holders• Rule 4 might be used for the classification of the

existing card holders

Documents

Data Mining: A Closer Look Typical Problems. Data Mining: Typical Problems Classification Estimation Prediction