Upload
thomasine-grant
View
227
Download
0
Embed Size (px)
Citation preview
2011 Data MiningIndustrial & Information Systems Engineering
Chapter 2:Overview of Data Mining Process
•Pilsung Kang•Industrial & Information Systems Engineering
•Seoul National University of Science & Technology
2
2011 Data Mining, IISE, SNUT
Data Mining Definition Revisited
Extracting useful information from large datasets. (Hand et al., 2001)
Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules. (Berry and Linoff, 1997, 2000)
Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amount data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. Gartner Group, 2004)
3
2011 Data Mining, IISE, SNUT
Descriptive vs. Predictive (purpose)
Look back to the past
To extract compact and
easily understood
information from large,
sometimes gigantic
database.
OLAP (online analytical
processing), SQL (structured
query language).
Predict the future
Identify strong links between
variables of data.
To predict the unknown
consequence (dependent
variable) based on the
information provided
(independent variable)
y = f(x1, x2, ..., xn) + ε
Descriptive Modeling Predictive Modeling
4
2011 Data Mining, IISE, SNUT
Supervised vs. Unsupervised (methods)
Goal: predict a single
“target” or “outcome”
variable.
Finds relations between X
and Y.
Train (learn) data where
target value is known.
Score data where target
value is not known.
Explores intrinsic
characteristics.
Estimates underlying
distribution.
Segment data into
meaningful groups or detect
patterns.
There is no target (outcome)
variable to predict or classify.
Supervised Learning Unsupervised Learning
5
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Data Visualization
Graphs and plots of data.
Histograms, boxplots, bar charts, scatterplots.
Especially useful to examine relationships between pairs of
variables.
Descriptive & Unsupervised
1
6
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Data Reduction
Distillation of complex/large data into simpler/smaller data.
Reducing the number of variables/columns.
Also called dimensionality reduction(variable selection,
variable extraction, e.g., principal component analysis)
Reducing the number of records/rows.
Also called data compression (e.g., sampling and clus-
tering)
Descriptive & UnsupervisedData Visualization + Data Reduction = Data Explo-
ration
2
7
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Segmentation/Clustering
3
Goal: divide the entire data into a small number of sub-
groups.
Homogeneous within groups while heterogeneous between
groups.
Examples: Market segmentation, social network analysis.
Descriptive & Unsupervised
8
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Segmentation/Clustering example: hierarchical clustering
3
9
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification
Goal: predict categorical target (outcome) variable.
Examples: Purchase/no purchase, fraud/no fraud, creditwor-
thy/not creditworthy.
Each row is a case/record/instance.
Each column is a variable.
Target variable is often binary (yes/no).
Predictive & Supervised
4
10
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification Example: Decision Tree
4
11
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification Example: Logistic Regres-sion
-5 -4 -3 -2 -1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Play if 1/(1+exp(-0.2*outlook+0.4*humidity+0.8*windy) >
0.5
Else, do not play
4
12
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Classification Examples
“Separate the riding mower buyers(●) from non-buyers(○)”
(x-axis: income(x$1000), y-axis: Lot size (x1000 sqft))
4
13
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Prediction
Goal: predict numerical target (outcome) variable.
Examples: sales, revenue, performance.
As in classification:
Each row is a case/record/instance.
Each column is a variable.
Taken together, classification and prediction
constitute “predictive analytics”
Predictive & Supervised
5
14
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Prediction Example: Neural Networks
5
15
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Association Rule
Goal: produce rules that define “what goes with what”
Example: “If X was purchased, Y was also purchased”
Rows are transactions.
Used in recommender systems – “Our records show you
bought X, you may also like Y”
Also called “affinity analysis,” or “market basket analysis”
Predictive & Unsupervised
6
16
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Association Rule Example: Market Basket Analysis
Wall Mart (USA) E-Mart (Korea)6
17
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Novelty Detection
Goal: identify if a new case is similar to the given ‘normal’
cases.
Example: medical diagnosis, fault detection, identity verifi-
cation.
Each row is a case/record/instance.
Each column is a variable.
No explicit target variable, but assumed that all records
have the same target.
Also called “outlier detection,” or “one-class classification”
Predictive & Unsupervised7
18
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Novelty Detection Example: Keystroke Dynamics-based User Authentication
http://ksd.snu.ac.kr7
19
2011 Data Mining, IISE, SNUT
Data Mining Techniques
Descriptive Model-ing
Predictive Modeling
Supervised
Learning
Unsuper-vised
Learning
• … • Classification
• Prediction
• Data Visualization
• Data Reduction
•
Segmentation/clusterin
g
• Association Rules
• Novelty Detection
20
2011 Data Mining, IISE, SNUT
Steps in Data Mining
1. Define and understand the purpose of data mining
project
2. Formulate the data mining problem
3. Obtain/verify/modify the data
5. Build data mining models
6. Evaluate and interpret the results
7. Deploy and monitor the model
4. Explore and customize the data
21
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Define and understand the purpose of data mining project Why do we have to conduct this project?
What would be the achievement if the project succeed?
1
(Jun, 2010: http://www.kdnuggets.com)
22
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Formulate the data mining problem
What is the purpose?
Increase sales.
Detect cancer patients.
What data mining task is appropriate?
Classification.
Prediction.
Association rules, …
2
23
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Data acquisition
Data source
Data warehouse,
Data mart, …
Define input variables and target variable if neces-
sary
Ex: Churn prediction for credit card service
• Inputs: age, sex, tenure, amount of spending, risk
grade,…
• Target: whether he/she leaves the company.
3
24
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Outlier detection
Outlier
“A value that the variable cannot have” or “ An ex-
tremely rare value” (ex: age 990, height -150cm, …)
There are a number of outliers in a real database due to
many reasons.
How to deal with outliers?
Ignore the record with outliers if total record is suffi-
cient.
Replace with another value (mean, median, estimate
from a certain pdf, etc) if total records are insufficient.
3
25
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Missing Value Im-putation Missing value
A variable is missing when it has null value in database
although it should have a certain real value.
Operational errors, human errors.
How to deal with missing values?
Ignore the record with missing values if total record is
sufficient.
Replace with another value (mean, median, estimate
from a certain pdf, etc) if total records are insufficient.
3
26
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Variable handling
Type of variables
Binary: 0/1 (ex: benign/malignant in medical diagno-
sis).
Categorical: more than two values, ordered (high,
middle, low) or not ordered (ex: color, job).
Ordinal: continuous, differences between two consecu-
tive values are not identical (ex: rank of the final exam).
Interval: continuous, difference between two consecu-
tive values are identical (ex: age, height, weight).
3
27
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Obtain/verify/modify the data: Variable handling
Variable transformation
Binning:• interval → binary or ordered categorical.
1-of-C coding: • unordered categorical → binary.
Low Mid High“Color: yellow, red, blue,
green”d1 d2 d3
yel-low 1 0 0
red 0 1 0
blue 0 0 1
green 0 0 0
3
28
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data Visualiza-tion Single variable
4
Histogram:• shows the distribution of a single variable.• possible to check the normality.
Box plot
0
20
40
60
80
100
120
140
160
180
5 10 15 20 25 30 35 40 45 50
Freq
uency
MEDV
Histogram
medianquartile 1
“max”
“min”
outliers
mean
outlier
quartile 3
29
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data Visualiza-tion
4
Multiple variables
Correlation table:
• indicate which variables are highly (positively or
negatively) correlated.
• Help to remove irrelevant variables or select repre-
sentative variables
CRIM ZN INDUS CHAS NOX RMCRIM 1ZN -0.20047 1INDUS 0.406583 -0.53383 1CHAS -0.05589 -0.0427 0.062938 1NOX 0.420972 -0.5166 0.763651 0.091203 1RM -0.21925 0.311991 -0.39168 0.091251 -0.30219 1
30
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data Visualiza-tion
4
Multiple variables
Scatter plot matrix:
• Shows the relations between two pairs of variables.
Var. 1
Var. 2
Var. 3
Var. 4
31
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Dimensionality Reduction
4
Curse of dimensionality
The number of records increases exponentially to sus-
tain the same explain ability as the number of variables
increases.
“If there are various logical ways to explain a certain phenomenon, the simplest is the best” - Occam’s Razor
21=2 22=4 23=8
32
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Dimensionality Reduction
4
Variable reduction
Select a small set of relevant variables.
Correlation analysis, Kolmogorov-Sminrov test, …
V1 V2 V3 V4 V5 V6
V1 1 0.9 -0.8 0.1 0.2 0
V2 1 -0.7 0.2 0.1 0.1
V3 1 -0.1 0.1 -0.1
V4 1 0.9 0.3
V5 1 -0.9
V6 1
Select
V1 & V4
33
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Dimensionality Reduction
4
Variable extraction
Construct a new variable that contains more intensive
information than original variables.
Principal component analysis (PCA), …
Example:
Original variables:
• Age, sex, height, weight
• Income, property, tax paid
Constructed variables:
• Var1: age+3*I(sex = female)+0.2*height-0.3*weight
• Var2: Income + 0.1*property + 2*tax paid
34
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Instance Reduc-tion
4
Random sampling
Select a small set of records with uniformly distributed
sampling rate.
In classification, class ratios are preserved.
Stratified sampling
Select a set of records such that rare events have
higher probability to be selected.
In classification, class ratios are modified.
• Under-sampling: preserve minority, reduce majority.
• Over-sampling: preserve majority, increase minority.
35
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data separation
4
Over-fitting
Occurs when data mining algorithms ‘memorize’ the
given data, even unnecessary (noise, outlier, etc.).
0 2 4 6 8 100
2
4
6
8
10
0 2 4 6 8 100
2
4
6
8
10
36
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data partition
4
Training Data
Used to build a model or learn data mining algorithm.
Validation Data
Used to select the best parameters for the model.
Test Data
Used to select the best model among algorithms con-
sidered.Training DataAlgorithm A-1Algorithm A-2Algorithm A-3Algorithm B-1 Algorithm B-2Algorithm B-3
Validation DataAlgorithm A-1Algorithm A-2Algorithm A-3Algorithm B-1 Algorithm B-2Algorithm B-3
Test DataAlgorithm A-1Algorithm A-2Algorithm A-3Algorithm B-1 Algorithm B-2Algorithm B-3
37
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Explore and customize the data: Data normaliza-tion
4
Normalization (Standardization)
Eliminate the effect caused by different measurement
scale or unit.
z-score: (value-mean)/(standard deviation).
Id Age Income
1 25 1,000,000
2 35 2,000,000
3 45 3,000,000
… … …
Mean 35 2,000,000
Stdev 5 1,000,000
Id Age Income
1 -2 -1
2 0 0
3 2 1
… … …
Mean 0 0
Stdev 1 1
Original data Normalized data
38
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Build data mining models
Data mining algorithm
Classification
• Logistic regression, k-nearest neighbor, naïve bayes,
classification trees, neural networks, linear discrimi-
nant analysis.
Prediction
• Linear regression, k-nearest neighbor, regression
trees, neural networks.
Association rules: A priori algorithm.
Clustering: Hierarchical clustering, K-Means clustering.
5
39
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Evaluate and interpret the results
Classification performance
Confusion matrix
Simple accuracy: (A+C)/(A+B+C+D)
Balanced correction rate:
Lift charts, receiver operating characteristic (ROC)
curve, etc.
6
Predicted
1(+) 0(-)
Ac-tual
1(+)True positive,Sensitivity (A)
False nega-tive,
Type I error (B)
0(-)
False posi-tive,
Type II error (C)
True nega-tive,
Specificity (D)
DC
D
BA
A
40
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Evaluate and interpret the results
Prediction performance
y: actual target value, y’: predicted target value
• Mean squared error, Root mean squared error
• Mean absolute error
• Mean absolute percentage error6
n
i ii yyn
MSE1
2)(1
n
i ii yyn
RMSE1
2)(1
n
i ii yyn
MAE1
1
n
i iii yyyn
MAPE1
/1
0 2 4 6 8 100
2
4
6
8
10
41
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Evaluate and interpret the results
Clustering
Within variance: variance among record in a single
cluster.
Between variance: variance between clusters.
Good clustering: high between variance and low within
variance.
Association rules
Support:
Confidence:
Lift:
6
),( BAP
)(
),()|(
BP
BAPBAP
)()(
),(
)(
)|(
BPAP
BAP
BP
BAP
42
2011 Data Mining, IISE, SNUT
Steps in Data Mining
Deploy and monitor the model
Deployment
Integrate the data mining model into operational sys-
tem.
Run the model on real data to produce decisions or ac-
tions.
• “Send Mr. Kang a coupon because his likelihood to
leave the company next month is 80%”
Monitoring
Evaluate the performance of the model after deploy-
ment.
Update or redevelop if necessary.7