Predictive Analytics using Machine learning · 2017-03-18 · Predictive Analytics using Machine...

Preview:

Citation preview

Predictive Analyticsusing

Machine learningPraisan Padungweang, Ph.D.

Model evaluation

2

The Confusion MatrixA confusion matrix shows the number of correct and incorrect decisions made by the model compare to the actual labels (target) in the data.

For a problem involving n classes, it is an n × n matrix with the rows labeled with actual classes and the columns labeled with predicted classes.

Predicted

T F

Act

ual T

F

Predicted

a b c

Act

ual

a

b

c

3

The Confusion MatrixThe relationship between classes can be depicted as a 2 x 2 confusion matrix

◦ True Positive (TP): Correctly classified as the class of interest

◦ True Negative (TN): Correctly classified as not the class of interest

◦ False Positive (FP): Incorrectly classified as the class of interest

◦ False Negative (FN): Incorrectly classified as not the class of interest

Predicted

T F

Act

ual

TTrue

Positive

False

Negative(Type II error)

FFalse

Positive(Type I error)

True

Negative

4

Accuracy (TP+ TN)/(TP+FN+FP+TN)

Model evaluation

5

Confusion matrix

Predicted status

𝑃1 𝑃2 𝑃3 𝑃𝑘

Act

ual

Sta

tus

𝐴1 𝑨𝟏𝑷𝟏 𝐴1𝑃2 𝐴1𝑃3 𝐴1𝑃𝑘

𝐴2 𝐴2𝑃1 𝑨𝟐𝑷𝟐 𝐴2𝑃3 𝐴2𝑃𝑘

𝐴3 𝐴3𝑃1 𝐴3𝑃2 𝑨𝟑𝑷𝟑 𝐴3𝑃𝑘

:

𝐴𝑘 𝐴𝑘𝑃1 𝐴𝑘𝑃2 𝐴𝑘𝑃3 𝐴𝑘 𝑨𝒌𝑷𝒌

Accuracy = 𝐴1𝑃1+𝐴2𝑃2+𝐴3𝑃3+⋯+𝐴𝑘𝑃𝑘

𝑛

Model evaluation for multiple classes

Model evaluation

6

Churn Predicted

1 John Yes 0.72

2 Sophie No 0.56

3 David Yes 0.44

4 Emma No 0.18

5 Bob No 0.36

Predicted status

churn no churn

Actual Status

churn 1 (John) 1(David)

no churn

1 (Sophie)2 (Emma,

Bob)

Accuracy = 𝑇𝑃+𝑇𝑁

𝑛=1+2

5= 0.6

Model evaluation for binary classes

Model evaluation

7

Accuracy = 𝑇𝑃+𝑇𝑁

𝑁

Actual Class Prob. of "1" Actual Class Prob. of "1"

1 0.996 1 0.506

1 0.988 0 0.471

1 0.984 0 0.337

1 0.980 1 0.218

1 0.948 0 0.199

1 0.889 0 0.149

1 0.848 0 0.048

0 0.762 0 0.038

1 0.707 0 0.025

1 0.681 0 0.022

1 0.656 0 0.016

0 0.622 0 0.004

Actual Class Prob. of "1" Actual Class Prob. of "1"

1 0.996 1 0.506

1 0.988 0 0.471

1 0.984 0 0.337

1 0.980 1 0.218

1 0.948 0 0.199

1 0.889 0 0.149

1 0.848 0 0.048

0 0.762 0 0.038

1 0.707 0 0.025

1 0.681 0 0.022

1 0.656 0 0.016

0 0.622 0 0.004

Predicted status

1 0

Actual Status

1

0

Other Evaluation MetricsThere are other Evaluation Metrics that can be calculate from the confusion matrix

◦ Sensitivity and specificity

◦ Precision and Recall

◦ F-measure

8

Sensitivity and specificity

Other Evaluation Metrics

9

Predicted

T F

Act

ual

TTrue

Positive

False

Negative(Type II error)

True positive rate, Sensitivity,

Recall = 𝑻𝑷

𝑻𝑷+𝑭𝑵

FFalse

Positive(Type I error)

True

Negative

True negative rate,

Specificity = 𝑻𝑵

𝑭𝑷+𝑻𝑵

Positive predictive value,

Precision = 𝑻𝑷

𝑻𝑷+𝑭𝑷

Accuracy = 𝑻𝑷+ 𝑻𝑵

𝑻𝑷+𝑭𝑵+𝑭𝑷+𝑻𝑵

F-score = 𝟐×𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏×𝑹𝒆𝒄𝒂𝒍𝒍

𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+𝑹𝒆𝒄𝒂𝒍𝒍

10

For example, in spam message problem ◦ the sensitivity of 0.842 implies that 84 percent of spam messages were correctly

classified.

◦ the specificity of 0.996 implies that 99.6 percent of non-spam messages were correctly classified, or alternatively, 0.4 percent of valid messages were rejected as spam.

The idea of rejecting 0.4 percent of valid email messages may be unacceptable

Predicted

T F

Act

ual

TTrue

Positive

False

Negative(Type II error)

True positive rate,

Sensitivity = 𝑻𝑷

𝑻𝑷+𝑭𝑵

FFalse

Positive(Type I error)

True

Negative

True negative rate,

Specificity = 𝑻𝑵

𝑻𝑵+𝑭𝑷

Sensitivity and specificity

11

Predicted

T F

Act

ual

TTrue

Positive

False

Negative(Type II error)

True positive rate, Sensitivity,

Recall = 𝑻𝑷

𝑻𝑷+𝑭𝑵

FFalse

Positive(Type I error)

True

Negative

Positive predictive value,

Precision = 𝑻𝑷

𝑻𝑷+𝑭𝑷 When a model predicts the positive class, how often is it correct?

o A precise model will only predict the positive class in cases very likely to be positive. It will be very trustworthy.

• having both high precision and recall at the same time is very challenging.

A model with high recall captures a large portion of the positive examples.o For example, a search engine with high

recall returns a large number of documents pertinent to the search query.

Precision and recall

Other Evaluation Metrics

F-measure

A measure of model performance that combines precision and recall into a single number is known as the F-measure (also sometimes called the F1 score or the F-score).

Since the F-measure reduces model performance to a single number, it provides a convenient way to compare several models side-by-side.

12

The F-measure

F1

Problems with Unequal Costs and Benefits

Accuracy makes no distinction between false positive and false negative errors.

◦ It makes the tacit assumption that both errors are equally important.

◦ With real-world domains this is rarely the case.

These two errors are very different, should be counted separately, and should have different costs.

13

Model-> cancer Actual-> not

Model-> notActual-> cancer

He would be given further tests • expensive• inconvenient• stressful

Do nothing!

A Key Analytical Framework: cThe general form of an expected value calculation

EV = 𝑝(𝑜1) ∙ 𝑣(𝑜1) + 𝑝(𝑜2) ∙ 𝑣(𝑜2)+...

= σ𝑖 𝑝(𝑜𝑖) ∙ 𝑣(𝑜𝑖)

◦ 𝑜𝑖 is a possible decision outcome;

◦ 𝑝(𝑜𝑖) is its probability

◦ 𝑣(𝑜𝑖) is its business value.

The probabilities often can be estimated from the data

The business values often need to be acquired from other sources◦ usually the values must come from external domain knowledge

14

Expected Value for Model EvaluationIn targeted marketing, for example, a consumer need to be assigned as responder versus not likely responder then we could target the likely responders.

Cost/profit◦ If a consumer buys the product for $200 and our product related costs are $100.

◦ We mail some marketing materials, and the overall cost including postage is $1.

Yielding ◦ $99 is a value (profit) if the consumer responds (buys the product).

◦ a cost of $1 or equivalently a benefit of -$1 if the consumer not responds .

15

Cost-Benefit

Predicted

R N

Act

ual

R 99 0

N -1 0

Cost-Benefit matrices

7.425 0

-0.1 0

Expected Value for Model Evaluation

16

ModelPredicted

R N

Act

ual

R 150 150

N 200 1500

Predicted

R N

Act

ual

R 0.075 0.075

N 0.1 0.75

Cost-Benefit

Predicted

R N

Act

ual

R 99 0

N -1 0

/ 2,000

7.325

Acc=82.5%

Expected value

Targeted marketing

0 0

0 0

Expected Value for Model Evaluation

17

ModelPredicted

R N

Act

ual

R 0 300

N 0 1700

Predicted

R N

Act

ual

R 0 0.15

N 0 0.85

Cost-Benefit

Predicted

R N

Act

ual

R 99 0

N -1 0

/ 2,000

0

Acc=85%

Expected value

Targeted marketing

Expected Value for Model Evaluation

18

ModelPredicted

Churn not

Act

ual

Churn 100 50

not 150 9700

Cost-Benefit

Predicted

Churn not

Act

ual

Churn -10 -100

not -10 0

Acc=98%

ModelPredicted

Churn not

Act

ual

Churn 0 150

not 0 9850

Expected value = -0.75

Acc=98.5% Expected value = -1.5

Churn prediction

Problems with Unbalanced ClassesConsider a domain where the classes appear in a 999:1 ratio.

◦ A simple rule—always choose the most prevalent class—gives 99.9% accuracy.

Skews of 1:100 are common in fraud detection.

In churn data the baseline churn rate is approximately 10% per month ◦ If we simply classify everyone as negative we could achieve the accuracy of 90%!

19

Problems with Unbalanced Classes

20

Model1Predicted

Churn not

Act

ual

Churn 100 50

not 150 9700

Model2Predicted

Churn not

Act

ual

Churn 0 150

not 0 9850

Accuracy = 98% Accuracy = 98.5%

OtherMachine Learning

Models

21

Decision trees

Decision treesDecision trees are recursive partitioning algorithms (RPAs) that come up with a tree-like structure representing patterns in an underlying data set

Example Decision Tree

23

Decision trees

The top node is the root node ◦ Specify a testing condition of which the outcome corresponds to a

branch leading up to an internal node.

The terminal nodes of the tree assign the classifications and are also referred to as the leaf nodes.

Parent node

Child nodes

leaf nodes

24

Not Respond Respond

Decision treesMany algorithms have been suggested to construct decision trees.

Amongst the most popular are: C4.5, CART and CHAID.

These algorithms differ in their way of answering the key decisions to build a tree, which are:

Splitting decision: ◦ Which variable to split at what value (e.g., age < 30 or not, income < 1,000 or not;

marital status = married or not)

Stopping decision: ◦ When to stop growing a tree?

Assignment decision: ◦ What class (e.g., good or bad customer) to assign to a leaf node?

25

Decision treesSplitting decision

Use the concept of impurity

Consider three nodes containing good (unfilled circles) and bad (filled circles) customers

◦ Minimal impurity occurs when all customers are either good or bad.

◦ Maximal impurity occurs when one has the same number of good and bad customers

Feature X1 Feature X2 Feature X3

26

Decision trees - Splitting decision

Decision trees will now aim at minimizing the impurity in the data.

The most popular measurement are: Entropy: E(S) = −pGlog2(pG)−pBlog2(pB) (C4.5)

Gini: Gini(S) = 2pGpB (CART)

with pG (pB) being the proportions of class G (good) and B (bad), respectively.

27

Decision trees

Stopping criterion

The tree can learn to fit the specificities or noise in the data, which is also referred to as overfitting.

The data should be split into a training sample and a validation sample ◦ The training sample will be used to make the splitting decision

◦ The validation sample is an independent sample ◦ monitor the misclassification error

28

Stopping criteria Spark parameters

omaxDepthoMaximum depth of a tree

o Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.

o minInstancesPerNodeo For a node to be split further, each of its children must receive at least this number

of training instances

o minInfoGaino For a node to be split further, the split must improve at least this much (in terms of

information gain).

29

Decision treesAssignment decision

typically looks at the majority class within the leaf node to make the decision

30

Bad Good

Decision treesDecision trees essentially model decision boundaries orthogonal to the axes

Decision Boundary of a Decision Tree

31

Decision treesDecision trees can be used for various purposes in analytics.

input selection ◦ attributes that occur at the top of the tree are more predictive of the target

initial segmentation. ◦ builds a tree of two or three levels deep as the segmentation scheme

◦ then uses second stage machine learning models for further refinement

final analytical model to be used directly into production◦ It gives a white box model with a clear explanation behind how it reaches its

classifications.

32

Model decision boundaries

33

Decision trees

Logistic regression

Neural Networks

Neural networks

𝑓(. )

w0

w1

w2

A mathematical representations inspired by the functioning of the human brain.

Another more realistic perspective sees neural networks as generalizations of existing machine learning models.

35

Neural networksNeural networks vs Linear regression

x0

x1 𝑓(. )

0

1

36

y

x

𝑓(𝓏) = 𝓏

𝓏 = 𝜃0+ 𝜃1𝐴𝑔𝑒+ 𝜃2𝐼𝑛𝑐𝑜𝑚𝑒

Neural networksNeural networks vs Logistic regression

x0

x1 𝑓(. )

0

1

37

𝑓(𝓏)=1

1+𝑒−(𝓏)

𝓏 = 𝜃0+ 𝜃1𝐴𝑔𝑒+ 𝜃2𝐼𝑛𝑐𝑜𝑚𝑒

Neural networksSingle Layer Perceptron X0=1

w0

w1

w2

w3

CustomerAge(𝑥1)

Income(𝑥2)

Gender(𝑥3)

Response (y)

John 30 1,500 M No 0

Sarah 31 800 F Yes 1

Sophie 52 1,800 F Yes 1

David 48 2,000 M No 1

Peter 34 1,800 M Yes 0

w1

Agew2

Incomew3

Genderw0

Bias (inception)

77.09677288 -1.69512 -2.99575 1.64252

1

1.643

77.097

-1.695

-2.996

Age

Income

Gender

Weights

Neural networksMulti Layer Perceptron (MLP)

Layer 1 Layer 2 Layer 3

Input Layer Hidden Layer Output Layer

39

Neural networksEach node has a transformation function f(.) (also called activation functions). The most popular activation functions are:

Linear ranging between −∞ and +∞; 𝑓 𝑧 = 𝑧

Sigmoid (Logistic)

ranging between 0 and 1; 𝑓 𝑧 =1

1+𝑒−𝑧

Hyperbolic tangent

ranging between –1 and +1; 𝑓 𝑧 =𝑒𝑧−𝑒−𝑧

𝑒𝑧+𝑒−𝑧

40

Selecting activation function

Hidden Layer -> logistic, hyperbolic tangent, linear

Output Layer ◦ For classification (e.g., churn, response, fraud),

◦ it is common practice to adopt a logistic transformation in the output layer, since the outputs can then be interpreted as probabilities.

◦ For regression targets◦ Linear

◦ Linear, logistic, hyperbolic tangent for normalized target

41

Input Layer Hidden Layer Output Layer

Model ComparisonHeld-out test datadata is divide into training set and test set

◦ Training set is user for models creation (training and validation)

◦ Test set is held-out for model selection

42

Training set Test set

models

Test performance

The selected model

mo

del

s cr

eati

on

Model ComparisonCross-validation for model comparison

◦ K-folds cross-validation

43

Demo

Data Preprocessing

Models training

Models evaluation

Model deployment

44

Hands on, machine learning using Spark, in the class

Recommended