Predictive Analytics using Machine learning · 2017-03-18 · Predictive Analytics using Machine...

Predictive Analyticsusing

Machine learningPraisan Padungweang, Ph.D.

Model evaluation

The Confusion MatrixA confusion matrix shows the number of correct and incorrect decisions made by the model compare to the actual labels (target) in the data.

For a problem involving n classes, it is an n × n matrix with the rows labeled with actual classes and the columns labeled with predicted classes.

Predicted

The Confusion MatrixThe relationship between classes can be depicted as a 2 x 2 confusion matrix

◦ True Positive (TP): Correctly classified as the class of interest

◦ True Negative (TN): Correctly classified as not the class of interest

◦ False Positive (FP): Incorrectly classified as the class of interest

◦ False Negative (FN): Incorrectly classified as not the class of interest

Predicted

Positive

Negative(Type II error)

FFalse

Positive(Type I error)

Negative

Accuracy (TP+ TN)/(TP+FN+FP+TN)

Model evaluation

Confusion matrix

Predicted status

𝑃1 𝑃2 𝑃3 𝑃𝑘

𝐴1 𝑨𝟏𝑷𝟏 𝐴1𝑃2 𝐴1𝑃3 𝐴1𝑃𝑘

𝐴2 𝐴2𝑃1 𝑨𝟐𝑷𝟐 𝐴2𝑃3 𝐴2𝑃𝑘

𝐴3 𝐴3𝑃1 𝐴3𝑃2 𝑨𝟑𝑷𝟑 𝐴3𝑃𝑘

𝐴𝑘 𝐴𝑘𝑃1 𝐴𝑘𝑃2 𝐴𝑘𝑃3 𝐴𝑘 𝑨𝒌𝑷𝒌

Accuracy = 𝐴1𝑃1+𝐴2𝑃2+𝐴3𝑃3+⋯+𝐴𝑘𝑃𝑘

Model evaluation for multiple classes

Model evaluation

Churn Predicted

1 John Yes 0.72

2 Sophie No 0.56

3 David Yes 0.44

4 Emma No 0.18

5 Bob No 0.36

Predicted status

churn no churn

Actual Status

churn 1 (John) 1(David)

no churn

1 (Sophie)2 (Emma,

Accuracy = 𝑇𝑃+𝑇𝑁

𝑛=1+2

5= 0.6

Model evaluation for binary classes

Model evaluation

Accuracy = 𝑇𝑃+𝑇𝑁

Actual Class Prob. of "1" Actual Class Prob. of "1"

1 0.996 1 0.506

1 0.988 0 0.471

1 0.984 0 0.337

1 0.980 1 0.218

1 0.948 0 0.199

1 0.889 0 0.149

1 0.848 0 0.048

0 0.762 0 0.038

1 0.707 0 0.025

1 0.681 0 0.022

1 0.656 0 0.016

0 0.622 0 0.004

Actual Class Prob. of "1" Actual Class Prob. of "1"

1 0.996 1 0.506

1 0.988 0 0.471

1 0.984 0 0.337

1 0.980 1 0.218

1 0.948 0 0.199

1 0.889 0 0.149

1 0.848 0 0.048

0 0.762 0 0.038

1 0.707 0 0.025

1 0.681 0 0.022

1 0.656 0 0.016

0 0.622 0 0.004

Predicted status

Actual Status

Other Evaluation MetricsThere are other Evaluation Metrics that can be calculate from the confusion matrix

◦ Sensitivity and specificity

◦ Precision and Recall

◦ F-measure

Sensitivity and specificity

Other Evaluation Metrics

Predicted

Positive

True positive rate, Sensitivity,

Recall = 𝑻𝑷

𝑻𝑷+𝑭𝑵

FFalse

Negative

True negative rate,

Specificity = 𝑻𝑵

𝑭𝑷+𝑻𝑵

Positive predictive value,

Precision = 𝑻𝑷

𝑻𝑷+𝑭𝑷

Accuracy = 𝑻𝑷+ 𝑻𝑵

𝑻𝑷+𝑭𝑵+𝑭𝑷+𝑻𝑵

F-score = 𝟐×𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏×𝑹𝒆𝒄𝒂𝒍𝒍

𝑷𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏+𝑹𝒆𝒄𝒂𝒍𝒍

For example, in spam message problem ◦ the sensitivity of 0.842 implies that 84 percent of spam messages were correctly

classified.

◦ the specificity of 0.996 implies that 99.6 percent of non-spam messages were correctly classified, or alternatively, 0.4 percent of valid messages were rejected as spam.

The idea of rejecting 0.4 percent of valid email messages may be unacceptable

Predicted

Positive

True positive rate,

Sensitivity = 𝑻𝑷

𝑻𝑷+𝑭𝑵

FFalse

Negative

True negative rate,

Specificity = 𝑻𝑵

𝑻𝑵+𝑭𝑷

Sensitivity and specificity

Predicted

Positive

True positive rate, Sensitivity,

Recall = 𝑻𝑷

𝑻𝑷+𝑭𝑵

FFalse

Negative

Positive predictive value,

Precision = 𝑻𝑷

𝑻𝑷+𝑭𝑷 When a model predicts the positive class, how often is it correct?

o A precise model will only predict the positive class in cases very likely to be positive. It will be very trustworthy.

• having both high precision and recall at the same time is very challenging.

A model with high recall captures a large portion of the positive examples.o For example, a search engine with high

recall returns a large number of documents pertinent to the search query.

Precision and recall

Other Evaluation Metrics

F-measure

A measure of model performance that combines precision and recall into a single number is known as the F-measure (also sometimes called the F1 score or the F-score).

Since the F-measure reduces model performance to a single number, it provides a convenient way to compare several models side-by-side.

The F-measure

Problems with Unequal Costs and Benefits

Accuracy makes no distinction between false positive and false negative errors.

◦ It makes the tacit assumption that both errors are equally important.

◦ With real-world domains this is rarely the case.

These two errors are very different, should be counted separately, and should have different costs.

Model-> cancer Actual-> not

Model-> notActual-> cancer

He would be given further tests • expensive• inconvenient• stressful

Do nothing!

A Key Analytical Framework: cThe general form of an expected value calculation

EV = 𝑝(𝑜1) ∙ 𝑣(𝑜1) + 𝑝(𝑜2) ∙ 𝑣(𝑜2)+...

= σ𝑖 𝑝(𝑜𝑖) ∙ 𝑣(𝑜𝑖)

◦ 𝑜𝑖 is a possible decision outcome;

◦ 𝑝(𝑜𝑖) is its probability

◦ 𝑣(𝑜𝑖) is its business value.

The probabilities often can be estimated from the data

The business values often need to be acquired from other sources◦ usually the values must come from external domain knowledge

Expected Value for Model EvaluationIn targeted marketing, for example, a consumer need to be assigned as responder versus not likely responder then we could target the likely responders.

Cost/profit◦ If a consumer buys the product for $200 and our product related costs are $100.

◦ We mail some marketing materials, and the overall cost including postage is $1.

Yielding ◦ $99 is a value (profit) if the consumer responds (buys the product).

◦ a cost of $1 or equivalently a benefit of -$1 if the consumer not responds .

Cost-Benefit

Predicted

R 99 0

N -1 0

Cost-Benefit matrices

7.425 0

-0.1 0

Expected Value for Model Evaluation

ModelPredicted

R 150 150

N 200 1500

Predicted

R 0.075 0.075

N 0.1 0.75

Cost-Benefit

Predicted

R 99 0

N -1 0

/ 2,000

Acc=82.5%

Expected value

Targeted marketing

ModelPredicted

R 0 300

N 0 1700

Predicted

R 0 0.15

N 0 0.85

Cost-Benefit

Predicted

R 99 0

N -1 0

/ 2,000

Acc=85%

Expected value

Targeted marketing

ModelPredicted

Churn not

Churn 100 50

not 150 9700

Cost-Benefit

Predicted

Churn not

Churn -10 -100

not -10 0

Acc=98%

ModelPredicted

Churn not

Churn 0 150

not 0 9850

Expected value = -0.75

Acc=98.5% Expected value = -1.5

Churn prediction

Problems with Unbalanced ClassesConsider a domain where the classes appear in a 999:1 ratio.

◦ A simple rule—always choose the most prevalent class—gives 99.9% accuracy.

Skews of 1:100 are common in fraud detection.

In churn data the baseline churn rate is approximately 10% per month ◦ If we simply classify everyone as negative we could achieve the accuracy of 90%!

Problems with Unbalanced Classes

Model1Predicted

Churn not

Churn 100 50

not 150 9700

Model2Predicted

Churn not

Churn 0 150

not 0 9850

Accuracy = 98% Accuracy = 98.5%

OtherMachine Learning

Models

Decision trees

Decision treesDecision trees are recursive partitioning algorithms (RPAs) that come up with a tree-like structure representing patterns in an underlying data set

Example Decision Tree

Decision trees

The top node is the root node ◦ Specify a testing condition of which the outcome corresponds to a

branch leading up to an internal node.

The terminal nodes of the tree assign the classifications and are also referred to as the leaf nodes.

Parent node

Child nodes

leaf nodes

Not Respond Respond

Decision treesMany algorithms have been suggested to construct decision trees.

Amongst the most popular are: C4.5, CART and CHAID.

These algorithms differ in their way of answering the key decisions to build a tree, which are:

Splitting decision: ◦ Which variable to split at what value (e.g., age < 30 or not, income < 1,000 or not;

marital status = married or not)

Stopping decision: ◦ When to stop growing a tree?

Assignment decision: ◦ What class (e.g., good or bad customer) to assign to a leaf node?

Decision treesSplitting decision

Use the concept of impurity

Consider three nodes containing good (unfilled circles) and bad (filled circles) customers

◦ Minimal impurity occurs when all customers are either good or bad.

◦ Maximal impurity occurs when one has the same number of good and bad customers

Feature X1 Feature X2 Feature X3

Decision trees - Splitting decision

Decision trees will now aim at minimizing the impurity in the data.

The most popular measurement are: Entropy: E(S) = −pGlog2(pG)−pBlog2(pB) (C4.5)

Gini: Gini(S) = 2pGpB (CART)

with pG (pB) being the proportions of class G (good) and B (bad), respectively.

Decision trees

Stopping criterion

The tree can learn to fit the specificities or noise in the data, which is also referred to as overfitting.

The data should be split into a training sample and a validation sample ◦ The training sample will be used to make the splitting decision

◦ The validation sample is an independent sample ◦ monitor the misclassification error

Stopping criteria Spark parameters

omaxDepthoMaximum depth of a tree

o Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.

o minInstancesPerNodeo For a node to be split further, each of its children must receive at least this number

of training instances

o minInfoGaino For a node to be split further, the split must improve at least this much (in terms of

information gain).

Decision treesAssignment decision

typically looks at the majority class within the leaf node to make the decision

Bad Good

Decision treesDecision trees essentially model decision boundaries orthogonal to the axes

Decision Boundary of a Decision Tree

Decision treesDecision trees can be used for various purposes in analytics.

input selection ◦ attributes that occur at the top of the tree are more predictive of the target

initial segmentation. ◦ builds a tree of two or three levels deep as the segmentation scheme

◦ then uses second stage machine learning models for further refinement

final analytical model to be used directly into production◦ It gives a white box model with a clear explanation behind how it reaches its

classifications.

Model decision boundaries

Decision trees

Logistic regression

Neural Networks

Neural networks

𝑓(. )

A mathematical representations inspired by the functioning of the human brain.

Another more realistic perspective sees neural networks as generalizations of existing machine learning models.

Neural networksNeural networks vs Linear regression

x1 𝑓(. )

𝑓(𝓏) = 𝓏

𝓏 = 𝜃0+ 𝜃1𝐴𝑔𝑒+ 𝜃2𝐼𝑛𝑐𝑜𝑚𝑒

Neural networksNeural networks vs Logistic regression

x1 𝑓(. )

𝑓(𝓏)=1

1+𝑒−(𝓏)

𝓏 = 𝜃0+ 𝜃1𝐴𝑔𝑒+ 𝜃2𝐼𝑛𝑐𝑜𝑚𝑒

Neural networksSingle Layer Perceptron X0=1

CustomerAge(𝑥1)

Income(𝑥2)

Gender(𝑥3)

Response (y)

John 30 1,500 M No 0

Sarah 31 800 F Yes 1

Sophie 52 1,800 F Yes 1

David 48 2,000 M No 1

Peter 34 1,800 M Yes 0

Incomew3

Genderw0

Bias (inception)

77.09677288 -1.69512 -2.99575 1.64252

77.097

-1.695

-2.996

Income

Gender

Weights

Neural networksMulti Layer Perceptron (MLP)

Layer 1 Layer 2 Layer 3

Input Layer Hidden Layer Output Layer

Neural networksEach node has a transformation function f(.) (also called activation functions). The most popular activation functions are:

Linear ranging between −∞ and +∞; 𝑓 𝑧 = 𝑧

Sigmoid (Logistic)

ranging between 0 and 1; 𝑓 𝑧 =1

1+𝑒−𝑧

Hyperbolic tangent

ranging between –1 and +1; 𝑓 𝑧 =𝑒𝑧−𝑒−𝑧

𝑒𝑧+𝑒−𝑧

Selecting activation function

Hidden Layer -> logistic, hyperbolic tangent, linear

Output Layer ◦ For classification (e.g., churn, response, fraud),

◦ it is common practice to adopt a logistic transformation in the output layer, since the outputs can then be interpreted as probabilities.

◦ For regression targets◦ Linear

◦ Linear, logistic, hyperbolic tangent for normalized target

Input Layer Hidden Layer Output Layer

Model ComparisonHeld-out test datadata is divide into training set and test set

◦ Training set is user for models creation (training and validation)

◦ Test set is held-out for model selection

Training set Test set

models

Test performance

The selected model

Model ComparisonCross-validation for model comparison

◦ K-folds cross-validation

Data Preprocessing

Models training

Models evaluation

Model deployment

Hands on, machine learning using Spark, in the class

Predictive Analytics using Machine learning · 2017-03-18 · Predictive Analytics using Machine...

Documents

Fundamentals of Machine Learning for Predictive Data Analytics€¦ · Big Idea Fundamentals Standard Approach: The ID3 AlgorithmSummary Fundamentals of Machine Learning for Predictive

Analytics Overview #Predictive Analytics

Architecting for Analytics - Vlamis Software Solutionsvlamiscdn.com/papers2020/ArchitectingforAnalytics... · Machine Learning and Predictive Analytics Data Visualization ETL and

IBM SPSS Predictive Analytics Workshop · Explore multiple predictive analytics techniques ... Crime analysis Predictive policing ... 30 IBM SPSS Predictive Analytics Workshop

New Ways for Predictive Analytics and Machine Learning to Advance Population Health

Machine Learning and Predictive Analytics in SAS ... · Machine Learning and Predictive Analytics in SAS® Enterprise Miner™ and SAS/STAT® Software . D. Richard Cutler, Utah State

2017 Predictive Analytics in Healthcare Trend Forecast · Predictive Analytics in Healthcare Trend Forecast ... predictive analytics in the next five years is comparable. ... Machine-learning

Big Data Analytics and Predictive Analytics - _ Predictive Analytics Today

Integrating Azure Machine Learning and Predictive Analytics with SharePoint Online

In this book, you ll learn: Predictive Analytics with3.droppdf.com/files/u78Jo/predictive-analytics-with...Predictive Analytics with Microsoft Azure Machine Learning Predictive Analytics

Predictive Analytics for Fog Computing using Machine Learning and …cell.missouri.edu/media/publications/pp.pdf · 2018-06-29 · Predictive Analytics for Fog Computing using Machine

Introduction to Machine Learning & Data Analytics...Machine Learning & Predictive Analytics > Typically start with sensing problems or potential opportunities, which may initially

Predictive intelligence of reliable analytics in ... · Keywords Predictive intelligence ·Exploration query prediction ·Centroid refinement ·Machine learning 1Introduction Due

WHITE PAPER Using Artificial Intelligence, Machine Learning, and Predictive Analytics ... · 2018-10-30 · Using Artificial Intelligence, Machine Learning, and Predictive Analytics

Machine Learning/ Data Science: Boosting Predictive Analytics Model Performance

Ensemble machine learning in the predictive data analytics ... · Predictive Data Analytics of Indian Stock Market Marxia Oli Sigo ... performance. Eric Siegel (2016) emphasized that

Expanding Predictive Analytics Through the Use of Machine ......Expanding Predictive Analytics Through the Use of Machine Learning Thursday, February 28, 2013, 11:10 a.m. Chris Cooksey,

Data Analytics and Visualization Workshop … · 09/04/2019 · Data Analytics • Data Visualization • Data Science • Big Data • Predictive Analytics • Machine Learning

Making Predictive Analytics More Practical With Alteryxpages.alteryx.com/rs/...alteryx-predictive-analytics-practical.pdf · Making Predictive Analytics More Practical With ... predictive

FUNDAMENTALS OF MACHINE LEARNING FOR PREDICTIVE … · This is an excerpt from the book Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples,