INTRO TO DATA SCIENCE - GitHub Pages

Preview:

Citation preview

INTRO TO DATA SCIENCELECTURE 4: INTRO TO ML & KNN CLASSIFICATIONFrancesco Mosconi DAT10 SF // October 15, 2014

DATA SCIENCE IN THE NEWSHEADER– CLASS NAME, PRESENTATION TITLE

DATA SCIENCE IN THE NEWS

Source: http://f1metrics.wordpress.com/2014/10/03/building-a-race-simulator/

DATA SCIENCE IN THE NEWS

Source: http://www.pyimagesearch.com/2014/10/13/deep-learning-amazon-ec2-gpu-python-nolearn/

RECAP

‣Cleaning data ‣Dealing with missing data ‣Setting up github for homework !

LAST TIME

QUESTIONS?INTRO TO DATA SCIENCE

I. WHAT IS MACHINE LEARNING? II. CLASSIFICATION PROBLEMSIII. BUILDING EFFECTIVE CLASSIFIERS IV. THE KNN CLASSIFICATION MODELEXERCISES:IV. LAB: KNN CLASSIFICATION IN PYTHON V. BONUS LAB: VISUALIZATION WITH MATPLOTLIB (IF TIME ALLOWS)

AGENDA

I. WHAT IS MACHINE LEARNING?

INTRO TO DATA SCIENCE

WHAT IS MACHINE LEARNING? 9

WHAT IS MACHINE LEARNING? 10

WHAT IS MACHINE LEARNING? 11

WHAT IS MACHINE LEARNING? 12

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!!!!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning

WHAT IS MACHINE LEARNING? 13

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !!! !! !!source: http://en.wikipedia.org/wiki/Machine_learning

WHAT IS MACHINE LEARNING? 14

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !!!!!source: http://en.wikipedia.org/wiki/Machine_learning

WHAT IS MACHINE LEARNING? 15

from Wikipedia: !“Machine learning, a branch of artificial intelligence, is about the construction and study of systems that can learn from data.” !!!“The core of machine learning deals with representation and generalization…” !‣ representation – extracting structure from data !‣ generalization – making predictions from data !!!source: http://en.wikipedia.org/wiki/Machine_learning

REMEMBER THIS? 16

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

WE ARE NOW HERE 17

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

WE WANT TO GO HERE 18

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

WE WANT TO GO HERE 19

source: http://www.dataists.com/2010/09/the-data-science-venn-diagram/

QUESTION!What does it take to make this jump?

ANSWER: PROBLEM SOLVING! 20

ANSWER: PROBLEM SOLVING! 21

NOTE!Implementing solutions to ML problems is the focus of this course!

THE STRUCTURE OF MACHINE LEARNING PROBLEMS

INTRO TO DATA SCIENCE

REMEMBER WHAT WE SAID BEFORE? 23

Supervised

Unsupervised

Making predictions

Extracting structure

REMEMBER WHAT WE SAID BEFORE? 24

representation

generalization

Supervised

Unsupervised

Making predictions

Extracting structure

TYPES OF LEARNING PROBLEMS - SUPERVISED EXAMPLE 25

TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 26

TYPES OF LEARNING PROBLEMS - UNSUPERVISED EXAMPLE 27

TYPES OF DATA 28

Continuous Categorical

Quantitative Qualitative

TYPES OF DATA 29

Continuous Categorical

Quantitative Qualitative

NOTE!The space where data live is called the feature space. !Each point in this space is called a record.

TYPES OF ML SOLUTIONS 30

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

TYPES OF ML SOLUTIONS 31

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

NOTEWe will implement solutions using models and algorithms. !Each will fall into one of these four buckets.

WHATIS THEGOALOFMACHINE LEARNING?

QUESTION

REMEMBER WHAT WE SAID BEFORE? 33

Supervised

Unsupervised

Making predictions

Extracting structureANSWER!The goal is determined by the type of problem.

HOWDO YOUDETERMINETHE RIGHTAPPROACH?

QUESTION

TYPES OF ML SOLUTIONS 35

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

ANSWER!The right approach is determined by the desired solution.

TYPES OF ML SOLUTIONS 36

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

ANSWER!The right approach is determined by the desired solution.

NOTE!All of this depends on your data!

WHATDO YOUDOWITH YOURRESULTS?

QUESTION

THE DATA SCIENCE WORKFLOW 38

source: http://benfry.com/phd/dissertation-110323c.pdf

ANSWER!Interpret them and react accordingly.

THE DATA SCIENCE WORKFLOW 39

source: http://benfry.com/phd/dissertation-110323c.pdf

ANSWER!Interpret them and react accordingly

NOTE!This also relies on your problem solving skills!

II. CLASSIFICATION PROBLEMS

INTRO TO DATA SCIENCE

TYPES OF ML SOLUTIONS 41

Supervised

Unsupervised

Continuous Categorical

??? ???

??? ???

CLASSIFICATION PROBLEMS 42

Supervised

Unsupervised

Continuous Categorical

regression classificationdimension reduction clustering

CLASSIFICATION PROBLEMS 43

Here’s (part of) an example dataset:

CLASSIFICATION PROBLEMS 44

Here’s (part of) an example dataset:

{independent variables

CLASSIFICATION PROBLEMS 45

Here’s (part of) an example dataset: {class labels

(qualitative){independent variables

CLASSIFICATION PROBLEMS 46

Q: What does “supervised” mean?

CLASSIFICATION PROBLEMS 47

Q: What does “supervised” mean? A: We know the labels.

class labels

(qualitative)

CLASSIFICATION PROBLEMS 48

Q: How does a classification problem work?

CLASSIFICATION PROBLEMS 49

Q: How does a classification problem work? A: Data in, predicted labels out.

source: http://www-users.cs.umn.edu/~kumar/dmbook/ch4.pdf

Q: What steps does a classification problem require?

dataset

CLASSIFICATION PROBLEMS 50

model

Q: What steps does a classification problem require? 1) split dataset

dataset

CLASSIFICATION PROBLEMS 51

model

Q: What steps does a classification problem require? 1) split dataset 2) train model

CLASSIFICATION PROBLEMS 52

dataset

training set

model

Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model

CLASSIFICATION PROBLEMS 53

model

dataset

training set

test set

Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions

CLASSIFICATION PROBLEMS 54

model

dataset

test set

training set

predictions

new data

Q: What steps does a classification problem require? 1) split dataset 2) train model 3) test model 4) make predictions

CLASSIFICATION PROBLEMS 55

model

dataset

test set

training set

predictions

new data

NOTEThis new data is called out of sample data. !We don’t know the labels for these OOS records!

III. BUILDING EFFECTIVE CLASSIFIERS

INTRO TO DATA SCIENCE

Q: What types of prediction error will we run into?

BUILDING EFFECTIVE CLASSIFIERS 57

model

dataset

test set

training set

predictions

new data

Q: What types of prediction error will we run into? 1) training error

BUILDING EFFECTIVE CLASSIFIERS 58

model

dataset

test set

training set

predictions

new data

Q: What types of prediction error will we run into? 1) training error 2) generalization error

BUILDING EFFECTIVE CLASSIFIERS 59

model

dataset

test set

training set

predictions

new data

Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error

BUILDING EFFECTIVE CLASSIFIERS 60

model

dataset

test set

training set

predictions

new data

Q: What types of prediction error will we run into? 1) training error 2) generalization error 3) OOS error

BUILDING EFFECTIVE CLASSIFIERS 61

model

dataset

test set

training set

predictions

NOTE!We want to estimate OOS prediction error so we know what to expect from our model.

new data

TRAINING ERROR 62

Q: Why should we use training & test sets?

TRAINING ERROR 63

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset.

TRAINING ERROR 64

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error?

TRAINING ERROR 65

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set).

TRAINING ERROR 66

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set). A: Down to zero!

TRAINING ERROR 67

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set). A: Down to zero!

NOTE!This phenomenon is called overfitting.

OVERFITTING 68

source: Data Analysis with Open Source Tools, by Philipp K. Janert. O’Reilly Media, 2011.!

OVERFITTING - EXAMPLE 69

source: http://www.dtreg.com

OVERFITTING - EXAMPLE 70

source: http://www.dtreg.com

TRAINING ERROR 71

Q: Why should we use training & test sets? !Thought experiment: Suppose instead, we train our model using the entire dataset. Q: How low can we push the training error? - We can make the model arbitrarily complex (effectively

“memorizing” the entire training set). A: Down to zero! !A: Training error is not a good estimate of OOS accuracy.

NOTE!This phenomenon is called overfitting.

GENERALIZATION ERROR 72

Suppose we do the train/test split.

GENERALIZATION ERROR 73

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy?

GENERALIZATION ERROR 74

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split.

GENERALIZATION ERROR 75

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same?

GENERALIZATION ERROR 76

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not!

GENERALIZATION ERROR 77

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.

GENERALIZATION ERROR 78

Suppose we do the train/test split. !Q: How well does generalization error predict OOS accuracy? Thought experiment: Suppose we had done a different train/test split. Q: Would the generalization error remain the same? A: Of course not! !A: On its own, not very well.

NOTE!The generalization error gives a high-variance estimate of OOS accuracy.

GENERALIZATION ERROR 79

Something is still missing!

GENERALIZATION ERROR 80

Something is still missing! !Q: How can we do better?

GENERALIZATION ERROR 81

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors.

GENERALIZATION ERROR 82

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average?

GENERALIZATION ERROR 83

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking!

GENERALIZATION ERROR 84

Something is still missing! !Q: How can we do better? Thought experiment: Different train/test splits will give us different generalization errors. Q: What if we did a bunch of these and took the average? A: Now you’re talking! !A: Cross-validation.

CROSS-VALIDATION 85

Steps for n-fold cross-validation:

CROSS-VALIDATION 86

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions.

CROSS-VALIDATION 87

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set.

CROSS-VALIDATION 88

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error.

CROSS-VALIDATION 89

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration.

CROSS-VALIDATION 90

Steps for n-fold cross-validation: !1) Randomly split the dataset into n equal partitions. 2) Use partition 1 as test set & union of other partitions as training set. 3) Find generalization error. 4) Repeat steps 2-3 using a different partition as the test set at each iteration. 5) Take the average generalization error as the estimate of OOS accuracy.

CROSS-VALIDATION 91

Features of n-fold cross-validation:

CROSS-VALIDATION 92

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error.

CROSS-VALIDATION 93

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing.

CROSS-VALIDATION 94

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split

CROSS-VALIDATION 95

Features of n-fold cross-validation: !1) More accurate estimate of OOS prediction error. 2) More efficient use of data than single train/test split. - Each record in our dataset is used for both training and testing. 3) Presents tradeoff between efficiency and computational expense. - 10-fold CV is 10x more expensive than a single train/test split 4) Can be used for model selection.

IV. KNN CLASSIFICATION

INTRO TO DATA SCIENCE

KNN CLASSIFICATION - BASICS 97

Suppose we want to predict the color of the grey dot.

KNN CLASSIFICATION 98

Suppose we want to predict the color of the grey dot. !1) Pick a value for k.

k = 3

KNN CLASSIFICATION 99

Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors.

k = 3

KNN CLASSIFICATION 100

Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.

KNN CLASSIFICATION 101

Suppose we want to predict the color of the grey dot. !1) Pick a value for k. 2) Find colors of k nearest neighbors. 3) Assign the most common color to the grey dot.

OPTIONAL NOTE!Our definition of “nearest” implicitly uses the Euclidean distance function.

KNN CLASSIFICATION 102

Another example with k = 3 Will our new example be blue or orange?

LABS

INTRO TO DATA SCIENCE