Machine Learning 101 - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/Webinar/2016...© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Michael Brückner

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Michael Brückner

Manager Machine Learning

25/02/2016

Machine Learning 101


Agenda

• What is Machine Learning and why do we need it?

• Model Building

• Model Evaluation & Tuning


What is Machine Learning?

Methods and Systems that …

Adaptbased on recorded

data

Predictnew data based on recorded

data

Optimizean action given a utility

function

Extracthidden

structure from the

data

Summarizedata into concise

descriptions


What is Machine Learning NOT?

Methods and Systems that …

can yield Garbage-In Knowledge-

Out

perform well without

data modeling& feature

engineering

avoid the curse-of-

dimensionality

are a replacement for business

rules


Infer-Predict-Decide Cycle

Inference

Build & evaluate Predictor

Prediction

Apply the learned Predictor

Decision Making

Adjust Business lossand get new/more data


What for?

Automate tasks, which typically require humans in order to

• scale

• improve over humans (non-experts)

• preserve privacy

or solve tasks that are impossible for humans


Examples: Personalized Recommandation

• Input:


Examples: Personalized Recommandation

• Output:


Examples: Face Detection & Recognition

Face detection

• Input: image

• Output: face position

Face recognition

• Input: face (image & face position)

• Output: person’s name


Examples: Full-Text Translation

• Input: text in one language

• Output: text of another language


Examples: Spam Filtering

• Input: email (text, images, …)

• Output: spam/non-spam flag

• Challenges:

• extremely high precision for

legitimate emails

• spam changes constantly

• noisy ground truth


Supervised Machine Learning

1. Model problem in terms of input data and output data

2. Collect sample of input-output pairs

3. Learn a mapping that produces the output given the

input

4. Apply this function on new inputs to make predictions


A Programer’s Perspective

Traditional Programming (Predicting)

Supervised Machine Learning

Computer

Input Data

Mapping

Output Data

Computer

Input Data

Output Data

Mapping


Advantages

• Use data instead of intuition to derive the mapping

• Can solve very complex tasks

• Can adapt to new situations (collect more data)

• Does not require much expert knowledge


Input Data

Description Type Cost Actual Cost Diff In Catalogue

Movies Entertainment $50 $28 $22 Yes

Music (CDs, MP3s, etc.) $500 $30 $470 No

Sporting Events Entertainment $0 $40 ($40) No

Dining Out Food $1,000 $1,200 ($200) Yes

Groceries $100 $0 $100 Yes

Charity 1 Gifts and Charity $200 $200 $0 No

Charity 2 $500 $500 $0 No

Cable/Satellite Housing $100 $100 $0 Yes

Electric Housing $45 $40 $5 Yes

Mortgage or Rent $700 $700 $0 Yes

Health Insurance $400 $400 $0 Yes

Home Insurance $400 $400 $0 No

Credit Card 1 $0 Yes

Dataset

Categorical Data

Missing Data

Binary Data

Numerical Data

Attribute Name

Attribute Value

Attribute

Text Data


Description Type Cost Actual Cost Diff In Catalogue

Movies Entertainment $50 $28 $22 Yes

Music (CDs, MP3s, etc.) ? $500 $30 $470 No

Sporting Events Entertainment $0 $40 ($40) No

Dining Out Food $1,000 $1,200 ($200) Yes

Groceries ? $100 $0 $100 Yes

Charity 1 Gifts and Charity $200 $200 $0 No

Charity 2 ? $500 $500 $0 No

Cable/Satellite Housing $100 $100 $0 Yes

Electric Housing $45 $40 $5 Yes

Mortgage or Rent ? $700 $700 $0 Yes

Health Insurance $400 $400 $0 Yes

Home Insurance $400 $400 $0 No

Credit Card 1 ? $0 Yes

Output Data

Target Attribute Values

Target Attribute


Agenda


• Model Building



Problem Setting

• Input: vector of observable attributes, x

• Output: target attribute value, y

• Training data: pairs of input and corresponding output,

D = (x1,y1),…,(xN,yN)

• Application data: inputs only

• Goal: learn mapping fw:x ↦ y

Predictor


Challenges in Model Building

• Which function class for Predictor (data modeling)?

• How to pre-process the data (feature engineering)?

• How to learn this Predictor from our training data?

• How to generalize to new data?


Which function class for Predictor?

Types of prediction tasks (output type):

• Binary Classification ⇒ binary target y {–1, +1}

• Multinomial Classification ⇒ categorical target y {1… K}

• Regression ⇒ numeric target y [ l ,u] R


Which function class for Binary Classification?

• Decision Tree

+

+-

-

-

x2 > 7?

no yes

+

+

+

+

+

x1 < 3?

no yes

x2 < 5?

no yes

x1 < 1?

no yes

+

+

-

-

x2

x11 3

5

7



• Decision Tree

+-

x2 > 7?

no yes

+

x1 < 3?

no yes

x2 < 5?

no yes

x1 < 1?

no yes

+ -

x2

x1

+

--



• Linear function

• binary target attribute

values y {–1, +1}

x2

x1

Hw +

-

y(x) = sign( fw(x))

Hw

={x | fw(x) = xTw+ w

0= 0}

^



• Generalized linear function

(Kernel methods)

• Layered Generalized linear

function (Neural Networks)

• Ensemble of functions

• …

x2

x1

+

- +

-


How to pre-process the data?

• Predictor’s function class defined for limited input domain

⇒ transform/extract attributes first (pre-processing)

• Number to (normalized) Number:

• z-standardization, min-max normalization

• Number to Category:

• Binning (quantile, equidistant)

• Category to (numeric) Vector:

• One-hot encoding


How to pre-process the data?

• Predictor’s function class defined for limited input domain

⇒ transform/extract attributes first (pre-processing)

• Text to (numeric) Vector:

• Normalization, tokenization, stemming

• Bag-of-Words, Bag-of-NGrams, TI-IDF ⇒ sparse vector

• Latent word embedding (LSI, word2vec, LDA) ⇒ dense vector

• Image to (numeric) Vector:

• HoG, DAISY, color histogram


How to learn a Predictor?

• Loss of Predictor fw:x ↦ y for a given input-output pair:

Loss function PredictionGround Truth

L(y, fw(x))



Loss functions for binary classification (target ): y Î{-1,+1}



Function Class Loss Function Learning Algorithm

Decision Trees 0/1 loss ID3

Decision Trees Quadratic loss CART

Linear function Quadratic loss Least-squares regression

Linear function Logistic loss Logistic regression

Linear function Hinge loss Support Vector Machines

Layered Generalized

Linear function

Logistic loss Neural Networks

(Binary Classification)

Layered Generalized

Linear function

Quadratic loss Neural Networks

(Regression)



• Theoretical Risk:

• Empirical Risk:

Average over all possible data

Average over training data



• Prediction depends on Predictor with model

parameters w

• Minimize Risk w.r.t. those model parameters w⇒ mathematical Optimisation Problem

• Gradient-based first or second-order methods

• Coordinate-descent methods

• (Greedy) Search

y(x)^ fw


How to generalize to new data?

Err

or

Model Complexity


How to generalize to new data?

• Empirical Risk:

• Structural Risk: Regularizer


Agenda


• Model Building



Performance for Binary Classification

Total number of

data points (N)

True Target

positive negative

Predicted

Target

positiveTrue

Positive

False

Positive

negativeFalse

Negative

True

Negative



• Accuracy:

• Recall (true positive rate):

• Precision:

• Fall-out (false positive rate):

TP+TN

NTP

TP+ FNTP

TP+ FPFP

TN + FP



Decision function

AUC

(Area Under roc Curve)

y(x) = sign( fw(x)+b)^

Predictor Decision threshold


Training vs. Test Performance

How do we know that a Predictor works well on new data?

Small error on training

data ≠ small error on

new data (test data)!


Hold-out Evaluation

• Put some data aside before training = test data

• Use this hold-out data for evaluation

• Disadvantages:

• What if we were (un)lucky when choosing the hold-out data?

• We do NOT use all the data for model training!


K-Fold Cross Validation-based Evaluation

• Split data into K partitions (folds)

• Take all but one partition to train a Predictor

• Evaluate Predictor on the left-out partition

• Repeat this for all partitions

• Average performance for all K evaluations

• Finally train a Predictor on all data


Model Tuning

Learning methods and Predictors have hyper-parameters

• Amount of regularization

• Choice of loss function

• Decision threshold score

• Learning rate

• …


Example: Decision threshold

Decision threshold


How to choose hyper-parameters?

Grid Search:

• Evaluate Predictor for all grid points (hyper-parameter

combinations)

• Take best grid point

Very expensive!

210 010 210

12

02

12


How to choose hyper-parameters?

Bayesian Optimisation:

• Learn model to predict evaluation outcomes

• Evaluate Predictor only for promising grid points

• Take best grid point

after fixed number of

evaluations

210 010 210

12

02

12


Common Pitfalls

• Model tuning is part of training

⇒ Do NOT use test data or test CV partitions!

• Use proper grid resolution and axis scaling

• Use same metric for tuning as for evaluation

Thank you!


Documents

Machine Learning 101 - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/Webinar/2016...© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Michael Brückner