of 46/46
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Michael Brückner Manager Machine Learning 25/02/2016 Machine Learning 101

Machine Learning 101 - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/Webinar/2016...© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Michael Brückner

  • View
    1

  • Download
    0

Embed Size (px)

Text of Machine Learning 101 - Amazon Web Servicesaws-de-media.s3.amazonaws.com/images/Webinar/2016...©...

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Michael Brückner

    Manager Machine Learning

    25/02/2016

    Machine Learning 101

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Agenda

    • What is Machine Learning and why do we need it?

    • Model Building

    • Model Evaluation & Tuning

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    What is Machine Learning?

    Methods and Systems that …

    Adaptbased on recorded

    data

    Predictnew data based on recorded

    data

    Optimizean action given a utility

    function

    Extracthidden

    structure from the

    data

    Summarizedata into concise

    descriptions

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    What is Machine Learning NOT?

    Methods and Systems that …

    can yield Garbage-In Knowledge-

    Out

    perform well without

    data modeling& feature

    engineering

    avoid the curse-of-

    dimensionality

    are a replacement for business

    rules

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Infer-Predict-Decide Cycle

    Inference

    Build & evaluate Predictor

    Prediction

    Apply the learned Predictor

    Decision Making

    Adjust Business lossand get new/more data

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    What for?

    Automate tasks, which typically require humans in order to

    • scale

    • improve over humans (non-experts)

    • preserve privacy

    or solve tasks that are impossible for humans

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Examples: Personalized Recommandation

    • Input:

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Examples: Personalized Recommandation

    • Output:

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Examples: Face Detection & Recognition

    Face detection

    • Input: image

    • Output: face position

    Face recognition

    • Input: face (image & face position)

    • Output: person’s name

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Examples: Full-Text Translation

    • Input: text in one language

    • Output: text of another language

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Examples: Spam Filtering

    • Input: email (text, images, …)

    • Output: spam/non-spam flag

    • Challenges:

    • extremely high precision for

    legitimate emails

    • spam changes constantly

    • noisy ground truth

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Supervised Machine Learning

    1. Model problem in terms of input data and output data

    2. Collect sample of input-output pairs

    3. Learn a mapping that produces the output given the

    input

    4. Apply this function on new inputs to make predictions

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    A Programer’s Perspective

    Traditional Programming (Predicting)

    Supervised Machine Learning

    Computer

    Input Data

    Mapping

    Output Data

    Computer

    Input Data

    Output Data

    Mapping

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Advantages

    • Use data instead of intuition to derive the mapping

    • Can solve very complex tasks

    • Can adapt to new situations (collect more data)

    • Does not require much expert knowledge

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Input Data

    Description Type Cost Actual Cost Diff In Catalogue

    Movies Entertainment $50 $28 $22 Yes

    Music (CDs, MP3s, etc.) $500 $30 $470 No

    Sporting Events Entertainment $0 $40 ($40) No

    Dining Out Food $1,000 $1,200 ($200) Yes

    Groceries $100 $0 $100 Yes

    Charity 1 Gifts and Charity $200 $200 $0 No

    Charity 2 $500 $500 $0 No

    Cable/Satellite Housing $100 $100 $0 Yes

    Electric Housing $45 $40 $5 Yes

    Mortgage or Rent $700 $700 $0 Yes

    Health Insurance $400 $400 $0 Yes

    Home Insurance $400 $400 $0 No

    Credit Card 1 $0 Yes

    Dataset

    Categorical Data

    Missing Data

    Binary Data

    Numerical Data

    Attribute Name

    Attribute Value

    Attribute

    Text Data

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Description Type Cost Actual Cost Diff In Catalogue

    Movies Entertainment $50 $28 $22 Yes

    Music (CDs, MP3s, etc.) ? $500 $30 $470 No

    Sporting Events Entertainment $0 $40 ($40) No

    Dining Out Food $1,000 $1,200 ($200) Yes

    Groceries ? $100 $0 $100 Yes

    Charity 1 Gifts and Charity $200 $200 $0 No

    Charity 2 ? $500 $500 $0 No

    Cable/Satellite Housing $100 $100 $0 Yes

    Electric Housing $45 $40 $5 Yes

    Mortgage or Rent ? $700 $700 $0 Yes

    Health Insurance $400 $400 $0 Yes

    Home Insurance $400 $400 $0 No

    Credit Card 1 ? $0 Yes

    Output Data

    Target Attribute Values

    Target Attribute

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Agenda

    • What is Machine Learning and why do we need it?

    • Model Building

    • Model Evaluation & Tuning

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Problem Setting

    • Input: vector of observable attributes, x

    • Output: target attribute value, y

    • Training data: pairs of input and corresponding output,

    D = (x1,y1),…,(xN,yN)

    • Application data: inputs only

    • Goal: learn mapping fw:x ↦ y

    Predictor

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Challenges in Model Building

    • Which function class for Predictor (data modeling)?

    • How to pre-process the data (feature engineering)?

    • How to learn this Predictor from our training data?

    • How to generalize to new data?

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Which function class for Predictor?

    Types of prediction tasks (output type):

    • Binary Classification ⇒ binary target y {–1, +1}

    • Multinomial Classification ⇒ categorical target y {1… K}

    • Regression ⇒ numeric target y [ l ,u] R

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Which function class for Binary Classification?

    • Decision Tree

    +

    +-

    -

    -

    x2 > 7?

    no yes

    +

    +

    +

    +

    +

    x1 < 3?

    no yes

    x2 < 5?

    no yes

    x1 < 1?

    no yes

    +

    +

    -

    -

    x2

    x11 3

    5

    7

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Which function class for Binary Classification?

    • Decision Tree

    +-

    x2 > 7?

    no yes

    +

    x1 < 3?

    no yes

    x2 < 5?

    no yes

    x1 < 1?

    no yes

    + -

    x2

    x1

    +

    --

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Which function class for Binary Classification?

    • Linear function

    • binary target attribute

    values y {–1, +1}

    x2

    x1

    Hw +

    -

    y(x) = sign( fw(x))

    Hw

    ={x | fw(x) = xTw+ w

    0= 0}

    ^

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Which function class for Binary Classification?

    • Generalized linear function

    (Kernel methods)

    • Layered Generalized linear

    function (Neural Networks)

    • Ensemble of functions

    • …

    x2

    x1

    +

    - +

    -

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to pre-process the data?

    • Predictor’s function class defined for limited input domain

    ⇒ transform/extract attributes first (pre-processing)

    • Number to (normalized) Number:

    • z-standardization, min-max normalization

    • Number to Category:

    • Binning (quantile, equidistant)

    • Category to (numeric) Vector:

    • One-hot encoding

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to pre-process the data?

    • Predictor’s function class defined for limited input domain

    ⇒ transform/extract attributes first (pre-processing)

    • Text to (numeric) Vector:

    • Normalization, tokenization, stemming

    • Bag-of-Words, Bag-of-NGrams, TI-IDF ⇒ sparse vector

    • Latent word embedding (LSI, word2vec, LDA) ⇒ dense vector

    • Image to (numeric) Vector:

    • HoG, DAISY, color histogram

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to learn a Predictor?

    • Loss of Predictor fw:x ↦ y for a given input-output pair:

    Loss function PredictionGround Truth

    L(y, fw(x))

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to learn a Predictor?

    Loss functions for binary classification (target ): y Î{-1,+1}

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to learn a Predictor?

    Function Class Loss Function Learning Algorithm

    Decision Trees 0/1 loss ID3

    Decision Trees Quadratic loss CART

    Linear function Quadratic loss Least-squares regression

    Linear function Logistic loss Logistic regression

    Linear function Hinge loss Support Vector Machines

    Layered Generalized

    Linear function

    Logistic loss Neural Networks

    (Binary Classification)

    Layered Generalized

    Linear function

    Quadratic loss Neural Networks

    (Regression)

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to learn a Predictor?

    • Theoretical Risk:

    • Empirical Risk:

    Average over all possible data

    Average over training data

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to learn a Predictor?

    • Prediction depends on Predictor with model

    parameters w

    • Minimize Risk w.r.t. those model parameters w⇒ mathematical Optimisation Problem

    • Gradient-based first or second-order methods

    • Coordinate-descent methods

    • (Greedy) Search

    y(x)^ fw

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to generalize to new data?

    Err

    or

    Model Complexity

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to generalize to new data?

    • Empirical Risk:

    • Structural Risk: Regularizer

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Agenda

    • What is Machine Learning and why do we need it?

    • Model Building

    • Model Evaluation & Tuning

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Performance for Binary Classification

    Total number of

    data points (N)

    True Target

    positive negative

    Predicted

    Target

    positiveTrue

    Positive

    False

    Positive

    negativeFalse

    Negative

    True

    Negative

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Performance for Binary Classification

    • Accuracy:

    • Recall (true positive rate):

    • Precision:

    • Fall-out (false positive rate):

    TP+TN

    NTP

    TP+ FNTP

    TP+ FPFP

    TN + FP

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Performance for Binary Classification

    Decision function

    AUC

    (Area Under roc Curve)

    y(x) = sign( fw(x)+b)^

    Predictor Decision threshold

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Training vs. Test Performance

    How do we know that a Predictor works well on new data?

    Small error on training

    data ≠ small error on

    new data (test data)!

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Hold-out Evaluation

    • Put some data aside before training = test data

    • Use this hold-out data for evaluation

    • Disadvantages:

    • What if we were (un)lucky when choosing the hold-out data?

    • We do NOT use all the data for model training!

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    K-Fold Cross Validation-based Evaluation

    • Split data into K partitions (folds)

    • Take all but one partition to train a Predictor

    • Evaluate Predictor on the left-out partition

    • Repeat this for all partitions

    • Average performance for all K evaluations

    • Finally train a Predictor on all data

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Model Tuning

    Learning methods and Predictors have hyper-parameters

    • Amount of regularization

    • Choice of loss function

    • Decision threshold score

    • Learning rate

    • …

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Example: Decision threshold

    Decision threshold

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to choose hyper-parameters?

    Grid Search:

    • Evaluate Predictor for all grid points (hyper-parameter

    combinations)

    • Take best grid point

    Very expensive!

    210 010 210

    12

    02

    12

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    How to choose hyper-parameters?

    Bayesian Optimisation:

    • Learn model to predict evaluation outcomes

    • Evaluate Predictor only for promising grid points

    • Take best grid point

    after fixed number of

    evaluations

    210 010 210

    12

    02

    12

  • © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

    Common Pitfalls

    • Model tuning is part of training

    ⇒ Do NOT use test data or test CV partitions!

    • Use proper grid resolution and axis scaling

    • Use same metric for tuning as for evaluation

  • Thank you!

    © 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.