Carlos Guestrin- Neural Networks

  • Upload
    grettsz

  • View
    221

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 Carlos Guestrin- Neural Networks

    1/25

    2005-2007 Carlos Guestrin 1

    Neural Networks

    Machine Learning 10701/15781

    Carlos Guestrin

    Carnegie Mellon University

    October 10th, 2007

    2005-2007 Carlos Guestrin 2

    Perceptron as a graph- 6 -4 -2 0 2 4 6

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

  • 8/3/2019 Carlos Guestrin- Neural Networks

    2/25

    2005-2007 Carlos Guestrin 3

    The perceptron learning rule

    Compare to MLE:

    2005-2007 Carlos Guestrin 4

    Hidden layer Perceptron:

    1-hidden layer:

  • 8/3/2019 Carlos Guestrin- Neural Networks

    3/25

    2005-2007 Carlos Guestrin 5

    Example data for NN with hidden layer

    2005-2007 Carlos Guestrin 6

    Learned weights for hidden layer

  • 8/3/2019 Carlos Guestrin- Neural Networks

    4/25

    2005-2007 Carlos Guestrin 7

    NN for images

    2005-2007 Carlos Guestrin 8

    Weights in NN for images

  • 8/3/2019 Carlos Guestrin- Neural Networks

    5/25

    2005-2007 Carlos Guestrin 9

    Gradient descent for 1-hidden layer

    Back-propagation: ComputingDropped w0 to make derivation simpler

    2005-2007 Carlos Guestrin 10

    Gradient descent for 1-hidden layer

    Back-propagation: ComputingDropped w0 to make derivation simpler

  • 8/3/2019 Carlos Guestrin- Neural Networks

    6/25

    2005-2007 Carlos Guestrin 11

    Multilayer neural networks

    2005-2007 Carlos Guestrin 12

    Forward propagation prediction Recursive algorithm

    Start from input layer

    Output of node Vk with parents U1,U2,:

  • 8/3/2019 Carlos Guestrin- Neural Networks

    7/25

    2005-2007 Carlos Guestrin 13

    Back-propagation learning

    Just gradient descent!!!

    Recursive algorithm for computing gradient

    For each example

    Perform forward propagation

    Start from output layer

    Compute gradient of node Vk with parents U1,U2,

    Update weight wik

    2005-2007 Carlos Guestrin 14

    Many possible response functions

    Sigmoid

    Linear

    Exponential

    Gaussian

  • 8/3/2019 Carlos Guestrin- Neural Networks

    8/25

    2005-2007 Carlos Guestrin 15

    Convergence of backprop

    Perceptron leads to convex optimization

    Gradient descent reaches global minima

    Multilayer neural nets not convex

    Gradient descent gets stuck in local minima

    Hard to set learning rate

    Selecting number of hidden units and layers = fuzzy process

    NNs falling in disfavor in last few years

    Well see later in semester, kernel trickis a good alternative

    Nonetheless, neural nets are one of the most used ML

    approaches

    2005-2007 Carlos Guestrin 16

    Overfitting? Neural nets represent

    complex functions

    Output becomes more complex

    with gradient steps

  • 8/3/2019 Carlos Guestrin- Neural Networks

    9/25

    2005-2007 Carlos Guestrin 17

    Overfitting

    Output fits training data too well

    Poor test set accuracy

    Overfitting the training data

    Related to bias-variance tradeoff

    One of central problems of ML

    Avoiding overfitting?

    More training data

    Regularization

    Early stopping

    2005-2007 Carlos Guestrin 18

    What you need to know about

    neural networks Perceptron:

    Representation

    Perceptron learning rule

    Derivation

    Multilayer neural nets

    Representation

    Derivation of backprop

    Learning rule

    Overfitting

    Definition

    Training set versus test set

    Learning curve

  • 8/3/2019 Carlos Guestrin- Neural Networks

    10/25

    2005-2007 Carlos Guestrin 19

    Announcements

    Recitation this week: Neural networks

    Project proposals due next Wednesday

    Exciting data:

    Swivel.com - user generated graphs

    Recognizing Captchas

    Election contributions

    Activity recognition

    2005-2007 Carlos Guestrin 20

    Instance-based

    Learning

    Machine Learning 10701/15781Carlos Guestrin

    Carnegie Mellon University

    October 10th, 2007

  • 8/3/2019 Carlos Guestrin- Neural Networks

    11/25

    2005-2007 Carlos Guestrin 21

    Why not just use Linear Regression?

    2005-2007 Carlos Guestrin 22

    Using data to predict new data

  • 8/3/2019 Carlos Guestrin- Neural Networks

    12/25

    2005-2007 Carlos Guestrin 23

    Nearest neighbor

    2005-2007 Carlos Guestrin 24

    Univariate 1-Nearest NeighborGiven datapoints (x1,y1) (x2,y2)..(xN,yN),where we assume yi=f(xi) for some

    unknown function f.

    Given query pointxq, your job is to predict

    Nearest Neighbor:

    1. Find the closestxi in our set of datapoints

    qxfy !

    ( )qi

    i

    xxnni != argmin

    ( )nniyy =2. Predict

    Heres adataset withone input, oneoutput and fourdatapoints.

    x

    y

    Here,

    this

    is

    thecl

    osest

    datapoi

    nt

    Here, this isthe closestdatapoint

    Here,this

    is

    theclos

    est

    data

    poin

    t

    H

    ere,

    this

    is

    theclo

    sest

    datapoint

  • 8/3/2019 Carlos Guestrin- Neural Networks

    13/25

    2005-2007 Carlos Guestrin 25

    1-Nearest Neighbor is an example of.

    Instance-based learning

    Four things make a memory based learner: A distance metric How many nearby neighbors to look at?

    A weighting function (optional) How to fit with the local points?

    x1 y1x2 y2x3 y3

    .

    .

    xn yn

    A function approximator

    that has been aroundsince about 1910.

    To make a prediction,

    search database for

    similar datapoints, and fit

    with the local points.

    2005-2007 Carlos Guestrin 26

    1-Nearest Neighbor

    Four things make a memory based learner:

    1. A distance metric

    Euclidian (and many more)

    2. How many nearby neighbors to look at?

    One

    3. A weighting function (optional)

    Unused

    4. How to fit with the local points?

    Just predict the same output as the nearest neighbor.

  • 8/3/2019 Carlos Guestrin- Neural Networks

    14/25

    2005-2007 Carlos Guestrin 27

    Multivariate 1-NN examples

    Regression Classification

    2005-2007 Carlos Guestrin 28

    Multivariate distance metricsSuppose the input vectors x1, x2, xn are two dimensional:

    x1 = (x11 ,x12 ) , x2 = (x21 ,x22) , xN = (xN1 ,xN2).

    One can draw the nearest-neighbor regions in input space.

    Dist(xi,xj) =(xi1 xj1)2+(3xi2 3xj2)

    2

    The relative scalings in the distance metric affect region shapes

    Dist(xi,xj) = (xi1 xj1)2 + (xi2xj2)

    2

  • 8/3/2019 Carlos Guestrin- Neural Networks

    15/25

    2005-2007 Carlos Guestrin 29

    Euclidean distance metric

    Other Metrics

    Mahalanobis, Rank-based, Correlation-based,

    ( )

    =

    (=

    '

    ' 22

    )x'-(x)x'-(x)x'(x,

    ')x'(x,

    T

    i

    iii

    D

    xxD )

    where

    Or equivalently,

    2005-2007 Carlos Guestrin 30

    Notable distance metrics

    (and their level sets)

    L1 norm (absolute)

    L1 (max) norm

    Scaled Euclidian (L2)

    Mahalanobis (here,

    on the previous slide is notnecessarily diagonal, but is

    symmetric

  • 8/3/2019 Carlos Guestrin- Neural Networks

    16/25

    2005-2007 Carlos Guestrin 31

    Consistency of 1-NN

    Consider an estimatorfntrained on n examples

    e.g., 1-NN, neural nets, regression,...

    Estimator is consistentif true error goes to zero as

    amount of data increases

    e.g., for no noise data, consistent if:

    Regression is not consistent!

    Representation bias

    1-NN is consistent (under some mild fineprint)

    What about variance???

    2005-2007 Carlos Guestrin 32

    1-NN overfits?

  • 8/3/2019 Carlos Guestrin- Neural Networks

    17/25

    2005-2007 Carlos Guestrin 33

    k-Nearest Neighbor

    Four things make a memory based learner:

    1. A distance metric

    Euclidian (and many more)

    2. How many nearby neighbors to look at?

    k

    1. A weighting function (optional)

    Unused

    2. How to fit with the local points?

    Just predict the average output among the k nearest neighbors.

    2005-2007 Carlos Guestrin 34

    k-Nearest Neighbor (here k=9)

    K-nearest neighbor for function fitting smoothes away noise, but there areclear deficiencies.

    What can we do about all the discontinuities that k-NN gives us?

  • 8/3/2019 Carlos Guestrin- Neural Networks

    18/25

    2005-2007 Carlos Guestrin 35

    Weighted k-NNs

    Neighbors are not all the same

    2005-2007 Carlos Guestrin 36

    Kernel regression

    Four things make a memory based learner:

    1. A distance metric

    Euclidian (and many more)

    2. How many nearby neighbors to look at?

    All of them

    3. A weighting function (optional)

    wi = exp(-D(xi, query)2 / Kw

    2)

    Nearby points to the query are weighted strongly, far points

    weakly. The KW parameter is the Kernel Width. Very

    important.

    4. How to fit with the local points?

    Predict the weighted average of the outputs:

    predict = wiyi /wi

  • 8/3/2019 Carlos Guestrin- Neural Networks

    19/25

    2005-2007 Carlos Guestrin 37

    Weighting functions

    wi = exp(-D(xi, query)2 / Kw

    2)

    Typically optimize Kwusing gradient descent

    (Our examples use Gaussian)

    2005-2007 Carlos Guestrin 38

    Kernel regression predictions

    Increasing the kernel width Kw means further away points get anopportunity to influence you.

    As Kw1, the prediction tends to the global average.

    KW=80KW=20KW=10

  • 8/3/2019 Carlos Guestrin- Neural Networks

    20/252

    2005-2007 Carlos Guestrin 39

    Kernel regression on our test cases

    KW=1/16 axis width.KW=1/32 of x-axis width.KW=1/32 of x-axis width.

    Choosing a good Kw is important. Not just for Kernel Regression, but

    for all the locally weighted learners were about to see.

    2005-2007 Carlos Guestrin 40

    Kernel regression can look bad

    KW = Best.KW = Best.KW = Best.

    Time to try something more powerful

  • 8/3/2019 Carlos Guestrin- Neural Networks

    21/252

    2005-2007 Carlos Guestrin 41

    Locally weighted regression

    Kernel regression:

    Take a very very conservative function approximatorcalled AVERAGING. Locally weight it.

    Locally weighted regression:

    Take a conservative function approximator calledLINEAR REGRESSION. Locally weight it.

    2005-2007 Carlos Guestrin 42

    Locally weighted regression Four things make a memory based learner:

    A distance metric

    Any

    How many nearby neighbors to look at?

    All of them

    A weighting function (optional)

    Kernels

    wi = exp(-D(xi, query)2 / Kw2)

    How to fit with the local points?

    General weighted regression:

    ( )2

    1

    2

    xy argmin!=

    "=N

    k

    k

    T

    kkw

  • 8/3/2019 Carlos Guestrin- Neural Networks

    22/252

    2005-2007 Carlos Guestrin 43

    How LWR works

    Query

    Linear regression Same parameters for

    all queries

    Locally weighted regression Solve weighted linear regression

    for each query

    ( ) YXXX T

    1T !

    =

    !!!!!

    "

    #

    $$$$$

    %

    &

    =

    nw

    w

    w

    W

    000

    000

    000

    000

    2

    1

    O

    "=

    WX( )

    T

    WX( )

    #1

    WX( )

    T

    WY

    2005-2007 Carlos Guestrin 44

    Another view of LWR

    Image from Cohn, D.A., Ghahramani, Z., and Jordan, M.I. (1996) "Active Learning with Statistical Models", JAIR Volume 4, pages 129-145.

  • 8/3/2019 Carlos Guestrin- Neural Networks

    23/252

    2005-2007 Carlos Guestrin 45

    LWR on our test cases

    KW = 1/8 of x-axis width.KW = 1/32 of x-axiswidth.

    KW = 1/16 of x-axiswidth.

    2005-2007 Carlos Guestrin 46

    Locally weighted polynomial regression

    LW Quadratic RegressionKernel width KW at optimal

    level.

    KW = 1/15 x-axis

    LW Linear RegressionKernel width KW at optimal

    level.

    KW = 1/40 x-axis

    Kernel RegressionKernel width KW at optimal

    level.

    KW = 1/100 x-axis

    Local quadratic regression is easy: just add quadratic terms to theWXTWX matrix. As the regression degree increases, the kernel widthcan increase without introducing bias.

  • 8/3/2019 Carlos Guestrin- Neural Networks

    24/252

    2005-2007 Carlos Guestrin 47

    Curse of dimensionality for

    instance-based learning

    Must store and retreve all data!

    Most real work done during testing

    For every test sample, must search through all dataset very slow!

    Well see fast methods for dealing with large datasets

    Instance-based learning often poor with noisy or irrelevant

    features

    2005-2007 Carlos Guestrin 48

    Curse of the irrelevant feature

  • 8/3/2019 Carlos Guestrin- Neural Networks

    25/25

    2005-2007 Carlos Guestrin 49

    What you need to know about

    instance-based learning

    k-NN Simplest learning algorithm

    With sufficient data, very hard to beat strawman approach

    Picking k?

    Kernel regression Set k to n (number of data points) and optimize weights by

    gradient descent

    Smoother than k-NN

    Locally weighted regression

    Generalizes kernel regression, not just local average

    Curse of dimensionalityMust remember (very large) dataset for prediction

    Irrelevant features often killers for instance-based approaches

    2005-2007 Carlos Guestrin 50

    Acknowledgment This lecture contains some material from

    Andrew Moores excellent collection of ML

    tutorials:

    http://www.cs.cmu.edu/~awm/tutorials