Machine Learning Tutorial

  • Upload
    mrt5000

  • View
    240

  • Download
    0

Embed Size (px)

Citation preview

  • 7/30/2019 Machine Learning Tutorial

    1/33

    CB GS REC

    Machine Learning basic concepts

    Machine Learning Tutorial for the UKP lab,

  • 7/30/2019 Machine Learning Tutorial

    2/33

    This ppt includes some slides/slide-parts/text taken

    from online materials created by the following

    - Greg Grudic- Alexander Vezhnevets- Hal III Daume

  • 7/30/2019 Machine Learning Tutorial

    3/33

    The goal of machine learning is to build computer

    systems that can adapt and learn from theirexperience.

    Tom Dietterich

    3SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    4/33

    1x

    1ySystem 2

    , , ...,h h hN M

    1 2, ,..., Nx x x=x

    =

    npu ar a es:

    1 2, ,..., K

    , ,...,y y y=y

    Output Variables:

    4SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    5/33

    When the relationships between all system variables

    (input, output, and hidden) is completelyunderstood!

    This is NOT the case for almost any real system!

    5SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    6/33

    -

    Supervised Learning

    Unsupervised Learning

    6SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    7/33

    Given: Training examples

    1 1 2 2, , , ,..., ,P Px x x x x x

    Find

    Predict , where is not in the training set

    f x

    ( ) =y f x x

    7SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    8/33

    ,

    Definition:A computer program is said to learn

    from experience Ewith respect to some class of tasks T

    and performance measure P,

    if its performance at tasks in T, as measured by P, improveswith experience E.

    Learned hypothesis: model of problem/task TModel quality: accuracy/performance measured by P

    8SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    9/33

    Data: experience E in the form of examples / instances

    characteristic of the whole input space

    independent and identically distributed (no bias in selection / observations)

    oo examp e 1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) robabl i.i.d.

    representative? if annotation is involved it is always a question of compromises

    e n e y a examp e all abstracts that have John Smith as an author

    9

    Instances have to be comparable to each other

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    10/33

    Example: set of queries and a set of top retrieved documents

    (characterized via tf, idf, tf*idf, PRank, BM25 scores) for each

    top retrieved set is dependent on underlying IR system!

    issues with representativeness, but forreranking this is fine

    characterization is dependent on query (exc. PRank), i.e. only certain pairs (forthe same Q) are meaningfully comparable (c.f. independent examples for thesame Q)

    we have to normalize the features per query to have same mean/variance

    we have to form pairs and compare e.g. the diff of feature values

    Toy example: Q = learning, rank 1: tf = 15, rank 100: tf = 2

    10

    Q = overfitting, rank 1: tf = 2, rank 10: tf = 0

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    11/33

    The available examples (experience) has to be

    described to the algorithm in a consumable format Here: examples are represented asvectorsof pre-defined features

    E.g. forcredit risk assesment, typical features can be: income range,, , , ,

    city of residence, etc.

    Common feature t es

    binary (criminal record, Y/N)

    nominal cit of residence Xordinal (income range, 0-10K, 10-20K, )

    11

    ,

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    12/33

    CB GS REC

    Experimental practice

    by now youve learned what machine learning is; in the supervised approach youneed (carefully selected / prepared) examples that you describe through features;

    the algorithm then learns a model of the problem based on the examples (usually,improvement is observed in terms of some performance measure

    June 10, 2011

  • 7/30/2019 Machine Learning Tutorial

    13/33

    2 kinds of arameters one the user sets for the training procedure in advance hyperparameter

    the degree of polynom to match in regression

    number/size of hidden layer in Neural Network

    number of instances per leaf in decision tree

    one that actually gets optimized through the training parameter

    regression coefficients

    network weights

    size/depth of decision tree (in Weka, other implementations might allow to control that)

    we usually do not talk about the latter, but refer to hyperparameters as parameters

    Hyperparameters the less the algorithm has, the better

    Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free

    typically are set to optimize performance (on validation set, or through cross-validation)

    manual, grid search, simulated annealing, gradient descent, etc.

    13

    common pitfall:

    select the hyperparameters via CV, e.g. 10-fold + report cross-validation resultsSS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    14/33

    - ,

    { }kk

    xxX ,...,1=

    2X

    3X

    5X

    4X

    1X

    TestThe result is an averageover all iterations

    Train

    14SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    15/33

    -

    n- o : common prac ce or ma ng yper parame er es ma on morerobust

    round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model typical: random splits, without replacement (each instance tests exactly once)

    the other way: random subsampling cross-validation

    - , , . No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)

    bad practice? problem: training sets largely overlap, test errors are also dependent

    . .caution)

    5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

    o ng v a na ura un s o process ng or e g ven as typically, document boundaries best practice is doing it yourself!

    ML package / CSV representation is not aware of e.g. document boundaries!

    15

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    16/33

    -

    Ideally the valid settings are:

    take off-the-shelf algorithms, avoid parameter tuning and compare, . . -

    n.b. you probably do the folding yourself, trying to minimize biases!

    do parameter tuning (n.b. selecting/tuning your features is also tuning!)

    but then normally you have to have a blind set (from the beginning) e.g. have a look at shared tasks, e.g. CoNLL practical way to learn

    ex erimental best ractice to ali n the redefined standards ou mi ht evenbenefit from comparative results, etc.)

    You might want to do something different

    be aware of these & the conse uences

    16

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    17/33

    1. define the task

    instance, target variable/labels, collect and label/annotate data cre t r s assessment: cre t request, goo a cre t, ~s ran out n t e

    previous year

    . ,

    (development) ((test!)) / test(evaluation) data3. pick a learning algorithm (e.g. decision tree), train model train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use

    pruning, ) according to performance on validation data test model accuracy on (blind) test set

    4. read to use model to redict unseen instances with an ex ected

    17

    accuracy similar to that seen on test

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    18/33

    Relation: segment

    Instances: 1500Attributes: 20

    . ,

    Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2Correctly Classified Instances 290 96.6667 %Incorrectly Classified Instances 10 3.3333 %

    Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12Correctly Classified Instances 281 93.6667 %

    18

    Incorrectly Classified Instances 19 6.3333 %

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    19/33

    ng a po ynom a regress on:

    0.0

    1.0

    t

    M=0

    0.0

    1.0

    t

    M=1

    M

    By, for instance, least squares: 0.0 0.5 1.01.0

    0.0 0.5 1.0

    1.0

    .

    =

    =n

    n

    nxxa0

    )(

    1.0 M=3 1.0 M=9

    x

    0.0t 0.0t

    2

    1 0

    minarg = =

    =l

    j

    M

    n

    n

    nj xy

    0.0 0.5 1.0

    1.0

    x

    0.0 0.5 1.0

    1.0

    x

    19SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    20/33

    Important concept: discriminative power of the

    algorithm linear vs nonlinear model

    some theoretical aspects:

    1-hidden-layer NN with unlimited hidden nodes canperfectly model any smooth function/surface

    20SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    21/33

    ,has no (bad) generalization ability

    results in high test error (useless model)

    Underfitting: the model is not capable of learning the (complex)patterns in the training set

    Reasons of Underfitting and Overfitting: lack of discriminative power

    sma samp e s zenoise in the data /labels or features/

    generalization ability of algorithmhas to be chosen wrt. sam le size

    Size (complexity) of learnt modelgrows with data size

    21

    ,

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    22/33

    TP: p classified as p

    FP: n classified as pTN: n classified as n

    Good prediction:

    TP+TNError:FP (false alarm) + FN (miss)

    22SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    23/33

    The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)

    Error rate The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)

    [Root]?[Mean|Absolute][Squared]?Error The difference between the predicted and actual values

    e.g. =2))(( yxf

    nRMSE

    Algorithms (e.g. those in Weka) typically optimize these might be a mismatch between optimization objective and actual evaluation measure optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)

    23SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    24/33

    Fraction of correctly predicted positives and allpredicted positives

    TP/ TP+FP

    FP: n classified as p

    TN: n

    classified as n

    Recall Fraction of correctl redicted ositives and all actual ositives

    TP/(TP+FN)

    F measure weighted harmonic mean of Precision and Recall (usually equal weighted, =1)

    recallprecision

    F

    += 22

    )1(

    Only makes sense for a subset of classes (usually measured for a single

    24

    For all classes, it equals the accuracy

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    25/33

    , . . , , .A sequence of tokens with the same label is treated as a single instance

    John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG_O _O _ORG.

    Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conllevalfor NER

    Example tagging: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG

    _O _O _ORG.

    Multiple penalty:, ,

    2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG)

    25

    = . , = .

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    26/33

    . . ,time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to thisfunction.

    2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance, . , .

    require humans in the loop.

    3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate,mean-average-precision. These require humans at the front of the loop, but after that arec eap an qu c . yp ca y some e or as een pu n o s ow ng corre a on e ween ese

    and something higher up.4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for

    parsing, chunking and named-entity recognition), alignment error rate (for word alignment)an perp ex y or anguage mo e ng . ese a so requ re umans a e ron o e oop,but differ from (3) in that they are not actually compared with higher-up tasks.

    become disfunctional when you are optimizing them!

    phrase P/R/F e.g. in NER

    Readabilit measures

    26

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    27/33

    , . . , , . John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_Ojoining_O IBM_ORG.

    Example tagging 1: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_Ojoining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG)

    F(PER) = 0.67, F(ORG) = 0.5

    Example tagging 2: o n_PER stu e _O at_O t e_O o ns_O op ns_O n vers ty_O e ore_O o n n g_O _ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP

    1 FN: Johns Hopkins University (ORG) F(PER) = 1.0, F(ORG) = 0.67

    Optimizing phrase-F can encourage / prefer systems that do not mark entities!

    27

    mos e y, s s a

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    28/33

    ROC Receiver Operating Characteristic curve Curve that depicts the relation between recall (sensitivity) and false

    -

    Best case

    all)

    Worst case

    ity(Rec

    Sensiti

    28False Positives FP / (FP+TN)

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    29/33

    rea un er curve,

    As you vary the decision threshold, you can plot the recall vs. false

    ositive rate

    The area under the curve measures how accurately your modelsepara es pos ve rom nega ves

    perfect ranking: AUC = 1.0

    random decision: AUC = 0.5

    Similarly (e.g. in IR): area under P/R curve

    w en ere are oo many rue nega ves

    correctly identifying negatives is not interesting anyway

    29SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    30/33

    rec s on

    number of true positives in top K predictions / ranks

    MAP

    The average of precisions computed at the point of each of the positives in theranked list (P=0 for positives not ranked at all)

    For graded relevance / ranking

    Highly relevant documents appearing lower in a search result list should bepenalized as the graded relevance value is reduced logarithmically proportionalto the position of the result.

    30SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    31/33

    easures ow e accuracy

    erroro e mo e c anges w sample size

    iteration number

    Smaller sample worse accuracy

    more likely bias in the estimate(representative sample)

    variance in the estimate

    If it looks differently:

    you are plotting error vs. size/iteration

    31

    overfitting (iteration, not sample size)!

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    32/33

    varying amount of training data (Banko & Brill, 2001): Winnow

    nave Bayes memory-based learner

    Features: bag of words:

    words within a window of the

    collocations containingspecific words and/or part of speech

    Training corpus: 1-billion wordsfrom a variety of English texts(news articles, literature, scientific abstracts, etc.)

    32SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |

  • 7/30/2019 Machine Learning Tutorial

    33/33

    Su ervised learnin : based on a set of labeled exam les x fx learn the

    input-output mapping, i.e. f(x)

    3 factors of successful machine learning models much data

    good features

    well-suited learning algorithm

    ML workflow1. problem definition

    . , ,

    3. selection of learning algorithm, (hyper)parameter tuning, training a final model

    4. predict unseen examples & fill tables / draw figures for the paper - test

    are u w t data representation (i.i.d, comparability, )

    experimental setup (cross-validation, blind testing, )

    33

    a a s ze an a gor m se ec on over ng, un er ng,

    evaluation measures

    SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |