Machine Learning Tutorial

7/30/2019 Machine Learning Tutorial

1/33

CB GS REC

Machine Learning basic concepts

Machine Learning Tutorial for the UKP lab,


2/33

This ppt includes some slides/slide-parts/text taken

from online materials created by the following

- Greg Grudic- Alexander Vezhnevets- Hal III Daume


3/33

The goal of machine learning is to build computer

systems that can adapt and learn from theirexperience.

Tom Dietterich

3SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |


4/33

1x

1ySystem 2

, , ...,h h hN M

1 2, ,..., Nx x x=x

=

npu ar a es:

1 2, ,..., K

, ,...,y y y=y

Output Variables:



5/33

When the relationships between all system variables

(input, output, and hidden) is completelyunderstood!

This is NOT the case for almost any real system!



6/33

-

Supervised Learning

Unsupervised Learning



7/33

Given: Training examples

1 1 2 2, , , ,..., ,P Px x x x x x

Find

Predict , where is not in the training set

f x

( ) =y f x x



8/33

,

Definition:A computer program is said to learn

from experience Ewith respect to some class of tasks T

and performance measure P,

if its performance at tasks in T, as measured by P, improveswith experience E.

Learned hypothesis: model of problem/task TModel quality: accuracy/performance measured by P



9/33

Data: experience E in the form of examples / instances

characteristic of the whole input space

independent and identically distributed (no bias in selection / observations)

oo examp e 1000 abstracts chosen randomly out of 20M PubMed entries (abstracts) robabl i.i.d.

representative? if annotation is involved it is always a question of compromises

e n e y a examp e all abstracts that have John Smith as an author

9

Instances have to be comparable to each other

SS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |


10/33

Example: set of queries and a set of top retrieved documents

(characterized via tf, idf, tf*idf, PRank, BM25 scores) for each

top retrieved set is dependent on underlying IR system!

issues with representativeness, but forreranking this is fine

characterization is dependent on query (exc. PRank), i.e. only certain pairs (forthe same Q) are meaningfully comparable (c.f. independent examples for thesame Q)

we have to normalize the features per query to have same mean/variance

we have to form pairs and compare e.g. the diff of feature values

Toy example: Q = learning, rank 1: tf = 15, rank 100: tf = 2

10

Q = overfitting, rank 1: tf = 2, rank 10: tf = 0



11/33

The available examples (experience) has to be

described to the algorithm in a consumable format Here: examples are represented asvectorsof pre-defined features

E.g. forcredit risk assesment, typical features can be: income range,, , , ,

city of residence, etc.

Common feature t es

binary (criminal record, Y/N)

nominal cit of residence Xordinal (income range, 0-10K, 10-20K, )

11

,



12/33

CB GS REC

Experimental practice

by now youve learned what machine learning is; in the supervised approach youneed (carefully selected / prepared) examples that you describe through features;

the algorithm then learns a model of the problem based on the examples (usually,improvement is observed in terms of some performance measure

June 10, 2011


13/33

2 kinds of arameters one the user sets for the training procedure in advance hyperparameter

the degree of polynom to match in regression

number/size of hidden layer in Neural Network

number of instances per leaf in decision tree

one that actually gets optimized through the training parameter

regression coefficients

network weights

size/depth of decision tree (in Weka, other implementations might allow to control that)

we usually do not talk about the latter, but refer to hyperparameters as parameters

Hyperparameters the less the algorithm has, the better

Naive Bayes the best? No parameters! usually algs with better discriminative power are not parameter-free

typically are set to optimize performance (on validation set, or through cross-validation)

manual, grid search, simulated annealing, gradient descent, etc.

13

common pitfall:

select the hyperparameters via CV, e.g. 10-fold + report cross-validation resultsSS 2011 | Computer Science Department | UKP Lab - Gyrgy Szarvas |


14/33

- ,

{ }kk

xxX ,...,1=

2X

3X

5X

4X

1X

TestThe result is an averageover all iterations

Train



15/33

-

n- o : common prac ce or ma ng yper parame er es ma on morerobust

round robin training/testing n times, with (n-1)/n data to train and 1/n data to evaluate the model typical: random splits, without replacement (each instance tests exactly once)

the other way: random subsampling cross-validation

- , , . No Unbiased Estimator of the Variance of K-Fold Cross-Validation (Bengio and Grandvalet 2004)

bad practice? problem: training sets largely overlap, test errors are also dependent

. .caution)

5-2 CV is a better option: do 2-fold CV and repeat 5 times, calculate average: less overlap in training sets

o ng v a na ura un s o process ng or e g ven as typically, document boundaries best practice is doing it yourself!

ML package / CSV representation is not aware of e.g. document boundaries!

15



16/33

-

Ideally the valid settings are:

take off-the-shelf algorithms, avoid parameter tuning and compare, . . -

n.b. you probably do the folding yourself, trying to minimize biases!

do parameter tuning (n.b. selecting/tuning your features is also tuning!)

but then normally you have to have a blind set (from the beginning) e.g. have a look at shared tasks, e.g. CoNLL practical way to learn

ex erimental best ractice to ali n the redefined standards ou mi ht evenbenefit from comparative results, etc.)

You might want to do something different

be aware of these & the conse uences

16



17/33

1. define the task

instance, target variable/labels, collect and label/annotate data cre t r s assessment: cre t request, goo a cre t, ~s ran out n t e

previous year

. ,

(development) ((test!)) / test(evaluation) data3. pick a learning algorithm (e.g. decision tree), train model train on training set optimize/set model hyperparameters (e.g. number of instances / leaf, use

pruning, ) according to performance on validation data test model accuracy on (blind) test set

4. read to use model to redict unseen instances with an ex ected

17

accuracy similar to that seen on test



18/33

Relation: segment

Instances: 1500Attributes: 20

. ,

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2Correctly Classified Instances 290 96.6667 %Incorrectly Classified Instances 10 3.3333 %

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 12Correctly Classified Instances 281 93.6667 %

18

Incorrectly Classified Instances 19 6.3333 %



19/33

ng a po ynom a regress on:

0.0

1.0

t

M=0

0.0

1.0

t

M=1

M

By, for instance, least squares: 0.0 0.5 1.01.0

0.0 0.5 1.0

1.0

.

=

=n

n

nxxa0

)(

1.0 M=3 1.0 M=9

x

0.0t 0.0t

2

1 0

minarg = =

=l

j

M

n

n

nj xy

0.0 0.5 1.0

1.0

x

0.0 0.5 1.0

1.0

x



20/33

Important concept: discriminative power of the

algorithm linear vs nonlinear model

some theoretical aspects:

1-hidden-layer NN with unlimited hidden nodes canperfectly model any smooth function/surface



21/33

,has no (bad) generalization ability

results in high test error (useless model)

Underfitting: the model is not capable of learning the (complex)patterns in the training set

Reasons of Underfitting and Overfitting: lack of discriminative power

sma samp e s zenoise in the data /labels or features/

generalization ability of algorithmhas to be chosen wrt. sam le size

Size (complexity) of learnt modelgrows with data size

21

,



22/33

TP: p classified as p

FP: n classified as pTN: n classified as n

Good prediction:

TP+TNError:FP (false alarm) + FN (miss)



23/33

The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (TP+TN) / (TP+FN+FP+TN)

Error rate The rate of correct (incorrect) predictions made by the model over a data set (cf. coverage). (FP+FN) / (TP+FN+FP+TN)

[Root]?[Mean|Absolute][Squared]?Error The difference between the predicted and actual values

e.g. =2))(( yxf

nRMSE

Algorithms (e.g. those in Weka) typically optimize these might be a mismatch between optimization objective and actual evaluation measure optimize different measures research on its own (e.g. in ML for IR, a.k.a. learning to rank)



24/33

Fraction of correctly predicted positives and allpredicted positives

TP/ TP+FP

FP: n classified as p

TN: n

classified as n

Recall Fraction of correctl redicted ositives and all actual ositives

TP/(TP+FN)

F measure weighted harmonic mean of Precision and Recall (usually equal weighted, =1)

recallprecision

F

+= 22

)1(

Only makes sense for a subset of classes (usually measured for a single

24

For all classes, it equals the accuracy



25/33

, . . , , .A sequence of tokens with the same label is treated as a single instance

John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG_O _O _ORG.

Why? We need complete phrases to be identified correctly How? With external evaluation script, e.g. conllevalfor NER

Example tagging: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG

_O _O _ORG.

Multiple penalty:, ,

2 FPs: Johns Hopkins (PER) and University (ORG) 1 FN: Johns Hopkins University (ORG)

25

= . , = .



26/33

. . ,time saved, lives saved, hopes of tenure saved, etc. We rarely have any access to thisfunction.

2. The human-evaluation function. Typical examples are fluency/adequecy judgments, relevance, . , .

require humans in the loop.

3. Automatic correlation-driving functions. Typical examples are Bleu, Rouge, word error rate,mean-average-precision. These require humans at the front of the loop, but after that arec eap an qu c . yp ca y some e or as een pu n o s ow ng corre a on e ween ese

and something higher up.4. Automatic intuition-driven functions. Typical examples are accuracy (for anything), f-score (for

parsing, chunking and named-entity recognition), alignment error rate (for word alignment)an perp ex y or anguage mo e ng . ese a so requ re umans a e ron o e oop,but differ from (3) in that they are not actually compared with higher-up tasks.

become disfunctional when you are optimizing them!

phrase P/R/F e.g. in NER

Readabilit measures

26



27/33

, . . , , . John_PER studied_O at_O the_O Johns_ORG Hopkins_ORG University_ORG before_Ojoining_O IBM_ORG.

Example tagging 1: John_PER studied_O at_O the_O Johns_PER Hopkins_PER University_ORG before_Ojoining_O IBM_ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 2 FPs: Johns Hopkins (PER) and University (ORG)

F(PER) = 0.67, F(ORG) = 0.5

Example tagging 2: o n_PER stu e _O at_O t e_O o ns_O op ns_O n vers ty_O e ore_O o n n g_O _ORG. 3 Positives: John (PER), Johns Hopkins University (ORG), IBM (ORG) 0 FP

1 FN: Johns Hopkins University (ORG) F(PER) = 1.0, F(ORG) = 0.67

Optimizing phrase-F can encourage / prefer systems that do not mark entities!

27

mos e y, s s a



28/33

ROC Receiver Operating Characteristic curve Curve that depicts the relation between recall (sensitivity) and false

-

Best case

all)

Worst case

ity(Rec

Sensiti

28False Positives FP / (FP+TN)



29/33

rea un er curve,

As you vary the decision threshold, you can plot the recall vs. false

ositive rate

The area under the curve measures how accurately your modelsepara es pos ve rom nega ves

perfect ranking: AUC = 1.0

random decision: AUC = 0.5

Similarly (e.g. in IR): area under P/R curve

w en ere are oo many rue nega ves

correctly identifying negatives is not interesting anyway



30/33

rec s on

number of true positives in top K predictions / ranks

MAP

The average of precisions computed at the point of each of the positives in theranked list (P=0 for positives not ranked at all)

For graded relevance / ranking

Highly relevant documents appearing lower in a search result list should bepenalized as the graded relevance value is reduced logarithmically proportionalto the position of the result.



31/33

easures ow e accuracy

erroro e mo e c anges w sample size

iteration number

Smaller sample worse accuracy

more likely bias in the estimate(representative sample)

variance in the estimate

If it looks differently:

you are plotting error vs. size/iteration

31

overfitting (iteration, not sample size)!



32/33

varying amount of training data (Banko & Brill, 2001): Winnow

nave Bayes memory-based learner

Features: bag of words:

words within a window of the

collocations containingspecific words and/or part of speech

Training corpus: 1-billion wordsfrom a variety of English texts(news articles, literature, scientific abstracts, etc.)



33/33

Su ervised learnin : based on a set of labeled exam les x fx learn the

input-output mapping, i.e. f(x)

3 factors of successful machine learning models much data

good features

well-suited learning algorithm

ML workflow1. problem definition

. , ,

3. selection of learning algorithm, (hyper)parameter tuning, training a final model

4. predict unseen examples & fill tables / draw figures for the paper - test

are u w t data representation (i.i.d, comparability, )

experimental setup (cross-validation, blind testing, )

33

a a s ze an a gor m se ec on over ng, un er ng,

evaluation measures


Documents

Machine Learning Tutorial