78
Machine Learning Ludovic Samper Antidot September 1st, 2015 Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77

WISS 2015 - Machine Learning lecture by Ludovic Samper

  • Upload
    antidot

  • View
    3.482

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WISS 2015 - Machine Learning lecture by Ludovic Samper

Machine Learning

Ludovic Samper

Antidot

September 1st, 2015

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 1 / 77

Page 2: WISS 2015 - Machine Learning lecture by Ludovic Samper

Antidot

Software vendor since 1999

Paris, Lyon, Aix-en-Provence

45 employees

Founders : Fabrice Lacroix CEO, Stephane Loesel CTO, JeromeMainka Chief Scientist Officer

Software products and solutions

Antidot Finder Suite (AFS) search engine

Antidot Information Factory (AIF) a pipe & filters framework

SaaS, Hosted License, 0n-site License

50% of the revenue invested in R&D

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 2 / 77

Page 3: WISS 2015 - Machine Learning lecture by Ludovic Samper

Antidot

Machine Learning

Automatic text document classification

Named Entity Extraction

Compound Splitter (for german words)

Clustering algorithm (for news agregation)

Open Data, Semantic Web

http://www.rechercheisidore.fr/ Social Sciences andHumanities research platform. Enriched with open resources

https://github.com/antidot/db2triples/ open source libraryto export a db in RDF

Antidot is a Partner organization in WDAqua project

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 3 / 77

Page 4: WISS 2015 - Machine Learning lecture by Ludovic Samper

Tutorial

Study a classical task in Machine Learning : text classification

Show scikit-learn.org Python machine learning library

Follow the “Working with text data” tutorial :http://scikit-learn.org/stable/tutorial/text_analytics/

working_with_text_data.html

Additional material on http://blog.antidot.net/

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 4 / 77

Page 5: WISS 2015 - Machine Learning lecture by Ludovic Samper

Summary of the tutorial

1 Problem definitionSupervised classificationEvaluation metrics

2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 ConclusionMethodology

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 5 / 77

Page 6: WISS 2015 - Machine Learning lecture by Ludovic Samper

Sommaire

1 Problem definitionSupervised classificationEvaluation metrics

2 Extracting features from text files

3 Algorithms for classification

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 6 / 77

Page 7: WISS 2015 - Machine Learning lecture by Ludovic Samper

20 newsgroups dataset

http://qwone.com/~jason/20Newsgroups/

20 newsgroups

20 newsgroups documents collected in the 90’s

The label is the newsgroup the document belongs to

A popular collection

18846 documents : 11314 in train, 7532 in test

wiss-ml.ipynb#The-20-newsgroups-dataset

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 7 / 77

Page 8: WISS 2015 - Machine Learning lecture by Ludovic Samper

Classification

Problem statement

One label per document

Automatically determine the label of an unseen document. Set ofdocuments and their labels

A supervised classification problem

Training

Set of documents and their labels

Build a model

Inference

Given a new document, use the model to predict its label

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 8 / 77

Page 9: WISS 2015 - Machine Learning lecture by Ludovic Samper

Precision and Recall I

Binary classification

C C

Labeled C TP True Positive FP False Positive

Not labeled C FN False Negative TN True Negative

Precision

TP

TP + FP

Proba(e ∈ C |e labeled C )

Recall

TP

TP + FN

Proba(e labeled C |e ∈ C )

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 9 / 77

Page 10: WISS 2015 - Machine Learning lecture by Ludovic Samper

Precision and Recall II

F1

F1 = 2P × R

P + R

Harmonic mean of Precision and Recall

Accuracy

TP + TN

TP + TN + FP + FN

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 10 / 77

Page 11: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multiclass I

NC = number of class

Macro Average

Bmacro =1

NC

NC∑k=1

(Bbinary (TPk ,FPk ,TNk ,FNk))

Average mesure by class. Large classes count has much as small ones.

Micro Average

Bmicro = Bbinary (

NC∑k=1

TPi ,

NC∑k=1

FPi ,

NC∑k=1

TNk ,

NC∑k=1

FNk)

Average mesure by instance

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 11 / 77

Page 12: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multiclass II

Micro average in single label multiclass

NC∑k=1

(FNk) =

NC∑k=1

(FPk)

andNC∑k=1

(TNk) =

NC∑k=1

(TPk)

Then,

Precisionmicro = Recallmicro = Accuracy =

∑NCk=1(TPk)

Nbdoc

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 12 / 77

Page 13: WISS 2015 - Machine Learning lecture by Ludovic Samper

Sommaire

1 Problem definition

2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)

3 Algorithms for classification

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 13 / 77

Page 14: WISS 2015 - Machine Learning lecture by Ludovic Samper

Bag of words

From text to features

Count the number of occurrences of words in text

“bag” because position isn’t taken into account

Extensions

Remove stop words

Remove too frequent words (max_df)

lowercase

Ngram (ngram_range) tokenize ngrams instead of words. Useful totake into account word positions

wiss-ml.ipynb#Bag-of-words

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 14 / 77

Page 15: WISS 2015 - Machine Learning lecture by Ludovic Samper

Term frequency inverse document frequency (tfidf) I

Intuition

Take into account relative importance of each word regarding the wholedatasetIf a word occurs in every document, it doesn’t hold any information

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 15 / 77

Page 16: WISS 2015 - Machine Learning lecture by Ludovic Samper

Term frequency inverse document frequency (tfidf) II

Definition

Term frequency × inverse document frequency

tfidf (w , d) = tf (w , d)× idf (w , d)

tf (w , d) = term frequency(word w in doc d)

idf (w) = log(Ndoc

doc freq(w))

In scikit-learn :

tfidf (w , d) = tf (w , d)× (idf (w) + 1)

Terms that occurs in all documents idf = 0 will not be ignored

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 16 / 77

Page 17: WISS 2015 - Machine Learning lecture by Ludovic Samper

Term frequency inverse document frequency (tfidf) III

Options

Normalisation ||doc|| = 1. Ex, for norm L2,∑

w∈d tfidf(w , d)2 = 1

Smoothing : add one to document frequencies as if an extra doccontained every term in the collection exactly once

idf (w) = log(Ndoc + 1

doc freq(w) + 1)

Example

Show most significants words of a doc wiss-ml.ipynb#Tfidf

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 17 / 77

Page 18: WISS 2015 - Machine Learning lecture by Ludovic Samper

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 18 / 77

Page 19: WISS 2015 - Machine Learning lecture by Ludovic Samper

Supervised classification problem I

Notations

x = (x1, · · · , xn) = (xi )0≤i<n feature vector

{(xd , yd)}0≤d<D the training set

∀i , xi ∈ Rn

xi feature vector for document in dimension of the feature space

∀d , yd ∈ {1, · · · ,NC}NC the number of classesyd the class of document d

y class predictionFor a new vector x , y is the predicted class of x .

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 19 / 77

Page 20: WISS 2015 - Machine Learning lecture by Ludovic Samper

Supervised classification problem II

Goal

Find a function F :

Rn → {1, · · · ,NC}x 7→ y

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 20 / 77

Page 21: WISS 2015 - Machine Learning lecture by Ludovic Samper

In 20newsgroups I

Values in 20 newsgroups

n = 130107 nb features (number of unique terms)

D = 11314 training samples

NC = 20 different classes

Goal

Find a function F that given a new document predicts its class

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 21 / 77

Page 22: WISS 2015 - Machine Learning lecture by Ludovic Samper

Naıve Bayes Algorithm I

Bayes’ theorem

P(A|B) =P(B|A)P(A)

P(B)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 22 / 77

Page 23: WISS 2015 - Machine Learning lecture by Ludovic Samper

Naıve Bayes Algorithm II

Posterior probability of class C

P(C |x) =P(x |C )P(C )

P(x)

P(x) does not depend on C ,

P(C |x) ∝ P(x |C )P(C )

Naıve Bayes independent assumption : each feature i is conditionallyindependent of every other feature j

P(C |x) ∝ P(C )×n∏

i=1

P(xi |C )

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 23 / 77

Page 24: WISS 2015 - Machine Learning lecture by Ludovic Samper

Naıve Bayes Algorithm III

Classifier from the probability model

y = arg maxk∈{1,··· ,NC}

P(y = k)×n∏

i=0

P(xi |y = k)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 24 / 77

Page 25: WISS 2015 - Machine Learning lecture by Ludovic Samper

Parameter estimation in Naıve Bayes’ classifier

Prior of a class

P(y = k) =nb samples in class k

total nb samples

Can also be uniform : P(y = k) = 1NC

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 25 / 77

Page 26: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multinomial Naıve Bayes I

Naıve Bayes

P(x |y = k) =∏n

i=1 P(xi |y = k)

Multinomial distribution

Event word is i follows a multinomial distribution with parameters(p1, · · · , pn) where pi = P(word = i)

P(x1, · · · , xn) =n∏

i=1

pxii

Where∑

i pi = 1.pi = P(w = i)One distribution for each class y .

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 26 / 77

Page 27: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multinomial Naıve Bayes II

Multinomial Naıve Bayes

One multinomial distribution for each class

P(i |y = k) =sum of occurrences of word xi in class k

total nb words in class k

=

∑d∈k xi∑

0≤j<n

∑d∈k xj

With smoothing,

P(i |y = k) =

∑d∈k xi + α∑

0≤j<n

∑d∈k xj + αn

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 27 / 77

Page 28: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multinomial Naıve Bayes III

Inference in Multinomial Naıve Bayes

y = arg maxk

P(y = k |x)

= arg maxk

P(y = k)∏

0≤i<n

P(i |y = k)xi

= arg maxk

(log(P(y = k)) +

∑0≤i<n

xi log(P(i |y = k)))

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 28 / 77

Page 29: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multinomial Naıve Bayes IV

A linear model

In the log space,

(log P(y = k|x))k ∝ W0 + W T .x

W0, is the vector of priors :

W0 = log(P(y = k))

W is the matrix of distributions :

W = (wik), i ∈ [1, n], k ∈ [1,NC ]

wik = log P(i |y = k)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 29 / 77

Page 30: WISS 2015 - Machine Learning lecture by Ludovic Samper

Multinomial Naıve Bayes V

Example step-by-step

http://www.antidot.net/wiss2015/wiss-ml.html#Naive-Bayes

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 30 / 77

Page 31: WISS 2015 - Machine Learning lecture by Ludovic Samper

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 31 / 77

Page 32: WISS 2015 - Machine Learning lecture by Ludovic Samper

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 32 / 77

Page 33: WISS 2015 - Machine Learning lecture by Ludovic Samper

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 33 / 77

Page 34: WISS 2015 - Machine Learning lecture by Ludovic Samper

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 34 / 77

Page 35: WISS 2015 - Machine Learning lecture by Ludovic Samper

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 35 / 77

Page 36: WISS 2015 - Machine Learning lecture by Ludovic Samper

A linear classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 36 / 77

Page 37: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, notations

Problem

S, training set{(xi , yi ), xi ∈ Rn, yi ∈ {−1, 1}}i∈0..D

Find a linear function 〈w , xi 〉+ b such that :

sign(〈w , xi 〉+ b) = yi

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 37 / 77

Page 38: WISS 2015 - Machine Learning lecture by Ludovic Samper

SVM, maximum margin classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 38 / 77

Page 39: WISS 2015 - Machine Learning lecture by Ludovic Samper

Margin

distance(x+, x−) = 〈 w

||w ||, x+ − x−〉

=1

||w ||(〈w , x+〉 − 〈w , x−〉)

=1

||w ||((〈w , x+〉+ b)− (〈w , x−〉+ b))

=1

||w ||(1− (−1))

=2

||w ||

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 39 / 77

Page 40: WISS 2015 - Machine Learning lecture by Ludovic Samper

SVM, maximum margin classifier

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 40 / 77

Page 41: WISS 2015 - Machine Learning lecture by Ludovic Samper

Solving an optimization problem using the Lagrangien

Primal problem

minimizew ,bf (w , b)

Under the constraints, hi (w , b) ≥ 0

Lagrange function

L(w , b, α) = f (w , b)−∑i

αihi (w , b)

Let, g(α) = inf(w ,b) L(w , b, α)∀w , b, g(α) ≤ L(w , b, α)Moreover, L(w , b, α) ≤ f (w , b)Thus, ∀αi ≥ 0, g(α) ≤ minw ,b f (w , b)And with Karush Kuhn Tucker (KKT) optimality condition,

maxα

g(α) = minw ,b

f (w , b)⇔ αihi (w , x) = 0

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 41 / 77

Page 42: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, problem

Primal problem

minimize(w ,b)||w ||2

2

Under the constraints, ∀0 < i ≤ D, yi (〈w , xi 〉+ b) ≥ 1

Lagrange function

L(w , b, α) =1

2||w ||2 −

∑i

αi (yi (〈w , xi 〉+ b)− 1)

Dual problem :maximize(w ,b,α)L(w , b, α)

with αi ≥ 0Optimality in w, b is a saddle point with α

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 42 / 77

Page 43: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, problem

Derivative in w, b need to vanish

∂wL(w , b, α) = w −

∑i

αiyixi = 0

∂bL(w , b, α) =

∑i

αiyi = 0

Dual problem

maximizeα −1

2

∑i ,j

αiαjyiyj〈xi , xj〉+∑i

αi

under the constraints, { ∑i αiyi = 0

αi ≥ 0

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 43 / 77

Page 44: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vectors

Support vectors

w =∑i

yiαixi

Karush Kuhn Tucker (KKT) optimality condition

Lagrange multiplier times constraint equals zero

αi (yi (〈w , xi 〉+ b)− 1) = 0

Thus, {αi = 0αi > 0⇒ yi (〈w , xi 〉+ b) = 1

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 44 / 77

Page 45: WISS 2015 - Machine Learning lecture by Ludovic Samper

Experiments with separable space

SVMvaryingC.ipynb

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 45 / 77

Page 46: WISS 2015 - Machine Learning lecture by Ludovic Samper

What happens if space is not separable

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 46 / 77

Page 47: WISS 2015 - Machine Learning lecture by Ludovic Samper

Adding slack variable

Problem was

minimize(w ,b)||w ||2

2

With,yi (w .xi + b) ≥ 1

With slack

minimize(w ,b)||w ||2

2+ C

∑i

ξi

With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 47 / 77

Page 48: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, without slack

Primal problem

minimize(w ,b)||w ||2

2

With,yi (w .xi + b) ≥ 1

Lagrange function

L(w , b, α) =1

2||w ||2 −

∑i

αi (yi (〈w , xi 〉+ b)− 1)

Dual problem :maximize(w ,b,α)L(w , b, α)

Optimality in w , b, is a saddle point with α

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 48 / 77

Page 49: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, with slack

Primal problem

minimize(w ,b)||w ||2

2+ C

∑i

ξi

With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0

Lagrange function

L(w , b, ξ, α, η) =1

2||w ||2 + C

∑i

ξi −∑i

αi (yi (〈xi ,w〉+ b) + ξi − 1)−∑i

ηiξi

Dual problem :maximize(w ,b,ξ,α,η)L(w , b, ξ, α, η)

Optimality in w , b, ξ is a saddle point with α, η

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 49 / 77

Page 50: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, problem

Derivative in w, b, ξ need to vanish

∂wL(w , b, ξ, α, η) = w −

∑i

αiyixi = 0

∂bL(w , b, ξ, α, η) =

∑i

αiyi = 0

∂ξL(w , b, ξ, α, η) = C − αi − ηi = 0⇒ ηi = C − αi

Dual problem

maximizeα −1

2

∑i ,j

αiαjyiyj〈xi , xj〉+∑i

αi

under the constraints,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 50 / 77

Page 51: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vectors

Support vectors

w =∑i

yiαixi

Karush Kuhn Tucker (KKT) optimality condition

Lagrange multiplier times constraint equals zero

αi (yi (〈w , xi 〉+ b) + ξi − 1) = 0

ηiξi = 0⇔ (C − αi )ξi = 0

Thus, αi = 0⇒ yi (〈w , xi 〉+ b) ≥ 10 < αi < C ⇒ yi (〈w , xi 〉+ b) = 1αi = C ⇒ yi (〈w , xi 〉+ b) ≤ 1

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 51 / 77

Page 52: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, Loss functions

Primal problem

minimize(w ,b)||w ||2

2+ C

∑i

ξi

With, {yi (w .xi + b) ≥ 1− ξiξi ≥ 0

With loss function

minimize(w ,b)||w ||2

2+ C

∑i

max(0, 1− yi (w .xi + b))

here,loss(xi , yi ) = max(0, 1− yi (w .xi + b)) = max(0, 1− f (xi ))

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 52 / 77

Page 53: WISS 2015 - Machine Learning lecture by Ludovic Samper

Support Vector Machine, Common loss functions

Common loss functions

hinge loss, L1-loss : max(0, 1− yi (w .xi + b))

squares hinge L2-loss : max(0, (1− yi (w .xi + b))2)

logistic loss : log(1 + exp(−yi (w .xi + b)))

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 53 / 77

Page 54: WISS 2015 - Machine Learning lecture by Ludovic Samper

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 54 / 77

Page 55: WISS 2015 - Machine Learning lecture by Ludovic Samper

Expermiments with different values for C

SVMvaryingC.ipynb#Varying-C-parameter

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 55 / 77

Page 56: WISS 2015 - Machine Learning lecture by Ludovic Samper

Non linearly separable data

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 56 / 77

Page 57: WISS 2015 - Machine Learning lecture by Ludovic Samper

Non linearly separable data, Φ(x) = (x , x2)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 57 / 77

Page 58: WISS 2015 - Machine Learning lecture by Ludovic Samper

Non linearly separable data, Φ(x) = (x , x2)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 58 / 77

Page 59: WISS 2015 - Machine Learning lecture by Ludovic Samper

Linear case

Primal Problem

minimizew ,b1

2||w ||2 + C

∑i

ξi

subject to, yi (〈w , xi 〉+ b) ≥ 1− ξi and ξi ≥ 0

Dual Problem

maximizeα1

2

∑i ,j

αiαjyiyj〈xi , xj〉+∑i

αi

subject to,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Support vector expansion

f (x) =∑i

αiyi 〈xi , x〉+ b

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 59 / 77

Page 60: WISS 2015 - Machine Learning lecture by Ludovic Samper

With a transformation Φ : x 7→ Φ(x)

Primal Problem

minimizew ,b1

2||w ||2 + C

∑i

ξi

subject to, yi (〈w ,Φ(xi )〉+ b) ≥ 1− ξi and ξi ≥ 0

Dual Problem

maximizeα1

2

∑i ,j

αiαjyiyj〈Φ(xi ),Φ(xj)〉+∑i

αi

subject to,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Support vector expansion

f (x) =∑i

αiyi 〈Φ(xi ),Φ(x)〉+ b

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 60 / 77

Page 61: WISS 2015 - Machine Learning lecture by Ludovic Samper

The kernel trick

Kernel function

k(x , x ′) = 〈Φ(x),Φ(x ′)〉

We just need to compute the dot product in the new space

Dual Problem

maximizeα1

2

∑i ,j

αiαjyiyjk(xi , xj) +∑i

αi

subject to,∑

i αiyi = 0 and 0 ≤ αi ≤ C

Support vector expansion

f (x) =∑i

αiyik(xi , x) + b

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 61 / 77

Page 62: WISS 2015 - Machine Learning lecture by Ludovic Samper

Kernels

Kernel functions

linear : k(x , x ′) = 〈x , x ′〉polynomial : k(x , x ′) = (γ〈x , x ′〉+ r)d

rbf : k(x , x ′) = exp(−γ|x − x ′|2)

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 62 / 77

Page 63: WISS 2015 - Machine Learning lecture by Ludovic Samper

RBF Kernel imply an infinite space

Here we’re in dimension 1, x ∈ R

k(x , x ′) = exp(−(x − x ′)2)

= exp(−x2)exp(−x ′2)exp(2xx ′)

With Taylor transformation,

k(x , x ′) = exp(−x2)exp(−x ′2)∞∑k=0

2kxkx ′k

k!

= 〈(· · · , 2k−1√k!

exp(−x2)xk , · · · ),

(· · · , 2k−1√k!

exp(−x ′2)x ′k , · · · )〉

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 63 / 77

Page 64: WISS 2015 - Machine Learning lecture by Ludovic Samper

Experiments with different kernels

www.antidot.net/wiss2015/SVMvaryingC.html#Non-linear-kernels

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 64 / 77

Page 65: WISS 2015 - Machine Learning lecture by Ludovic Samper

SVM in multiclass

one-vs-the rest

NC binary classifiers (but each involving all dataset)

At prediction time, choose the class with maximum decision value

one-vs-oneNC (NC−1)

2 binary classifiers

At prediction time, vote

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 65 / 77

Page 66: WISS 2015 - Machine Learning lecture by Ludovic Samper

SVM in scikit-learn

SVC : Support Vector Classification

sklearn.svm.linearSVC

based on Liblinear library

strategy : one-vs-the rest

only linear kernel

loss can be : ‘hinge’ or ‘squared hinge’

sklearn.svm.SVC

based on libSVM

multiclass strategy : one-vs-one

kernel can be : linear, polynomial, RBF, sigmoid, precomputed

only hinge loss

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 66 / 77

Page 67: WISS 2015 - Machine Learning lecture by Ludovic Samper

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 Conclusion

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 67 / 77

Page 68: WISS 2015 - Machine Learning lecture by Ludovic Samper

Cross validation I

http://scikit-learn.org/stable/modules/cross_validation.html

Overfitting

Estimation of parameters on the test set can lead to overfitting :parameters are the best for this test set but not in the general case.

Train, test and validation dataset

A solution :

tweak the parameters on the test set

validate on a validation dataset

only few data in training dataset

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 68 / 77

Page 69: WISS 2015 - Machine Learning lecture by Ludovic Samper

Cross validation II

Cross validation

k-fold cross validation

Split training data in k partitions of the same size

train the model on k − 1 partitions

then, evaluate on the kth partition

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 69 / 77

Page 70: WISS 2015 - Machine Learning lecture by Ludovic Samper

Cross validation III

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 70 / 77

Page 71: WISS 2015 - Machine Learning lecture by Ludovic Samper

Grid Search

http://scikit-learn.org/stable/modules/grid_search.html

Grid search

Test each value for each parameter

brut force algorithm to find the best value for each parameter

In scikit-learn

Automatically runs k× number of parameters’ values trainings

Keeps the best model

Demo with scikit-learnhttp://www.antidot.net/wiss2015/grid_search_20newsgroups.html

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 71 / 77

Page 72: WISS 2015 - Machine Learning lecture by Ludovic Samper

Sommaire

1 Problem definition

2 Extracting features from text files

3 Algorithms for classification

4 ConclusionMethodology

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 72 / 77

Page 73: WISS 2015 - Machine Learning lecture by Ludovic Samper

1 Problem definitionSupervised classificationEvaluation metrics

2 Extracting features from text filesBag of words modelTerm frequency inverse document frequency (tfidf)

3 Algorithms for classificationNaıve BayesSupport Vector Machine (SVM)Tuning parameters

Cross validationGrid search

4 ConclusionMethodology

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77

Page 74: WISS 2015 - Machine Learning lecture by Ludovic Samper

Methodology

To solve a problem using Machine Learning, you have to :

1 Understand the data

2 Choose an evaluation measure

3 Be able to test the model

4 Find the main features

5 Try the algorithms, with different parameters

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 73 / 77

Page 75: WISS 2015 - Machine Learning lecture by Ludovic Samper

Conclusion

Machine Learning has a lot of applications

With libraries like scikit-learn, no need to implement algorithmsyourself

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 74 / 77

Page 76: WISS 2015 - Machine Learning lecture by Ludovic Samper

Questions ?

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 75 / 77

Page 77: WISS 2015 - Machine Learning lecture by Ludovic Samper

References

Machine Learning in Python :

http://scikit-learn.org

Alex Smola very good lecture on Machine Learning at CMU :

http://alex.smola.org/teaching/10-701-15/

Kernels : https://www.youtube.com/watch?v=0Nis-oMLbDs

SVM : https://www.youtube.com/watch?v=bsbpqNIKQzU

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 76 / 77

Page 78: WISS 2015 - Machine Learning lecture by Ludovic Samper

Bernoulli Naıve Bayes

Features

xi = 1 iff word i is present in documentElse, xi = 0The number of occurrences of word i doesn’t matter

Bernoulli

For each feature i ,P(xi |y = k) = P(i |y = k)xi + (1− P(i |y = k))(1− xi )Absence of a feature is explicitly taken into account

Estimation of P(i |y = k)

P(i |y = k) =1 + nb of documents in k that contains word i

nb of documents in k

Ludovic Samper (Antidot) Machine Learning September 1st, 2015 77 / 77