27
COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

COMP90051StatisticalMachineLearningSemester2,2016

Lecturer:TrevorCohn

18.Bayesianclassification&modelselection

Page 2: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Recap:Bayesianinference

• UncertaintynotcapturedbyMLE,MAPetc

• Bayesianapproachpreservesuncertainty* careaboutpredictionsNOTparameters* chooseprioroverparameters,thenmodelposterior* integrateoutparametersforprediction(today)

• Requirescomputinganintegralforthe’evidence’term* conjugatepriormakesthispossible

2

Page 3: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

StagesofTraining

1. Decideonmodelformulation&prior

2. Computeposterior overparameters,p(w|X,y)

3

3. Findmodeforw

4. Usetomakepredictionontest

3. Samplemanyw

4. Usetomakeensemble averagepredictionontest

3. Useall wtomakeexpectedpredictionontest

MAP approx.Bayes exactBayes

Page 4: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Predictionwithuncertainw

• Couldpredictusingsampledregressioncurves* sampleS parameters,𝒘 " , 𝑠 ∈ [1, 𝑆]

* foreachsamplecomputeprediction𝑦∗(")attestpointx*

* computethemean(andvar.)overthesepredictions* thisprocessisknownasMonteCarlointegration

• ForBayesianregressionthere’sasimplersolution* integrationcanbedoneanalytically,giving

𝑝(𝑦/∗ |𝑿, 𝒚, 𝒙∗, 𝜎5) = ∫ 𝑝 𝒘 𝑿, 𝒚, 𝜎5)𝑝(𝑦∗ 𝒙∗, 𝒘, 𝜎5 𝑑𝒘

4

Page 5: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Prediction(cont.)

• PleasantpropertiesofGaussiandistributionmeansintegrationistractable

* additivevariancebasedonx* matchtotrainingdata* cf.MLE/MAPestimate,where varianceisafixedconstant

5

p(y⇤|x⇤,X,y,�2) =

Zp(w|X,y,�2

)p(y⇤|x⇤,w,�2)dw

=

ZNormal(w|wN ,VN )Normal(y⇤|x0

⇤w,�2)dw

= Normal(y⇤|x0⇤wN ,�2

N (x⇤))

�2N (x⇤) = �2

+ x

0⇤VNx⇤

(wN andVN definedinlecture17)

Page 6: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

BayesianPredictionexample

6

MLE(blue)andMAP(green)pointestimates,withfixedvariance

variancehigherfurtherfromdatapoints

samplesfromposterior

MLE,MAPfit

Data:y=xsin(x);Model=cubic

Page 7: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Caveats

• Assumptions* knowndatanoiseparameter,σ2

* datawasdrawnfromthemodeldistribution

• Inrealsettings,σ2 isunknown* hasitsownconjugatepriorNormal likelihood⨉ InverseGamma priorresultsinInverseGamma posterior

* closedformpredictivedistribution,withstudent-Tlikelihood(seeMurphy,7.6.3)

7

Page 8: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

BayesianClassification

HowcanweapplyBayesianideastodiscretesettings?

8

Page 9: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Generativescenario

• Firstoffconsidermodelswhichgeneratetheinput* cf.discriminativemodels,whichconditionontheinput* I.e.,p(y|x) vsp(x,y),NaïveBayesvsLogisticRegression

• Forsimplicity,startwithmostbasicsetting* n cointosses, ofwhichk wereheads* onlyhavex (sequenceofoutcomes),butno‘classes’y

• Methodsapplytogenerativemodelsoverdiscretedata* e.g.,topicmodels,generativeclassifiers(NaïveBayes,mixtureofmultinomials)

9

Page 10: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

DiscreteConjugateprior:Beta-Binomial

• Conjugatepriorsalsoexistfordiscretespaces

• Considern cointosses, ofwhichk wereheads* letp(head)=qfromasingletoss(Bernoullidist)* Inferencequestionisthecoinbiased,i.e.,isq≈0.5

• Severaldraws,useBinomialdist* anditsconjugateprior,Betadist

10

p(k|n, q) =✓n

k

◆qk(1� q)n�k

p(q) = Beta(q;↵,�)

=�(↵+ �)

�(↵)�(�)q↵�1(1� q)��1

Page 11: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Betadistribution

11Sourcedfromhttps://en.wikipedia.org/wiki/Beta_distribution

Page 12: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Beta-Binomialconjugacy

12

Bayesianposterior

trick:ignoreconstantfactors(normaliser)

Sweet!WeknownthenormaliserforBeta

p(k|n, q) =✓n

k

◆qk(1� q)n�k

p(q) = Beta(q;↵,�)

=�(↵+ �)

�(↵)�(�)q↵�1(1� q)��1

p(q|k, n) / p(k|n, q)p(q)/ qk(1� q)n�kq↵�1(1� q)��1

= qk+↵�1(1� q)n�k+��1

/ Beta(q; k + ↵, n� k + �)

Page 13: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Laplace’sSunriseProblemEverymorningyouobservethesunrising.Basedsolelyonthisfact,what’stheprobabilitythatthesunwillrisetomorrow?

• Usebeta-binomial,whereq isthePr(sunrisesinmorning)* posterior* n=k=ageindays* letalpha=beta=1(uniformprior)

• Undertheseassumptions

13

p(q|k, n) = Beta(q; k + ↵, n� k + �)

p(q|k) = Beta(q; k + 1, 1)

Ep(q|k) [q] =k + 1

k + 2

’smoothed’countofdayswheresunrose/didnot

Page 14: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

SunriseProblem(cont.)

14

Day(n,k) k+α n-k+β E[q]0 1 1 0.51 2 1 0.6672 3 1 0.75…365 366 1 0.9972920(80years)

2921 1 0.99997

Considerahumanlife-span

Effectofpriordiminishingwithdata,butneverdisappearscompletely.

Page 15: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Suiteofusefulconjugatepriors

15

likelihood conjugateprior

Normal Normal(for mean)

Normal InverseGamma (forvariance)orInverseWishart(covariance)

Binomial Beta

Multinomial Dirichlet

Poisson Gamma

regressio

nclassification

coun

ts

Page 16: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

BayesianLogisticRegression

16

Discriminativeclassifier,whichconditionsoninputs.HowcanwedoBayesian

inferenceinthissetting?

Page 17: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

NowforLogisticRegression…

• Similarproblemswithparameteruncertaintycomparedtoregression* althoughpredictiveuncertaintyin-builttomodeloutputs

17

8.4. Bayesian logistic regression 257

−10 −5 0 5−8

−6

−4

−2

0

2

4

6

8

data

(a)

Log−Likelihood

1

2

3 4

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(b)

Log−Unnormalised Posterior

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(c)

Laplace Approximation to Posterior

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(d)

Figure 8.5 (a) Two-class data in 2d. (b) Log-likelihood for a logistic regression model. The line is drawnfrom the origin in the direction of the MLE (which is at infinity). The numbers correspond to 4 pointsin parameter space, corresponding to the lines in (a). (c) Unnormalized log posterior (assuming vaguespherical prior). (d) Laplace approximation to posterior. Based on a figure by Mark Girolami. Figuregenerated by logregLaplaceGirolamiDemo.

Unfortunately this integral is intractable.The simplest approximation is the plug-in approximation, which, in the binary case, takes the

form

p(y = 1|x,D) ≈ p(y = 1|x,E [w]) (8.60)

where E [w] is the posterior mean. In this context, E [w] is called the Bayes point. Of course,such a plug-in estimate underestimates the uncertainty. We discuss some better approximationsbelow.

MurphyFig8.5&8.6p257-8

258 Chapter 8. Logistic regression

p(y=1|x, wMAP)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(a)

−10 −8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

decision boundary for sampled w

(b)

MC approx of p(y=1|x)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(c)

numerical approx of p(y=1|x)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(d)

Figure 8.6 Posterior predictive distribution for a logistic regression model in 2d. Top left: contours ofp(y = 1|x, wmap). Top right: samples from the posterior predictive distribution. Bottom left: Averagingover these samples. Bottom right: moderated output (probit approximation). Based on a figure by MarkGirolami. Figure generated by logregLaplaceGirolamiDemo.

8.4.4.1 Monte Carlo approximation

A better approach is to use a Monte Carlo approximation, as follows:

p(y = 1|x,D) ≈ 1

S

S∑

s=1

sigm((ws)Tx) (8.61)

where ws ∼ p(w|D) are samples from the posterior. (This technique can be trivially extendedto the multi-class case.) If we have approximated the posterior using Monte Carlo, we can reusethese samples for prediction. If we made a Gaussian approximation to the posterior, we candraw independent samples from the Gaussian using standard methods.

Figure 8.6(b) shows samples from the posteiror predictive for our 2d example. Figure 8.6(c)

Page 18: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

NowforLogisticRegression…

• Canweuseconjugateprior?E.g.,* Beta-Binomialforgenerative binarymodels* Dirichlet-Multinomialformulticlass(similarformulation)

• Modelisdiscriminative,withparametersdefinedusinglogisticsigmoid*

* needprioroverw, notq* noknownconjugateprior(!),thususeaGaussianprior

18*Orsoftmax formulticlass;sameproblemsariseandsimilarsolution

p(y|q,x) = qy(1� q)1�y

q = �(x0w)

Page 19: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Non-conjugacy

• Noknownsolutionforthenormalisingconstant

• Resolvebyapproximation

19

p(w|X,y) / p(w)p(y|X,w)

= Normal(0,�2I)

nY

i=1

�(x0iw)

yi(1� �(x0

iw))

1�yi

Laplaceapprox.:• assumeposterior≃ Normalabout

mode• cancomputenormalisationconstant,

drawsamplesetc.

MurphyFig8.6p258

Page 20: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

BayesianModelSelection

20

Usingtheevidencetoselectthebestclass ofmodel.

Page 21: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

ModelSelection

• Choosingthebestmodel* linear,polynomialorder,RBFbasis/kernel* settingmodelhyperparameters* optimiser settings* typeofmodel(e.g.,decisiontreevssvm)

21

Complexmodels:• betterabilityfitthetrainingdata• mayfitittoowell

Simplemodels:• moreconstrained,poorerfittotrainingdata

• mightbeinsufficient

Page 22: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

ModelSelection(frequentist)• Holdout somedataforvalidation(fixedset,leave-one-out,10-foldcrossvalid.,etc)* treatheld-outerrorasestimateofgeneralisation error* modelwithlowesterrorischosen* mightretrainchosenmodelonfulldataset

• However,thisis* datainefficient:mustholdasideevaluationdata* computationallyinefficient:repeatedlyroundsoftrainingandevaluating

* ineffective:whenselectingmanyparametersatonce(canoverfit theheldout set)

22

Page 23: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

BayesianModelSelection

• ModelselectionusingBayesrule,toselectbetweencompetingmodelclasses* withMi asmodeli andD thedatasete.g.,X or y|X,soforregressionp(D)=p(y|X)

* letp(Mi)beuniform;i.e.,termdropped

• Decisionbetweentwomodelclassesboilsdowntotest(knownasBayesfactor)

23

p(Mi|D) =p(D|Mi)p(Mi)

p(D)

p(M1|D)

p(M2|D)=

p(D|M1)

p(D|M2)> 1

Page 24: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

TheEvidence: p(D|M)=p(y|X,M)

• Imaginewe’reconsideringwhethertousealinearbasisorcubicbasisforsupervisedregression* whatisp(y|X,M=linear)orp(y|X,M=cubic)?* whathappenedtotheparametersw?

• Theseareintegratedout,i.e.,

* seenbefore:thedenominatorfromposterior,aka‘marginallikelihood’

24

p(y|X,M = linear) =

Zp(y|X,w,M = linear)p(w)dw

p(w|X,y,M) =p(y|X,w,M)p(w)

p(y|X,M)

Page 25: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

TheEvidence:BayesianOccam’sRazor

• Howwelldoesthemodelfitthedata,underany(all)parametersettings?

25

• Flexible(complex)models* abletofitmanydifferentdatasets,byselectingspecificparameters

* mostotherparametersettingswillleadtoapoorfit

• Simplermodels* fitfewdatasetswell* lesssensitivetoparametervalues

* manyparametersettingswillgivesimilarfit

Page 26: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

EvidenceCartoon(underuniformprior)• Spaceofmodels:

linear<quadratic<cubic

• Assumingquadraticdata,thisisbestfitby>=quadraticmodel

• Ascomplexityclassgrows,spaceofmodelsgrowstoo* fractionofparams

offering‘good’fittodatawillshrink

• Ideally,wouldselectquadraticmodelasfractionisgreatest

26

linearmodels

→quadraticmodels

cubicmodels

goodfittogivendataset

Page 27: COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

StatisticalMachineLearning(S22017) Lecture18

Summary

• Conjugatepriorrelationships* Normal-Normal,Beta-Binomial

• Bayesianinference* parametersare‘nuisance’variables* integratedoutduringinference

• Bayesianclassification* non-conjugacynecessitatesapproximation

• Bayesianmodelselection

27