COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection

COMP90051StatisticalMachineLearningSemester2,2016

Lecturer:TrevorCohn

18.Bayesianclassification&modelselection

StatisticalMachineLearning(S22017) Lecture18

Recap:Bayesianinference

• UncertaintynotcapturedbyMLE,MAPetc

• Bayesianapproachpreservesuncertainty* careaboutpredictionsNOTparameters* chooseprioroverparameters,thenmodelposterior* integrateoutparametersforprediction(today)

• Requirescomputinganintegralforthe’evidence’term* conjugatepriormakesthispossible

2


StagesofTraining

1. Decideonmodelformulation&prior

2. Computeposterior overparameters,p(w|X,y)

3

3. Findmodeforw

4. Usetomakepredictionontest

3. Samplemanyw

4. Usetomakeensemble averagepredictionontest

3. Useall wtomakeexpectedpredictionontest

MAP approx.Bayes exactBayes


Predictionwithuncertainw

• Couldpredictusingsampledregressioncurves* sampleS parameters,𝒘 " , 𝑠 ∈ [1, 𝑆]

* foreachsamplecomputeprediction𝑦∗(")attestpointx*

* computethemean(andvar.)overthesepredictions* thisprocessisknownasMonteCarlointegration

• ForBayesianregressionthere’sasimplersolution* integrationcanbedoneanalytically,giving

𝑝(𝑦/∗ |𝑿, 𝒚, 𝒙∗, 𝜎5) = ∫ 𝑝 𝒘 𝑿, 𝒚, 𝜎5)𝑝(𝑦∗ 𝒙∗, 𝒘, 𝜎5 𝑑𝒘

4


Prediction(cont.)

• PleasantpropertiesofGaussiandistributionmeansintegrationistractable

* additivevariancebasedonx* matchtotrainingdata* cf.MLE/MAPestimate,where varianceisafixedconstant

5

p(y⇤|x⇤,X,y,�2) =

Zp(w|X,y,�2

)p(y⇤|x⇤,w,�2)dw

=

ZNormal(w|wN ,VN )Normal(y⇤|x0

⇤w,�2)dw

= Normal(y⇤|x0⇤wN ,�2

N (x⇤))

�2N (x⇤) = �2

+ x

0⇤VNx⇤

(wN andVN definedinlecture17)


BayesianPredictionexample

6

MLE(blue)andMAP(green)pointestimates,withfixedvariance

variancehigherfurtherfromdatapoints

samplesfromposterior

MLE,MAPfit

Data:y=xsin(x);Model=cubic


Caveats

• Assumptions* knowndatanoiseparameter,σ2

* datawasdrawnfromthemodeldistribution

• Inrealsettings,σ2 isunknown* hasitsownconjugatepriorNormal likelihood⨉ InverseGamma priorresultsinInverseGamma posterior

* closedformpredictivedistribution,withstudent-Tlikelihood(seeMurphy,7.6.3)

7


BayesianClassification

HowcanweapplyBayesianideastodiscretesettings?

8


Generativescenario

• Firstoffconsidermodelswhichgeneratetheinput* cf.discriminativemodels,whichconditionontheinput* I.e.,p(y|x) vsp(x,y),NaïveBayesvsLogisticRegression

• Forsimplicity,startwithmostbasicsetting* n cointosses, ofwhichk wereheads* onlyhavex (sequenceofoutcomes),butno‘classes’y

• Methodsapplytogenerativemodelsoverdiscretedata* e.g.,topicmodels,generativeclassifiers(NaïveBayes,mixtureofmultinomials)

9


DiscreteConjugateprior:Beta-Binomial

• Conjugatepriorsalsoexistfordiscretespaces

• Considern cointosses, ofwhichk wereheads* letp(head)=qfromasingletoss(Bernoullidist)* Inferencequestionisthecoinbiased,i.e.,isq≈0.5

• Severaldraws,useBinomialdist* anditsconjugateprior,Betadist

10

p(k|n, q) =✓n

k

◆qk(1� q)n�k

p(q) = Beta(q;↵,�)

=�(↵+ �)

�(↵)�(�)q↵�1(1� q)��1


Betadistribution

11Sourcedfromhttps://en.wikipedia.org/wiki/Beta_distribution


Beta-Binomialconjugacy

12

Bayesianposterior

trick:ignoreconstantfactors(normaliser)

Sweet!WeknownthenormaliserforBeta

p(k|n, q) =✓n

k

◆qk(1� q)n�k

p(q) = Beta(q;↵,�)

=�(↵+ �)

�(↵)�(�)q↵�1(1� q)��1

p(q|k, n) / p(k|n, q)p(q)/ qk(1� q)n�kq↵�1(1� q)��1

= qk+↵�1(1� q)n�k+��1

/ Beta(q; k + ↵, n� k + �)


Laplace’sSunriseProblemEverymorningyouobservethesunrising.Basedsolelyonthisfact,what’stheprobabilitythatthesunwillrisetomorrow?

• Usebeta-binomial,whereq isthePr(sunrisesinmorning)* posterior* n=k=ageindays* letalpha=beta=1(uniformprior)

• Undertheseassumptions

13

p(q|k, n) = Beta(q; k + ↵, n� k + �)

p(q|k) = Beta(q; k + 1, 1)

Ep(q|k) [q] =k + 1

k + 2

’smoothed’countofdayswheresunrose/didnot


SunriseProblem(cont.)

14

Day(n,k) k+α n-k+β E[q]0 1 1 0.51 2 1 0.6672 3 1 0.75…365 366 1 0.9972920(80years)

2921 1 0.99997

Considerahumanlife-span

Effectofpriordiminishingwithdata,butneverdisappearscompletely.


Suiteofusefulconjugatepriors

15

likelihood conjugateprior

Normal Normal(for mean)

Normal InverseGamma (forvariance)orInverseWishart(covariance)

Binomial Beta

Multinomial Dirichlet

Poisson Gamma

regressio

nclassification

coun

ts


BayesianLogisticRegression

16

Discriminativeclassifier,whichconditionsoninputs.HowcanwedoBayesian

inferenceinthissetting?


NowforLogisticRegression…

• Similarproblemswithparameteruncertaintycomparedtoregression* althoughpredictiveuncertaintyin-builttomodeloutputs

17

8.4. Bayesian logistic regression 257

−10 −5 0 5−8

−6

−4

−2

0

2

4

6

8

data

(a)

Log−Likelihood

1

2

3 4

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(b)

Log−Unnormalised Posterior

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(c)

Laplace Approximation to Posterior

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(d)

Figure 8.5 (a) Two-class data in 2d. (b) Log-likelihood for a logistic regression model. The line is drawnfrom the origin in the direction of the MLE (which is at infinity). The numbers correspond to 4 pointsin parameter space, corresponding to the lines in (a). (c) Unnormalized log posterior (assuming vaguespherical prior). (d) Laplace approximation to posterior. Based on a figure by Mark Girolami. Figuregenerated by logregLaplaceGirolamiDemo.

Unfortunately this integral is intractable.The simplest approximation is the plug-in approximation, which, in the binary case, takes the

form

p(y = 1|x,D) ≈ p(y = 1|x,E [w]) (8.60)

where E [w] is the posterior mean. In this context, E [w] is called the Bayes point. Of course,such a plug-in estimate underestimates the uncertainty. We discuss some better approximationsbelow.

MurphyFig8.5&8.6p257-8

258 Chapter 8. Logistic regression

p(y=1|x, wMAP)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(a)

−10 −8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

decision boundary for sampled w

(b)

MC approx of p(y=1|x)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(c)

numerical approx of p(y=1|x)

−8 −6 −4 −2 0 2 4 6 8−8

−6

−4

−2

0

2

4

6

8

(d)

Figure 8.6 Posterior predictive distribution for a logistic regression model in 2d. Top left: contours ofp(y = 1|x, wmap). Top right: samples from the posterior predictive distribution. Bottom left: Averagingover these samples. Bottom right: moderated output (probit approximation). Based on a figure by MarkGirolami. Figure generated by logregLaplaceGirolamiDemo.

8.4.4.1 Monte Carlo approximation

A better approach is to use a Monte Carlo approximation, as follows:

p(y = 1|x,D) ≈ 1

S

S∑

s=1

sigm((ws)Tx) (8.61)

where ws ∼ p(w|D) are samples from the posterior. (This technique can be trivially extendedto the multi-class case.) If we have approximated the posterior using Monte Carlo, we can reusethese samples for prediction. If we made a Gaussian approximation to the posterior, we candraw independent samples from the Gaussian using standard methods.

Figure 8.6(b) shows samples from the posteiror predictive for our 2d example. Figure 8.6(c)


NowforLogisticRegression…

• Canweuseconjugateprior?E.g.,* Beta-Binomialforgenerative binarymodels* Dirichlet-Multinomialformulticlass(similarformulation)

• Modelisdiscriminative,withparametersdefinedusinglogisticsigmoid*

* needprioroverw, notq* noknownconjugateprior(!),thususeaGaussianprior

18*Orsoftmax formulticlass;sameproblemsariseandsimilarsolution

p(y|q,x) = qy(1� q)1�y

q = �(x0w)


Non-conjugacy

• Noknownsolutionforthenormalisingconstant

• Resolvebyapproximation

19

p(w|X,y) / p(w)p(y|X,w)

= Normal(0,�2I)

nY

i=1

�(x0iw)

yi(1� �(x0

iw))

1�yi

Laplaceapprox.:• assumeposterior≃ Normalabout

mode• cancomputenormalisationconstant,

drawsamplesetc.

MurphyFig8.6p258


BayesianModelSelection

20

Usingtheevidencetoselectthebestclass ofmodel.


ModelSelection

• Choosingthebestmodel* linear,polynomialorder,RBFbasis/kernel* settingmodelhyperparameters* optimiser settings* typeofmodel(e.g.,decisiontreevssvm)

21

Complexmodels:• betterabilityfitthetrainingdata• mayfitittoowell

Simplemodels:• moreconstrained,poorerfittotrainingdata

• mightbeinsufficient


ModelSelection(frequentist)• Holdout somedataforvalidation(fixedset,leave-one-out,10-foldcrossvalid.,etc)* treatheld-outerrorasestimateofgeneralisation error* modelwithlowesterrorischosen* mightretrainchosenmodelonfulldataset

• However,thisis* datainefficient:mustholdasideevaluationdata* computationallyinefficient:repeatedlyroundsoftrainingandevaluating

* ineffective:whenselectingmanyparametersatonce(canoverfit theheldout set)

22


BayesianModelSelection

• ModelselectionusingBayesrule,toselectbetweencompetingmodelclasses* withMi asmodeli andD thedatasete.g.,X or y|X,soforregressionp(D)=p(y|X)

* letp(Mi)beuniform;i.e.,termdropped

• Decisionbetweentwomodelclassesboilsdowntotest(knownasBayesfactor)

23

p(Mi|D) =p(D|Mi)p(Mi)

p(D)

p(M1|D)

p(M2|D)=

p(D|M1)

p(D|M2)> 1


TheEvidence: p(D|M)=p(y|X,M)

• Imaginewe’reconsideringwhethertousealinearbasisorcubicbasisforsupervisedregression* whatisp(y|X,M=linear)orp(y|X,M=cubic)?* whathappenedtotheparametersw?

• Theseareintegratedout,i.e.,

* seenbefore:thedenominatorfromposterior,aka‘marginallikelihood’

24

p(y|X,M = linear) =

Zp(y|X,w,M = linear)p(w)dw

p(w|X,y,M) =p(y|X,w,M)p(w)

p(y|X,M)


TheEvidence:BayesianOccam’sRazor

• Howwelldoesthemodelfitthedata,underany(all)parametersettings?

25

• Flexible(complex)models* abletofitmanydifferentdatasets,byselectingspecificparameters

* mostotherparametersettingswillleadtoapoorfit

• Simplermodels* fitfewdatasetswell* lesssensitivetoparametervalues

* manyparametersettingswillgivesimilarfit


EvidenceCartoon(underuniformprior)• Spaceofmodels:

linear<quadratic<cubic

• Assumingquadraticdata,thisisbestfitby>=quadraticmodel

• Ascomplexityclassgrows,spaceofmodelsgrowstoo* fractionofparams

offering‘good’fittodatawillshrink

• Ideally,wouldselectquadraticmodelasfractionisgreatest

26

→

→

linearmodels

→quadraticmodels

cubicmodels

goodfittogivendataset


Summary

• Conjugatepriorrelationships* Normal-Normal,Beta-Binomial

• Bayesianinference* parametersare‘nuisance’variables* integratedoutduringinference

• Bayesianclassification* non-conjugacynecessitatesapproximation

• Bayesianmodelselection

27

Documents

COMP90051 Statistical Machine Learning · 2017. 10. 23. · COMP90051 Statistical Machine Learning Semester 2, 2016 Lecturer: Trevor Cohn 18. Bayesian classification & model selection