Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
COMP90051StatisticalMachineLearningSemester2,2016
Lecturer:TrevorCohn
18.Bayesianclassification&modelselection
StatisticalMachineLearning(S22017) Lecture18
Recap:Bayesianinference
• UncertaintynotcapturedbyMLE,MAPetc
• Bayesianapproachpreservesuncertainty* careaboutpredictionsNOTparameters* chooseprioroverparameters,thenmodelposterior* integrateoutparametersforprediction(today)
• Requirescomputinganintegralforthe’evidence’term* conjugatepriormakesthispossible
2
StatisticalMachineLearning(S22017) Lecture18
StagesofTraining
1. Decideonmodelformulation&prior
2. Computeposterior overparameters,p(w|X,y)
3
3. Findmodeforw
4. Usetomakepredictionontest
3. Samplemanyw
4. Usetomakeensemble averagepredictionontest
3. Useall wtomakeexpectedpredictionontest
MAP approx.Bayes exactBayes
StatisticalMachineLearning(S22017) Lecture18
Predictionwithuncertainw
• Couldpredictusingsampledregressioncurves* sampleS parameters,𝒘 " , 𝑠 ∈ [1, 𝑆]
* foreachsamplecomputeprediction𝑦∗(")attestpointx*
* computethemean(andvar.)overthesepredictions* thisprocessisknownasMonteCarlointegration
• ForBayesianregressionthere’sasimplersolution* integrationcanbedoneanalytically,giving
𝑝(𝑦/∗ |𝑿, 𝒚, 𝒙∗, 𝜎5) = ∫ 𝑝 𝒘 𝑿, 𝒚, 𝜎5)𝑝(𝑦∗ 𝒙∗, 𝒘, 𝜎5 𝑑𝒘
4
StatisticalMachineLearning(S22017) Lecture18
Prediction(cont.)
• PleasantpropertiesofGaussiandistributionmeansintegrationistractable
* additivevariancebasedonx* matchtotrainingdata* cf.MLE/MAPestimate,where varianceisafixedconstant
5
p(y⇤|x⇤,X,y,�2) =
Zp(w|X,y,�2
)p(y⇤|x⇤,w,�2)dw
=
ZNormal(w|wN ,VN )Normal(y⇤|x0
⇤w,�2)dw
= Normal(y⇤|x0⇤wN ,�2
N (x⇤))
�2N (x⇤) = �2
+ x
0⇤VNx⇤
(wN andVN definedinlecture17)
StatisticalMachineLearning(S22017) Lecture18
BayesianPredictionexample
6
MLE(blue)andMAP(green)pointestimates,withfixedvariance
variancehigherfurtherfromdatapoints
samplesfromposterior
MLE,MAPfit
Data:y=xsin(x);Model=cubic
StatisticalMachineLearning(S22017) Lecture18
Caveats
• Assumptions* knowndatanoiseparameter,σ2
* datawasdrawnfromthemodeldistribution
• Inrealsettings,σ2 isunknown* hasitsownconjugatepriorNormal likelihood⨉ InverseGamma priorresultsinInverseGamma posterior
* closedformpredictivedistribution,withstudent-Tlikelihood(seeMurphy,7.6.3)
7
StatisticalMachineLearning(S22017) Lecture18
BayesianClassification
HowcanweapplyBayesianideastodiscretesettings?
8
StatisticalMachineLearning(S22017) Lecture18
Generativescenario
• Firstoffconsidermodelswhichgeneratetheinput* cf.discriminativemodels,whichconditionontheinput* I.e.,p(y|x) vsp(x,y),NaïveBayesvsLogisticRegression
• Forsimplicity,startwithmostbasicsetting* n cointosses, ofwhichk wereheads* onlyhavex (sequenceofoutcomes),butno‘classes’y
• Methodsapplytogenerativemodelsoverdiscretedata* e.g.,topicmodels,generativeclassifiers(NaïveBayes,mixtureofmultinomials)
9
StatisticalMachineLearning(S22017) Lecture18
DiscreteConjugateprior:Beta-Binomial
• Conjugatepriorsalsoexistfordiscretespaces
• Considern cointosses, ofwhichk wereheads* letp(head)=qfromasingletoss(Bernoullidist)* Inferencequestionisthecoinbiased,i.e.,isq≈0.5
• Severaldraws,useBinomialdist* anditsconjugateprior,Betadist
10
p(k|n, q) =✓n
k
◆qk(1� q)n�k
p(q) = Beta(q;↵,�)
=�(↵+ �)
�(↵)�(�)q↵�1(1� q)��1
StatisticalMachineLearning(S22017) Lecture18
Betadistribution
11Sourcedfromhttps://en.wikipedia.org/wiki/Beta_distribution
StatisticalMachineLearning(S22017) Lecture18
Beta-Binomialconjugacy
12
Bayesianposterior
trick:ignoreconstantfactors(normaliser)
Sweet!WeknownthenormaliserforBeta
p(k|n, q) =✓n
k
◆qk(1� q)n�k
p(q) = Beta(q;↵,�)
=�(↵+ �)
�(↵)�(�)q↵�1(1� q)��1
p(q|k, n) / p(k|n, q)p(q)/ qk(1� q)n�kq↵�1(1� q)��1
= qk+↵�1(1� q)n�k+��1
/ Beta(q; k + ↵, n� k + �)
StatisticalMachineLearning(S22017) Lecture18
Laplace’sSunriseProblemEverymorningyouobservethesunrising.Basedsolelyonthisfact,what’stheprobabilitythatthesunwillrisetomorrow?
• Usebeta-binomial,whereq isthePr(sunrisesinmorning)* posterior* n=k=ageindays* letalpha=beta=1(uniformprior)
• Undertheseassumptions
13
p(q|k, n) = Beta(q; k + ↵, n� k + �)
p(q|k) = Beta(q; k + 1, 1)
Ep(q|k) [q] =k + 1
k + 2
’smoothed’countofdayswheresunrose/didnot
StatisticalMachineLearning(S22017) Lecture18
SunriseProblem(cont.)
14
Day(n,k) k+α n-k+β E[q]0 1 1 0.51 2 1 0.6672 3 1 0.75…365 366 1 0.9972920(80years)
2921 1 0.99997
Considerahumanlife-span
Effectofpriordiminishingwithdata,butneverdisappearscompletely.
StatisticalMachineLearning(S22017) Lecture18
Suiteofusefulconjugatepriors
15
likelihood conjugateprior
Normal Normal(for mean)
Normal InverseGamma (forvariance)orInverseWishart(covariance)
Binomial Beta
Multinomial Dirichlet
Poisson Gamma
regressio
nclassification
coun
ts
StatisticalMachineLearning(S22017) Lecture18
BayesianLogisticRegression
16
Discriminativeclassifier,whichconditionsoninputs.HowcanwedoBayesian
inferenceinthissetting?
StatisticalMachineLearning(S22017) Lecture18
NowforLogisticRegression…
• Similarproblemswithparameteruncertaintycomparedtoregression* althoughpredictiveuncertaintyin-builttomodeloutputs
17
8.4. Bayesian logistic regression 257
−10 −5 0 5−8
−6
−4
−2
0
2
4
6
8
data
(a)
Log−Likelihood
1
2
3 4
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
(b)
Log−Unnormalised Posterior
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
(c)
Laplace Approximation to Posterior
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
(d)
Figure 8.5 (a) Two-class data in 2d. (b) Log-likelihood for a logistic regression model. The line is drawnfrom the origin in the direction of the MLE (which is at infinity). The numbers correspond to 4 pointsin parameter space, corresponding to the lines in (a). (c) Unnormalized log posterior (assuming vaguespherical prior). (d) Laplace approximation to posterior. Based on a figure by Mark Girolami. Figuregenerated by logregLaplaceGirolamiDemo.
Unfortunately this integral is intractable.The simplest approximation is the plug-in approximation, which, in the binary case, takes the
form
p(y = 1|x,D) ≈ p(y = 1|x,E [w]) (8.60)
where E [w] is the posterior mean. In this context, E [w] is called the Bayes point. Of course,such a plug-in estimate underestimates the uncertainty. We discuss some better approximationsbelow.
MurphyFig8.5&8.6p257-8
258 Chapter 8. Logistic regression
p(y=1|x, wMAP)
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
(a)
−10 −8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
decision boundary for sampled w
(b)
MC approx of p(y=1|x)
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
(c)
numerical approx of p(y=1|x)
−8 −6 −4 −2 0 2 4 6 8−8
−6
−4
−2
0
2
4
6
8
(d)
Figure 8.6 Posterior predictive distribution for a logistic regression model in 2d. Top left: contours ofp(y = 1|x, wmap). Top right: samples from the posterior predictive distribution. Bottom left: Averagingover these samples. Bottom right: moderated output (probit approximation). Based on a figure by MarkGirolami. Figure generated by logregLaplaceGirolamiDemo.
8.4.4.1 Monte Carlo approximation
A better approach is to use a Monte Carlo approximation, as follows:
p(y = 1|x,D) ≈ 1
S
S∑
s=1
sigm((ws)Tx) (8.61)
where ws ∼ p(w|D) are samples from the posterior. (This technique can be trivially extendedto the multi-class case.) If we have approximated the posterior using Monte Carlo, we can reusethese samples for prediction. If we made a Gaussian approximation to the posterior, we candraw independent samples from the Gaussian using standard methods.
Figure 8.6(b) shows samples from the posteiror predictive for our 2d example. Figure 8.6(c)
StatisticalMachineLearning(S22017) Lecture18
NowforLogisticRegression…
• Canweuseconjugateprior?E.g.,* Beta-Binomialforgenerative binarymodels* Dirichlet-Multinomialformulticlass(similarformulation)
• Modelisdiscriminative,withparametersdefinedusinglogisticsigmoid*
* needprioroverw, notq* noknownconjugateprior(!),thususeaGaussianprior
18*Orsoftmax formulticlass;sameproblemsariseandsimilarsolution
p(y|q,x) = qy(1� q)1�y
q = �(x0w)
StatisticalMachineLearning(S22017) Lecture18
Non-conjugacy
• Noknownsolutionforthenormalisingconstant
• Resolvebyapproximation
19
p(w|X,y) / p(w)p(y|X,w)
= Normal(0,�2I)
nY
i=1
�(x0iw)
yi(1� �(x0
iw))
1�yi
Laplaceapprox.:• assumeposterior≃ Normalabout
mode• cancomputenormalisationconstant,
drawsamplesetc.
MurphyFig8.6p258
StatisticalMachineLearning(S22017) Lecture18
BayesianModelSelection
20
Usingtheevidencetoselectthebestclass ofmodel.
StatisticalMachineLearning(S22017) Lecture18
ModelSelection
• Choosingthebestmodel* linear,polynomialorder,RBFbasis/kernel* settingmodelhyperparameters* optimiser settings* typeofmodel(e.g.,decisiontreevssvm)
21
Complexmodels:• betterabilityfitthetrainingdata• mayfitittoowell
Simplemodels:• moreconstrained,poorerfittotrainingdata
• mightbeinsufficient
StatisticalMachineLearning(S22017) Lecture18
ModelSelection(frequentist)• Holdout somedataforvalidation(fixedset,leave-one-out,10-foldcrossvalid.,etc)* treatheld-outerrorasestimateofgeneralisation error* modelwithlowesterrorischosen* mightretrainchosenmodelonfulldataset
• However,thisis* datainefficient:mustholdasideevaluationdata* computationallyinefficient:repeatedlyroundsoftrainingandevaluating
* ineffective:whenselectingmanyparametersatonce(canoverfit theheldout set)
22
StatisticalMachineLearning(S22017) Lecture18
BayesianModelSelection
• ModelselectionusingBayesrule,toselectbetweencompetingmodelclasses* withMi asmodeli andD thedatasete.g.,X or y|X,soforregressionp(D)=p(y|X)
* letp(Mi)beuniform;i.e.,termdropped
• Decisionbetweentwomodelclassesboilsdowntotest(knownasBayesfactor)
23
p(Mi|D) =p(D|Mi)p(Mi)
p(D)
p(M1|D)
p(M2|D)=
p(D|M1)
p(D|M2)> 1
StatisticalMachineLearning(S22017) Lecture18
TheEvidence: p(D|M)=p(y|X,M)
• Imaginewe’reconsideringwhethertousealinearbasisorcubicbasisforsupervisedregression* whatisp(y|X,M=linear)orp(y|X,M=cubic)?* whathappenedtotheparametersw?
• Theseareintegratedout,i.e.,
* seenbefore:thedenominatorfromposterior,aka‘marginallikelihood’
24
p(y|X,M = linear) =
Zp(y|X,w,M = linear)p(w)dw
p(w|X,y,M) =p(y|X,w,M)p(w)
p(y|X,M)
StatisticalMachineLearning(S22017) Lecture18
TheEvidence:BayesianOccam’sRazor
• Howwelldoesthemodelfitthedata,underany(all)parametersettings?
25
• Flexible(complex)models* abletofitmanydifferentdatasets,byselectingspecificparameters
* mostotherparametersettingswillleadtoapoorfit
• Simplermodels* fitfewdatasetswell* lesssensitivetoparametervalues
* manyparametersettingswillgivesimilarfit
StatisticalMachineLearning(S22017) Lecture18
EvidenceCartoon(underuniformprior)• Spaceofmodels:
linear<quadratic<cubic
• Assumingquadraticdata,thisisbestfitby>=quadraticmodel
• Ascomplexityclassgrows,spaceofmodelsgrowstoo* fractionofparams
offering‘good’fittodatawillshrink
• Ideally,wouldselectquadraticmodelasfractionisgreatest
26
→
→
linearmodels
→quadraticmodels
cubicmodels
goodfittogivendataset
StatisticalMachineLearning(S22017) Lecture18
Summary
• Conjugatepriorrelationships* Normal-Normal,Beta-Binomial
• Bayesianinference* parametersare‘nuisance’variables* integratedoutduringinference
• Bayesianclassification* non-conjugacynecessitatesapproximation
• Bayesianmodelselection
27