Upload
ngoxuyen
View
221
Download
4
Embed Size (px)
Citation preview
Lecture4:SupervisedLearning
QuickRecap
• Graphicalmodels• Inference• Decodingformodelsofstructure
• Finally,wegettolearning.– Today,assumeacollecDonofNpairs(x,y);supervisedlearningwithcompletedata.
Loss
• Lethbeahypothesis(aninstanDated,predicDvemodel).
• loss(x,y;h)=ameasureofhowbadlyhperformsoninputxifyisthecorrectoutput.
• Howtodecidewhat“loss”shouldbe?1. computaDonalexpense2. knowledgeofactualcostsoferrors3. formalfoundaDonsenablingtheoreDcalguarantees
Risk
• ThereissometruedistribuDonp*overinput,outputpairs(X,Y).
• UnderthatdistribuDon,whatdoweexpecth’slosstobe?
• Wedon’thavep*,butwehavetheempiricaldistribuDon,givingempiricalrisk:
EmpiricalRiskMinimizaDon
• Providesacriteriontodecideonh:
• BackgroundpreferencesoverhcanbeincludedinregularizedempiricalriskminimizaDon:
ParametricAssumpDons
• Typicallywedonotmovein“h‐space,”butratherinthespaceofconDnuously‐parameterizedpredictors.
ThreeKindsofLossFuncDons
• Error– Couldbezero‐one,ortask‐specific.– Meansquarederrormakessenseforcon*nuouspredicDonsandisusedinregression.
• Logloss– ProbabilisDcinterpretaDon(“likelihood”)
• Hingeloss– GeometricinterpretaDon(“margin”)
LogLoss(FirstVersion)
• MaximumlikelihoodesDmaDon:R(w)is0formodelsinthefamily,+∞forothermodels.
• Maximuma posteriori (MAP) esDmaDon:R(w)is–logp(w)
• Ohencalledgenera6vemodeling.
LogLoss(FirstVersion)
Examples:• N‐gramlanguagemodels
• SupervisedHMMtaggers
• Charniak,Collins,andStanfordparsers
LogLoss(FirstVersion)
ComputaDonally…• ConvexanddifferenDable.• Closedformfordirected,mulDnomial‐basedmodelspw.
– Countandnormalize!• Inothercases,requiresposteriorinference,whichcanbe
expensivedependingonthemodel’sstructure.• Lineardecoding(forsomeparameterizaDons).
LogLoss(FirstVersion)
Error…• NonoDonoferror.• Learnerwinsbymovingasmuchprobabilitymassaspossibletotrainingexamples.
LogLoss(FirstVersion)
Guarantees…• Consistency:ifthetruemodelisintherightfamily,enoughdatawillleadyoutoit.
LogLoss(FirstVersion)
DifferentparameterizaDons…
• MulDnomials(BN‐like):
• Globallog‐linear(MN‐like):
• Locallynormalizedlog‐linear:
ReflecDonsonGeneraDveModels
• MostearlysoluDonsaregeneraDve.
• MostunsupervisedapproachesaregeneraDve.
• SomepeopleonlybelieveingeneraDvemodels.
• SomeDmesesDmatorsarenotaseasyastheyseem(“deficiency”).
• Starthereifthere’sasensiblegeneraDvestory.– Youcanalwaysusea“bener”lossfuncDonwiththesamelinearmodellateron.
ADiatribeon“Deficiency”
• Weusetheterm“deficiency”torefertotheabilityofgeneraDvemodelstoleakprobabilitymassonill‐formedoutcomes.– WordsontopofeachotherinIBMMTmodels.
• Ifourmodelswereunabletogenerateill‐formedoutcomes,we’dhavesolvedNLP!– “Ill‐formed”isintheeyeofthebeholder.
• Theproblemiswhenyouadd“filtering”stepsinthegeneraDvestoryanddon’taccountfortheminesDmaDon.– It’syouresDmatorthatis“deficient,”notyourmodel.
Zero‐OneLoss
Zero‐OneLoss
ComputaDonally:• Piecewiseconstant.
Error:
Guarantees:none
ErrorasLoss
Generalizeszero‐one,samedifficulDes.Example:Bleu‐scoremaximizaDoninmachinetranslaDon,with“MERT”linesearch.
Comparison
Genera6ve(LogLoss) ErrorasLoss
ComputaDon ConvexopDmizaDon. OpDmizingapiecewiseconstantfuncDon.
Error‐awareness None
Guarantees Consistency. None.
DiscriminaDveLearning
• VariouslossfuncDonsbetweenloglossanderror.
• ThreecommonlyusedinNLP:– CondiDonallogloss(“maxent,”CRFs)– Hingeloss(structuralSVMs)
– Perceptron’sloss• We’lldiscusseach,compare,andunify.
LogLoss(SecondVersion)
• CanbeunderstoodasageneraDvemodeloverY,butdoesnotmodelX.– “CondiDonal”model
loss(x,y;hw) = ! log pw(y | x)
LogLoss(SecondVersion)
Examples:• LogisDcregression(forclassificaDon)• MEMMs
• CRFs
loss(x,y;hw) = ! log pw(y | x)
LogLoss(SecondVersion)
ComputaDonally…
• ConvexanddifferenDable.• Requiresposteriorinference,whichcanbeexpensivedependingonthemodel’sstructure.
• Lineardecoding(forsomeparameterizaDons).
loss(x,y;hw) = ! log pw(y | x)
LogLoss(SecondVersion)
Error…• NonoDonoferror.• Learnerwinsbymovingasmuchprobabilitymassaspossibletotrainingexamples’correctoutputs.
loss(x,y;hw) = ! log pw(y | x)
LogLoss(SecondVersion)
Guarantees…• Consistency:ifthetruecondiDonalmodelisintherightfamily,enoughdatawillleadyoutoit.
loss(x,y;hw) = ! log pw(y | x)
LogLoss(SecondVersion)
DifferentparameterizaDons…
• Globallog‐linear(CRF):
• Locallynormalizedlog‐linear(MEMM):
!w!g(x,y) + log!
y!
expw!g(x",y")
loss(x,y;hw) = ! log pw(y | x)
ComparingtheTwoLogLosses
‐logpw(x,y) ‐logpw(y|x)
ParameterizaDon UsuallymulDnomials(BN‐like).
Almostalwayslog‐linear(MN‐like).
UndertheusualparameterizaDon…
ComputaDon Countandnormalize.
ConvexopDmizaDon.
Error‐awareness None. AwareoftheY‐predicDontask,(approximateszero‐one).
Guarantees Consistencyofjoint. Consistencyofcond.
HingeLoss
• PenalizesthemodelforlesngcompeDtorsgetclosetothecorrectanswery.– CanpenalizetheminproporDontotheirerror.
loss(x,y;hw) = !w!g(x,y) + maxy!
w!g(x,y") + error(y",y)
HingeLoss
Examples…• Perceptron(includingCollins’structuredversion)
– Classicversionignoreserror term
• SVMandsomestructuredvariants:– Max‐marginMarkovnetworks(Taskaretal.)– MIRA(1‐best,k‐best)
loss(x,y;hw) = !w!g(x,y) + maxy!
w!g(x,y") + error(y",y)
HingeLoss
ComputaDonally…• Convex,noteverywheredifferenDable.
– Manyspecializedtechniquesnowavailable.
• RequiresMAPor“cost‐augmented”MAPinference.
• Lineardecoding.
loss(x,y;hw) = !w!g(x,y) + maxy!
w!g(x,y") + error(y",y)
HingeLoss
Error…• Builtin.• MostconvenientwhenerrorfuncDonfactorssimilarlytofeaturesg.
loss(x,y;hw) = !w!g(x,y) + maxy!
w!g(x,y") + error(y",y)
HingeLoss
Guarantees…• GeneralizaDonbounds.
– NotclearhowseriouslytotaketheseinNLP;maynotbeDghtenoughtobemeaningful.
• OhenyouwillfindconvergenceguaranteesforopDmizaDontechniques.
loss(x,y;hw) = !w!g(x,y) + maxy!
w!g(x,y") + error(y",y)
TheyAreAllRelated
β γ
CondiDonallogloss 1 0
Perceptron’shingeloss ∞ 0
StructuralSVM’shingeloss ∞ >0
Sohmax‐margin(GimpelandSmith,2010) 1 1
1!
log!
y!
exp"!
#w! (g(x,y")! g(x,y)) + "error(y",y)
$%
CRFs,MaxMargin,orPerceptron?
• Forsupervisedproblems,wedonotexpectlargedifferences.
• Perceptroniseasiesttoimplement.– Withcost‐augmentedinference,itshouldgetbenerandbeginstoapproachMIRAandM3Ns.
• CRFsarebestforprobabilityfeDshists.– Probablymostappropriateifyouareextendingwithlatentvariables;thejuryisout.
• Notyet“plugandplay..”
R(w)
• RegularizaDonterm–avoidoverfisng– Usuallymeans“avoidlargemagnitudesinw”
• (Log)Prior–respectbackgroundbeliefsaboutthepredictorhw
R(w)
• UsualstarDngpoint:squaredL2norm– ComputaDonallyconvenient(it’sstronglyconvex,itisitsownFenchelconjugate,…)
– ProbabilisDcview:Gaussianprioronweights(ChenandRosenfeld,2000)
– Geometricview:Euclideandistance(originalregularizaDonmethodinSVMs)
– Onlyonehyperparameter
R(w) = !!w!22 = !
!
j
w2j
R(w)
• AnotheropDon:L1‐norm– ComputaDonallylessconvenient(noteverywheredifferenDable)
– ProbabilisDcview:Laplacianprioronweights(originallyproposedas“lasso”inregression)
– Sparsityinducing(“free”featureselecDon)
R(w) = !!w!1 = !!
j
|wj |
R(w)
• LotsofanenDontothisinmachinelearning.• “Structuredsparsity”
– Wantgroupsoffeaturestogotozero,orgroup‐internalsparsity,or…
• InterpolaDonbetweenL1andL2–“elasDcnet”– Sparsitybutmaybebenerbehaved
• Thisisnotyet“plugandplay.”– OpDmizaDonalgorithmisheavilyaffected.
MAPLearningisInference
• Seeking“mostprobableexplanaDon”ofthedata,intermsofw.– Explainthedata:p(x,y|w)– Nottoosurprising:p(w)
• Ifwethinkof“W”asanotherrandomvariable,MAPlearningisMAPinference.– Looksverydifferentfromdecoding!– ButatahighlevelofabstracDon,itisthesame.
MAPLearningasaGraphicalModel
w Y
X
R
exp –R(w) = p(w)
pw(Y)
pw(X | Y)
• Thisisaviewoflearninga“noisychannel”model.
MAPLearningasaGraphicalModel
w Y
X
R
exp –R(w) = p(w)
pw(Y | X)
• ThisisaviewoflearninginaCRF.
MAPEsDmaDonforCRFs
maxwp(w|x,y),whichisMAPinference
iteratetoobtaingradient:
sufficientstaDsDcsfromp(y|x,w),obtainedbyposteriorinference
HowToThinkAboutOpDmizaDon
• DependingonyourchoiceoflossandR,differentapproachesbecomeavailable.– Learningalgorithmscaninteractwithinference/decodingalgorithms,too.
• InNLPtoday,itisprobablymoreimportanttofocusonthefeatures,errorfuncDon,andpriorknowledge.– Decidewhatyouwant,andthenusethebestavailableopDmizaDontechnique.
KeyTechniques
• Quasi‐Newton–batchmethodfordifferenDablelossfuncDons– LBFGS,OWLQNwhenusingL1regularizaDon
• StochasDcsubgradientascent–online– Generalizesperceptron,MIRA,stochasDcgradientascent
– SomeDmessensiDvetostepsize– Canohenuse“mini‐batches”tospeedupconvergence
• ForerrorminimizaDon:randomizaDon
Pixalls
• EngineeringonlinelearningproceduresistempDngandmay helpyougetbenerperformance.– Withoutatleastsomeanalysisintermsofloss,error,andregularizaDon,it’sunlikelytobeusefuloutsideyourproblem/dataset.
• WhenrandomizaDonisinvolved,lookatvarianceacrossruns(Clarketal.,2011)
• Alwaystunehyperparameters(e.g.,regularizaDonstrength)ondevelopmentdata!
MajorTopicsinCurrentWork
• Copingwithapproximateinference• ExploiDngincompletedata
– Semisupervisedlearning
– CreaDngfeaturesfromrawtext– Latentvariablemodels(discussedtomorrow)
• Featuremanagement– Structuredsparsity(R)