Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deﬁciency” • We use the term “deﬁciency” to refer to the ability of generave

Lecture4:SupervisedLearning

QuickRecap

•  Graphicalmodels•  Inference•  Decodingformodelsofstructure

•  Finally,wegettolearning.– Today,assumeacollecDonofNpairs(x,y);supervisedlearningwithcompletedata.

Loss

•  Lethbeahypothesis(aninstanDated,predicDvemodel).

•  loss(x,y;h)=ameasureofhowbadlyhperformsoninputxifyisthecorrectoutput.

•  Howtodecidewhat“loss”shouldbe?1.  computaDonalexpense2.  knowledgeofactualcostsoferrors3.  formalfoundaDonsenablingtheoreDcalguarantees

Risk

•  ThereissometruedistribuDonp*overinput,outputpairs(X,Y).

•  UnderthatdistribuDon,whatdoweexpecth’slosstobe?

•  Wedon’thavep*,butwehavetheempiricaldistribuDon,givingempiricalrisk:

EmpiricalRiskMinimizaDon

•  Providesacriteriontodecideonh:

•  BackgroundpreferencesoverhcanbeincludedinregularizedempiricalriskminimizaDon:

ParametricAssumpDons

•  Typicallywedonotmovein“h‐space,”butratherinthespaceofconDnuously‐parameterizedpredictors.

ThreeKindsofLossFuncDons

•  Error– Couldbezero‐one,ortask‐specific.– Meansquarederrormakessenseforcon*nuouspredicDonsandisusedinregression.

•  Logloss– ProbabilisDcinterpretaDon(“likelihood”)

•  Hingeloss– GeometricinterpretaDon(“margin”)

LogLoss(FirstVersion)

•  MaximumlikelihoodesDmaDon:R(w)is0formodelsinthefamily,+∞forothermodels.

•  Maximuma posteriori (MAP) esDmaDon:R(w)is–logp(w)

•  Ohencalledgenera6vemodeling.


Examples:•  N‐gramlanguagemodels

•  SupervisedHMMtaggers

•  Charniak,Collins,andStanfordparsers


ComputaDonally…•  ConvexanddifferenDable.•  Closedformfordirected,mulDnomial‐basedmodelspw.

–  Countandnormalize!•  Inothercases,requiresposteriorinference,whichcanbe

expensivedependingonthemodel’sstructure.•  Lineardecoding(forsomeparameterizaDons).


Error…•  NonoDonoferror.•  Learnerwinsbymovingasmuchprobabilitymassaspossibletotrainingexamples.


Guarantees…•  Consistency:ifthetruemodelisintherightfamily,enoughdatawillleadyoutoit.


DifferentparameterizaDons…

•  MulDnomials(BN‐like):

•  Globallog‐linear(MN‐like):

•  Locallynormalizedlog‐linear:

ReflecDonsonGeneraDveModels

•  MostearlysoluDonsaregeneraDve.

•  MostunsupervisedapproachesaregeneraDve.

•  SomepeopleonlybelieveingeneraDvemodels.

•  SomeDmesesDmatorsarenotaseasyastheyseem(“deficiency”).

•  Starthereifthere’sasensiblegeneraDvestory.–  Youcanalwaysusea“bener”lossfuncDonwiththesamelinearmodellateron.

ADiatribeon“Deficiency”

•  Weusetheterm“deficiency”torefertotheabilityofgeneraDvemodelstoleakprobabilitymassonill‐formedoutcomes.– WordsontopofeachotherinIBMMTmodels.

•  Ifourmodelswereunabletogenerateill‐formedoutcomes,we’dhavesolvedNLP!–  “Ill‐formed”isintheeyeofthebeholder.

•  Theproblemiswhenyouadd“filtering”stepsinthegeneraDvestoryanddon’taccountfortheminesDmaDon.–  It’syouresDmatorthatis“deficient,”notyourmodel.

Zero‐OneLoss

Zero‐OneLoss

ComputaDonally:•  Piecewiseconstant.

Error:

Guarantees:none

ErrorasLoss

Generalizeszero‐one,samedifficulDes.Example:Bleu‐scoremaximizaDoninmachinetranslaDon,with“MERT”linesearch.

Comparison

Genera6ve(LogLoss) ErrorasLoss

ComputaDon ConvexopDmizaDon. OpDmizingapiecewiseconstantfuncDon.

Error‐awareness None

Guarantees Consistency. None.

DiscriminaDveLearning

•  VariouslossfuncDonsbetweenloglossanderror.

•  ThreecommonlyusedinNLP:– CondiDonallogloss(“maxent,”CRFs)– Hingeloss(structuralSVMs)

– Perceptron’sloss•  We’lldiscusseach,compare,andunify.

LogLoss(SecondVersion)

•  CanbeunderstoodasageneraDvemodeloverY,butdoesnotmodelX.– “CondiDonal”model

loss(x,y;hw) = ! log pw(y | x)


Examples:•  LogisDcregression(forclassificaDon)•  MEMMs

•  CRFs



ComputaDonally…

•  ConvexanddifferenDable.•  Requiresposteriorinference,whichcanbeexpensivedependingonthemodel’sstructure.

•  Lineardecoding(forsomeparameterizaDons).



Error…•  NonoDonoferror.•  Learnerwinsbymovingasmuchprobabilitymassaspossibletotrainingexamples’correctoutputs.



Guarantees…•  Consistency:ifthetruecondiDonalmodelisintherightfamily,enoughdatawillleadyoutoit.



DifferentparameterizaDons…

•  Globallog‐linear(CRF):

•  Locallynormalizedlog‐linear(MEMM):

!w!g(x,y) + log!

y!

expw!g(x",y")


ComparingtheTwoLogLosses

‐logpw(x,y) ‐logpw(y|x)

ParameterizaDon UsuallymulDnomials(BN‐like).

Almostalwayslog‐linear(MN‐like).

UndertheusualparameterizaDon…

ComputaDon Countandnormalize.

ConvexopDmizaDon.

Error‐awareness None. AwareoftheY‐predicDontask,(approximateszero‐one).

Guarantees Consistencyofjoint. Consistencyofcond.

HingeLoss

•  PenalizesthemodelforlesngcompeDtorsgetclosetothecorrectanswery.– CanpenalizetheminproporDontotheirerror.

loss(x,y;hw) = !w!g(x,y) + maxy!

w!g(x,y") + error(y",y)

HingeLoss

Examples…•  Perceptron(includingCollins’structuredversion)

–  Classicversionignoreserror term

•  SVMandsomestructuredvariants:– Max‐marginMarkovnetworks(Taskaretal.)– MIRA(1‐best,k‐best)



HingeLoss

ComputaDonally…•  Convex,noteverywheredifferenDable.

– Manyspecializedtechniquesnowavailable.

•  RequiresMAPor“cost‐augmented”MAPinference.

•  Lineardecoding.



HingeLoss

Error…•  Builtin.•  MostconvenientwhenerrorfuncDonfactorssimilarlytofeaturesg.



HingeLoss

Guarantees…•  GeneralizaDonbounds.

– NotclearhowseriouslytotaketheseinNLP;maynotbeDghtenoughtobemeaningful.

•  OhenyouwillfindconvergenceguaranteesforopDmizaDontechniques.



TheyAreAllRelated

β γ

CondiDonallogloss 1 0

Perceptron’shingeloss ∞ 0

StructuralSVM’shingeloss ∞ >0

Sohmax‐margin(GimpelandSmith,2010) 1 1

1!

log!

y!

exp"!

#w! (g(x,y")! g(x,y)) + "error(y",y)

$%

CRFs,MaxMargin,orPerceptron?

•  Forsupervisedproblems,wedonotexpectlargedifferences.

•  Perceptroniseasiesttoimplement.– Withcost‐augmentedinference,itshouldgetbenerandbeginstoapproachMIRAandM3Ns.

•  CRFsarebestforprobabilityfeDshists.– Probablymostappropriateifyouareextendingwithlatentvariables;thejuryisout.

•  Notyet“plugandplay..”

R(w)

•  RegularizaDonterm–avoidoverfisng– Usuallymeans“avoidlargemagnitudesinw”

•  (Log)Prior–respectbackgroundbeliefsaboutthepredictorhw

R(w)

•  UsualstarDngpoint:squaredL2norm– ComputaDonallyconvenient(it’sstronglyconvex,itisitsownFenchelconjugate,…)

– ProbabilisDcview:Gaussianprioronweights(ChenandRosenfeld,2000)

– Geometricview:Euclideandistance(originalregularizaDonmethodinSVMs)

– Onlyonehyperparameter

R(w) = !!w!22 = !

!

j

w2j

R(w)

•  AnotheropDon:L1‐norm– ComputaDonallylessconvenient(noteverywheredifferenDable)

– ProbabilisDcview:Laplacianprioronweights(originallyproposedas“lasso”inregression)

– Sparsityinducing(“free”featureselecDon)

R(w) = !!w!1 = !!

j

|wj |

R(w)

•  LotsofanenDontothisinmachinelearning.•  “Structuredsparsity”

– Wantgroupsoffeaturestogotozero,orgroup‐internalsparsity,or…

•  InterpolaDonbetweenL1andL2–“elasDcnet”– Sparsitybutmaybebenerbehaved

•  Thisisnotyet“plugandplay.”– OpDmizaDonalgorithmisheavilyaffected.

MAPLearningisInference

•  Seeking“mostprobableexplanaDon”ofthedata,intermsofw.– Explainthedata:p(x,y|w)– Nottoosurprising:p(w)

•  Ifwethinkof“W”asanotherrandomvariable,MAPlearningisMAPinference.– Looksverydifferentfromdecoding!– ButatahighlevelofabstracDon,itisthesame.

MAPLearningasaGraphicalModel

w Y

X

R

exp –R(w) = p(w)

pw(Y)

pw(X | Y)

•  Thisisaviewoflearninga“noisychannel”model.

MAPLearningasaGraphicalModel

w Y

X

R

exp –R(w) = p(w)

pw(Y | X)

•  ThisisaviewoflearninginaCRF.

MAPEsDmaDonforCRFs

maxwp(w|x,y),whichisMAPinference

iteratetoobtaingradient:

sufficientstaDsDcsfromp(y|x,w),obtainedbyposteriorinference

HowToThinkAboutOpDmizaDon

•  DependingonyourchoiceoflossandR,differentapproachesbecomeavailable.– Learningalgorithmscaninteractwithinference/decodingalgorithms,too.

•  InNLPtoday,itisprobablymoreimportanttofocusonthefeatures,errorfuncDon,andpriorknowledge.– Decidewhatyouwant,andthenusethebestavailableopDmizaDontechnique.

KeyTechniques

•  Quasi‐Newton–batchmethodfordifferenDablelossfuncDons–  LBFGS,OWLQNwhenusingL1regularizaDon

•  StochasDcsubgradientascent–online– Generalizesperceptron,MIRA,stochasDcgradientascent

–  SomeDmessensiDvetostepsize–  Canohenuse“mini‐batches”tospeedupconvergence

•  ForerrorminimizaDon:randomizaDon

Pixalls

•  EngineeringonlinelearningproceduresistempDngandmay helpyougetbenerperformance.– Withoutatleastsomeanalysisintermsofloss,error,andregularizaDon,it’sunlikelytobeusefuloutsideyourproblem/dataset.

•  WhenrandomizaDonisinvolved,lookatvarianceacrossruns(Clarketal.,2011)

•  Alwaystunehyperparameters(e.g.,regularizaDonstrength)ondevelopmentdata!

MajorTopicsinCurrentWork

•  Copingwithapproximateinference•  ExploiDngincompletedata

– Semisupervisedlearning

– CreaDngfeaturesfromrawtext– Latentvariablemodels(discussedtomorrow)

•  Featuremanagement– Structuredsparsity(R)

Documents

Lecture 4: Supervised Learning - Carnegie Mellon …nasmith/psnlp/lecture4.pdfA Diatribe on “Deﬁciency” • We use the term “deﬁciency” to refer to the ability of generave