Kernels for Structured Output

KernelsforStructuredOutput

SPFLODDNovember10,2011

PlanforToday

1.  Treekernels(CollinsandDuffy,2002)2.  Why(input,output)andoutputkernels

aren’treallyavailable

3.  Reranking4.  KernelizingCRFs5.  RaPonalkernels6.  KerneldependencyesPmaPon

KernelsonStructures

•  LastPme,Williamtalkedaboutkernelsonfactorialobjects(treepaths),andalsoaboutstringkernels.–  IdidnotmenPonitinSeptember,buttheM3Npaper(generalizingSVMstostructuredoutputs)useskernelsaswell–oninputs.

•  Theideageneralizesnicelytotrees.•  KeyassumpPon:learningandinferencecanbeaccomplishedifwecanefficientlycalculatef(x)Tf(x’),wherefisourimpliedfeaturespace.

ABitofHistory:“DOP”

•  Data‐orientedparsing:abadnameforaninteresPngidea(Bod,1998).–  EveryconPguoussubtreeisafeature.–  Lotsofpapersonhowtodothisefficiently.– Mostcloselyrelatedtomemory‐basedorinstance‐basedlearning(alongthelinesofKNN).

– Goodman(1996)approximatedwithaPCFG.

•  Theparttoremember:every treefragmentisafeature.

•  Relatedtotreesubs*tu*ongrammar.

AllTreeFragmentsFeatureVector

•  Everypossiblefragmentcorrespondstoadimensioninthevectorf(x).

•  fi(x)=thenumberofPmestheithfragmentoccursinx.

•  f(x)Tf(x’)=numberofexactlymatchingfragmenttokensinxandx’

TreeKernel(CollinsandDuffy,2002)

f(x)!f(x′) =∑

fi(x)fi(x′)

[ith fragment matches at n]

︸︷︷︸Ii(n)

n′∈x′

[ith fragment matches at n′]

︸︷︷︸Ii(n′)

Ii(n)∑

n′∈x′

Ii(n′)

n′∈x′

Ii(n)Ii(n′)

︸︷︷︸∆(n,n′)

∆(n, n′) =

0 if productions at n and n′ differ1 if n and n′ are preterminals#kids(n)∏

(1 + ∆(jth child of n, jth child of n′) otherwise

•  O(|x||x’|)runPme(numberofnodesineachtree).–  CollinsandDuffyclaimit’sclosertolinearinpracPce.

•  Labeledsequencesareakindoftree.•  YoucanusewordsimilarityfuncPonsinsteadof0/1formatchingwords.

•  CollinsandDuffyusedtheCollinsparser(model2)to:–  providealikelihoodtousealongsidethekernelasafeature

–  provide“mulPplehypotheses”foruseinthevotedperceptronalgorithm

•  ParsinggainsonWSJPennTreebanktask.

“MulPpleHypotheses”?•  Structuredperceptronaswelearnedit(andalsoCRF,SSVM,etc.)

assumewereasonabouttheenPresetofpossibleoutputsyforeachinputx.–  Decoding,summing,cost‐augmenteddecoding.

•  Here,arerankingapproachisassumed. –  Usesomeothermodeltoprovidecandidates.–  DiscriminaPve,kernelizedmodel(here,perceptroninthedual)only

getstorerankcandidates.–  CharniakandJohnson(2005)ranwiththererankingideabutwent

backtolog‐linearmodels,andbyengineeringgoodfeaturesdidquitewell.

•  Reranking:apopularideaintheearly2000s,regardlessofwhetheryouusekernels.

•  Understudiedchallenge:diversityofthen‐bestlist.

GrumpyAside:KnowThyKernel

•  Kernel=setoffeatures•  You’repremymuchalwaysusingakernel.•  Empiricallyitseemsthat:–  knowingyourproblemanddesiginggoodfeaturestoaddtoyour“kernel”isawin

–  tryingallthedifferentkernelsimplementedinSVMlight(withoutunderstandingthedifferences)mayhelpalimle,butnobodycares.

•  Forlanguage,anythingbeyondalinearkernelusuallyneedssomejusPficaPon.

KernelsandDecoding

•  Ideally,wewouldlikekernelsonenPreinputsandoutputs,asinCollinsandDuffy,butlearndirectlyfromthedata,notasasecondaryrerankingstage.

•  Whywon’tthiswork?

decode(x) = arg maxy

w!f(x,y)

= arg maxy

y′∈Y(xi)

αi,y′K((xi,y′), (x,y))

KernelsonOutputs

•  InpracPce,apartfromreranking,thisisnotdoneyet.

•  ThereareafewinteresPngpapersthatexplorevariouspossibiliPes,andIwanttodiscusssomeofthem.– KernelCRFs– RaPonalkernels– KerneldependencyesPmaPon

KernelCRFs(Laffertyetal.,2004)

•  Don’ttryforanarbitraryK((x,y),(x’,y’)).•  Instead,defineyourstructureyasanassignmentofvaluestovariablesYinaMarkovnetwork.

•  Kernelsarenowoncliques:K((x,yc),(x’,yc’’)).– Anytwocliquesassignmentsinanytwographs.

•  Representertheorem:inthemodelthatmaximizesregularizedlog‐loss:

score(x,y) =N∑

c∈cliques(graph(xi))

y′c∈Yc

αi,c,y′cKc((xi,y

′c), (x,yc))

LearningAlgorithm

•  Toomanycliques!•  GreedyforwardselecPon(muchlikeolderfeatureselecPonalgorithms,e.g.,DellaPietraetal.,1997).

•  Basicideaistoiterate:–  Foreverylabeledcliqueinthetrainingdata,calculatethefirstderivaPveoftheobjecPve(regularizedlog‐likelihood)w.r.t.theclique.•  Thisisdoneapproximately,forefficiency.

–  AddthecliquewiththelargestgradienttotheacPveset.–  OpPmizelikelihoodforthecurrentacPvesetofcliques;thisisdoneinthedual.

But…

•  Thistechniqueisnotwidelyused.•  InNLP,mostreportedresultssPckwithlinearkernels;lotsofresultsincludesome“featureengineering.”– Someresearcherssee“featureengineering”asgood,honestwork.

– OthersseeitasadistracPonfrom“general”methods.

– Whatdoyouthink?

RaPonalKernels(Cortesetal.,2004)

•  UndersomecondiPons,youcanuseWFSTstodefineakernelbetweenstrings.– OrbetweensetsofstringsrepresentedasFSAs.

•  ThekernelfuncPonisdefinedbydoingweightedcomposiPonx∘T∘y,andthentakingthesemiringpathsum.– Editdistanceusesmin‐plus.– Stringkernelsuseplus‐Pmes.

PDSKernels

•  NotallkernelsareposiPvedefiniteandsymmetric.–  ThosearenecessarycondiPonsforlearningalgorithmsto“work”withakernel.

•  Cortesetal.definesomeformalproperPes(closureundervariousoperaPons).

•  TheycharacterizesomeexisPngkernelsasPDS.

•  Experimentsincluded,butnotforstructuredoutputs.

KernelDependencyEsPmaPon

PCAandKernelPCA

•  Principalcomponentanalysis(Pearson,1901):transformmulP‐dimensionaldataintouncorrelateddimensions.– EigenvaluedecomposiPonofthecovariancematrix

– SingularvaluedecomposiPonofthedatamatrix

•  KernelPCA(Schoelkopfetal.,1998):doitinaRKHS!– Onlyinnerproductsareneeded.

KernelDependencyEsPmaPon(Westonetal.,2003)

Fornow,imaginejustkernelsonoutputs,K(y,y’).

inputs

outputs

“outputfeaturespace”

“pre‐image”problem

kernelPCAmap:principleaxesinRKHSfeaturespace

mulPvariateregression

Punchline

•  YoushouldunderstandthatkernelsareaformalizaPonofthenoPonoffeatures.

•  AbstracPngfeaturesintoakernelcanopenupthepossibilityofusingsomecoollearningalgorithms.

•  ButyouruntheriskofgesngtoofarfromthedataandapplicaPon.

•  Kernelsontheoutput sidecreatesignficantcomputaPonalchallengesthatremaintobesolvedforpracPcaluse.

Kernels for Structured Output

Documents

From Multi-class Classification to Structured Output ... Multi-class Classification to Structured Output Prediction ... Sequence labeling gives a label to each element ... –They

Structured Output Prediction and Binary Code Learning in

Large Margin Methods for Structured and Interdependent Output Variables

Structured Output SVM Prediction of Apparent Age, Gender and …cmp.felk.cvut.cz/ftp/articles/uricar/Uricar-CVPRw2016.pdf · We propose structured output SVM for predicting the ap-parent

Magic Moments: Moment-based Approaches to Structured Output Prediction

Deep Structured Output Learning for Unconstrained Text Recognition

Kernel Methods for Structured Data · on kernels for structured data and suggest some basic principles for developing novel ones. We ﬁnally ... Inner products are also known as

Deep Structured Prediction with Nonlinear Output ...papers.nips.cc/paper/7869-deep-structured-prediction-with-nonlinear... · Deep Structured Prediction with Nonlinear Output Transformations

Large Margin Methods for Structured and Interdependent Output

Structured Output-Associated Dictionary Learning for

STRUCTURED OUTPUT PREDICTION ON DATA STREAMSkt.ijs.si/theses/phd_aljaz_osojnik.pdfIn this thesis, we deal with structured output prediction (SOP) on data streams. SOP is concerned

Learning to Adapt Structured Output Space for Semantic ... · Learning to Adapt Structured Output Space for Semantic Segmentation Yi-Hsuan Tsai 1Wei-Chih Hung2 Samuel Schulter Kihyuk

Adversarial Structured Output Prediction

Learning to Localize Objects with Structured Output …trevor/CS294PublicFiles...Learning to Localize Objects with Structured Output Regression Matthew B. Blaschko and Christoph H

Machine Learning Classification, Discriminative …...Machine Learning Classiﬁcation, Discriminative learning Structured output, structured input, discriminative function, joint

Domain Adaptation for Structured Output via Discriminative

Non-parametric Structured Output Networks

Characteristic Kernels on Structured Domains Excel in Robotics …gretton/papers/DanGreSch10.pdf · 2018-08-20 · Characteristic Kernels on Structured Domains Excel in Robotics 267

Autotuning Sparse Matrix and Structured Grid Kernels

Structured Output SVM Prediction of Apparent Age, Gender ...people.ee.ethz.ch/~timofter/publications/Uricar-CVPRW-2016.pdf · We propose structured output SVM for predicting the ap-parent