View
9
Download
0
Category
Preview:
Citation preview
PlanforToday
1. Treekernels(CollinsandDuffy,2002)2. Why(input,output)andoutputkernels
aren’treallyavailable
3. Reranking4. KernelizingCRFs5. RaPonalkernels6. KerneldependencyesPmaPon
KernelsonStructures
• LastPme,Williamtalkedaboutkernelsonfactorialobjects(treepaths),andalsoaboutstringkernels.– IdidnotmenPonitinSeptember,buttheM3Npaper(generalizingSVMstostructuredoutputs)useskernelsaswell–oninputs.
• Theideageneralizesnicelytotrees.• KeyassumpPon:learningandinferencecanbeaccomplishedifwecanefficientlycalculatef(x)Tf(x’),wherefisourimpliedfeaturespace.
ABitofHistory:“DOP”
• Data‐orientedparsing:abadnameforaninteresPngidea(Bod,1998).– EveryconPguoussubtreeisafeature.– Lotsofpapersonhowtodothisefficiently.– Mostcloselyrelatedtomemory‐basedorinstance‐basedlearning(alongthelinesofKNN).
– Goodman(1996)approximatedwithaPCFG.
• Theparttoremember:every treefragmentisafeature.
• Relatedtotreesubs*tu*ongrammar.
AllTreeFragmentsFeatureVector
• Everypossiblefragmentcorrespondstoadimensioninthevectorf(x).
• fi(x)=thenumberofPmestheithfragmentoccursinx.
• f(x)Tf(x’)=numberofexactlymatchingfragmenttokensinxandx’
TreeKernel(CollinsandDuffy,2002)
f(x)!f(x′) =∑
i
fi(x)fi(x′)
=∑
i
(∑
n∈x
[ith fragment matches at n]
)
︸ ︷︷ ︸Ii(n)
(∑
n′∈x′
[ith fragment matches at n′]
)
︸ ︷︷ ︸Ii(n′)
=∑
i
∑
n∈x
Ii(n)∑
n′∈x′
Ii(n′)
=∑
n∈x
∑
n′∈x′
∑
i
Ii(n)Ii(n′)
︸ ︷︷ ︸∆(n,n′)
∆(n, n′) =
0 if productions at n and n′ differ1 if n and n′ are preterminals#kids(n)∏
j=1
(1 + ∆(jth child of n, jth child of n′) otherwise
Notes
• O(|x||x’|)runPme(numberofnodesineachtree).– CollinsandDuffyclaimit’sclosertolinearinpracPce.
• Labeledsequencesareakindoftree.• YoucanusewordsimilarityfuncPonsinsteadof0/1formatchingwords.
• CollinsandDuffyusedtheCollinsparser(model2)to:– providealikelihoodtousealongsidethekernelasafeature
– provide“mulPplehypotheses”foruseinthevotedperceptronalgorithm
• ParsinggainsonWSJPennTreebanktask.
“MulPpleHypotheses”?• Structuredperceptronaswelearnedit(andalsoCRF,SSVM,etc.)
assumewereasonabouttheenPresetofpossibleoutputsyforeachinputx.– Decoding,summing,cost‐augmenteddecoding.
• Here,arerankingapproachisassumed. – Usesomeothermodeltoprovidecandidates.– DiscriminaPve,kernelizedmodel(here,perceptroninthedual)only
getstorerankcandidates.– CharniakandJohnson(2005)ranwiththererankingideabutwent
backtolog‐linearmodels,andbyengineeringgoodfeaturesdidquitewell.
• Reranking:apopularideaintheearly2000s,regardlessofwhetheryouusekernels.
• Understudiedchallenge:diversityofthen‐bestlist.
GrumpyAside:KnowThyKernel
• Kernel=setoffeatures• You’repremymuchalwaysusingakernel.• Empiricallyitseemsthat:– knowingyourproblemanddesiginggoodfeaturestoaddtoyour“kernel”isawin
– tryingallthedifferentkernelsimplementedinSVMlight(withoutunderstandingthedifferences)mayhelpalimle,butnobodycares.
• Forlanguage,anythingbeyondalinearkernelusuallyneedssomejusPficaPon.
KernelsandDecoding
• Ideally,wewouldlikekernelsonenPreinputsandoutputs,asinCollinsandDuffy,butlearndirectlyfromthedata,notasasecondaryrerankingstage.
• Whywon’tthiswork?
decode(x) = arg maxy
w!f(x,y)
= arg maxy
N∑
i=1
∑
y′∈Y(xi)
αi,y′K((xi,y′), (x,y))
KernelsonOutputs
• InpracPce,apartfromreranking,thisisnotdoneyet.
• ThereareafewinteresPngpapersthatexplorevariouspossibiliPes,andIwanttodiscusssomeofthem.– KernelCRFs– RaPonalkernels– KerneldependencyesPmaPon
KernelCRFs(Laffertyetal.,2004)
• Don’ttryforanarbitraryK((x,y),(x’,y’)).• Instead,defineyourstructureyasanassignmentofvaluestovariablesYinaMarkovnetwork.
• Kernelsarenowoncliques:K((x,yc),(x’,yc’’)).– Anytwocliquesassignmentsinanytwographs.
• Representertheorem:inthemodelthatmaximizesregularizedlog‐loss:
score(x,y) =N∑
i=1
∑
c∈cliques(graph(xi))
∑
y′c∈Yc
αi,c,y′cKc((xi,y
′c), (x,yc))
LearningAlgorithm
• Toomanycliques!• GreedyforwardselecPon(muchlikeolderfeatureselecPonalgorithms,e.g.,DellaPietraetal.,1997).
• Basicideaistoiterate:– Foreverylabeledcliqueinthetrainingdata,calculatethefirstderivaPveoftheobjecPve(regularizedlog‐likelihood)w.r.t.theclique.• Thisisdoneapproximately,forefficiency.
– AddthecliquewiththelargestgradienttotheacPveset.– OpPmizelikelihoodforthecurrentacPvesetofcliques;thisisdoneinthedual.
But…
• Thistechniqueisnotwidelyused.• InNLP,mostreportedresultssPckwithlinearkernels;lotsofresultsincludesome“featureengineering.”– Someresearcherssee“featureengineering”asgood,honestwork.
– OthersseeitasadistracPonfrom“general”methods.
– Whatdoyouthink?
RaPonalKernels(Cortesetal.,2004)
• UndersomecondiPons,youcanuseWFSTstodefineakernelbetweenstrings.– OrbetweensetsofstringsrepresentedasFSAs.
• ThekernelfuncPonisdefinedbydoingweightedcomposiPonx∘T∘y,andthentakingthesemiringpathsum.– Editdistanceusesmin‐plus.– Stringkernelsuseplus‐Pmes.
PDSKernels
• NotallkernelsareposiPvedefiniteandsymmetric.– ThosearenecessarycondiPonsforlearningalgorithmsto“work”withakernel.
• Cortesetal.definesomeformalproperPes(closureundervariousoperaPons).
• TheycharacterizesomeexisPngkernelsasPDS.
• Experimentsincluded,butnotforstructuredoutputs.
PCAandKernelPCA
• Principalcomponentanalysis(Pearson,1901):transformmulP‐dimensionaldataintouncorrelateddimensions.– EigenvaluedecomposiPonofthecovariancematrix
– SingularvaluedecomposiPonofthedatamatrix
• KernelPCA(Schoelkopfetal.,1998):doitinaRKHS!– Onlyinnerproductsareneeded.
KernelDependencyEsPmaPon(Westonetal.,2003)
Fornow,imaginejustkernelsonoutputs,K(y,y’).
X
Y
inputs
outputs
“outputfeaturespace”
“pre‐image”problem
kernelPCAmap:principleaxesinRKHSfeaturespace
mulPvariateregression
Punchline
• YoushouldunderstandthatkernelsareaformalizaPonofthenoPonoffeatures.
• AbstracPngfeaturesintoakernelcanopenupthepossibilityofusingsomecoollearningalgorithms.
• ButyouruntheriskofgesngtoofarfromthedataandapplicaPon.
• Kernelsontheoutput sidecreatesignficantcomputaPonalchallengesthatremaintobesolvedforpracPcaluse.
Recommended