Upload
lynhu
View
219
Download
0
Embed Size (px)
Citation preview
Announcements• Midtermreturnedatendofclasstoday• OnlyexamsthatweretakenonThursday
• Today:movingintoneuralnetsviawordembeddings• Tuesday:Introduc=ontobasicneuralnetarchitecture.ChrisKedzietolecture.• HomeworkoutonTuesday• Languageapplica=onsusingdifferentarchitectures
Methodssofar• Bagofwords• Simpleandinterpretable
• Invectorspace,representasentenceJohnlikesmilk[0000100100000010000]“one-hot”vectorValuescouldbefrequency,TF*IDF
• Sparserepresenta=on• Dimensionality:50Kunigrams,500Kbigrams
• Curseofdimensionality!
FromSymbolictoDistributedRepresentations• Itsproblem,e.g.,forwebsearch• Ifusersearchesfor[DellnotebookbaYery],shouldmatchdocumentswith“DelllaptopbaYery”
• Ifusersearchesfor[SeaYlemotel]shouldmatchdocumentscontaining“SeaYlehotel”
• But• Motel[0000000000001000]Hotel[0001000000000000]
• OurqueryanddocumentvectorsareorthogonalThereisnonaturalno=onofsimilarityinasetofone-hotvectors• ->Exploreadirectapproachwherevectorsencodeit
SlidefromChrisManning
DistributionalSemantics• “Youshallknowawordbythecompanyitkeeps”[J.R.Firth1957]• Marcosawahairyli;lewampunukhidingbehindatree
• Wordsthatoccurinsimilarcontextshavesimilarmeaning• Recordwordco-occurrencewithinawindowoveralargecorpus
WordContextMatrices• Eachrowirepresentsaword• Eachcolumnjrepresentsalinguis=ccontext• Matrixijrepresentsstrengthofassocia=on• MfεR,Mf
i,j=f(wi,cj)wherefisanassocia=onmeasureofthestrengthbetweenawordandacontext
I hamburger book gi. spoon
ate .45 .56 .02 .03 .3
gave .46 .13 .67 .7 .25
took .46 .1 .7 .5 .3
AssociationsandSimilarity• Effec=veassocia=onmeasure:PointwiseMutualInforma=on(PMI)logP(w,c)/P(w)P(c)=log#(w,c)*|D|/#(w)*#(c)
• Computesimilaritybetweenwordsandtext• CosineSimilarityΣiui �vi/√Σi(ui)2√Σi(vi)2
DimensionalityReduction• Capturescontext,buts=llhassparsenessissues• Singularvaluedecomposi=on(SVD)• FactorsmatrixMintotwonarrowmatrices:W,awordmatrix,andC,acontextmatrixsuchthatWCT=M’isthebestrank-dapproxima=onofM
• A“smoothed”versionofM• Addswordstocontextsifotherwordsinthiscontextseemtoco-locatewitheachother• Representseachwordasadensed-dimensionalvectorinsteadofasparse|VC|one
NeuralNets• Afamilyofmodelswithindeeplearning• Themachinelearningapproacheswehaveseentodaterelyon“featureengineering”• Withneuralnets,insteadwelearnbyop=mizingasetofparameters
Why“DeepLearning”?• Representa?onlearningaYemptstoautoma=callylearngoodfeaturesorrepresenta=ons• DeeplearningalgorithmsaYempttolearn(mul=plelevelsof)representa=onandanoutput• From“raw”inputsx(e.g.,sound,characters,words)
SlideadaptedfromChrisManning
ReasonsforExploringDeepLearning• Manuallydesignedfeaturescanbeover-specificortakealong=metodesign• …butcanprovideanintui=onaboutthesolu=on
• Learnedfeaturesareeasytoadapt• Deeplearningprovidesaveryflexibleframeworkforrepresen=ngword,visualandlinguis=cinforma=on• Bothsupervisedandunsupervisedmethods
SlideadaptedfromChrisManning
Progresswithdeeplearning• Hugeleapsforwardwith• Speech• Vision• MachineTransla=on• Moremodestadvancesinotherareas
[Krizhevskyetal.2012]
FromDistributionalSemanticstoNeuralNetworks• Insteadofcount-basedmethods,distributedrepresenta=onsofwordmeaning• Eachwordassociatedwithavectorwheremeaningiscapturedindifferentdimensionsaswellasindimensionsofotherwords• Dimensionsinadistributedrepresenta=onarenotinterpretable• Specificdimensionsdonotcorrespondtospecificconcepts
BasicIdeaofLearningNeuralNetworkEmbeddings• Defineamodelthataimstopredictbetweenacenterwordwtandcontextwordsintermsofwordvectors
p(context|wt)=….Whichhasalossfunc=on,e.g.,
J=1-p(w-t|wt)• Welookatmanyposi=onstinalargecorpus• Wekeepadjusingthevectorrepresenta=onsofwordstominimizeloss
SlideadaptedfromChrisManning
EmbeddingsAreMagic
vector(‘king’)-vector(‘man’)+vector(‘woman’)≈ vector(‘queen’)
20 SlidefromDragomirRadev,ImagecourtesyofJurafsky&Mar=n
Relevantapproaches:YoavandGoldberg• Chapter9:Aneuralprobabilis=clanguagemodel(Bengioetal2003)• Chapter10,p.113NLP(almost)fromScratch(Collobert&Weston2008)• Chapter10,p114Word2vec(Mikologetal2013)
MainIdeaofword2vec• Predictbetweeneverywordanditscontext
• Twoalgorithms• Skip-gram(SG)Predictcontextwordsgiventarget(posi=onindependent)• Con=nuousBagofWords(CBOW)Predicttargetwordfrombag-of-wordscontext
SlideadaptedfromChrisManning
TrainingMethods• Two(moderatelyefficient)trainingmethods
Hierarchicalso{maxNega=vesamplingToday:naïveso{max
SlideadaptedfromChrisManning
Instead,abankcanholdtheinvestmentsinacustodialaccountContextcentercontextwordswordsword2wordt2wordwindowwindowButasagricultureburgeonsontheeastbank,theriverwillshrinkContextwordscentercontext2wordwindowt2wordwindow
ObjectiveFunction• Maximizetheprobabilityofcontextwordsgiventhecenterword
J’(Θ)=ΠΠP(wt+j|wtjΘ)t=1-m≤j≤mj≠0Nega=veloglikelihood
J’(Θ)=-1/TΣΣlogP(wt+j|wt)t=1-m≤j≤mj≠0WhereΘrepresentsallvariablestobeop=mized
SlideadaptedfromChrisManning
Softmaxusingwordctoobtainprobabilityofwordo
• ConvertP(wt+j|wt)
P(o|c)=exp(uoTvc)/Σvw=1exp(uwTvc)exponen=atenormalizetomakeposi=vewhereoistheoutside(oroutput)wordindexandcisthecenterwordindex,vcanduoarecenterandoutsidevectorsofindicescando
SlideadaptedfromChrisManning
EmbeddingsAreMagic
vector(‘king’)-vector(‘man’)+vector(‘woman’)≈ vector(‘queen’)
30 SlidefromDragomirRadev,ImagecourtesyofJurafsky&Mar=n
EvaluatingEmbeddings• NearestNeighbors• Analogies• (A:B)::(C:?)• Informa=onRetrieval• Seman=cHashing
SlidefromDragomirRadev
Howarewordembeddingsused?• Asfeaturesinsupervisedsystems• Asthemainrepresenta=onwithaneuralnetapplica=on/task