Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Modelli Geometrici per l’apprendimentoautomatico e l’Information Retrieval
R. Basili
Corso diWeb MiningeRetrievala.a. 2008-9
June 18, 2009
Overview SVDxLSA LSA and kernels Beyond References
1 OverviewLinear TransformationsEmbeddings
2 SVD for LSALSA: semantic interpretationLSA, second-order relations and clusteringLSA and MLAn example: LSA and Frame Semantics
3 LSA and kernels
4 Beyond LSAIsomapLocally Linear EmbeddingLocality Preserving Projections
5 References
Overview SVDxLSA LSA and kernels Beyond References
Linear Transformations
Change of Basis
Change of Basis
Given two alternative basisB = {b1, ...,bn} andB′ = {b′1, ...,b′n}, such that
the square matrixC = (cik) describe the change of the basis, i.e.
b′k = c1kb1 +c2kb2 + ...cnkbn ∀k = 1, ...,n
Overview SVDxLSA LSA and kernels Beyond References
Linear Transformations
Matrix and Change of Basis
Matrix and Change of Basis
The effect of the matrixC on a generic vectorx allows to compute thechange of basis according only to the involved basisB andB′. For everyx = ∑n
k=1xkbk such that in the new basisB′, x can be expressed byx = ∑n
k=1x′kb′k, then it follows that:
x =n
∑k=1
x′kb′k = ∑
k
x′k
(
∑i
cikbi
)
=n
∑i,k=1
x′kcikbi
from which it follows that:
xi =n
∑k=1
x′kcik ∀i = 1, ...,n
Overview SVDxLSA LSA and kernels Beyond References
Linear Transformations
Matrix and Change of Basis
Matrix and Change of Basis
The effect of the matrixC on a generic vectorx allows to compute thechange of basis according only to the involved basisB andB′. For everyx = ∑n
k=1xkbk such that in the new basisB′, x can be expressed byx = ∑n
k=1x′kb′k, then it follows that:
x =n
∑k=1
x′kb′k = ∑
k
x′k
(
∑i
cikbi
)
=n
∑i,k=1
x′kcikbi
from which it follows that:
xi =n
∑k=1
x′kcik ∀i = 1, ...,n
The above condition suggests thatC is sufficient to describe any change ofbasis through the matrix vector mutliplication operations:
x = Cx′
Overview SVDxLSA LSA and kernels Beyond References
Linear Transformations
Matrix and Change of Basis
Matrix and Change of Basis
The effect of the matrixC on a matrixA can be seen by studying the casewherex,y are the expression of two vectors in a baseB while theircounterpart onB′ arex′,y′, respectively. Now ifA andB are such thaty = Ax andy′ = Bx′, then it follows that:
y = Cy′ = Ax = A(Cx′) = ACx′
(this means that)
y′ = C−1ACx′
from which it follows that:
B = C−1AC
The transformation of basisC is asimilarity transformationand matricesAandC are saidsimilar.
Overview SVDxLSA LSA and kernels Beyond References
Linear Transformations
Matrix and Change of Basis
Matrix and Change of Basis
The effect of the matrixC on a matrixA can be seen by studying the casewherex,y are the expression of two vectors in a baseB while theircounterpart onB′ arex′,y′, respectively. Now ifA andB are such thaty = Ax andy′ = Bx′, then it follows that:
y = Cy′ = Ax = A(Cx′) = ACx′
(this means that)
y′ = C−1ACx′
from which it follows that:
B = C−1AC
The transformation of basisC is asimilarity transformationand matricesAandC are saidsimilar.
Overview SVDxLSA LSA and kernels Beyond References
Linear Transformations
Overview
Analisi alle Componenti Principali
La Singular Value DecompositionDefinizioneEsempiRiduzione della dimensionalita’ del problema
Latent Semantic Analysise SVDInterpretazioneLatent Semantic Indexing
Applicazioni di LSA:termedocument clustering
Latent Semantic kernels
Embedding non lineari:Isomap, LNN ed LPP
Overview SVDxLSA LSA and kernels Beyond References
LSA e Singular Value Decomposition
Dati V (|V |= M) il vocabolario dei termini eT (|T |= N), l’insieme deitesti di una collezione (training), applico alla matriceW degliM termini(righe) per gliN documenti (colonne) la decomposizione in valori singolariseguente (Golub and Kahan, 1965):
W = USVT
dove:
U (M×R) conM vettori rigaui e’ singolare (UUT = I )
S(R×R) e’ diagonale, consij tc. sij = 0 ∀i = 1, ...,R ed i valorisingolari nella diagonale principali tali chesi = sii es1 ≥ s2≥ ...≥ sR > 0
V (N×R) conN vettori rigavi e’ singolare (VVT = I )
(OSS:Re’ il rangodi W).
Overview SVDxLSA LSA and kernels Beyond References
Le proprieta’ di SVD
La decomposizione in valori singolari traduce la matriceW di partenza in:
W = USVT
dove:
U eV sono le matrici deivettori singolari sinistroedestrodi W (cioe’gli autovettorio eigenvectorsdi WWT e WTW, rispettivamente)
le colonne diU e le righe diV definiscono uno spazioortonormale,cioe’ UUT = I eVVT = I
Se’ la matrice diagonale dei valori singolari diW. I valori singolarisono le radici degli autovalori diWWT o WTW (sono uguali)
Le trasformazioni lineari ottenute ora sono due (come vedremo tra poco):WV e WTU.
Overview SVDxLSA LSA and kernels Beyond References
LSA: proprieta’ di SVD
La decomposizione in valori singolari
W = USVT
si puo’ approssimare con
W∼W′ = UkSkVTk
trascurando le trasformazioni lineari di ordine superioreak << R in modoche:
Uk (M×k) conM vettori rigaui e’ singolare ed ortonormale(UkUT
k = I )
Sk (k×k) e’ diagonale, consij tc. sij = 0 ∀i = 1, ...,k ed i valorisingolari nella diagonale principali tali chesi = sii es1 ≥ s2≥ ...≥ sk > 0
Vk (N×k) conN vettori rigavi e’ singolare ed ortonormale (VkVTk = I )
Overview SVDxLSA LSA and kernels Beyond References
Riduzione del rango
k e’ il numero di valori singolari scelti per rappresentare i concettinellŠinsieme dei documenti.In genere,k << R (rango iniziale).
Overview SVDxLSA LSA and kernels Beyond References
LSA: proprieta’ di SVD
La decomposizione in valori singolari
W∼W′ = UkSkVTk
ha un certo insieme di proprieta’
la matriceSk e’ unica (anche seU e V non lo sono)
per costuzioneW′ e’ la matrice che soddisfa la decomposizione diordinek piu’ vicina a W (in norma euclidea)
gli si sono le radici degli autovalori diWWT
le componenti principalidel problema sono in relazione aUk eVk
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: interpretazione semantica
Si dice che inW = USVT, Scattura lastruttura semantica latentedellospazio di partenza diW, V ×T e che la riduzione aW′ non ha effettosignificativo su questa proprieta’.... Vediamo perche’:
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: interpretazione semantica
Si dice che inW = USVT, Scattura lastruttura semantica latentedellospazio di partenza diW, V ×T e che la riduzione aW′ non ha effettosignificativo su questa proprieta’.... Vediamo perche’:
gli autovalori ed autovettori sono direzioni particolari dello spazio dipartenzaV ×T , determinate dalla trasformazioneW.
USsi ottiene daW = USVT: infatti WV= USVTV = US, cioe’ per ogniriga (termine)i-esima diW (o U), uiS= wiV.
QUINDI: rappresentare i vettori dei termini (righewi della matriceW)medianteuiS, SIGNIFICA: combinare linearmente gli elementi (idocumenti,vj) della base ortonormale data daV (dopo il troncamento ak)
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: interpretazione semantica (cont’d)
Inoltre (per idocumenti):
VSsi ottiene daW = (USVT): infatti WT = (USVT)T = VSUT, da cuiWTU = VS. Le colonne (documenti)wj di W (o righe diV) sono talichevjS= wjU.
QUINDI: rappresentare i vettori dei testi (colonne della matriceW)mediantevjS, significa combinare linearmente (medianteS) le righe (iterminiui) della base ortonormale data daU
USe VSrappresentano le mappature di termini inV e documenti inTnello spazio di dimensionek generato dalla SVD
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: interpretazione semantica (cont’d)
Le trasformazioni sono date dalle combinazioni lineari sulle kdimensioni che corrispondono a concetti (temi). Infatti:
Le matriciU eV sono basi ortonormali per lo spazio LS di dimensionek generato. Esso sono le direzioni privilegiate della trasformazioneiniziale e quindi sono combinazioni lineari inWWT (o WTW) cioe’concetti determinati dal ricorrere degli stessi termini nei documenti (eviceversa).
Ogni vettore di terminewi dunque e’ rappresentato in tale spaziotramiteWV cioŁ come combinazione dei documenti inziali, il cheequivale a calcolareUS
Analogamente per i documentiwj medianteWTU=VS
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: un esempio da una distribuzione artificiale
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: un esempio calcolato
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: Approssimazione del rango, k= 2
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA
Calcolo della similarita’ Query-Doc
PerN documenti, la matriceV contieneN righe, ognuna delle qualirappresenta le coordinate del documentodi proiettato nello spazio LSA
Una queryq puo’ essere trattata come uno pseudo-documento eproiettata anch’essa nello spazio LSA
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: W= UΣVT
Uso della trasformazione SVD
SeW = UΣVT si ha anche che
V = WTUΣ−1
d = dTUΣ−1
q = qTUΣ−1 (pseudo documento)
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: W= UΣVT
Uso della trasformazione SVD
SeW = UΣVT si ha anche che
V = WTUΣ−1
d = dTUΣ−1
q = qTUΣ−1 (pseudo documento)
A valle della riduzione dimensionale di ordine k
d = dTUkΣ−1k
q = qTUkΣ−1k (pseudo documento)
Ne segue che:sim(q,d) = sim(qTUkΣ−1k ,dTUkΣ−1
k )
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: Calcolo del query vector
Overview SVDxLSA LSA and kernels Beyond References
LSA: semantic interpretation
LSA: Vettori della query e dei documenti
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Relazioni del secondo ordine
LSA e significato
La rappresentazione in LSA dei termini coglie un numero maggiore diaspetti del significato dei termini, poiche’ contribuiscono alla nuovarappresentazione tutte le co-occorrenze nei diversi contesti descrittidalla matriceW iniziale
La rappresentazione di un terminet nel nuovo spazio non è piu’ ilversore unitario~t ortogonale a (e quindi indipendente da) tutti gli altri
La similitudine tra due terminiti e tj dipende dalla trasformazioneUΣ edipende quindi da tutte le co-occorrenze comuni con altri termini tk(contk 6= ti , tj). Queste dipendenze costituiscono relazioni delsecondoordine.
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Metriche di pesatura
In LSA i modelli di pesatura possono variare per migliorare la ricerca nellospazio delle trasformazioni lineari possibili
Frequenza.cij (o le sue varianti normalizzatecij|dj |
,cij
maxlkclk)
(Landauer)wij =log(cij +1)
1+∑Nj=1
cijti
logcijti
logN
=log(cij +1)
1+∑Nj=1 Pij logPij
logN
(Bellegarda, LM modeling)wij = (1− εi)cijnj
con
εi =− 1log2N ∑N
j=1cijti
logcijti
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Metriche tra termini
In LSA la similarita’ tra i termini e’ descritta dal prodotto
WWT
e puo’ essere calcolata come
USVT(USVT)T = (USVT)(VSTUT) = USSTUT = US(US)T
cioe’ moltiplicando la matrice delle (ui) perSe poi applicando il prodottoscalare.
Applicazioni: Indexing(ass dei termini ai docs),word/term clustering(raggruppamento di termini).
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Metriche tra documenti
In LSA la similarita’ tra i documenti e’ descritta dal prodotto
WTW
e puo’ essere calcolata come
(USVT)TUSVT = (VSTUT)(USVT) = VSSVT = VS(VS)T
cioe’ moltiplicando la matrice delle (vi) perSe poi applicando il prodottoscalare.
Applicazioni: document clustering(raggruppamento di docs) ecategorizzazione dei testi(associazione di categorie ai documenti).
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Osservazioni
LSA e’ una tecnica per ottenere sistematicamente le rotazioninecessarie nello spazioV ×T ad allineare i nuovi assi con ledimensioni lungo le quali sussite la piu’ ampia variazione tra idocumenti
Il primo asse (s1) fornisce la variazione massima, il secondo quellamassima tra le rimanenti e cosi’ via
Vengono trascurate (scelta dik) le dimensioni le cui variazioni sonotrascurabili
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Osservazioni (2)
LSA e’ una tecnica per ottenere sistematicamente le rotazioninecessarie nello spazioV ×T ad allineare i nuovi assi con ledimensioni lungo le quali sussite la piu’ ampia variazione tra idocumenti
Ipotesi sottostanti:misura la variazione attraverso la norma euclidea (minimizza lo scartoquadratico)la distribuzione dei pesi legati ai termini nei documenti e’normale(isistemi di pesatura tendono a consolidare questa proprieta’)
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
LSA: Altre Applicazioni
Inferenza Semantica nelAutomatic Call Routing
Il taske’ quello di mappare domande (per unCall Center) in unaprocedura (azione) direply.
Domande di training:(T1) What is the time, (T2) What is the day, (T3) What time is themeeting, (T4) Cancel the meeting
thee’ irrilevante,timee’ ambiguo (tra 1 e 3)
Domanda di ingresso:when is the meeting(classe correttaT3)
Overview SVDxLSA LSA and kernels Beyond References
2ndOrd
Automatic Call Routing su uno spazio bidimensionale
Overview SVDxLSA LSA and kernels Beyond References
LSA and ML
LSA vs. Machine Learning
Esiste una relazione tra LSA ed i modelli dilearning?
Induction in LSA
LSA fornisce un metodo generale per lastima di similarita’quindi e’utilizzabile in tutti gli algoritmi induttivi per guidare lageneralizzazione (ad es. iperpiano di separazione)
LSA fornisce una trasformazione lineare dei vettori di attributi originaliispirata dalle caratteristiche del data set, quindi produce una nuovametrica guidata dai datistessi. Questo meccanismo e’ particolarmenteimportante quando non esistono modelli analitici (generativi) precisidelle distribuzioni dei dati iniziali (ad es. dati linguistici)
LSA puo’ essere applicato a dati non controllati (per esempio collezionidi documenti tematiche non annotate) e quindi estende il potere digeneralizzazione dei metodisupervisedsfruttando informazioni esterneal task. Questa proprieta’ consente di indurre una conoscenzacomplementare al task stesso
Overview SVDxLSA LSA and kernels Beyond References
LSA and ML
LSA: Machine Learning tasks
LSA in Relevance Feedback
Automatic Global Analysis
Stimatore di pseudo-rilevanzaprimadella espansione.
Overview SVDxLSA LSA and kernels Beyond References
LSA and ML
LSA: Machine Learning tasks
LSA in Relevance Feedback
Automatic Global Analysis
Stimatore di pseudo-rilevanzaprimadella espansione.
LSA and language semantics
SVD in distributional analysis: semantic spaces (see (Padoand Lapata,2007))
Word Sense Discrimination as clustering in LSA-like spaces(seeSchutze, 1998)
Word Sense Disambiguation in LSA spaces (see (Gliozzo et al., 2005),(Basili et al., 2006)
Framenet predicate induction (see (Basili et al., 2008), (Pennacchiotti etal., 2008))
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Frames as Conceptual Patterns
An example: theKILLING frame
Frame: KILLING
A K ILLER or CAUSE causes the death of the VICTIM .
CAUSE The rockslide killed nearly half of the climbers.INSTRUMENT It’s difficult to suicidewith only a pocketknife.K ILLER John drownedMartha.MEANS The flood exterminatedthe ratsby cutting off access to food.MEDIUM John drownedMartha.
Fra
me
Ele
me
nts
Pre
dica
tes annihilate.v, annihilation.n, asphyxiate.v,assassin.n, assassinate.v, assassina-
tion.n, behead.v, beheading.n, blood-bath.n, butcher.v,butchery.n, carnage.n,crucifixion.n, crucify.v, deadly.a, decapitate.v, decapitation.n, destroy.v, dis-patch.v, drown.v, eliminate.v, euthanasia.n, euthanize.v, . . .
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting frames in semantic fields
Semantic Spaces (Pado and Lapata, 2007)A Semantic Space for a set ofN targets is 4-tuple< B,A,S,V > where:
B is the set of basic features (e.g. words co-occurring with the targets)
A is a lexical association function that weights the correlations betweenb∈ B and the targets
S is a similarity function between targets (i.e. inℜ|B|×ℜ|B|)V is a linear transformation over the originalN×|B|matrix
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting frames in semantic fields
Semantic Spaces: a definition
A Semantic Space for a set ofN targets is 4-tuple< B,A,S,V > where:
B is the set of basic features (e.g. words co-occurring with the targets)
A is a lexical association function that weights the correlations betweenb∈ B and thetargets
S is a similarity function between targets (i.e. inℜ|B|×ℜ|B|)
V is a linear transformation over the originalN×|B|matrix
Examples
In IR systems targets are documents,B is the term vocabulary,A is thetf · idf score. TheSfunction is usually the cosine similarity, i.e.sim(~t1,~t2) = ∑i t1i ·t2i
||~t1||·||~t2||
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting frames in semantic fields
Semantic Spaces: a definition
A Semantic Space for a set ofN targets is 4-tuple< B,A,S,V > where:
B is the set of basic features (e.g. words co-occurring with the targets)
A is a lexical association function that weights the correlations betweenb∈ B and thetargets
S is a similarity function between targets (i.e. inℜ|B|×ℜ|B|)
V is a linear transformation over the originalN×|B|matrix
Examples
In IR systems targets are documents,B is the term vocabulary,A is thetf · idf score. TheSfunction is usually the cosine similarity, i.e.sim(~t1,~t2) = ∑i t1i ·t2i
||~t1||·||~t2||
In Latent Semantic Analysis (Berry et al. 94) targets are documents (or dually words), and theSVD transformation is used asV
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Semantic Spaces and Frame semantics
These lexicalized models corresponds to useful generalizations regardingsinonimy, class membershipor topical similarity
As frames are rich linguistic structures it is clear that more than one ofsuch properties hold among members (i.e. LUs) of the same frame
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Semantic Spaces and Frame semantics
These lexicalized models corresponds to useful generalizations regardingsinonimy, class membershipor topical similarity
As frames are rich linguistic structures it is clear that more than one ofsuch properties hold among members (i.e. LUs) of the same frame
Topical similarityplays a role as frames evoke events in very similartopical situations (e.g. KILLING vs. ARREST)
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Semantic Spaces and Frame semantics
These lexicalized models corresponds to useful generalizations regardingsinonimy, class membershipor topical similarity
As frames are rich linguistic structures it is clear that more than one ofsuch properties hold among members (i.e. LUs) of the same frame
Topical similarityplays a role as frames evoke events in very similartopical situations (e.g. KILLING vs. ARREST)
Sinonimyis also informative as LU’s in a frame can be synonyms (suchaskid, child), quasi-sinonyms (such asmothervs. father) andco-hyponims
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Semantic Spaces and Frame semantics
These lexicalized models corresponds to useful generalizations regardingsinonimy, class membershipor topical similarity
As frames are rich linguistic structures it is clear that more than one ofsuch properties hold among members (i.e. LUs) of the same frame
Topical similarityplays a role as frames evoke events in very similartopical situations (e.g. KILLING vs. ARREST)
Sinonimyis also informative as LU’s in a frame can be synonyms (suchaskid, child), quasi-sinonyms (such asmothervs. father) andco-hyponims
Which feature models and metrics correspond to a suitable geometricalnotion of framehood?
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Latent Semantic Spaces
LSA and Frame semantics
In our approach SVD is applied to source co-occurrence matrices in order toReduce the original dimensionalityCapturetopical similarity latent in the original documents, i.e. secondorder relations among targets
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Framehood in a semantic space
Frames are rich polymorphic classes and clustering is applied for detecting multiplecentroids
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Framehood in a semantic space
Frames are rich polymorphic classes and clustering is applied for detecting multiplecentroidsRegions of the space where LU’s manifest are also useful for detecting the sentences thatexpress the intended predicate semantics
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Framehood in a semantic space
Frames are rich polymorphic classes and clustering is applied for detecting multiplecentroidsRegions of the space where LU’s manifest are also useful for detecting the sentences thatexpress the intended predicate semantics
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
The clustering Algorithm
k-means
Hard clustering algorithm fed with a fixed number ofk randomly chosen seeds (centroids)
Sensitive to the choice ofk, and the seeding
Quality Threshold clustering ((Heyer et al., 1999))
Aggregative clustering similar tok-means with thresholds to increase flexibility
Minimal infracluster similarity (activatenew seeds)
Minimal intra-cluster dissimilarity (activatemerge)
Maximal number of cluster members (activatesplits)
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
The clustering Algorithm
QT-clustering
Require: QT {Quality Threshold for clusters}repeat
for all slot fillersw doSelectCx as the best cluster forw.if sim(w,cx) > QT then
Generate new clusterCw for welse
Acceptw into clusterCx
end ifend for
until No other shift is necessary or the maximum number of iteration is reached
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
The clustering Algorithm: Different settings
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Harvesting ontologies in semantic fields
Framehood in a semantic space
Frames are rich polymorphic classes and clustering is applied for detecting multiplecentroidsRegions of the space where LU’s manifest are also useful for detecting the sentences thatexpress the intended predicate semantics
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Evaluation: Current experimental set-up
The Corpus
TREC 2005 vol. 2
# of docs: about 230,000
# of tokens: about 110,000,000 (more than 70,000types)
Source Dimensionality: 230,000× 49,000
LSA Dimensionality reduction: 7,700× 100
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Evaluation: Current experimental set-up
The Corpus
TREC 2005 vol. 2
# of docs: about 230,000
# of tokens: about 110,000,000 (more than 70,000types)
Source Dimensionality: 230,000× 49,000
LSA Dimensionality reduction: 7,700× 100
Syntactic and Semantic Analysis
Parsing: Minpar (Lin,1998)
Synonimy, hyponimy info: Wordnet 1.7
Semantic Similarity Estimation: CD library (Basili et al„ 2004)
Framenet: 2.0 version
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Evaluating the LU Classification in English
English Number of frames: 220Number of LUs: 5042Most likely frames:Self_Motion(p=0.015),Clothing(p=0.014)
Table: The Gold Standard for the test over English
The measure is theAccuracy: the percentage of correctly classifiedlu’s byat least one of the proposedk choices.
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
The impact of LU Classification on the English data set
Nouns Verbs
Targeted Frames 220 220Involved LUs 2,200 2,180Average LUs per frame 10.0 9.91LUs covered by WordNet 2,187 2,169Number of Evoked Senses 7,443 11,489Average Polysemy 3.62 5.97Represented words (i.e.∑F WF) 2,145 1,270Average represented LUs per frame 9.94 9.85Active Lexical Senses (LF) 3,095 2,282Average Active Lexical Senses(|LF |/|WF |) per word over frames 1.27 1.79Active synsets (SF) 3,512 2,718Average Active synsets (|SF |/|WF |)per word over frames 1.51 2.19
Table: Statistics on nominal and verbal senses in the paradigmaticmodel of theEnglish FrameNet
Overview SVDxLSA LSA and kernels Beyond References
LSAvsFN
Comparative evaluation for English: accuracy
Overview SVDxLSA LSA and kernels Beyond References
LSA e metodi Kernel
Le funzioni kernelK(~oi ,~o) possono essere usate per la stima (implicita)della similarita’ tra due termini o documenti,oi eo, in spazi complessi alfine di addestrare una SVM secondo l’equazione:
h(~x) = sgn
(
m
∑i=1
αiK(oi,o)+b
)
dove~x = φ(o)Alcuni modelli che utilizzano LSA per la definizione diK(oi ,o) sono statidefiniti per problemi di
Text Categorization (categorizzazione testi)
Word Sense Disambiguation (classificazione occorrenze di una parolain sensi)
Overview SVDxLSA LSA and kernels Beyond References
LSA-based Domain Kernels
Assunzioni:
oi rappresenta un "termine"~tiLo spazio di rappresentazione di partenza per i~ti e’ quello di un VSMtradizionale (pesatura dei termin tramite la loro pesaturaTF× IDF neidocumenti)
La matriceT di partenza e’ quindi termini per doc
Overview SVDxLSA LSA and kernels Beyond References
LSA-based Domain Kernels (2)
Processo:
Applico la trasformazione LSA di dimensionek, oi ←~ti (OSS: glioi
sono vettori in LSA)
Assumo la metrica tra termini~ti come la similarita’ tra oggetti di tipooi
(cioe’ nelLatent Semantic Space)
Addestro un categorizzatore definendo una funzione kernelK(., .) nelseguente modo:K(~ti ,~t) = K(φ−1(oi),φ−1(o))
.= KLSA(o,oi) dove:
KLSA(o,oi) = oi⊗LSAo
OSS:⊗LSA indica il prodotto scalare tra gli oggettioi nello spazio LSA.
Overview SVDxLSA LSA and kernels Beyond References
LSA-based Domain Kernels (3)
LSA determina la trasformazione SVD e la approssimazione diordinek.k e’ la dimensione nello spazio trasformato, cioe’ il numero di componentiprincipali del problema inziale (VSM semplice)Interpretando tali componenti comedomini:
oi risulta la descrizione di un termine~ti nei diversi domini
termini simili secondoKLSA(oi ,o) condividono il maggior numero didomini
il kernel risultanteKLSA(oi ,o) e’ dettoLatent Semantic Kernel(Cristianini&Shawe-Taylor,2004) odomain kernel(Gliozzo &Strapparava,2005).
Overview SVDxLSA LSA and kernels Beyond References
LSA-based Domain Kernels:
Applicazione allaText Categorization
un documentoxi (come un termine) puo’ essere mappato in uno spaziodi tipo LSA, oi ←~xi
le k componenti dioi rappresentano la descrizione dixi nel nuovospazio
il kernelKLSA(oi,o) risultante cattura le similarita’ mediante unadescrizione di dominio di~xi
l’addestramento della SVM genera l’equazione dell’iperpiano secondoLSA:
f (~x) =
(
m
∑i=1
αiKLSA(oi ,oj)+b
)
OSS: LSA puo’ essere computato in una collezioneesternaal data set ditraining.
Overview SVDxLSA LSA and kernels Beyond References
Embeddings in ML
Embedding
Dimensionality reduction is a way to reduce the original spacedimension to a more manageable size that is still highly informative: itaims to prune out irrelevant information and noise from the initial set ofobservations
Overview SVDxLSA LSA and kernels Beyond References
Embeddings in ML
Embedding
Dimensionality reduction is a way to reduce the original spacedimension to a more manageable size that is still highly informative: itaims to prune out irrelevant information and noise from the initial set ofobservations
The idea is to embed the sourceN dimensional space into a smallerddimensional space: this operation is calledembedding
Overview SVDxLSA LSA and kernels Beyond References
Embeddings in ML
Embedding
Dimensionality reduction is a way to reduce the original spacedimension to a more manageable size that is still highly informative: itaims to prune out irrelevant information and noise from the initial set ofobservations
The idea is to embed the sourceN dimensional space into a smallerddimensional space: this operation is calledembedding
a transformation is suitable for a given feature representation and aspecific task if it optimizes one or more principles or aspects of the task(utility function)
Overview SVDxLSA LSA and kernels Beyond References
Embeddings in ML
Embedding
Dimensionality reduction is a way to reduce the original spacedimension to a more manageable size that is still highly informative: itaims to prune out irrelevant information and noise from the initial set ofobservations
The idea is to embed the sourceN dimensional space into a smallerddimensional space: this operation is calledembedding
a transformation is suitable for a given feature representation and aspecific task if it optimizes one or more principles or aspects of the task(utility function)
The DR techniques employed make different assumptions on (1) thenature of the original space (domain, range and distributions of attributevalues) and on (2) the optimality criterion to be applied (e.g. manifoldstructure, efficiency, resulting accuracy or cognitive plausibility)
Overview SVDxLSA LSA and kernels Beyond References
Non-Linear Embeddings
Embedding and Unfolding
An embedding define a process aiming to unfold the underlyingstructure ofthe data distribution on which linear (such Euclidean) distances are not valid.
Overview SVDxLSA LSA and kernels Beyond References
Non-Linear Embeddings and manifolds
A manifold is a topological space which is locally Euclidean. In general, anyobject which is nearlyflat on small scales is amanifold. Euclidean space is asimplest example of a manifold.
Overview SVDxLSA LSA and kernels Beyond References
Non-Linear Embeddings and manifolds
A manifold is a topological space which is locally Euclidean. In general, anyobject which is nearlyflat on small scales is amanifold. Euclidean space is asimplest example of a manifold.
Concept of submanifold
Manifolds arise naturally whenever there is a smooth variation of parameters(like pose of the face in previous example)The dimension of a manifold is the minimum integer number of co-ordinatesnecessary to identify each point in that manifold.
Overview SVDxLSA LSA and kernels Beyond References
Non-Linear Embeddings and manifolds
A manifold is a topological space which is locally Euclidean. In general, anyobject which is nearlyflat on small scales is amanifold. Euclidean space is asimplest example of a manifold.
Concept of submanifold
Manifolds arise naturally whenever there is a smooth variation of parameters(like pose of the face in previous example)The dimension of a manifold is the minimum integer number of co-ordinatesnecessary to identify each point in that manifold.
Overview SVDxLSA LSA and kernels Beyond References
Non-Linear Embeddings and manifolds
In general, the transformation should be make dependent on the local ratherthan on the global structure of the data distributions in theoriginal space inorder to recreate the manifold structure with the smallest number ofdimensions:Unfolding manifolds.
Overview SVDxLSA LSA and kernels Beyond References
Isomap
Isomap
Isomap Algorithm
Determine the neighbors. Two methods: (1) All points in a fixedradius; (2)K nearest neighbors
Overview SVDxLSA LSA and kernels Beyond References
Isomap
Isomap
Isomap Algorithm
Determine the neighbors. Two methods: (1) All points in a fixedradius; (2)K nearest neighbors
Construct a neighborhood graph. Each point is connected to the otherif it is a K nearest neighbor. Edge Length equals the Euclidean distance
Overview SVDxLSA LSA and kernels Beyond References
Isomap
Isomap
Isomap Algorithm
Determine the neighbors. Two methods: (1) All points in a fixedradius; (2)K nearest neighbors
Construct a neighborhood graph. Each point is connected to the otherif it is a K nearest neighbor. Edge Length equals the Euclidean distance
Compute the shortest paths between two nodes throughFloyd’s AlgorithmDjkstra’s Algorithm
Overview SVDxLSA LSA and kernels Beyond References
Isomap
Isomap
Isomap Algorithm
Determine the neighbors. Two methods: (1) All points in a fixedradius; (2)K nearest neighbors
Construct a neighborhood graph. Each point is connected to the otherif it is a K nearest neighbor. Edge Length equals the Euclidean distance
Compute the shortest paths between two nodes throughFloyd’s AlgorithmDjkstra’s Algorithm
Construct a lower dimensional embedding through ClassicalMultidimensional Scaling (or SVD)
Overview SVDxLSA LSA and kernels Beyond References
Isomap
Isomap vs. MDS: Residuals in different problems
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Fit Locally , Think Globally
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Local weights
In the first phase weights are chosen such that
ε(W) = minW||~Xi−∑j
Wij~Xj ||2
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Local weights
In the first phase weights are chosen such that
ε(W) = minW||~Xi−∑j
Wij~Xj ||2
The weights that minimize the reconstruction errors are invariant to rotation,rescaling and translation of the data points
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Local weights
In the first phase weights are chosen such that
ε(W) = minW||~Xi−∑j
Wij~Xj ||2
The weights that minimize the reconstruction errors are invariant to rotation,rescaling and translation of the data points
Invariance to translation is enforced by adding the constraint that the weightssum to one, i.e.∑ij Wij = 1
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Local weights
In the first phase weights are chosen such that
ε(W) = minW||~Xi−∑j
Wij~Xj ||2
The weights that minimize the reconstruction errors are invariant to rotation,rescaling and translation of the data points
Invariance to translation is enforced by adding the constraint that the weightssum to one, i.e.∑ij Wij = 1
The same weights that reconstruct the data points inD dimensions shouldreconstruct it in the manifold ind dimensions.
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Reconstruction weights
The weights derived in the first phase
ε(W) = minW||Xi−∑j
Wij Xj ||2
are used for the reconstruction ind dimensions.
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Reconstruction weights
The weights derived in the first phase
ε(W) = minW||Xi−∑j
Wij Xj ||2
are used for the reconstruction ind dimensions.
Each high-dimensional representation~Xi is mapped to a low-dimensionalvector~Yj representing the global internal coordinates on the manifold.
This is done by choosing thed-dimensional coordinates~Yi such to minimizethe embedding cost function, i.e.
Ψ(Y) = ∑i||~Yi−∑
jWij~Yj ||
2
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
Reconstruction weights
The weights derived in the first phase
ε(W) = minW||Xi−∑j
Wij Xj ||2
are used for the reconstruction ind dimensions.
Each high-dimensional representation~Xi is mapped to a low-dimensionalvector~Yj representing the global internal coordinates on the manifold.
This is done by choosing thed-dimensional coordinates~Yi such to minimizethe embedding cost function, i.e.
Ψ(Y) = ∑i||~Yi−∑
jWij~Yj ||
2
Fix the weights and optimize the coordinates
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
Locally Linear Embedding
How to compute the reconstruction weights
The solution to the following problem
Ψ(Y) = ∑i||~Yi−∑
jWij~Yj ||
2
consists in
Add constraints to ask for the solutions~Yi to be centered on the origin(∑i
~Yi =~0
Minimize theΨ by solving the corresponding sparseN×N eigenvalueproblem, whose bottomd non zero eigenvalues are the solution
Such setA of the (smallestd non-zero) eigenvalues provide an orderedset of orthogonal coordinates centered in the origin (i.e. the new basisfor the low-dimensional space), i.e.
~Yi = A~Xi
.
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
LLE
The analytical setting
minWΨ(Y) = ∑i ||~Yi−∑j Wij~Yj ||2
∑i~Yi =~0
1N ∑i
~Yi×~Yi = 1
that is equivalent to
minWΨ(Y) = ∑ij M(~Yi ,~Yj)
∑i~Yi =~0
1N ∑i
~Yi×~Yi = 1
whereMij = δij −Wij −Wji + ∑k WkiWkj
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
LLE Results on a Face recognition problem
Data set
NUMB. OF PHOTOGRAPHS 2000SIZE OF PHOTOGRAPHS 20x28
DATA VECTOR D=560NUMB OF NEIGHBORS K=12
DISTANCE Euclidean
Table: Experiments on the face recognition problem (source (Roweis and Saul,2000))
Overview SVDxLSA LSA and kernels Beyond References
Locally Linear Embedding
LLE Results on Word clustering
Data set
SOURCE Grolier EncyclopediaNUMB. OF DOCUMENTS 31,000SIZE OF VOCABULARY 5,000NUMB OF NEIGHBORS K=20
WEIGHTING word countsDISTANCE normalized inner product
Table: Experiments on word clustering (source (Roweis and Saul, 2000))
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Locality Preserving Projections (LPP)
LPP Embedding
LPP (Cai et al., IEEE on DKE 2005) is the last of this class of non linear embeddingmethods
It is in the class of laplacian methods (such as LLE)
Application to image and document clustering, topic modeling (such asLDA)
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
General Idea
Determine thebestlinear transformationA that preserves thelocalproperties of the space, without making global assumptions(as in LSA)
An adjecency graphG is adopted, based on internal metrics (i.e. thespace inner product) or external ones (e.g. dictionaries)
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
General Idea
Determine thebestlinear transformationA that preserves thelocalproperties of the space, without making global assumptions(as in LSA)
An adjecency graphG is adopted, based on internal metrics (i.e. thespace inner product) or external ones (e.g. dictionaries)
Formally:
arg min~a∑ij (~a~xi−~a~xj)2Wij
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
General Idea
Determine thebestlinear transformationA that preserves thelocalproperties of the space, without making global assumptions(as in LSA)
An adjecency graphG is adopted, based on internal metrics (i.e. thespace inner product) or external ones (e.g. dictionaries)
Formally:
arg min~a∑ij (~a~xi−~a~xj)2Wij
Dii = ∑j Wij and ...
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
General Idea
Determine thebestlinear transformationA that preserves thelocalproperties of the space, without making global assumptions(as in LSA)
An adjecency graphG is adopted, based on internal metrics (i.e. thespace inner product) or external ones (e.g. dictionaries)
Formally:
arg min~a∑ij (~a~xi−~a~xj)2Wij
Dii = ∑j Wij and ...L = D−W (Laplacian matrix)
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
General Idea
Determine thebestlinear transformationA that preserves thelocalproperties of the space, without making global assumptions(as in LSA)
An adjecency graphG is adopted, based on internal metrics (i.e. thespace inner product) or external ones (e.g. dictionaries)
Formally:
arg min~a∑ij (~a~xi−~a~xj)2Wij
Dii = ∑j Wij and ...L = D−W (Laplacian matrix)
Solve the eingenvector problem:XLXT~a = λXDXT~a
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
General Idea
Determine thebestlinear transformationA that preserves thelocalproperties of the space, without making global assumptions(as in LSA)
An adjecency graphG is adopted, based on internal metrics (i.e. thespace inner product) or external ones (e.g. dictionaries)
Formally:
arg min~a∑ij (~a~xi−~a~xj)2Wij
Dii = ∑j Wij and ...L = D−W (Laplacian matrix)
Solve the eingenvector problem:XLXT~a = λXDXT~a
Final projection intoℜd: yi = ATxi
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
Main advantages
LPP (Cai et al., IEEE on DKE 2005) determines a trade-off between differentDR methods, that optimize
the discriminative power of the resulting space (as in the LDA that issupervised, in fact)
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
Main advantages
LPP (Cai et al., IEEE on DKE 2005) determines a trade-off between differentDR methods, that optimize
the discriminative power of the resulting space (as in the LDA that issupervised, in fact)
and methods that optimize wrt to the recontruction of the original datadistribution (as in LSA)
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
LPP as a data-driven metrics
Main advantages
LPP (Cai et al., IEEE on DKE 2005) determines a trade-off between differentDR methods, that optimize
the discriminative power of the resulting space (as in the LDA that issupervised, in fact)
and methods that optimize wrt to the recontruction of the original datadistribution (as in LSA)
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
The Adjacency Graph,G
Given two vectorsxi andxj , G defines weightswij , as:
cosinegraph: wij = max{0,cos(xi ,xj)−τ|cos(xi ,xj)−τ| ·cos(xi ,xj)}.
ε-neighborhoodsgraph (gaussian kernel):
wij = max{0,ε−||xi−xj ||
2
|ε−||xi−xj ||2|·e−
||xi−xj ||2
t },
thetopicgraph:
wij = δ (i, j) ·cos(xi ,xj)
whereδ (i, j) = 1 only if a corpus categoryC can be found such thatxi ∈ C andxj ∈C and 0 otherwise.
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Evaluation Metrics
NMI
Normalized mutual information, defined as follows:
NMI(T,C) =∑t∈T,c∈C p(t,c)log2
p(t,c)p(t)·p(c)
min(H(T),H(C))(1)
Accuracy
The accuracyAC is given by:
AC=∑N
i=1 δ (Ai ,Oi)
N(2)
whereN is the total number of documents andδ (Ai ,Oi) is 1 only if Ai = Oi
and 0 otherwise
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Results
Topic Graph
REUTERS
METHOD ACC NMI
LSA 0.82 0.79LPP 0.94 0.99
Table: Best LSA vs. upper bound LPP results based on the "topic" graph on Reuters.
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Results
Dimensionality reduction effect: LSA vs. LPP
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Results
Clustering effect: LSA vs. LPP
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Results
LSA vs. LSA+LPP (Reuters)
LSA (700)THR ACC NMI CLUSTERS
-1 0.72 0.61 300.2 0.82 0.79 5070.4 0.86 0.84 1298
LSA+LPP(LSA 700, LPP 680,ε=0.05)
THR ACC NMI CLUSTERS
-1 0.77 0.66 300.2 0.81 0.78 4910.4 0.86 0.84 1253
Table: Performances on Reuters
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Results
LSA vs. LSA+LPP (20Newsgroup)
LSA (500)THR ACC NMI CLUSTERS
-1 0.58 0.57 200.2 0.59 0.59 4300.3 0.63 0.64 720
LSA+LPP(LSA 500, LPP 480,ε=0.05)
THR ACC NMI CLUSTERS
-1 0.54 0.55 200.2 0.59 0.60 4380.3 0.62 0.64 724
Table: Performances on 20Newsgroups
Overview SVDxLSA LSA and kernels Beyond References
Locality Preserving Projections
Non linear embedding: summary
ISOMAP LLEDo MDS on the geodesicdistance matrix.
Model local neighborhoods as lin-ear a patches and then embed in alower dimensional manifold
Global approach Local approachDynamic programming ap-proaches
Computationally efficient... sparsematrices
Convergence limited bythe manifold curvature andnumber of points
Good representational capacity
No clustering effect Clustering effect due to the local-ity preserving property of the eigen-value approach
Overview SVDxLSA LSA and kernels Beyond References
References
SVD and LSA
Susan T. Dumais, Michael Berry, Using Linear Algebra for IntelligentInformation Retrieval, SIAM Review, 1995, 37, 573–595
G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L.A. Streeter, K. E. Lochbaum, Information retrieval using a singular valuedecomposition model of latent semantic structure, SIGIR ’88: Proc. of theACM SIGIR conference on Research and development in InformationRetrieval, 1988
Non linear embeddings
S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locallylinear embedding. Science, 2000.
B. J. Tenenbaum, V. Silva, and J. Langford. A Global Geometric Frameworkfor Nonlinear Dimensionality Reduction. Science, pp.2319-2323, 2000
Xiaofei He, Partha Niyogi, Locality Preserving Projections. Proceedings ofAdvances in Neural Information Processing Systems, Vancouver, Canada,2003.
Overview SVDxLSA LSA and kernels Beyond References
References
LSA in language learning
Hinrich Schutze, Automatic word sense discrimination, ComputationalLinguistics, 24(1), 1998.
Beate Dorow and Dominic Widdows, Discovering Corpus-Specific WordSenses. EACL 2003, Budapest, Hungary. Conference, pages 79-82
Basili Roberto, Marco Cammisa, Alfio Gliozzo, Integrating Domain andParadigmatic Similarity for Unsupervised Sense Tagging, 17th EuropeanConference on Artificial Intelligence (ECAI06), Riva del Garda, Italy, 2006.
Sebastian Pado and Mirella Lapata. Dependency-based Construction ofSemantic Space Models. In Computational Linquistics, Vol.33(2):161-199,2007.
Basili R. Pennacchiotti M., Proceedings of GEMS "Geometrical Models ofNatural Language Semantics",http://www.aclweb.org/anthology/W/W09/#0200