10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings

10.0 Latent Semantic Analysis for Linguistic Processing

References: 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings of the IEEE, Aug 2000 2. “Latent Semantic Mapping”, IEEE Signal Processing Magazine, Sept. 2005,

Special Issue on Speech Technology in Human-Machine Communication 3. “Golub & Van Loan, “ Matrix Computations ”, 1989 4. “Probabilistic Latent Semantic Indexing”, ACM Special Interest Group on

Information Retrieval (ACM SIGIR), 1999 5. “Probabilistic Latent Semantic Indexing”, Proc. of Uncertainty in Artificial Intelligence, 1999

6. “Spoken Document Understanding and Organization”, IEEE Signal Processing Magazine, Sept. 2005, Special Issue on Speech Technology in Human-Machine Communication

Word-Document Matrix Representation• Vocabulary V of size M and Corpus T of size N

– V={w1,w2,...wi,..wM} , wi: the i-th word ,e.g. M=2×104

T={d1,d2,...dj,..dN} , dj: the j-th document ,e.g. N=105

– cij: number of times wi occurs in dj

nj: total number of words present in dj

ti = Σj cij : total number of times wi occurs in T

–

• Word-Document Matrix W

W = [wij]

– each row of W is a N-dim “feature vector” for a word w i with respect to all documents dj

each column of W is a M-dim “feature vector” for a document d j with respect to all words wi

entropy wordandlength document with normalizedbut doucments,in sfrequencie word, )-(1 w

j allfor /N if 1 jother for 0 and j somefor if 0 , 10

Tin wof power) (indexingentropy normalized ),log()( log

1

ij

i1

j

iji

iiji

ijiijiii

ijN

j i

iji

nc

tcctc

tc

tc

N

d1 d2 ........ dj .......... dNw1

w2

wi

wM

wij

Dimensionality Reduction•

–

– dimensionality reduction: selection of R largest eigenvalues (R=800 for example)

• –

– dimensionality reduction: selection of R largest eigenvalues

],....,[ U,USUWW R21MT

MR2

RMT

MNNM RReeeR

T1 U SUWW T 2

V SVW W 22T T

IVV rs,eigenvecto lorthonorma : e ,eesWW NTii

Tii

2i

T

si2

: weights (significance of the “component matrices” e’i e’iT)

]...,[V ,VS VWW 21RNT

NR2

RRRNNMT

MN Reee

R “concepts” or “latent semantic concepts”

(i, j) element of WT W : inner product of i-th and j-th columns of W “similarity” between di and dj

N)min(M, ifor 0s ,s s W W,of seigenvalue :s ,][sS ],,....,[ 2i

21i

2i

2iNN

2i

22N21 eeeV T

ji

T

wand wbetween "similarity" Wof rowsth -j andth -i ofproduct inner: WWof element j,i

21

2T2221M21 , WWof seigenvalue : ,][S, ],...ee,[eU iiiMMi ssss

i

iiii eee MTT2T IUU, rseigenvecto lorthonorma: , sWW

R “concepts” or “latent semantic concepts”

)ee matrices"component " theof nce(significa weights:s Tii

2i

Singular Value Decomposition (SVD)• Singular Value Decomposition (SVD)

– si: singular values, s1≥ s2.... ≥ sRU: left singular matrix, V: right singular matrix

• Vectors for word wi: uiS=ui (a row)– a vector with dimentionality N reduced to a vector uiS=ui with dimentionality R– N-dimentional space defined by N documents reduced to R-dimentional space defined

by R “concepts”– the R row vectors of VT, or column vectors of V, or eigenvectors {e’1,..e’R}, are the R

orthonormal basis for the “latent semantic space” with dimentionality R, with which uiS = u i is represented

– words with similar “semantic concepts” have “closer” location in the “latent semantic space”

• they tend to appear in similar “types” of documents, although not necessarily in exactly the same documents

TRRRMNM NRVSUW

ˆW NM

d1 d2 ........ dj . ......... dNw1

w2

wi

wM

wij=

w1

w2

wi

wM

ui

UR×R

s1

sR

d1 d2 ........ dj .......... dN

VT

R×N

M×R

vjT

Singular Value Decomposition (SVD)• Singular Value Decomposition (SVD)

• Vectors for document dj: vjS=vj (a row, or vj = S vjT for a column)

– a vector with dimentionality M reduced to a vector v jS=vj with dimentionality R– M-dimentional space defined by M words reduced to R-dimentional space defined

by R “concepts”– the R columns of U, or eigenvectors{e1,...eR}, are the R orthonormal basis for the

“latent semantic space” with dimensionality R, with which v jS=vj is represented– documents with similar “semantic concepts” have “closer” location in the “latent

semantic space”• they tend to include similar “types” of words, although not necessarily exactly the same

words• The Association Structure between words wi and documents dj is

preserved with noisy information deleted, while the dimensionality is reduced to a common set of R “concepts”

d1 d2 ........ dj . ......... dNw1

w2

wi

wM

wij=

w1

w2

wi

wM

ui

UR×R

s1

sR

d1 d2 ........ dj .......... dN

vj VT

R×N

M×R

T

T

Example Applications in Linguistic Processing• Word Clustering

– example applications: class-based language modeling, information retrieval ,etc.– words with similar “semantic concepts” have “closer” location in the “latent

semantic space”• they tend to appear in similar “types” of documents, although not necessarily in exactly

the same documents – each component in the reduced word vector ujS=uj is the “association” of the word

with the corresponding “concept” – example similarity measure between two words:

• Document Clustering– example applications: clustered language modeling, language model adaptation,

information retrieval, etc.– documents with similar “semantic concepts” have “closer” location in the “latent

semantic space”• they tend to include similar “types” of words, although not necessarily exactly the same

words– each component on the reduced document vector vjS=vj is the “association” of the

document with the corresponding “concept”– example “similarity” measure between two documents:

SuSu

uSu

uu

uuwwsim

ji

Tji

ji

jiji

),(

SvSv

vSv

vv

vvddsim

ji

ji

ji

jiji

),(

2

2

Example Applications in Linguistic Processing

• Information Retrieval– “concept matching” vs “lexical matching” : relevant documents are

associated with similar “concepts”, but may not include exactly the same words

– example approach: treating the query as a new document (by “folding-in”), and evaluating its “similarity” with all possible documents

• Fold-in – consider a new document outside of the training corpus T, but with similar

language patterns or “concepts”– construct a new column dp ,p>N, with respect to the M words– assuming U and S remain unchanged

dp=USvpT (just as a column in W= USVT)

v p = vpS = dpTU as an R-dim representation of the new document

(i.e. obtaining the projection of dp on the basis ei of U by inner product)

Integration with N-gram Language Models• Language Modeling for Speech Recognition

– Prob(wq|dq-1) wq: the q-th word in the current document to be recognized (q: sequence index)dq-1: the recognized history in the current documentv q-1=dq-1

TU : representation of dq by vq (folded-in)– Prob(wq|dq-1) can be estimated by uq and v q-1 in the R-dim space– integration with N-gram

Prob(wq|Hq-1) =Prob(wq|hq-1, dq-1)Hq-1: history up to wq-1

hq-1:<wq-n+1, wq-n+2,... wq-1 >– N-gram gives local relationships, while dq-1 gives semantic concepts– dq-1 emphasizes more the key content words, while N-gram counts all words

similarly including function words• v q-1 for dq-1 can be estimated iteratively

– assuming the q-th word in the current document is wi

word-by- wordupdated, )1()1(

]0....0100...00)[1()1(

1

1

ii

qTqq

iqq

uq

vq

qUdv

qd

qqd

(n)

(n)

v q moves in the R-dim space initially, eventually settle down somewhere

i-th dimentionality out of M

T

Probabilistic Latent Semantic Analysis (PLSA)

• Exactly the same as LSA, using a set of latent topics{ }to construct a new relationship between the documents and terms, but with a probabilistic framework

• Trained with EM by maximizing the total likelihood

: frequency count of term in the document

KTTT ,, 21

K

kikkjij DTPTtPDtP

1

)|()|()|(

)|(log),(1 1

N

i

n

jijijT DtPDtcL

),( ij Dtc jt iD

t1t2

t j

tn

D1D2

Di

DN

TK

Tk

T2

T1

P(T |D )k i P(t |T )j k

Di: documents Tk: latent topics tj: terms

Documents

10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings