Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala

Latent Semantic Indexing: A probabilistic Analysis

Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh

Vempala

Motivation

• Application in several areas:– querying– clustering, identifying topics– Other:

• synonym recognition (TOEFL..)

• Psychology test

• essay scoring

Motivation

• Latent Semantic Indexing is– Latent: Captures associations which are not

explicit– Semantic: Represents meaning as a function of

similarity to other entities– Cool: Lots of spiffy applications, and the

potential for some good theory too

Overview

• IR and two classical problems

• How LSI works

• Why LSI is effective: A probabilistic analysis

Information Retrieval

• Text corpus with many documents (docs)

• Given a query, find relevant docs

• Classical problems:– synonymy: missing docs with reference to

“automobile” when querying on “car”– polysemy: retrieving docs on internet when querying

on “surfing”

• Solution: Represent docs (and queries) by their underlying latent concepts

Information Retrieval

• Represent each document as a word vector

• Represent corpus as term-document matrix (T-D matrix)

• A classical method:– Create new vector from query terms– Find documents with highest dot-product

Document vector space

Query

Word 1

Word 2

Latent Semantic Indexing(LSI)

• Process term-document (T-D) matrix to expose statistical structure

• Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff

• Related to principal component analysis (PCA), multiple dimensional scaling (MDS)

Parameters

• U = universe of terms

• n = number of terms

• m = number of docs

• A = n x m matrix with rank r– columns represent docs– rows represent terms

Singular Value Decomposition(SVD)

=

=

mxnA

mxrU

rxrD

rxnVT

Terms

Documents

• LSI uses SVD, a linear analysis method:

SVD• r is the rank of A

• D diagonal matrix of the r singular values

• U and V matrices composed of orthonormal columns

• SVD is always possible

• numerical methods for SVD exist

• run time: O(m n c), where c denotes the average number of words per document

T-D Matrix Approximation

=

=

mxnA’k

mxkUk

kxkDk

kxnVT

k

Terms

Documents

Synonymy

• LSI used in several ways: e.g. detecting synonymy

• A measure of similarity for two terms:– In original space: dot product of rows (terms)

and of ( , entry in )– Better: dot product of rows and of

TAAij A

kkkk DUVA i j

Tkk AA

i j

( , entry in )j

i

i

“Semantic” Space

House

Home

Domicile

Kumquat

Apple

Orange

Pear

Synonymy (intuition)

• Consider the term-term autocorrelation matrix

• If two terms co-occur (e.g. supply-demand) we get nearly identical rows

• Yields a small eigenvalue for

• The eigenvector will likely be projected out in as it gives a weak eigenvalue

TAA

TAA

Tkk AA

A Performance Evaluation

• Landauer & Dumais– Perform LSI on 30,000 encyclopedia articles– Take synonym test from TOEFL– Choose most similar wordLSI - 64.4% (52.2% corrected for guessing)People - 64.5% (52.7% corrected for guessing)Correlated .44 with incorrect alternatives

A Probabilistic Analysisoverview

• The model:– Topics sufficiently disjoint– Each doc drawn from a single (random) topic

• Result:– With high probability (whp) :

• Docs from the same topic will be similar

• Docs from different topics will be dissimilar

The Probabilistic Model

• K topics, each corresponding to a set of words

• The sets are mutually disjoint

• Below, all random choices are made uniformly at random

• A corpus of m docs, each doc created as follows..

The Probabilistic Model (cont.)

• choosing a doc:– choose length of the doc– choose a topic – Repeat times:

• With prob choose a word from topic

• With prob choose a word from other topics

1l

T

T

l

Set up

• Let vector assigned to doc by the rank-k LSI performed on the corpus.

• The rank-k LSI is -skewed if

»

»

• (intuition) Docs from the same topic should be similar (high dot product), …

dv

'' 1., ' ddddidd vvvvTvv ',dd vv

''., ' ddddjdid vvvvTvTv

d

The Result

• Theorem: Assume the corpus is created from the model just described (k topics, etc.) . Then the rank-k LSI is

-skewed with probability )(O )1

(1m

O

Proof Sketch

• Show with k topics, we obtain k orthogonal subspaces– Assume strictly disjoint topics ( )

• show that whp the k highest eigenvalues of indeed correspond to the k topics (are not intra-topic)

– ( ) relax by using a matrix perturbation analysis

0k

Tk AA

0

Extensions

• Theory should go beyond explaining (ideally)

• Potential for speed up: – project the doc vectors onto a suitably small

space– perform LSI on this space

• Yields O(m( n + c log n)) compared to O(mnc)

2log

Future work

• Learn more abstract algebra (math)!

• Extensions: – docs spanning multiple topics?– polysemy?– other positive properties?

• Another important role of theory:– Unify and generalize: spectral analysis has

found applications elsewhere in IR