Latent Semantic Analysismath.agrocampus-ouest.fr/infoglueDeliverLive/digital...Latent Semantic Analysis Summer 2014 text mining missing data correspondence analysis SVD Jelena Markovic

Latent Semantic Analysis

Summer 2014

text mining

missing data

correspondence analysis

SVD

Jelena Markovic and Paulo Orenstein

contingency tableperfume

wine tastinghappiness

principal component analysis

BrittanyStanford

ozoneNA

vanilla

raccoons

sauvignon

log-linear models

statistics

Miss Dior

STATS300D

Number of components

Multivariate analysis

dimensionality

Multiple factor analysis

active variables

observations cloud FactoMineR

res.pca

inertia

Supplementary information

eigenvalues raccoons

Miss Dior


Summer 2014

text mining

missing data


SVD





BrittanyStanford

ozoneNA

vanilla

raccoons

sauvignon

log-linear models

statistics

Miss Dior

STATS300D



dimensionality


active variables


res.pca

inertia



Miss Dior


Summer 2014

text mining

missing data


SVD





BrittanyStanford

ozoneNA

vanilla

raccoons

sauvignon

log-linear models

statistics

Miss Dior

STATS300D



dimensionality


active variables


res.pca

inertia



Miss Dior

http://www.jstor.org/

2



1: Introduction

3

book2007/2/23page 4

4 Chapter 1. Vectors and Matrices in Data Mining and Pattern Recognition

1.2 Vectors and MatricesThe following examples illustrate the use of vectors and matrices in data mining.These examples present the main data mining areas discussed in the book, and theywill be described in more detail in Part II.

In many applications a matrix is just a rectangular array of data, and theelements are scalar, real numbers:

A =

⎛

⎜⎜⎜⎝

a11 a12 · · · a1n

a21 a22 · · · a2n...

......

am1 am2 · · · amn

⎞

⎟⎟⎟⎠∈ Rm×n.

To treat the data by mathematical methods, some mathematical structure must beadded. In the simplest case, the columns of the matrix are considered as vectorsin Rm.

Example 1.1. Term-document matrices are used in information retrieval. Con-sider the following selection of five documents.1 Key words, which we call terms,are marked in boldface.2

Document 1: The GoogleTM matrix P is a model of the Internet.Document 2: Pij is nonzero if there is a link from Web page j to i.Document 3: The Google matrix is used to rank all Web pages.Document 4: The ranking is done by solving a matrix eigenvalue

problem.Document 5: England dropped out of the top 10 in the FIFA

ranking.

If we count the frequency of terms in each document we get the following result:

Term Doc 1 Doc 2 Doc 3 Doc 4 Doc 5eigenvalue 0 0 0 1 0England 0 0 0 0 1FIFA 0 0 0 0 1Google 1 0 1 0 0Internet 1 0 0 0 0link 0 1 0 0 0matrix 1 0 1 1 0page 0 1 1 0 0rank 0 0 1 1 1Web 0 1 1 0 0

1In Document 5, FIFA is the Federation Internationale de Football Association. This documentis clearly concerned with football (soccer). The document is a newspaper headline from 2005. Afterthe 2006 World Cup, England came back into the top 10.

2To avoid making the example too large, we have ignored some words that would normally beconsidered as terms (key words). Note also that only the stem of the word is significant: “ranking”is considered the same as “rank.”

4

1: Introduction

3

book2007/2/23page 4




A =

⎛

⎜⎜⎜⎝

a11 a12 · · · a1n

a21 a22 · · · a2n...

......


⎞

⎟⎟⎟⎠∈ Rm×n.





ranking.





book2007/2/23page 4




A =

⎛

⎜⎜⎜⎝

a11 a12 · · · a1n

a21 a22 · · · a2n...

......


⎞

⎟⎟⎟⎠∈ Rm×n.





ranking.





4

book2007/2/23page 5

1.2. Vectors and Matrices 5

Thus each document is represented by a vector, or a point, in R10, and we canorganize all documents into a term-document matrix:

A =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 0 0 1 00 0 0 0 10 0 0 0 11 0 1 0 01 0 0 0 00 1 0 0 01 0 1 1 00 1 1 0 00 0 1 1 10 1 1 0 0

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

.

Now assume that we want to find all documents that are relevant to the query“ranking of Web pages.” This is represented by a query vector, constructed in away analogous to the term-document matrix:

q =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0000000111

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

∈ R10.

Thus the query itself is considered as a document. The information retrieval taskcan now be formulated as a mathematical problem: find the columns of A that areclose to the vector q. To solve this problem we must use some distance measurein R10.

In the information retrieval application it is common that the dimension m islarge, of the order 106, say. Also, as most of the documents contain only a smallfraction of the terms, most of the elements in the matrix are equal to zero. Such amatrix is called sparse.

Some methods for information retrieval use linear algebra techniques (e.g., sin-gular value decomposition (SVD)) for data compression and retrieval enhancement.Vector space methods for information retrieval are presented in Chapter 11.

Often it is useful to consider the matrix not just as an array of numbers, oras a set of vectors, but also as a linear operator. Denote the columns of A

a·j =

⎛

⎜⎜⎜⎝

a1j

a2j...

amj

⎞

⎟⎟⎟⎠, j = 1, 2, . . . , n,

book2007/2/23page 5

1.2. Vectors and Matrices 5

Thus each document is represented by a vector, or a point, in R10, and we canorganize all documents into a term-document matrix:

A =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0 0 0 1 00 0 0 0 10 0 0 0 11 0 1 0 01 0 0 0 00 1 0 0 01 0 1 1 00 1 1 0 00 0 1 1 10 1 1 0 0

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

.

Now assume that we want to find all documents that are relevant to the query“ranking of Web pages.” This is represented by a query vector, constructed in away analogous to the term-document matrix:

q =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0000000111

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

∈ R10.

Thus the query itself is considered as a document. The information retrieval taskcan now be formulated as a mathematical problem: find the columns of A that areclose to the vector q. To solve this problem we must use some distance measurein R10.

In the information retrieval application it is common that the dimension m islarge, of the order 106, say. Also, as most of the documents contain only a smallfraction of the terms, most of the elements in the matrix are equal to zero. Such amatrix is called sparse.

Some methods for information retrieval use linear algebra techniques (e.g., sin-gular value decomposition (SVD)) for data compression and retrieval enhancement.Vector space methods for information retrieval are presented in Chapter 11.

Often it is useful to consider the matrix not just as an array of numbers, oras a set of vectors, but also as a linear operator. Denote the columns of A

a·j =

⎛

⎜⎜⎜⎝

a1j

a2j...

amj

⎞

⎟⎟⎟⎠, j = 1, 2, . . . , n,

term-document matrix query vector

1: Introduction

4

2: Vector Space Model

- for realistic problems, use first text parser. Then, we create the term-document matrix.

5

- query matching: given q (‘query vector’), find best aj (‘document’).

- that is often done using the cosine distance measure. Hence, aj is retrieved if, given a tolerance,

cos(✓(q, aj)) =qTaj

||q||2||aj ||2> tolerance.

2: Vector Space Model

- besides counting, we can use a term weighting scheme. For example:

where fij is the number of times term i appears in document j, and ni is the number of documents that contain term i.

aij = fij log(n/ni),

6

- hence aij will be bigger the more ‘important’ the term is. The cosine distance measure takes that into account.

4

3: Latent Semantic Indexing

- Latent semantic indexing is “based on the assumption that there is

some underlying latent semantic structure in the data ... that is

corrupted by the wide variety of words used”.

- SVD:

book2007/2/23page 135

11.3. Latent Semantic Indexing 135

0 20 40 60 80 1000

10

20

30

40

50

60

70

80

90

100

Recall (%)

Pre

cisi

on (

%)

Figure 11.3. Query matching for Q9 using the vector space method. Recallversus precision.

11.3 Latent Semantic IndexingLatent semantic indexing25 (LSI) [28, 9] “is based on the assumption that there issome underlying latent semantic structure in the data . . . that is corrupted by thewide variety of words used” [76] and that this semantic structure can be discoveredand enhanced by projecting the data (the term-document matrix and the queries)onto a lower-dimensional space using the SVD.

Let A = UΣV T be the SVD of the term-document matrix and approximateit by a matrix of rank k:

A ≈ = UkΣkV Tk =: UkHk.

The columns of Uk live in the document space and are an orthogonal basis thatwe use to approximate the documents. Write Hk in terms of its column vectors,Hk = (h1, h2, . . . , hn). From A ≈ UkHk we have aj ≈ Ukhj , which means thatcolumn j of Hk holds the coordinates of document j in terms of the orthogonal

25Sometimes also called latent semantic analysis (LSA) [52].

7 ...


-

- Hk = (h1, h2, . . . hn).

- A ⇡ UkHk ) aj ⇡ Ukhj .

cos ✓j =qTk hj

||qk||2 ||hj ||2, qk = UT

k q.

8

A ⇡ UkHk.


- Artificial example: query “ranking of web pages”.

book2007/2/23page 138

138 Chapter 11. Text Mining

is often the average performance that matters.A systematic study of different aspects of LSI was done in [52]. It was shown

that LSI improves retrieval performance for surprisingly small values of the reducedrank k. At the same time, the relative matrix approximation errors are large. Itis probably not possible to prove any general results that explain in what way andfor which data LSI can improve retrieval performance. Instead we give an artificialexample (constructed using similar ideas as a corresponding example in [12]) thatgives a partial explanation.

Example 11.8. Consider the term-document matrix from Example 1.1 and thequery “ranking of Web pages.” Obviously, Documents 1–4 are relevant withrespect to the query, while Document 5 is totally irrelevant. However, we obtaincosines for the query and the original data as

(0 0.6667 0.7746 0.3333 0.3333

),

which indicates that Document 5 (the football document) is as relevant to the queryas Document 4. Further, since none of the words of the query occurs in Document 1,this document is orthogonal to the query.

We then compute the SVD of the term-document matrix and use a rank-2approximation. After projection to the two-dimensional subspace, the cosines, com-puted according to (11.4), are

(0.7857 0.8332 0.9670 0.4873 0.1819

).

It turns out that Document 1, which was deemed totally irrelevant for the queryin the original representation, is now highly relevant. In addition, the cosines forthe relevant Documents 2–4 have been reinforced. At the same time, the cosine forDocument 5 has been significantly reduced. Thus, in this artificial example, thedimension reduction enhanced the retrieval performance.

In Figure 11.6 we plot the five documents and the query in the coordinatesystem of the first two left singular vectors. Obviously, in this representation, thefirst document is closer to the query than Document 5. The first two left singularvectors are

u1 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14250.07870.07870.39240.12970.10200.53480.36470.48380.3647

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, u2 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.24300.26070.2607−0.02740.0740−0.37350.2156−0.47490.4023−0.4749

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

,

and the singular values are Σ = diag(2.8546, 1.8823, 1.7321, 1.2603, 0.8483). Thefirst four columns in A are strongly coupled via the words Google, matrix, etc., and

original data

book2007/2/23page 138





(0 0.6667 0.7746 0.3333 0.3333

),



(0.7857 0.8332 0.9670 0.4873 0.1819

).



u1 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14250.07870.07870.39240.12970.10200.53480.36470.48380.3647

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, u2 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.24300.26070.2607−0.02740.0740−0.37350.2156−0.47490.4023−0.4749

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

,


SVD, rank-2

- cosines for the query and data:

- documents 1-4 seem relevant, while 5 is totally irrelevant.

9 4


- the first two left singular vectors:

- the singular values are

book2007/2/23page 138





(0 0.6667 0.7746 0.3333 0.3333

),



(0.7857 0.8332 0.9670 0.4873 0.1819

).



u1 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14250.07870.07870.39240.12970.10200.53480.36470.48380.3647

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, u2 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.24300.26070.2607−0.02740.0740−0.37350.2156−0.47490.4023−0.4749

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

,


book2007/2/23page 138





(0 0.6667 0.7746 0.3333 0.3333

),



(0.7857 0.8332 0.9670 0.4873 0.1819

).



u1 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14250.07870.07870.39240.12970.10200.53480.36470.48380.3647

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

, u2 =

⎛

⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.24300.26070.2607−0.02740.0740−0.37350.2156−0.47490.4023−0.4749

⎞

⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

,


10


book2007/2/23page 139

11.4. Clustering 139

0.5 1 1.5 2 2.5−1.5

−1

−0.5

0

0.5

1

1

2

3

45

q

u1

u 2

Figure 11.6. The five documents and the query projected to the coordinatesystem of the first two left singular vectors.

those words are the dominating contents of the document collection (cf. the singularvalues). This shows in the composition of u1. So even if none of the words in thequery are matched by Document 1, that document is so strongly correlated to thedominating direction that it becomes relevant in the reduced representation.

11.4 ClusteringIn the case of document collections, it is natural to assume that there are groupsof documents with similar contents. If we think of the documents as points in Rm,we may be able to visualize the groups as clusters. Representing each cluster byits mean value, the centroid,26 we can compress the data in terms of the centroids.Thus clustering, using the k-means algorithm, for instance, is another method forlow-rank approximation of the term-document matrix. The application of clusteringto information retrieval is described in [30, 76, 77].

In analogy to LSI, the matrix Ck ∈ Rm×k of (normalized but not orthogonal)centroids can be used as an approximate basis in the “document space.” For querymatching we then need to determine the coordinates of all the documents in thisbasis. This can be made by solving the matrix least squares problem,

minGk

∥A− CkGk ∥F .

However, it is more convenient first to orthogonalize the columns of C, i.e., compute

26Closely related to the concept vector [30].

The five documents and the query projected to the coordinate of the first two left singular vectors.

11