Knowledge Discovery and Data Mining 1 (VO) (706.701) · Knowledge Discovery and Data Mining 1 (VO) (706.701) Singular Value Decomposition and Latent Semantic Analysis ... November

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Knowledge Discovery and Data Mining 1 (VO)(706.701)

Singular Value Decomposition and Latent Semantic Analysis

Denis Helic

ISDS, TU Graz

November 29, 2018

Denis Helic (ISDS, TU Graz) KDDM1 November 29, 2018 1 / 85

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Big picture: KDDM

Knowledge Discovery Process

Preprocessing

Linear Algebra

Mathematical Tools

Information Theory Statistical Inference

Probability Theory Hardware & ProgrammingModel

Infrastructure

Transformation

Data Mining


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Outline

1 Introduction2 Singular Value Decomposition3 SVD Example and Interpretation4 Mathematics of SVD5 Dimensionality Reduction with SVD6 Advanced: SVD minimizes the approximation error7 Latent Semantic Indexing

SlidesSlides are partially based on “Mining Massive Datasets” Chapter 11,“Introduction to Information Retrieval” by Manning, Raghavan andSchütze, and Melanie Martin AI Seminar


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Recap

RecapReview of Principal Component Analysis


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Recap — PCA: Algorithm

Organize data as an 𝑛 × 𝑚 matrix, with 𝑛 data points and 𝑚 featuresSubtract the average for each feature to obtain centered datamatrix 𝐗Calculate the sample covariance matrix 𝐒 = 1

𝑛𝐗𝑇 𝐗Calculate the eigenvalues and the eigenvectors of 𝐒Select the top 𝑘 eigenvectorsProject the data to the new space spanned by those 𝑘 eigenvectors:𝐗𝐐𝑘 ∈ ℝ𝑛×𝑘, where 𝐐𝑘 ∈ ℝ𝑚×𝑘


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Introduction

Recap — PCA: Interpretation

You can interpret the first couple of principal components to learnsomething about the datasetData miningHowever, be very careful: you can not generalize from a single datasetPCA transforms the set of correlated observations into a set oflinearly uncorrelated observationsI.e. the goal of the analysis is to decorrelate the data


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Singular Value Decomposition

SVD

We investigate now a second form of matrix analysis called SingularValue DecompositionIt allows an exact representation of any matrix, i.e., the matrix neednot be a square matrixIt decomposes a matrix into a product of three matricesIt also provides an elegant way of dimensionality reductionBy eliminating the less important parts of the data representation


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD — Interpretation

SVD is based on the idea that there exist a small number of hidden(latent) “concepts” that connect the rows and columns of the matrixWe can rank the “concepts” from the most to the least importantWe use the ranking to remove the less important “concepts” from thematrixSuch that the new reduced matrix closely approximates the originalmatrix (according to a certain criteria)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Definition

Singular Value DecompositionLet 𝐌 ∈ ℝ𝑛×𝑚 be a matrix and let 𝑟 be the rank of 𝐌 (the rank of amatrix is the largest number of linearly independent rows or columns).Then we can find matrices 𝐔, 𝐕, and 𝚺 with the following properties:

𝐔 ∈ ℝ𝑛×𝑟 is a column-orthonormal matrix𝐕 ∈ ℝ𝑚×𝑟 is a column-orthonormal matrix𝚺 ∈ ℝ𝑟×𝑟 is a diagonal matrix.

The matrix 𝐌 can be then written as:

𝐌 = 𝐔𝚺𝐕𝑇


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD form

Figure: Figure from “Mining Massive Datasets”


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

SVD Example and Interpretation

SVD: Simple example

We will decompose a utility matrix of a movie recommender systemWe have users who rate moviesAssume we have only two “concepts”, which underlie the movies andsteer the rating processE.g. concepts represent two movie genres: science fiction andromanceUsers from group A rate only science fiction and users from group Bonly romance


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

UserMovie Matrix Alien Star Wars Casablanca Titanic

𝐴1 1 1 1 0 0𝐴2 3 3 3 0 0𝐴3 4 4 4 0 0𝐴4 5 5 5 0 0𝐵1 0 0 0 4 4𝐵2 0 0 0 5 5𝐵3 0 0 0 2 2


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

Strict adherence to those two concepts gives the matrix a rank of𝑟 = 2We may pick one of the first four rows and one of the last three rowsand we can not find a nonzero linear combination that gives 𝟎But we can not pick three independent rowsIf we pick rows 1, 2 and 7 then three times the first minus the secondplus zero times the seventh gives 𝟎


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

Similarly for columnsWe may pick one of the first three and one of the last two and theywill be independentBut we can not pick three independent columnsIf we pick columns 1, 2, and 5 then the first minus the second pluszero times the fifth gives 𝟎Thus, the rank is indeed 𝑟 = 2 and 𝚺 ∈ ℝ2×2


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

We will see later how to calculate the decomposition:

𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

The key to understanding SVD is in viewing the 𝑟 columns of 𝐔, 𝚺,and 𝐕 as representing concepts that are hidden or latent in theoriginal matrix 𝐌In our example these concepts are clearOne is science fictionThe other one is romance


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

The rows of 𝐌 are peopleThe columns of 𝐌 are moviesThen the rows of 𝐔 are peopleThe columns of 𝐔 are concepts𝐔 connects people to concepts


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

For example, the user 𝐴1 likes only science fiction:

𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

The value of 0.14 in the first row and first column of 𝐔 indicates thisfactHowever, this value is smaller than some of other values in the firstcolumn of 𝐔

𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

While 𝐴1 watches only science fiction she does not rate these movieshighly𝐴1 contributes to the concept of science fiction but not as much ase.g. 𝐴4 who rated these movies highly

𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

On the other hand, the second column of the first row of 𝐔 is zero:

𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

𝐴1 does not rate romance movies at all and does not contribute tothat concept:

𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

𝐕𝑇 = (0.58 0.58 0.58 0 00 0 0 0.71 0.71)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

The rows of 𝐌 are peopleThe columns of 𝐌 are moviesThen the rows of 𝐕𝑇 are conceptsThe columns of 𝐕𝑇 are movies𝐕 connects movies to concepts


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

For example, the 0.58 in the first three columns of the first row of𝐕𝑇 indicates that the first three movies are of science fiction genreMatrix, Alien and Star Wars

𝐕𝑇 = (0.58 0.58 0.58 0 00 0 0 0.71 0.71)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

On the other hand, the last two movies have nothing to do withscience fictionCasablanca and Titanic

𝐕𝑇 = (0.58 0.58 0.58 0 00 0 0 0.71 0.71)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

Also, Matrix, Alien and Star Wars do not partake of the concept ofromance at allAs indicated by 0’s in the first three columns of the second row

𝐕𝑇 = (0.58 0.58 0.58 0 00 0 0 0.71 0.71)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

Whereas, Casablanca and Titanic are romance moviesThe 0.71 in the last two columns of the second row

𝐕𝑇 = (0.58 0.58 0.58 0 00 0 0 0.71 0.71)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

𝚺 = (12.4 00 9.5)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

Finally, the matrix 𝚺 gives the strength of each conceptIn our example the strength of science fiction is 12.4The strength of romance is 9.4Intuitively, science fiction is a stronger concept because we have moredata on movies of that genre and more people who rate these movies


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Simple example

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 0 0 4 40 0 0 5 50 0 0 2 2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

=

𝐌

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.14 00.42 00.56 00.70 0

0 0.600 0.750 0.30

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

×

𝐔

(12.4 00 9.5) ×

𝚺(0.58 0.58 0.58 0 0

0 0 0 0.71 0.71)

𝐕𝑇


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

In general, the concepts will not be so clearly composedThere will be fewer 0’s in 𝐔 and 𝐕𝚺 is always diagonalTypically the entities represented by the rows and columns of 𝐌 willcontribute to several different concepts to varying degrees


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Interpretation

In our simple example the rank of the matrix 𝐌 was equal to thenumber of conceptsIn practice that is rarely the caseThe rank 𝑟 is often greater than the number of the concepts andsome of the columns in 𝐔 and 𝐕 are harder to interpret


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


SVD: Another example

UserMovie Matrix Alien Star Wars Casablanca Titanic

𝐴1 1 1 1 0 0𝐴2 3 3 3 0 0𝐴3 4 4 4 0 0𝐴4 5 5 5 0 0𝐵1 0 2 0 4 4𝐵2 0 0 0 5 5𝐵3 0 1 0 2 2


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



In this (more realistic) example 𝐵1 and 𝐵3 rated also “Alien”Neither liked it much, but nevertheless they rated itThis gives the matrix a rank of 3E.g. we may pick the first, sixth, and seventh rows and check thatthey are independentHowever no four rows are independent


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Now, in our decomposition we have 𝑟 = 3We will have three columns in 𝐔, 𝐕, and 𝚺The first column corresponds to science fictionThe second column corresponds to romanceThe interpretation of the third column is not easy (it is a linearcombination of the users)Nice property: the third columns is the least important one


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



𝐔 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.13 0.02 −0.010.41 0.07 −0.030.55 0.09 −0.040.68 0.11 −0.050.15 −0.59 0.650.07 −0.73 −0.670.07 −0.29 0.32

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



𝐕𝑇 = ⎛⎜⎝

0.56 0.59 0.56 0.09 0.090.12 −0.02 0.12 −0.69 −0.690.40 −0.80 0.40 0.09 0.09

⎞⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



𝚺 = ⎛⎜⎝

12.4 0 00 9.5 00 0 1.3

⎞⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.13 0.02 −0.010.41 0.07 −0.030.55 0.09 −0.040.68 0.11 −0.050.15 −0.59 0.650.07 −0.73 −0.670.07 −0.29 0.32

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

× ⎛⎜⎝

12.4 0 00 9.5 00 0 1.3

⎞⎟⎠

× ⎛⎜⎝

0.56 0.59 0.56 0.09 0.090.12 −0.02 0.12 −0.69 −0.690.40 −0.80 0.40 0.09 0.09

⎞⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

Matrix diagonalization theorem

TheoremLet 𝐒 ∈ ℝ𝑛×𝑛 be a square matrix with 𝑛 linearly independenteigenvectors. Then there exists an eigendecomposition:

𝐒 = 𝐔𝚲𝐔−1

where the columns of 𝐔 are the eigenvectors of 𝐒 and 𝚲 is a diagonalmatrix whose diagonal entries are the eigenvalues of 𝐒 in decreasing order

𝚲 =⎛⎜⎜⎜⎜⎝

𝜆1𝜆2

…𝜆𝑛

⎞⎟⎟⎟⎟⎠

, 𝜆𝑖 ≥ 𝜆𝑖+1.

If the eigenvalues are distinct, then this decomposition is unique.


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD


How does this theorem work?𝐔 has eigenvectors of 𝐒 as its columns

𝐔 = (𝐮1 𝐮2 … 𝐮𝑛)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD


Then we have

𝐒𝐔 = 𝐒 × (𝐮1 𝐮2 … 𝐮𝑛)= (𝜆1𝐮1 𝜆2𝐮2 … 𝜆𝑛𝐮𝑛)

= (𝐮1 𝐮2 … 𝐮𝑛)⎛⎜⎜⎜⎜⎝

𝜆1𝜆2

…𝜆𝑛

⎞⎟⎟⎟⎟⎠

= 𝐔𝚲


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD


Thus we have

𝐒𝐔 = 𝐔𝚲

𝐒 = 𝐔𝚲𝐔−1


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

Symmetric diagonalization theorem

TheoremLet 𝐒 ∈ ℝ𝑛×𝑛 be a square symmetric matrix with 𝑛 linearly independenteigenvectors. Then there exists a symmetric diagonal decomposition:

𝐒 = 𝐐𝚲𝐐−1

where the columns of 𝐐 are the orthogonal and normalized eigenvectors of𝐒 (i.e. 𝐐 is an orthonormal matrix) and 𝚲 is a diagonal matrix whosediagonal entries are the eigenvalues of 𝐒. Further, all entries of 𝚲 and of𝐐 are real and we have 𝐐−1 = 𝐐𝑇 .


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD


Let us calculate 𝐌𝑇

𝐌𝑇 = (𝐔𝚺𝐕𝑇 )𝑇

= (𝐕𝑇 )𝑇 𝚺𝑇 𝐔𝑇

= 𝐕𝚺𝑇 𝐔𝑇

= 𝐕𝚺𝐔𝑇

The last equality since 𝚺 is diagonal and thus 𝚺𝑇 = 𝚺


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD


𝐌𝑇 = 𝐕𝚺𝐔𝑇

Let us calculate 𝐌𝐌𝑇

𝐌𝐌𝑇 = 𝐔𝚺𝐕𝑇 𝐕𝚺𝐔𝑇

= 𝐔𝚺2𝐔𝑇


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

𝐌𝐌𝑇 = 𝐔𝚺2𝐔𝑇

Thus, we have 𝐌𝐌𝑇 = 𝐒 and 𝚺2 = 𝚲That is 𝐔 is the matrix of eigenvectors of 𝐌𝐌𝑇

𝚺 = 𝚲1/2 is the matrix of square roots of the eigenvalues of 𝐌𝐌𝑇


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD


𝐌𝑇 = 𝐕𝚺𝐔𝑇

Let us calculate 𝐌𝑇 𝐌

𝐌𝑇 𝐌 = 𝐕𝚺𝐔𝑇 𝐔𝚺𝐕𝑇

= 𝐕𝚺2𝐕𝑇


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

𝐌𝑇 𝐌 = 𝐕𝚺2𝐕𝑇

Thus, we have 𝐌𝑇 𝐌 = 𝐒 and 𝚺2 = 𝚲That is 𝐕 is the matrix of eigenvectors of 𝐌𝑇 𝐌𝚺 = 𝚲1/2 is the matrix of square roots of the eigenvalues of 𝐌𝑇 𝐌


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

What is the relationship between eigenvalues of 𝐌𝐌𝑇 and 𝐌𝑇 𝐌Suppose 𝐮 is an eigenvector of 𝐌𝐌𝑇

𝐌𝐌𝑇 𝐮 = 𝜆𝐮


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

We multiply both sides of the equation by 𝐌𝑇 on the left

𝐌𝑇 𝐌𝐌𝑇 𝐮 = 𝐌𝑇 𝜆𝐮𝐌𝑇 𝐌(𝐌𝑇 𝐮) = 𝜆(𝐌𝑇 𝐮)

As long as (𝐌𝑇 𝐮) is not the zero vector 𝟎 it will be an eigenvector of𝐌𝑇 𝐌, i.e. it is one of 𝐯 vectors


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

The converse holds as wellSuppose 𝐯 is an eigenvector of 𝐌𝑇 𝐌

𝐌𝑇 𝐌𝐯 = 𝜆𝐯


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

We multiply both sides of the equation by 𝐌 on the left

𝐌𝐌𝑇 𝐌𝐯 = 𝐌𝜆𝐯𝐌𝐌𝑇 (𝐌𝐯) = 𝜆(𝐌𝐯)

As long as (𝐌𝐯) is not the zero vector 𝟎 it will be an eigenvector of𝐌𝐌𝑇 , i.e. it is one of 𝐮 vectors


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

What happens when e.g. 𝐌𝐯 = 𝟎

𝐌𝑇 𝐌𝐯 = 𝟎𝐌𝑇 𝐌𝐯 = 𝜆𝐯

𝜆𝐯 = 𝟎

Since 𝐯 is not 𝟎 it must be 𝜆 = 0


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

If the dimension of 𝐌𝑇 𝐌 (𝑚) are less than the dimension of 𝐌𝐌𝑇

(𝑛):Eigenvalues of 𝐌𝑇 𝐌 are eigenvalues of 𝐌𝐌𝑇 plus additional zerosIf the dimension of 𝐌𝑇 𝐌 (𝑚) are greater than the dimension of𝐌𝐌𝑇 (𝑛):Eigenvalues of 𝐌𝐌𝑇 are eigenvalues of 𝐌𝑇 𝐌 plus additional zeros


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD

𝐔 is the matrix of eigenvectors of 𝐌𝐌𝑇

𝚺 is the matrix of square roots of the non-zero eigenvalues of 𝐌𝐌𝑇

and 𝐌𝑇 𝐌 (equal eigenvalues + additional zeros)𝐕 is the matrix of eigenvectors of 𝐌𝑇 𝐌This gives also the algorithm for calculating the decomposition:eigenvalues and eigenvectors of 𝐌𝑇 𝐌 and 𝐌𝐌𝑇


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD Interpretation

What does 𝐌𝐌𝑇 represent?It is a square matrix with rows and columns corresponding to usersEach element measures the overlap between the people based on theirco-ratings of the moviesIt is the sum of the products of their movie ratingsCo-occurrence matrix


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD Interpretation

𝐌 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

𝐌𝐌𝑇 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

3 9 12 15 2 0 19 27 36 45 6 0 3

12 36 48 60 8 0 415 45 60 75 10 0 52 6 8 10 36 40 180 0 0 0 40 50 201 3 4 5 18 20 9

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD Interpretation

What does 𝐌𝑇 𝐌 represent?It is a square matrix with rows and columns corresponding to moviesEach element measures the overlap between the movies based ontheir co-ratings by usersIt is the sum of the products of the ratings that they got fromdifferent peopleCo-occurrence matrix


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Mathematics of SVD

SVD Interpretation

𝐌 =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

𝐌𝑇 𝐌 =⎛⎜⎜⎜⎜⎜⎜⎝

51 51 51 0 051 56 51 10 1051 51 51 0 00 10 0 45 450 10 0 45 45

⎞⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Dimensionality Reduction with SVD

SVD dimensionality reduction

How do we use SVD for dimensionality reduction?We want to represent a very large matrix 𝐌 by its SVD components𝐔, 𝚺, and 𝐕We interpreted the entries in 𝚺 as the measure of importance ofconceptsWe might set the 𝑟 − 𝑘 smallest entries in 𝚺 to zeroWith this we eliminate the 𝑟 − 𝑘 rows of 𝐔 and 𝐕


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1 1 1 0 03 3 3 0 04 4 4 0 05 5 5 0 00 2 0 4 40 0 0 5 50 1 0 2 2

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

=

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.13 0.02 −0.010.41 0.07 −0.030.55 0.09 −0.040.68 0.11 −0.050.15 −0.59 0.650.07 −0.73 −0.670.07 −0.29 0.32

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

× ⎛⎜⎝

12.4 0 00 9.5 00 0 1.3

⎞⎟⎠

× ⎛⎜⎝

0.56 0.59 0.56 0.09 0.090.12 −0.02 0.12 −0.69 −0.690.40 −0.80 0.40 0.09 0.09

⎞⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Now we set the smallest value in 𝚺 to 0 and eliminate thecorresponding rows and columns in 𝐔 and 𝐕

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.13 0.020.41 0.070.55 0.090.68 0.110.15 −0.590.07 −0.730.07 −0.29

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

× (12.4 00 9.5) × (0.56 0.59 0.56 0.09 0.09

0.12 −0.02 0.12 −0.69 −0.69)


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.13 0.020.41 0.070.55 0.090.68 0.110.15 −0.590.07 −0.730.07 −0.29

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

× (12.4 00 9.5) × (0.56 0.59 0.56 0.09 0.09

0.12 −0.02 0.12 −0.69 −0.69) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

0.93 0.95 0.93 0.014 0.0142.93 2.99 2.93 0.000 0.0003.92 4.01 3.92 0.026 0.0264.84 4.96 4.84 0.040 0.0400.37 1.21 0.37 4.04 4.040.35 0.65 0.35 4.87 4.870.16 0.57 0.16 1.98 1.98

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Advanced: SVD minimizes the approximation error

Advanced: SVD dimensionality reduction

The resulting matrix is quite close to the original matrixThus, this approach to dimensionality reduction seems to work quitewellHowever, since we approximate 𝐌 → we need to measure theapproximation errorWe can pick among several measures for this errorFor SVD decomposition we pick Frobenius norm


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Frobenius norm ||𝐌|| of a matrix 𝐌 is the square root of the sum ofthe squares of the elements of 𝐌

||𝐌|| =√√√⎷

𝑛∑𝑖=1

𝑚∑𝑗=1

𝑚2𝑖𝑗

= √𝑡𝑟(𝐌𝑇 𝐌)

=√√√⎷

𝑚𝑖𝑛(𝑛,𝑚)∑𝑖=1

𝜆𝑖

=√√√⎷


𝜎2𝑖


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Now suppose we want to approximate 𝐌 with a matrix 𝐌′ of therank 𝑘 < 𝑟 such that ||𝐌 − 𝐌′|| is minimalThus, we minimize the Frobenius norm of the difference between theoriginal matrix and its approximationEckart-Young theorem states that the solution to this problem is ofthe form (proof is complicated):

𝐌′ = 𝐔𝚺′𝐕𝑇

𝚺′ is the same matrix as 𝚺 except that it contains 𝑘 largest singularvalues and 𝑟 − 𝑘 of the smallest values are replaced by zero


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



𝐌 − 𝐌′ = 𝐔(𝚺 − 𝚺′)𝐕𝑇

||𝐌 − 𝐌′|| =√√√⎷


(𝜎𝑖 − 𝜎′𝑖)2

The singular values of this matrix are kept in 𝚺 − 𝚺’These are zeros for 𝑟 − 𝑘 singular values that we choose to keepThey are non-zeros for all singular values that we set to zeroThus, to minimize the Frobenius norm we should set the smallestvalues to zero


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



How many singular values should we keep?A useful rule of the thumb is to keep enough singular values to makeup 90% of energy in 𝚺We define energy as the sum of squares of singular values


𝜎2𝑖


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



In the example the total energy:

12.42 + 9.52 + 1.32 = 245.7

By removing the smallest singular value: (12.42 + 9.52 = 244.01) wekeep 99% of the energyBy also removing the second smallest singular value:(12.42 = 153.76) we would keep only 63% of the energy


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

Latent Semantic Indexing

SVD of a term-document matrix

Vector Space Model can not cope with two classic problems arising innatural languagesSynonymy: two words having the same meaningE.g. “car” and “automobile”Those synonym words get separate dimensions in the VSMThe model would underestimate the similarity of a documentcontaining both “car” and “automobile” to a query containing only“car”Polysemy: one word having multiple meaningsE.g. “bank” may mean a financial institution or a river bank


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



Another problem is the dimension of the term-document matrixIn latent semantic analysis (LSA) or latent semantic indexing (LSI)we use SVD to create a low-rank approximation of theterm-document matrixWe select 𝑘 largest singular values and create 𝐌𝑘 approximation tothe original matrix


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



With SVD we map each term/document to latent topicsIn practice, however the interpretation is rather difficultBy computing low-rank approximation of the original term-documentmatrix the SVD brings together the terms with similar co-occurrencesRetrieval quality may actually be improved by the approximation!Projecting onto a smaller number of dimensions brings similardocuments closer (signal-to-noise ratio)Retrieval by folding the query into the low-rank space

𝐪𝑘 = 𝚺−1𝐔𝑇 𝐪


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.



As we reduce 𝑘 recall improvesA value of 𝑘 in low hundreds tend to increase precision as well (thissuggests that a suitable 𝑘 addresses some of the challenges ofsynonymy)LSI works best in applications where there is little overlap betweendocuments and the queryLSI can be viewed as a soft clustering methodEach concept is a cluster and the value that a document has at thatconcept is its fractional membership in that concept


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: Example

Technical memo titlesTwo different collectionsThe first about HCIThe second about graph theory

ExampleExample from Melanie Martin AI Seminar


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: Example

c1: Human machine interface for ABC computer applicationsc2: A survey of user opinion of computer system response timec3: The EPS user interface management systemc4: System and human system engineering testing of EPSc5: Relation of user perceived response time to error measurement


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: Example

m1: The generation of random, binary, ordered treesm2: The intersection graph of paths in treesm3: Graph minors IV: Widths of trees and well-quasi-orderingm4: Graph minors: A survey


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

TermTitle c1 c2 c3 c4 c5 m1 m2 m3 m4

human 1 0 0 1 0 0 0 0 0interface 1 0 1 0 0 0 0 0 0

computer 1 1 0 0 0 0 0 0 0user 0 1 1 0 1 0 0 0 0

system 0 1 1 2 0 0 0 0 0response 0 1 0 0 1 0 0 0 0

time 0 1 0 0 1 0 0 0 0EPS 0 0 1 1 0 0 0 0 0

survey 0 1 0 0 0 0 0 0 1trees 0 0 0 0 0 1 1 1 0

graph 0 0 0 0 0 0 1 1 1minors 0 0 0 0 0 0 0 1 1


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

We would expect that human is similar to user but not to minors inthis contextCorrelation coefficient (covariance normalized to interval [−1, 1])𝑟(ℎ𝑢𝑚𝑎𝑛, 𝑢𝑠𝑒𝑟) = −0.37796𝑟(ℎ𝑢𝑚𝑎𝑛, 𝑚𝑖𝑛𝑜𝑟𝑠) = −0.28571


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

1 2 3 4 5 6 7 8 9human -0.22 -0.11 0.29 -0.41 -0.11 -0.34 -0.52 0.06 0.41

interface -0.20 -0.07 0.14 -0.55 0.28 0.50 0.07 0.01 0.11computer -0.24 0.04 -0.16 -0.60 -0.11 -0.26 0.30 -0.06 -0.49

user -0.40 0.06 -0.34 0.10 0.33 0.38 -0.00 0.00 -0.01system -0.64 -0.17 0.36 0.33 -0.16 -0.21 0.17 -0.03 -0.27

response -0.27 0.11 -0.43 0.07 0.08 -0.17 -0.28 0.02 0.05time -0.27 0.11 -0.43 0.07 0.08 -0.17 -0.28 0.02 0.05EPS -0.30 -0.14 0.33 0.19 0.11 0.27 -0.03 0.02 0.17

survey -0.21 0.27 -0.18 -0.03 -0.54 0.08 0.47 0.04 0.58trees -0.01 0.49 0.23 0.02 0.59 -0.39 0.29 -0.25 0.23

graph -0.04 0.62 0.22 0.00 -0.07 0.11 -0.16 0.68 -0.23minors -0.03 0.45 0.14 -0.01 -0.30 0.28 -0.34 -0.68 -0.18

Table: 𝐔


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

1 2 3 4 5 6 7 8 91 3.34 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.002 0.00 2.54 0.00 0.00 0.00 0.00 0.00 0.00 0.003 0.00 0.00 2.35 0.00 0.00 0.00 0.00 0.00 0.004 0.00 0.00 0.00 1.64 0.00 0.00 0.00 0.00 0.005 0.00 0.00 0.00 0.00 1.50 0.00 0.00 0.00 0.006 0.00 0.00 0.00 0.00 0.00 1.31 0.00 0.00 0.007 0.00 0.00 0.00 0.00 0.00 0.00 0.85 0.00 0.008 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.56 0.009 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.36

Table: 𝚺


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

1 2 3 4 5 6 7 8 9c1 -0.20 -0.61 -0.46 -0.54 -0.28 -0.00 -0.01 -0.02 -0.08c2 -0.06 0.17 -0.13 -0.23 0.11 0.19 0.44 0.62 0.53c3 0.11 -0.50 0.21 0.57 -0.51 0.10 0.19 0.25 0.08c4 -0.95 -0.03 0.04 0.27 0.15 0.02 0.02 0.01 -0.02c5 0.05 -0.21 0.38 -0.21 0.33 0.39 0.35 0.15 -0.60

m1 -0.08 -0.26 0.72 -0.37 0.03 -0.30 -0.21 0.00 0.36m2 -0.18 0.43 0.24 -0.26 -0.67 0.34 0.15 -0.25 -0.04m3 0.01 -0.05 -0.01 0.02 0.06 -0.45 0.76 -0.45 0.07m4 0.06 -0.24 -0.02 0.08 0.26 0.62 -0.02 -0.52 0.45

Table: 𝐕


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

TermTitle c1 c2 c3 c4 c5 m1 m2 m3 m4

human 0.16 0.40 0.38 0.47 0.18 -0.05 -0.12 -0.16 -0.09interface 0.14 0.37 0.33 0.40 0.16 -0.03 -0.07 -0.10 -0.04

computer 0.15 0.51 0.36 0.41 0.24 0.02 0.06 0.09 0.12user 0.26 0.84 0.61 0.70 0.39 0.03 0.08 0.12 0.19

system 0.45 1.23 1.05 1.27 0.56 -0.07 -0.15 -0.21 -0.05response 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22

time 0.16 0.58 0.38 0.42 0.28 0.06 0.13 0.19 0.22EPS 0.22 0.55 0.51 0.63 0.24 -0.07 -0.14 -0.20 -0.11

survey 0.10 0.53 0.23 0.21 0.27 0.14 0.31 0.44 0.42trees -0.06 0.23 -0.14 -0.27 0.14 0.24 0.55 0.77 0.66

graphs -0.06 0.34 -0.15 -0.30 0.20 0.31 0.69 0.98 0.85minors -0.04 0.25 -0.10 -0.21 0.15 0.22 0.50 0.71 0.62


...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.

...

.


LSA: example

𝑟(ℎ𝑢𝑚𝑎𝑛, 𝑢𝑠𝑒𝑟) = 0.9385𝑟(ℎ𝑢𝑚𝑎𝑛, 𝑚𝑖𝑛𝑜𝑟𝑠) = −0.8309LSA brought together human and user through co-occurrencesAlso, the dissimilarity between human and minors is now stronger


Documents

Knowledge Discovery and Data Mining 1 (VO) (706.701) · Knowledge Discovery and Data Mining 1 (VO) (706.701) Singular Value Decomposition and Latent Semantic Analysis ... November