Discriminative Feature Extraction and Dimension Reduction
- PCA LDA and LSA
References1 Introduction to Machine Learning Chapter 6 2 Data Mining Concepts Models Methods and Algorithms Chapter 3
Berlin ChenGraduate Institute of Computer Science amp Information Engineering
National Taiwan Normal University
ML-2
Introduction (13)
bull Goal discover significant patterns or features from the input datandash Salient feature selection or dimensionality reduction
ndash Compute an input-output mapping based on some desirable properties
Networkx yInput space Feature space
W
ML-3
Introduction (23)
bull Principal Component Analysis (PCA)
bull Linear Discriminant Analysis (LDA)
bull Latent Semantic Analysis (LSA)
bull Heteroscedastic Discriminant Analysis (HDA)
ML-4
Introduction (33)
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-2
Introduction (13)
bull Goal discover significant patterns or features from the input datandash Salient feature selection or dimensionality reduction
ndash Compute an input-output mapping based on some desirable properties
Networkx yInput space Feature space
W
ML-3
Introduction (23)
bull Principal Component Analysis (PCA)
bull Linear Discriminant Analysis (LDA)
bull Latent Semantic Analysis (LSA)
bull Heteroscedastic Discriminant Analysis (HDA)
ML-4
Introduction (33)
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-3
Introduction (23)
bull Principal Component Analysis (PCA)
bull Linear Discriminant Analysis (LDA)
bull Latent Semantic Analysis (LSA)
bull Heteroscedastic Discriminant Analysis (HDA)
ML-4
Introduction (33)
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-4
Introduction (33)
bull Formulation for discriminative feature extraction ndash Model-free (nonparametric)
bull Without prior information eg PCA bull With prior information eg LDA
ndash Model-dependent (parametric)bull Eg PLSA (Probabilistic Latent Semantic Analysis) with EM
(Expectation-Maximization) MCE (Minimum Classification Error) Training
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-5
Principal Component Analysis (PCA) (12)
bull Known as Karhunen-Loẻve Transform (1947 1963)
ndash Or Hotelling Transform (1933)
bull A standard technique commonly used for data reduction in statistical pattern recognition and signal processing
bull A transform by which the data set can be represented by reduced number of effective features and still retain the most intrinsic information contentndash A small set of features to be found to represent the data samples
accurately
bull Also called ldquoSubspace Decompositionrdquo ldquoFactor Analysisrdquo
Pearson 1901
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-6
Principal Component Analysis (PCA) (22)
The patterns show a significant differencefrom each other in one of the transformed axes
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-7
PCA Derivations (113)
bull Suppose x is an n-dimensional zero mean random vectorndash If x is not zero mean we can subtract the mean
before processing the following analysis
ndash x can be represented without error by the summation of n linearly independent vectors
sum ===
n
iiiiy Φyφx [ ]Tni yyy where 1=y
[ ]ni φφφΦ 1=
0xμ x == E
The basis vectorsThe i-th component
in the feature (mapped) space
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-8
PCA Derivations (213)
(23)
(01)
(10)
(11)(-11)
(52rsquo12rsquo)
⎥⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡minus+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡+⎥
⎦
⎤⎢⎣
⎡=⎥
⎦
⎤⎢⎣
⎡11
121
11 1
21
11
25
10
301
232
orthogonal basis sets
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-9
PCA Derivations (313)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull Such that is equal to the projectionof on
⎩⎨⎧
ne=
=jiji
jTi if 0
if 1φφ
iφxxx T
iiT
ii y ϕϕ ==forall
iy
x1ϕ
2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Φ
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-10
PCA Derivations (413)
ndash Further assume the column (basis) vectors of the matrix form an orthonormal set
bull also has the following propertiesndash Its mean is zero too
ndash Its variance is
bull The correlation between two projections and is
Φiy
0==== 0xx Ti
Ti
Tii EEyE ϕϕϕ jy
( )( ) j
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
Rφφφxxφ
φxxφxφxφ
==
==
[ ] [ ]xRR
xxxx
ofmatrix relation (auto-)cor theis
2222
iTi
iTT
iiTT
i EEyEyEyEii
i
ϕϕ
ϕϕϕϕσ
=
===minus=
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1 0
iy jy
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-11
PCA Derivations (513)
bull Minimum Mean-Squared Error Criterionndash We want to choose only m of that we still can
approximate well in mean-squared error criterionxsiφ
sumsumsum+===
+==n
mjjj
m
ii
n
ii yyy
111φφφx ii
( ) sum=
=m
iiym
1ˆ iφx
( ) ( )
sumsum
sum
sum sum
sumsum
+=+=
+=
+= +=
+=+=
==
=
⎭⎬⎫
⎩⎨⎧=
⎭⎬⎫
⎩⎨⎧
⎟⎠⎞⎜
⎝⎛⎟⎠⎞
⎜⎝⎛=minus=
n
mjj
Tj
n
mjj
n
mjj
kTj
n
mj
n
mkkj
kn
mkk
Tj
n
mjj
yE
yyE
yyEmEm
11
2
1
2
1 1
11
2
ˆ
R φφ
φφ
φφxx
σ
ε
⎩⎨⎧
ne=
=kjkj
kTj if 0
if 1φφQ
[ ] 2
222
0
j
j
yE
yEyE
yE
jj
j
=
minus=
=
σ
We should discard the
bases where the projections have lower variances
original vector
reconstructed vector
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-12
PCA Derivations (613)
bull Minimum Mean-Squared Error Criterionndash If the orthonormal (basis) set is selected to be the
eigenvectors of the correlation matrix associated with eigenvalues
bull They will have the property that
ndash Such that the mean-squared error mentioned above will be
siφR
siλ
jjj φR φ λ=
( )
sumsumsum
sum
+=+=+=
+=
===
=
n
mjj
n
mjjj
Tj
n
mjj
Tj
n
mjjm
111
1
2
λλ
σε
φφR φφ
is real and symmetric therefore its eigenvectors
form a orthonormal setR
is positive definite ( )=gt all eigenvalues are positive
R 0gtRxxT
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-13
PCA Derivations (713)
bull Minimum Mean-Squared Error Criterionndash If the eigenvectors are retained associated with the m largest
eigenvalues the mean-squared error will be
ndash Any two projections and will be mutually uncorrelated
bull Good news for most statistical modeling approachesndash Gaussians and diagonal matrices
( ) ( )0 where 11
gegegegegesum=+=
nmn
mjjeigen m λλλλε
iy jy
( )( ) 0 j ====
==
jTij
Tij
TTi
jTT
iTT
jTiji
E
EEyyE
φφRφφφxxφ
φxxφxφxφ
λ
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
sdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdotsdot
sdotsdot
nnnn
n
σσσ
σσσσσ
21
2222
11211
( )( )
sum==
minus⎟⎠⎞
⎜⎝⎛
sumasymp
minusminus=
=
i
Tii
T
TN
i
Tii
T
NE
N
E
xxxxR
μμxx
μxμxΣ
1
1
][
1
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-14
PCA Derivations (813)
bull An two-dimensional example of Principle Component Analysis
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-15
PCA Derivations (913)
bull Minimum Mean-Squared Error Criterionndash It can be proved that is the optimal solution under the
mean-squared error criterion ( ) meigenε
( )
[ ]( )[ ]( )
[ ] [ ][ ]( )nmmnmnmnmn
nmmnnm
nmmnjmnjnjm
jnmjTn
mkkjkj
jnjm
n
mj
n
mkjkk
Tjjk
n
mjj
Tj
μμμJ
μJ
j
μμUUΦRΦ
μμΦφφR
φφΦμΦRφ
μ0φRφφ
φφRφφ
where
where
where 22
Define
1
11
11
1 1
1
1 11
+minusminusminusminus
+minus+
+minusminuslele+
++=
lele+
+= +=+=
==rArr
=rArr
==forallrArr
==summinus=partpart
forallrArr
sum sum minusminussum= δTake derivation
Have a particular solution if is a diagonal matrix and its diagonal elements is the eigenvalues of and is their corresponding eigenvectors
mnminusUnm λλ 1+ R nm φφ 1+
ϕϕϕϕ RR 2=
partpart T
constraintsTo be minimized ⎩⎨⎧
ne=
=kjkj
jk if 0 if 1
δ
Objective function
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-16
PCA Derivations (1013)
bull Given an input vector x with dimension mndash Try to construct a linear transform Φrsquo (Φrsquo is an nxm matrix mltn)
such that the truncation result ΦrsquoTx is optimal in mean-squared error criterion
Encoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
2
1
x
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
Decoder
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nx
xx
ˆˆˆ
ˆ2
1
x
( ) ( )( )-xx-xxE Tx ˆˆ minimize
xΦy Tprime=
[ ]m
T
eeeΦΦ
where
21=primeprime
yΦx prime=ˆ
Φ prime
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
my
yy
2
1
y
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-17
PCA Derivations (1113)
bull Data compression in communication
ndash PCA is an optimal transform for signal representation and dimensional reduction but not necessary for classification tasks such as speech recognition
ndash PCA needs no prior information (eg class distributions of output information) of the sample patterns
(To be discussed later on)
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-18
PCA Derivations (1213)
bull Scree Graphndash The plot of variance as a function of the number of eigenvectors
keptbull Select such that
bull Or select those eigenvectors with eigenvalues larger than the average input variance (average eivgenvalue)
Thresholdnm
m ge+++++
+++λλλλ
λλλLL
L
21
21m
sumge=
n
iim n 1
1 λλ
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-19
PCA Derivations (1313)
bull PCA finds a linear transform W such that the sum of average between-class variation and average within-class variation is maximal
( ) WSWWSWSSSW bT
wT
bwJ +=+==~~~
WSWS wT
w =~
WSWS bT
b =~
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
( )( )sum minusminus=i
TiiN
xxxxS 1
sample index
class index
bw SSS += thatshowTry to
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-20
PCA Examples Data Analysis
bull Example 1 principal components of some data points
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-21
PCA Examples Feature Transformation
bull Example 2 feature transformation and selection
threshold for information content reserved
New feature dimensions
Correlation matrix for old feature dimensions
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-22
PCA Examples Image Coding (12)
bull Example 3 Image Coding
256
256
8
8
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-23
PCA Examples Image Coding (22)
bull Example 3 Image Coding (cont)(value reduction)(feature reduction)
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-24
PCA Examples Eigenface (14)
bull Example 4 Eigenface in face recognition (Turk and Pentland 1991)
ndash Consider an individual image to be a linear combination of a small number of face components or ldquoeigenfacerdquo derived from a set of reference images
ndash Stepsbull Convert each of the L reference images into a vector of
floating point numbers representing light intensity in each pixelbull Calculate the coverancecorrelation matrix between these
reference vectors bull Apply Principal Component Analysis (PCA) find the
eigenvectors of the matrix the eigenfacesbull Besides the vector obtained by averaging all images are
called ldquoeigenface 0rdquo The other eigenfaces from ldquoeigenface 1rdquoonwards model the variations from this average face
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
⎥⎥⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢⎢⎢
⎣
⎡
=
nL
L
L
L
nn x
xx
x
xx
x
xx
2
1
2
22
12
2
1
21
11
1
xxx
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-25
PCA Examples Eigenface (24)
bull Example 4 Eigenface in face recognition (cont) ndash Steps
bull Then the faces are then represented as eigenvoice 0 plus a linear combination of the remain K (K le L) eigenfaces
ndash The Eigenface approach persists the minimum mean-squared error criterion
ndash Incidentally the eigenfaces are not only themselves usually plausible faces but also directions of variations between faces
( ) ( ) ( )[ ]Kiiii
Kiiii
wwwKwww
21
21
121ˆ
=rArr
++++=
yeeexx
Feature vector of a person i
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-26
PCA Examples Eigenface (34)
The averaged face
Face images as the training set
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-27
PCA Examples Eigenface (44)
A projected Face imageSeven eigenfaces derived from the training set
(Indicate directions of variations between faces )
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-28
PCA Examples Eigenvoice (13)
bull Example 5 Eigenvoice in speaker adaptation (PSTL 2000)
ndash Stepsbull Concatenating the regarded parameters for each speaker r to
form a huge vector a(r) (a supervectors)bull SD HMM model mean parameters (μ)
Eigenvoice Eigenvoice space space
constructionconstruction
Speaker 1 Data
SI HMM
Speaker R Data
Model Training Model Training
Speaker 1 HMM Speaker R HMM
D = (Mn)times1 Principal Component
Analysis
Each new speaker S is representedEach new speaker S is representedby a point by a point PP in in KK--spacespace
( ) ( ) ( ) ( )Kwww Kiiii eeeeP 21 210 ++++=
SI HMM model
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-29
PCA Examples Eigenvoice (23)
bull Example 4 Eigenvoice in speaker adaptation (cont)
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-30
PCA Examples Eigenvoice (33)
bull Example 5 Eigenvoice in speaker adaptation (cont)ndash Dimension 1 (eigenvoice 1)
bull Correlate with pitch or sexndash Dimension 2 (eigenvoice 2)
bull Correlate with amplitudendash Dimension 3 (eigenvoice 3)
bull Correlate with second-formantmovement
Note thatEigenface performs on feature spacewhile eigenvoice performs on model space
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-31
Linear Discriminant Analysis (LDA) (12)
bull Also called ndash Fisherrsquos Linear Discriminant Analysis Fisher-Rao Linear
Discriminant Analysisbull Fisher (1936) introduced it for two-class classification
bull Rao (1965) extended it to handle multiple-class classification
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-32
Linear Discriminant Analysis (LDA) (22)
bull Given a set of sample vectors with labeled (class) information try to find a linear transform W such that the ratio of average between-class variation over average within-class variation is maximal
Within-class distributions are assumed here to be GaussiansWith equal variance in the two-dimensional sample space
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-33
LDA Derivations (14)
bull Suppose there are N sample vectors with dimensionality n each of them is belongs to one of the Jclassesndash The sample mean is
ndash The class sample means are
ndash The class sample covariances are
ndash The average within-class variation before transform
ndash The average between-class variation before transform
ix
( ) ( ) index class is 21 sdotisin= gJjjg ix
sum==
N
iiN 1
1 xx
( )sum
==
jgi
jj
iN xxx 1
( )( )( )sum
=minusminus=
jg
Tjiji
jj
iN xxxxxΣ 1
sum=j
jjw NN
ΣS 1
( )( )sum minusminus=j
Tjjjb N
NxxxxS 1
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-34
LDA Derivations (24)
bull If the transform is appliedndash The sample vectors will be
ndash The sample mean will be
ndash The class sample means will be
ndash The average within-class variation will be
[ ]mwwwW 1 2=
iT
i xWy =
xWxWxWy TN
ii
TN
ii
T
NN=⎟
⎠⎞
⎜⎝⎛== sumsum
== 11
11
( ) jT
jgi
T
jj
iNxWxWy
x== sum
=
1
( )( )( )
( )( )
WSW
WΣW
xWxWxWxWSxx x
wT
jj
jT
T
jgi
T
ji
T
jg jgi
T
ji
T
jjjw
NN
NNNN
N ii i
=⎭⎬⎫
⎩⎨⎧=
⎪⎭
⎪⎬⎫
⎪⎩
⎪⎨⎧
⎟⎟⎠
⎞⎜⎜⎝
⎛minus⎟
⎟⎠
⎞⎜⎜⎝
⎛minussdot=
sum
sumsum sumsum== =
1
1111~
x y
wS wS~
bS bS~
W
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-35
LDA Derivations (34)
bull If the transform is appliedndash Similarly the average between-class variation will be
ndash Try to find optimal such that the following objective function is maximized
bull A closed-form solution the column vectors of an optimal matrix are the generalized eigenvectors corresponding to the
largest eigenvalues in
bull That is are the eigenvectors corresponding to thelargest eigenvalues of
[ ]mwwwW 1 2=
WSWS bT
b =~
W
( )WSW
WSW
S
SW
wT
bT
w
bJ == ~
~
iwiib wSwS λ=W
siw
iiibw wwSS λ=minus1bw SS 1minus
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-36
LDA Derivations (44)
bull Proof ( )
( ) ( )( )
( )( )
( )( )
iiib
iwiibiwiib
iwT
ibT
iiiw
Tiw
iwT
ib
iwT
ibT
iw
iwT
iwT
ib
iwT
ibT
iwiwT
ib
i
i
iwT
ibT
i
i
wT
bT
w
b
w
i
i
ii
i
i
i
i
i
ii
i
i
J
wwSSwSwSwSwS
wSwwSw
wSwwS
wSwwS
wSw
wSwwS
wSw
wSwwS
wSw
wSwwSwSwwSw
wSwwSw
Ww
WSW
WSW
S
SWW
WWW
λ
λλ
λλ
λ
λ
=rArr
=rArr=minusrArr
⎟⎟⎠
⎞⎜⎜⎝
⎛==minus
=minusrArr
=minus
=partpart
rArr
=
===
minus1
22
2
ˆˆˆ
0
0
0
022
solution optimal has form qradtic The
thatfind want towe of tor column veceach for Or
maxarg~
~maxargmaxargˆ
Q
Q
2GFGGF
GF primeminusprime
=prime⎟⎠⎞
⎜⎝⎛
( )xCCxCxx T
T
+=d
d )(
determinant
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-37
LDA Examples Feature Transformation (12)
bull Example1 Experiments on Speech Signal Processing
Covariance Matrix of the 18-Mel-filter-bank vectors
Calculated using Year-99rsquos 5471 files
Covariance Matrix of the 18-cepstral vectors
Calculated using Year-99rsquos 5471 files
After Cosine Transform
( )( )sum minusminus=i
TiiN x
xxxxΣ 1 ( )( )sum minusminus=primei
TiiN y
yyyyΣ 1
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-38
LDA Examples Feature Transformation (22)
Covariance Matrix of the 18-PCA-cepstral vectors Covariance Matrix of the 18-LDA-cepstral vectors
Calculated using Year-99rsquos 5471 filesCalculated using Year-99rsquos 5471 files
20112311LDA-2
20172312LDA-1
22712632MFCC
WGTC
Character Error Rate
bull Example1 Experiments on Speech Signal Processing (cont)
After PCA Transform After LDA Transform
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-39
PCA vs LDA (12)
PCA LDA
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-40
Heteroscedastic Discriminant Analysis (HDA)
bull HDA Heteroscedastic Discriminant Analysis bull The difference in the projections obtained from LDA and
HDA for 2-class case
ndash Clearly the HDA provides a much lower classification error thanLDA theoretically
bull However most statistical modeling assume data samples are Gaussian and have diagonal covariance matrices
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-41
HW-3 Feature Transformation (14)
bull Given two data sets (MaleData FemaleData) in which each row is a sample with 39 features please perform the following operations1 Merge these two data sets and findplot the covariance matrix for
the merged data set2 Apply PCA and LDA transformations to the merged data set
respectively Also findplot the covariance matrices for transformations respectively Describe the phenomena that you have observed
3 Use the first two principal components of PCA as well as the first two eigenvectors of LDA to represent the merged data set Selectively plot portions of samples from MaleData and FemaleData respectively Describe the phenomena that you have observed
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-42
HW-3 Feature Transformation (24)
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-43
HW-3 Feature Transformation (34)
bull Plot Covariance Matrix
bull Eigen Decomposition
CoVar=[30 05 0409 63 0204 04 42
]colormap(default)surf(CoVar)
BE=[30 35 1419 63 2224 04 42
]
WI=[40 41 2129 87 3544 32 43
]
LDAIWI=inv(WI)A=IWIBEPCAA=BE+WI why ( Prove it )
[VD]=eig(A)[VD]=eigs(A3)
fid=fopen(Basisw)for i=13 feature vector length
for j=13 basis numberfprintf(fid1010f V(ij))
endfprintf(fidn)
end fclose(fid)
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-44
HW-3 Feature Transformation (44)
bull Examples
-10 -8 -6 -4 -2 0 2 4-2
-1
0
1
2
3
4
5
6
feature 1
feat
ure
2
2000筆原始資料經 PCA轉換後分布圖
-09 -08 -07 -06 -05 -04 -03 -02-12
-1
-08
-06
-04
-02
0
02
feature 1
feat
ure
2
2000筆原始資料經LDA轉換後分布圖
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-45
Latent Semantic Analysis (LSA) (17)
bull Also called Latent Semantic Indexing (LSI) Latent Semantic Mapping (LSM)
bull A technique originally proposed for Information Retrieval (IR) which projects queries and docs into a space with ldquolatentrdquo semantic dimensionsndash Co-occurring terms are projected onto the
same dimensions
ndash In the latent semantic space (with fewer dimensions) a query and doc can have high cosine similarity even if they do not share any terms
ndash Dimensions of the reduced space correspond to the axes of greatest variation
bull Closely related to Principal Component Analysis (PCA)
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-46
LSA (27)
bull Dimension Reduction and Feature Extractionndash PCA
ndash SVD (in LSA)
Xφ Tiiy =
Y
knX
kφ1φ
sum=
k
iiiy
1
φ
kφ1φ
nX
rxr
Σrsquo
r le min(mn)rxn
VrsquoTUrsquo
mxrmxn mxn
kxkA Arsquo
kgiven afor ˆmin2
XX minus
kF given afor min 2AA minusprime
feature space
latent semanticspace
latent semanticspace
k
k
orthonormal basis
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-47
LSA (37)
ndash Singular Value Decomposition (SVD) used for the word-document matrix
bull A least-squares method for dimension reduction
iφx
1ϕ2ϕ
1 where
cos
1
111 1
1
=
===
φ
xφxx
xx TT
y ϕϕ
θ
1y2y
Projection of a Vector x
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-48
LSA (47)
bull Frameworks to circumvent vocabulary mismatch
Doc
Query
terms
terms
doc expansion
query expansion
literal term matching
structure model
structure model
latent semanticstructure retrieval
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-49
LSA (57)
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-50
LSA (67)
Query ldquohuman computer interactionrdquo
An OOV word
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-51
LSA (77)
bull Singular Value Decomposition (SVD)
=
w1w2
wm
d1 d2 dn
mxn
rxr
mxr
rxnA Umxr
Σr VTrxm
d1 d2 dn
w1w2
wm
=
w1w2
wm
d1 d2 dn
mxn
kxk
mxk
kxnArsquo Ursquomxk
Σk VrsquoTkxm
d1 d2 dn
w1w2
wm
Docs and queries are represented in a k-dimensional space The quantities ofthe axes can be properly weighted according to the associated diagonalvalues of Σk
VTV=IrXr
Both U and V has orthonormalcolumn vectors
UTU=IrXr
K le r ||A||F2 ge ||Arsquo|| F2
r le min(mn)
sumsum= =
=m
i
n
jijFaA
1 1
22
Row A Rn
Col A Rmisinisin
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-52
LSA Derivations (17)
bull Singular Value Decomposition (SVD)ndash ATA is symmetric nxn matrix
bull All eigenvalues λj are nonnegative real numbers
bull All eigenvectors vj are orthonormal ( Rn)
bull Define singular valuesndash As the square roots of the eigenvalues of ATAndash As the lengths of the vectors Av1 Av2 hellip Avn
021 gegegege nλλλ
njjj 1 == λσ[ ]nnn vvvV 21=times
1=jT
jvv ( )nxn
T IVV =
( )ndiag λλλ 112 =Σ
22
11
Av
Av
=
=
σ
σFor λine 0 i=1helliprAv1 Av2 hellip Avr is an orthogonal basis of Col A
ii
iiiTii
TTii
Av
vvAvAvAv
σ
λλ
=rArr
===2
sigma
isin
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-53
LSA Derivations (27)
bull Av1 Av2 hellip Avr is an orthogonal basis of Col A
ndash Suppose that A (or ATA) has rank r le n
ndash Define an orthonormal basis u1 u2 hellip ur for Col A
bull Extend to an orthonormal basis u1 u2 hellip um of Rm
( ) 0====bull jT
ijjTT
ijT
iji vvAvAvAvAvAvAv λ
0 0 2121 ====gtgegege ++ nrrr λλλλλλ
[ ] [ ]rrr
iiiii
ii
i
vvvAuuu
AvuAvAvAv
u
11
2121 =ΣrArr
=rArr== σσ
[ ] [ ]
T
TTnrmr
VUA
AVVVUAVU
vvvvAuuuu
Σ=rArr
=ΣrArr=ΣrArr
=ΣrArr
2121
222
21
2 rFA σσσ +++=
sumsum= =
=m
i
n
jijFaA
1 1
22
V an orthonormal matrix (nxr)
nxnI
Known in advance
( )
( ) ( ) ( ) ⎟⎟⎠
⎞⎜⎜⎝
⎛=
minustimesminustimesminus
minustimestimes
rnrmrrm
rnrrnm 00
0Σ Σ
u also an orthonormal matrix
(mxr)
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-54
LSA Derivations (37)
mxn
A
2 V
1 V 1 U
2 U
( )
AVAV
VΣU
VV
000Σ
UUVUΣ
==
=
⎟⎟⎠
⎞⎜⎜⎝
⎛⎟⎟⎠
⎞⎜⎜⎝
⎛=
11
111
2
1121
T
T
T
TT0 =AX
Av
of space row thespans i
Ti
A
u
of space row
thespans
Rn Rm
U TV
AVU =Σ
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-55
LSA Derivations (47)
bull Additional Explanationsndash Each row of is related to the projection of a corresponding
row of onto the basis formed by columns of
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
ndash Each row of is related to the projection of a corresponding row of onto the basis formed by
bull the i-th entry of a row of is related to the projection of a corresponding row of onto the i-th column of
UA V
AVUUVVUAV
VUAT
T
=ΣrArrΣ=Σ=rArr
Σ=
UA
V
V
UTA
( )UAV
VUUVUVUUA
VUA
T
TTTT
T
=ΣrArr
Σ=Σ=Σ=rArr
Σ=
TAV
U
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-56
LSA Derivations (57)
bull Fundamental comparisons based on SVDndash The original word-document matrix (A)
ndash The new word-document matrix (Arsquo)bull compare two terms
rarr dot product of two rows of UrsquoΣrsquobull compare two docs
rarr dot product of two rows of VrsquoΣrsquobull compare a query word and a doc rarr each individual entry of Arsquo
w1w2
wm
d1 d2 dn
mxn
A
bull compare two terms rarr dot product of two rows of Andash or an entry in AAT
bull compare two docs rarr dot product of two columns of Andash or an entry in ATA
bull compare a term and a doc rarr each individual entry of A
ArsquoArsquoT=(UrsquoΣrsquoVrsquoT) (UrsquoΣrsquoVrsquoT)T=UrsquoΣrsquoVrsquoTVrsquoΣrsquoTUrsquoT =(UrsquoΣrsquo)(UrsquoΣrsquo)T
ArsquoTArsquo=(UrsquoΣrsquoVrsquoT)T rsquo(UrsquoΣrsquoVrsquoT) =VrsquoΣrsquoTrsquoUT UrsquoΣrsquoVrsquoT=(VrsquoΣrsquo)(VrsquoΣrsquo)T
For stretching or shrinking
Irxr
Ursquo=Umxk
Σrsquo=Σk
Vrsquo=Vnxk
wjwi
dk
ds
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-57
LSA Derivations (67)
bull Fold-in find representations for pesudo-docs qndash For objects (new queries or docs) that did not appear in the
original analysisbull Fold-in a new mx1 query (or doc) vector
ndash Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
row vectors
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-58
LSA Derivations (77)
bull Fold-in a new 1 X n term vector 1
11ˆminustimestimestimestimes Σ= kkknnk Vtt
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-59
LSA Example
bull Experimental resultsndash HMM is consistently better than VSM at all recall levelsndash LSA is better than VSM at higher recall levels
Recall-Precision curve at 11 standard recall levels evaluated onTDT-3 SD collection (Using word-level indexing terms)
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-60
LSA Conclusions
bull Advantagesndash A clean formal framework and a clearly defined optimization
criterion (least-squares)bull Conceptual simplicity and clarity
ndash Handle synonymy problems (ldquoheterogeneous vocabularyrdquo)
ndash Good results for high-recall searchbull Take term co-occurrence into account
bull Disadvantagesndash High computational complexityndash LSA offers only a partial solution to polysemy
bull Eg bank basshellip
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-61
LSA Toolkit SVDLIBC (15)
bull Doug Rohdes SVD C Library version 13 is basedon the SVDPACKC library
bull Download it at httptedlabmitedu~dr
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-62
LSA Toolkit SVDLIBC (25)
bull Given a sparse term-doc matrixndash Eg 4 terms and 3 docs
ndash Each entry is weighted by TFxIDF score
bull Perform SVD to obtain corresponding term and doc vectors represented in the latent semantic space
bull Evaluate the information retrieval capability of the LSA approach by using varying sizes (eg 100 200 600 etc) of LSA dimensionality
23 00 42
00 13 22
38 00 05
00 00 00
Term
Doc4 3 6
2
0 23
2 38
11 1330 421 222 05
RowTem
Col Doc
Nonzero entries
2 nonzero entries at Col 0
Col 0 Row 0 Col 0 Row 2
1 nonzero entryat Col 1
Col 1 Row 1 3 nonzero entry
at Col 2Col 2 Row 0 Col 2 Row 1 Col 2 Row 2
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-63
LSA Toolkit SVDLIBC (35)
bull Example term-docmatrix
bull SVD command (IR_svdbat)svd -r st -o LSA100 -d 100 Term-Doc-Matrix
51253 2265 21885277508 7725771596 16213399612 13080868709 7725771713 7725771744 77257711190 77257711200 162133991259 7725771helliphellip
Indexing Term no Doc no Nonzero
entries
sparse matrix input prefix of output filesNo of reserved
eigenvectors name of sparse
matrix input
LSA100-Ut
LSA100-S
LSA100-Vt
output
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-64
LSA Toolkit SVDLIBC (45)
bull LSA100-Ut
bull LSA100-S
100 512530003 0001 helliphellip0002 0002 helliphellip
word vector (uT) 1x100
51253 words
10026861882994155959hellip
100 eigenvalues
bull LSA100-Vt
100 22650021 0035 helliphellip0012 0022 helliphellip
doc vector (vT) 1x100
2265 docs
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice
ML-65
LSA Toolkit SVDLIBC (55)
bull Fold-in a new mx1 query vector
bull Cosine measure between the query and doc vectors in the latent semantic space
( ) 111ˆ minus
timestimestimestimes Σ= kkkmmT
k UqqQuery represented by the weightedsum of it constituent term vectors
The separate dimensions are differentially weighted
Just like a row of V
( )ΣΣ
Σ=ΣΣ=
dq
dqdqcoinedqsimT
ˆˆ
ˆˆ)ˆˆ(ˆˆ
2
TFxIDF weighted beforehand
ltlt ASCII85EncodePages false AllowTransparency false AutoPositionEPSFiles true AutoRotatePages All Binding Left CalGrayProfile (Dot Gain 20) CalRGBProfile (sRGB IEC61966-21) CalCMYKProfile (US Web Coated 050SWOP051 v2) sRGBProfile (sRGB IEC61966-21) CannotEmbedFontPolicy Warning CompatibilityLevel 14 CompressObjects Tags CompressPages true ConvertImagesToIndexed true PassThroughJPEGImages true CreateJDFFile false CreateJobTicket false DefaultRenderingIntent Default DetectBlends true DetectCurves 00000 ColorConversionStrategy LeaveColorUnchanged DoThumbnails false EmbedAllFonts true EmbedOpenType false ParseICCProfilesInComments true EmbedJobOptions true DSCReportingLevel 0 EmitDSCWarnings false EndPage -1 ImageMemory 1048576 LockDistillerParams false MaxSubsetPct 100 Optimize true OPM 1 ParseDSCComments true ParseDSCCommentsForDocInfo true PreserveCopyPage true PreserveDICMYKValues true PreserveEPSInfo true PreserveFlatness true PreserveHalftoneInfo false PreserveOPIComments false PreserveOverprintSettings true StartPage 1 SubsetFonts true TransferFunctionInfo Apply UCRandBGInfo Preserve UsePrologue false ColorSettingsFile () AlwaysEmbed [ true ] NeverEmbed [ true ] AntiAliasColorImages false CropColorImages true ColorImageMinResolution 300 ColorImageMinResolutionPolicy OK DownsampleColorImages true ColorImageDownsampleType Bicubic ColorImageResolution 300 ColorImageDepth -1 ColorImageMinDownsampleDepth 1 ColorImageDownsampleThreshold 150000 EncodeColorImages true ColorImageFilter DCTEncode AutoFilterColorImages true ColorImageAutoFilterStrategy JPEG ColorACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt ColorImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000ColorACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000ColorImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasGrayImages false CropGrayImages true GrayImageMinResolution 300 GrayImageMinResolutionPolicy OK DownsampleGrayImages true GrayImageDownsampleType Bicubic GrayImageResolution 300 GrayImageDepth -1 GrayImageMinDownsampleDepth 2 GrayImageDownsampleThreshold 150000 EncodeGrayImages true GrayImageFilter DCTEncode AutoFilterGrayImages true GrayImageAutoFilterStrategy JPEG GrayACSImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt GrayImageDict ltlt QFactor 015 HSamples [1 1 1 1] VSamples [1 1 1 1] gtgt JPEG2000GrayACSImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt JPEG2000GrayImageDict ltlt TileWidth 256 TileHeight 256 Quality 30 gtgt AntiAliasMonoImages false CropMonoImages true MonoImageMinResolution 1200 MonoImageMinResolutionPolicy OK DownsampleMonoImages true MonoImageDownsampleType Bicubic MonoImageResolution 1200 MonoImageDepth -1 MonoImageDownsampleThreshold 150000 EncodeMonoImages true MonoImageFilter CCITTFaxEncode MonoImageDict ltlt K -1 gtgt AllowPSXObjects false CheckCompliance [ None ] PDFX1aCheck false PDFX3Check false PDFXCompliantPDFOnly false PDFXNoTrimBoxError true PDFXTrimBoxToMediaBoxOffset [ 000000 000000 000000 000000 ] PDFXSetBleedBoxToMediaBox true PDFXBleedBoxToTrimBoxOffset [ 000000 000000 000000 000000 ] PDFXOutputIntentProfile () PDFXOutputConditionIdentifier () PDFXOutputCondition () PDFXRegistryName () PDFXTrapped False Description ltlt CHS ltFEFF4f7f75288fd94e9b8bbe5b9a521b5efa7684002000500044004600206587686353ef901a8fc7684c976262535370673a548c002000700072006f006f00660065007200208fdb884c9ad88d2891cf62535370300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c676562535f00521b5efa768400200050004400460020658768633002gt CHT ltFEFF4f7f752890194e9b8a2d7f6e5efa7acb7684002000410064006f006200650020005000440046002065874ef653ef5728684c9762537088686a5f548c002000700072006f006f00660065007200204e0a73725f979ad854c18cea7684521753706548679c300260a853ef4ee54f7f75280020004100630072006f0062006100740020548c002000410064006f00620065002000520065006100640065007200200035002e003000204ee553ca66f49ad87248672c4f86958b555f5df25efa7acb76840020005000440046002065874ef63002gt DAN ltFEFF004200720075006700200069006e0064007300740069006c006c0069006e006700650072006e0065002000740069006c0020006100740020006f007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e007400650072002000740069006c0020006b00760061006c00690074006500740073007500640073006b007200690076006e0069006e006700200065006c006c006500720020006b006f007200720065006b007400750072006c00e60073006e0069006e0067002e0020004400650020006f007000720065007400740065006400650020005000440046002d0064006f006b0075006d0065006e0074006500720020006b0061006e002000e50062006e00650073002000690020004100630072006f00620061007400200065006c006c006500720020004100630072006f006200610074002000520065006100640065007200200035002e00300020006f00670020006e0079006500720065002egt DEU ltFEFF00560065007200770065006e00640065006e0020005300690065002000640069006500730065002000450069006e007300740065006c006c0075006e00670065006e0020007a0075006d002000450072007300740065006c006c0065006e00200076006f006e002000410064006f006200650020005000440046002d0044006f006b0075006d0065006e00740065006e002c00200076006f006e002000640065006e0065006e002000530069006500200068006f00630068007700650072007400690067006500200044007200750063006b006500200061007500660020004400650073006b0074006f0070002d0044007200750063006b00650072006e00200075006e0064002000500072006f006f0066002d00470065007200e400740065006e002000650072007a0065007500670065006e0020006d00f60063006800740065006e002e002000450072007300740065006c006c007400650020005000440046002d0044006f006b0075006d0065006e007400650020006b00f6006e006e0065006e0020006d006900740020004100630072006f00620061007400200075006e0064002000410064006f00620065002000520065006100640065007200200035002e00300020006f0064006500720020006800f600680065007200200067006500f600660066006e00650074002000770065007200640065006e002egt ESP ltFEFF005500740069006c0069006300650020006500730074006100200063006f006e0066006900670075007200610063006900f3006e0020007000610072006100200063007200650061007200200064006f00630075006d0065006e0074006f0073002000640065002000410064006f0062006500200050004400460020007000610072006100200063006f006e00730065006700750069007200200069006d0070007200650073006900f3006e002000640065002000630061006c006900640061006400200065006e00200069006d0070007200650073006f0072006100730020006400650020006500730063007200690074006f00720069006f00200079002000680065007200720061006d00690065006e00740061007300200064006500200063006f00720072006500630063006900f3006e002e002000530065002000700075006500640065006e00200061006200720069007200200064006f00630075006d0065006e0074006f00730020005000440046002000630072006500610064006f007300200063006f006e0020004100630072006f006200610074002c002000410064006f00620065002000520065006100640065007200200035002e003000200079002000760065007200730069006f006e0065007300200070006f00730074006500720069006f007200650073002egt FRA ltFEFF005500740069006c006900730065007a00200063006500730020006f007000740069006f006e00730020006100660069006e00200064006500200063007200e900650072002000640065007300200064006f00630075006d0065006e00740073002000410064006f00620065002000500044004600200070006f007500720020006400650073002000e90070007200650075007600650073002000650074002000640065007300200069006d007000720065007300730069006f006e00730020006400650020006800610075007400650020007100750061006c0069007400e90020007300750072002000640065007300200069006d007000720069006d0061006e0074006500730020006400650020006200750072006500610075002e0020004c0065007300200064006f00630075006d0065006e00740073002000500044004600200063007200e900e90073002000700065007500760065006e0074002000ea0074007200650020006f007500760065007200740073002000640061006e00730020004100630072006f006200610074002c002000610069006e00730069002000710075002700410064006f00620065002000520065006100640065007200200035002e0030002000650074002000760065007200730069006f006e007300200075006c007400e90072006900650075007200650073002egt ITA ltFEFF005500740069006c0069007a007a006100720065002000710075006500730074006500200069006d0070006f007300740061007a0069006f006e00690020007000650072002000630072006500610072006500200064006f00630075006d0065006e00740069002000410064006f006200650020005000440046002000700065007200200075006e00610020007300740061006d007000610020006400690020007100750061006c0069007400e00020007300750020007300740061006d00700061006e0074006900200065002000700072006f006f0066006500720020006400650073006b0074006f0070002e0020004900200064006f00630075006d0065006e007400690020005000440046002000630072006500610074006900200070006f00730073006f006e006f0020006500730073006500720065002000610070006500720074006900200063006f006e0020004100630072006f00620061007400200065002000410064006f00620065002000520065006100640065007200200035002e003000200065002000760065007200730069006f006e006900200073007500630063006500730073006900760065002egt JPN ltFEFF9ad854c18cea51fa529b7528002000410064006f0062006500200050004400460020658766f8306e4f5c6210306b4f7f75283057307e30593002537052376642306e753b8cea3092670059279650306b4fdd306430533068304c3067304d307e3059300230c730b930af30c830c330d730d730ea30f330bf3067306e53705237307e305f306f30d730eb30fc30d57528306b9069305730663044307e305930023053306e8a2d5b9a30674f5c62103055308c305f0020005000440046002030d530a130a430eb306f3001004100630072006f0062006100740020304a30883073002000410064006f00620065002000520065006100640065007200200035002e003000204ee5964d3067958b304f30533068304c3067304d307e30593002gt KOR ltFEFFc7740020c124c815c7440020c0acc6a9d558c5ec0020b370c2a4d06cd0d10020d504b9b0d1300020bc0f0020ad50c815ae30c5d0c11c0020ace0d488c9c8b85c0020c778c1c4d560002000410064006f0062006500200050004400460020bb38c11cb97c0020c791c131d569b2c8b2e4002e0020c774b807ac8c0020c791c131b41c00200050004400460020bb38c11cb2940020004100630072006f0062006100740020bc0f002000410064006f00620065002000520065006100640065007200200035002e00300020c774c0c1c5d0c11c0020c5f40020c2180020c788c2b5b2c8b2e4002egt NLD (Gebruik deze instellingen om Adobe PDF-documenten te maken voor kwaliteitsafdrukken op desktopprinters en proofers De gemaakte PDF-documenten kunnen worden geopend met Acrobat en Adobe Reader 50 en hoger) NOR ltFEFF004200720075006b00200064006900730073006500200069006e006e007300740069006c006c0069006e00670065006e0065002000740069006c002000e50020006f0070007000720065007400740065002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740065007200200066006f00720020007500740073006b00720069006600740020006100760020006800f800790020006b00760061006c00690074006500740020007000e500200062006f007200640073006b0072006900760065007200200065006c006c00650072002000700072006f006f006600650072002e0020005000440046002d0064006f006b0075006d0065006e00740065006e00650020006b0061006e002000e50070006e00650073002000690020004100630072006f00620061007400200065006c006c00650072002000410064006f00620065002000520065006100640065007200200035002e003000200065006c006c00650072002000730065006e006500720065002egt PTB ltFEFF005500740069006c0069007a006500200065007300730061007300200063006f006e00660069006700750072006100e700f50065007300200064006500200066006f0072006d00610020006100200063007200690061007200200064006f00630075006d0065006e0074006f0073002000410064006f0062006500200050004400460020007000610072006100200069006d0070007200650073007300f5006500730020006400650020007100750061006c0069006400610064006500200065006d00200069006d00700072006500730073006f0072006100730020006400650073006b0074006f00700020006500200064006900730070006f00730069007400690076006f0073002000640065002000700072006f00760061002e0020004f007300200064006f00630075006d0065006e0074006f00730020005000440046002000630072006900610064006f007300200070006f00640065006d0020007300650072002000610062006500720074006f007300200063006f006d0020006f0020004100630072006f006200610074002000650020006f002000410064006f00620065002000520065006100640065007200200035002e0030002000650020007600650072007300f50065007300200070006f00730074006500720069006f007200650073002egt SUO ltFEFF004b00e40079007400e40020006e00e40069007400e4002000610073006500740075006b007300690061002c0020006b0075006e0020006c0075006f0074002000410064006f0062006500200050004400460020002d0064006f006b0075006d0065006e007400740065006a00610020006c0061006100640075006b006100730074006100200074007900f6007000f60079007400e400740075006c006f0073007400750073007400610020006a00610020007600650064006f007300740075007300740061002000760061007200740065006e002e00200020004c0075006f0064007500740020005000440046002d0064006f006b0075006d0065006e00740069007400200076006f0069006400610061006e0020006100760061007400610020004100630072006f0062006100740069006c006c00610020006a0061002000410064006f00620065002000520065006100640065007200200035002e0030003a006c006c00610020006a006100200075007500640065006d006d0069006c006c0061002egt SVE ltFEFF0041006e007600e4006e00640020006400650020006800e4007200200069006e0073007400e4006c006c006e0069006e006700610072006e00610020006f006d002000640075002000760069006c006c00200073006b006100700061002000410064006f006200650020005000440046002d0064006f006b0075006d0065006e00740020006600f600720020006b00760061006c00690074006500740073007500740073006b0072006900660074006500720020007000e5002000760061006e006c00690067006100200073006b0072006900760061007200650020006f006300680020006600f600720020006b006f007200720065006b007400750072002e002000200053006b006100700061006400650020005000440046002d0064006f006b0075006d0065006e00740020006b0061006e002000f600700070006e00610073002000690020004100630072006f0062006100740020006f00630068002000410064006f00620065002000520065006100640065007200200035002e00300020006f00630068002000730065006e006100720065002egt ENU (Use these settings to create Adobe PDF documents for quality printing on desktop printers and proofers Created PDF documents can be opened with Acrobat and Adobe Reader 50 and later) gtgt Namespace [ (Adobe) (Common) (10) ] OtherNamespaces [ ltlt AsReaderSpreads false CropImagesToFrames true ErrorControl WarnAndContinue FlattenerIgnoreSpreadOverrides false IncludeGuidesGrids false IncludeNonPrinting false IncludeSlug false Namespace [ (Adobe) (InDesign) (40) ] OmitPlacedBitmaps false OmitPlacedEPS false OmitPlacedPDF false SimulateOverprint Legacy gtgt ltlt AddBleedMarks false AddColorBars false AddCropMarks false AddPageInfo false AddRegMarks false ConvertColors NoConversion DestinationProfileName () DestinationProfileSelector NA Downsample16BitImages true FlattenerPreset ltlt PresetSelector MediumResolution gtgt FormElements false GenerateStructure true IncludeBookmarks false IncludeHyperlinks false IncludeInteractive false IncludeLayers false IncludeProfiles true MultimediaHandling UseObjectSettings Namespace [ (Adobe) (CreativeSuite) (20) ] PDFXOutputIntentProfileSelector NA PreserveEditing true UntaggedCMYKHandling LeaveUntagged UntaggedRGBHandling LeaveUntagged UseDocumentBleed false gtgt ]gtgt setdistillerparamsltlt HWResolution [2400 2400] PageSize [612000 792000]gtgt setpagedevice