ICA of Text Documents Based on Unsupervised Topic Separation and Keyword Identification in Document Collections: A Projection Approach Ata Kabán and Mark

ICA of Text Documents

Based on

Unsupervised Topic Separation and KeywordIdentification in Document Collections: A

Projection Approach

Ata Kabán and Mark Girolami

Jaakko Peltonen

[email protected] October 2000

1 Introduction

• ICA : proposed as a useful technique for findingmeaningful directions in multivariate data

• The objective function affects the form of potential structurediscovered

• Here, the problem is partitioning and analysis of sparse multivariate data

• Prior knowledge is used to derive a computationally inexpensive ICA

2 Introduction, continued

• Two complementary architectures:

• Skewness (asymmetry) is the right objective to optimize

• The two tasks will be unified in a single algorithm

• Result: - fast convergence - computational cost linear in training points

Observeddocuments

separate Documentprototypes

Observedwords

separate Topic-features

3 Data Representation

• Vector space representation: document [ t

1, t

2, . . . , t

T ]T

• T = number of words in the dictionary (tens of thousands)

• elements are binary indicators or frequencies sparse representation

• D = term document matrix (T N, N = number of documents) DT

term 1 term T

doc 1

doc N

4 Preprocessing

• Assumption: observations = noisy expansion of some denser group oflatent topics

• Number of clusters or topics set a priori

• K-dimensional LSA space USED AS topic-concepts subspace

• PCA may lose important data components:sparsity infrequent, meaningful correlation less concern

• Reconstruction: D DK=UEVT

5 Prototype Documents from a Corpus

Assumption: documents = noisy linear mixture of (~independent) document prototypes

• N. of prototypes = n. of topics prototypes reside in LSA-space (K dimensions)

• Data projection onto right eigenvectors + variance normalization

X(1):=E-1VT(DT)=UT (K T matrix)

• Task: find mixing matrix W(1), source documents S(1) so that

X(1)=W(1)TS(1) (S(1) : K T matrix)

6 Prototype Documents from a Corpus, continued

• Basis vectors of topic space assumed different to separate prototypes, find independent components

Words in documents are distributed in a positively skewed way

• Search restricted to skewed (perhaps asymmetric) distributions

• LSA unmixing matrix must be orthogonal ( W(1)-1=W(1)T )

DT

term 1 term T

doc 1

doc N

S(1)

term 1 term T

topic 1

topic K

W(1)E-1VT


• Objective: Skewness measure Fisher-skewness :

• Prior knowledge: small component mean projection variance restricted to unity

Simplified objective G(s) ( 3rd order moment)

• Prevent degenerate solutions Restrict wTw=1 for stationary points

• Solve with gradient methods or iteratively

{

2/32

3F

S

0)(1)(

1

T

tt

Tt gT

sGwxwx

w


• Sources positive is positive (output sign is relevant!)

• K orthonormal projection directions matrix iteration

• Similar to approximate Newton-Raphson optimization(FastICA type derivation small additional term)

• Computational complexity: O(2K2T + KT + 4K3)

TToldg )( )( XWXW

1)( )( WWWW T

new

9 Topic Features from Word Features

Assumption: terms = noisy linear expansion of (~independent) concepts (topics)

• Data compression:

X(2):=E-1UT(D)=VT (K N matrix)

• Task: find unmixing matrix W(2), topic features S(2) so that

X(2)=W(2)TS(2) (S(2) : K N matrix)

• This time, use a Clustering criterion

• Objective function (zkn

indicate class of xn)

• Stochastic minimization EM-type algorithm:

10 Topic Features from Word Features, continued

D

doc 1 doc Nterm 1

term T

S(2)

doc 1 doc N

topic 1

topic K

W(2)E-1UT

{

N

n

K

kknknz

NErr

1 1

2}){(1

wx

)()( )(def

)( XwXw oldTkn

oldTk softmaxzg

)(1 )()( XwXw oldT

kT

k

oldk g

N

11 Topic Features from Word Features, continued

• Comparison approach: set of binary classifiers algorithm:

• Maximizes:

= skewed, monotonic increasing function of topic sk

skewed prior is appropriate

• Variance normalized after LSA, independent topics source components aligned to orthonormal axes

• Similar to previous architechture

{ )()( XwXw Tk

Tk sigmoidg

TTkk g )( XwXw

||||/ kkk www

constN

G nTk

N

nk

))exp(1log(1

)(1

xww

12 Combining the Tasks

• Joint optimization problem:

• Information from linear outputs and from weights are complementary: Topic clustering weight peaks representative words

projections clustering information Document weight peaks clustering information prototype search projections index terms

• Review the separating weights on D: W(2)TE-1UT

{{

)()()()(

)()(),(

1)2()1()2()1(

1

)2()1()2()1(

DUEWUWSS

ssWW

TTTT

K

k kkk

GGGG

GGObj

13 Combining the Tasks, continued

• Whitening allows inspection but isn't practical normalize variance along the K principal directions!

D' := UE-1UTD

• Find new unmixing matrix to maximize W(2') G(W(2')TUTD') = ... = G(W(2')TX(2))

W(2') = W(2)

• Solve the relation : W(2)TUT=S(1)

W(1)TUT=S(1)

• Rewrite objective: concatenate data: [UT, VT]

W(1)=W(2)=W})()()( TTTT GGObj VWUWW

14 Combining the Tasks, continued

• Resultant algorithm : O(2K2(T + N) + K(T + N) + 4K3)

Inputs: D, K1. Decompose D with Lanczos algorithm.

Retain K first singular values. Obtain U, E, V.2. Let X = [UT, VT]3. Iterate until convergence:

Outputs: SℝK(T+N) , WℝKK

• S: [T document prototypes N topic-features], W: structure information of identified topics in the corpus

1)(

)(

)(

)(

WWW

SSXW

WXS

Tnew

T

oldT

15 Simulations

10 most representative words 10 most frequent wordsselected by algorithm conformal with human labeling

1. Newsgroup data ('sci.crypt', 'sci.med', 'sci.space', 'soc.religion.christian')

Simulation 2.10 most representative words,using 5 topics and 2 document classes('sci.space', 'soc.religion.christian')

medic god space kei patient christian nasa encrypt year peopl orbit secur effect rutger launch govern diseas thing dai system doctor bibl mission clipper studi christ flight chip health understand engin public call church shuttl escrow test point system de physician question scienc law

kei effect space peopl encrypt year nasa christian system call orbit god chip peopl dai rutger secur medic year thing govern question system church clipper ve high bibl public doctor launch question peopl find man part escrow patient scienc find comput studi engin christ

people god dai space sex church man year system issu group life nasa shuttl term thing love moon design sexual year christian jpl research basi find live earth cost respons question jesu orbit human homosexu bibl christ part discuss refer read rutger gov launch fornic faith human ron dr intercours issu save venu station lawI II III IV V

16 Conclusions

Dependency structure of the splitting in simulation 2

• Clustering and keyword identification by ICA variant that maximizes skewness

• Key assumption: asymmetrical latent prior• Joint problem solved (D and DT) 'spatio-temporal' ICA

• Algorithm is linear in number of documents, O(K2N)• Fast convergence (3 - 8 steps)• Potential number of topics can be greater than indicated by

a human labeler discover subtopics• Hierarchical partitioning possible (recursive binary splits)

sci.space soc.religion.christian

space shuttle space shuttle christian christian christian design (IV) mission (III) church (I) religion (II) morality (V)

17 Further Work

• Study links with other methods improve flexibility

• Or develop a mechanismto allow more structuredrepresentation, in a mixed or hierarchical manner

• For example: build in model-estimation to the algorithm

• Relax equal wk norm

assumption

x='sci.crypt', o='sci.space', ='sci.med', ='soc.religion.christian'

1 2 3