Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw Research partially supported by the KBN research project 4 T11C "Maps and intelligent navigation in WWW using Bayesian networks and artificial immune systems"4 T11C A search engine with SOM- based document set representation Map visualizations in 3D (BEATCA) The preparation of documents is done by an indexer, which turns a document into a vector-space model representation Indexer also identifies frequent phrases in document set for clustering and labelling purposes Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation The best (wrt some similarity measure) map is used by the query processor in response to the users query Document model in search engines In the so-called vector model a document is considered as a vector in space spanned by the words it contains. dog food walk My dog likes this food When walking, I take some food Clustering document vectors Document space 2D map m x r Mocna zmiana pooenia (gruba strzaka) Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar Our problem Instability Pre-defined major themes needed Our approach Find a coarse clustering into a few themes Bayesian Networks in Document Clustering SOM document-map based search engines require initial document clustering in order to present results in a meaningful way. Latent semantic Indexing based methods appear to be promising for this purpose. One of them, the PLSA, has been empirically investigated. A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested. A Bayesian Network chappiR dog owner food Meat walk Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph High compression, Simpliofication of reasoning. BN application in text processing Document classification Document Clustering Query Expansion Hidden variable approaches PLSA (Probabilistic Latent Semantic Analysis) PHITS (Probabilistic Hyperlink Analysis) Combined PLSA/PHITS Assumption of a hidden variable expressing the topic of the document. The topic probabilistically influence the appearence of the document (links in PHITS, terms in PLSA) PLSA - concept N be term-document matrix of word counts, i.e., N ij denotes how often a term (single word or phrase) t i occurs in document d j. probabilistic decomposition into factors z k (1 k K) P(t i | d j ) = k P(t i |z k )P(z k |d j ), with non-negative probabilities and two sets of normalization constraints i P(t i |z k ) = 1 for all k and k P(z k | d j ) = 1 for all j. DZ T1 T2 Tn..... Hidden variable PLSA - concept PLSA aims at maximizing L:= i,j N ij log k P(t i |z k )P(z k |d j ). Factors z k can be interpreted as states of a latent mixing variable associated with each observation (i.e., word occurrence), Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L DZ T1 T2 Tn Hidden variable different factors usually capture distinct "topics" of a document collection; by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge EM algorithm step 0 Data: DZT1T2...Tn 1? ? ? ? ? Data: DZT1T2...Tn Z randomly initialized EM algorithm step 1 Data: DZT1T2...Tn BN trained DZ T1 T2 Tn Hidden variable EM algorithm step 2 Data: DZT1T2...Tn Z sampled from BN DZ T1 T2 Tn Hidden variable GOTO step 1 untill convergence (Z assignment stable) Z sampled for each record according to the probability distribution P(Z=1|D=d,T1=t1,...,Tn=tn) P(Z=2|D=d,T1=t1,...,Tn=tn).... The problem Too high number of adjustable variables Pre-defined clusters not identified Long computation times instability Solution Our suggestion Use the Naive Bayes sharp version document assigned to the most probable class We were successful Up to five classes well clustered High speed (with 20,000 documents) Next step Naive bayes assumes document and term independence What if they are in fact dependent? Our solution: TAN APPROACH First we create a BN of terms/documents Then assume there is a hidden variable Promissing results, need a deeper study PLSA a model with term TAN D1 Z T6 T5 T4 Hidden variable D2 Dk T1 T2 T3 PLSA a model with document TAN T1 Z Hidden variable T2 Ti D6 D5 D4 D1 D2 D3

Documents

Bayesian Networks in Document Clustering Slawomir Wierzchon, Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer