Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish

Mining document maps

Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski

Mariusz Kujawiak

Institute of Computer SciencePolish Academy of Sciences

Warsaw

SAWM 2004 Mining Document Maps

Agenda

Motivation

Our approach

Architecture

User interface

Visualization

Map creation

Clustering

Experimental results

Future directions


Motivation

The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore

A good way of presenting massive document sets in an understandable form will be crucial in the near future

The BEATCA project targets at creation a full-fledged search engine for moderate size document collections (millions of documents) capable of representing on-line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach)


Our approach XXXX

The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization.A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithmsB ayesianE volutionaryA pproach toT extC onnectivityA nalysis


BEATCA architecture XXXXX

The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representationIndexer also identifies frequent phrases in document set for clustering and labelling purposesSubsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excludedThe map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query


BEATCA architecture

........

INTERNET

DBREGISTRY

HT-Base

HT-Base

VEC-BaseMAP-Base

DocGR-Base

Search Engine

Indexing +Optimizing

SpiderDownloading

MappingClustering

of docs

........

CellGR-Base

Clusteringof cells

........

........ ........ ........

Processing Flow Diagram - BEATCA


Example: summaries of documents KONIEC


Example: S&W frequent phrases KONIEC

sheep fiendsdairy goatsblack sheepspecial thankssheep and goatsuniversity medical centerpublic healthmedical informaticsinformation departmentspharmacy relateddrugs informationhealth care


User interface XXXX

Search results are presented on a document map

Compact (fuzzy) topical areas are extracted

Query-related summaries are generated on-line

Maps can have one of the following topologies:the traditional flat map (quadratic or hexagonal cells)

rotating 3D map (torus, sphere, cylinder)

hyperbolic map (Poincarre or Klein projections)

growing map (Growing Neural Gas)


User interface


Map visualizations in 3D


Hyperbolic map visualizations

triangular tesselation

hexagonal tesselation


Kohonen learning overview XXXX

Unsupervised learning neural network model

Neuron represented by reference vector in document space

Vector element (term dimension) equals TFxIDF

Iterative regression of reference vectors onto document vector space: #WZÓR#

Similiarity is computed as cosine of angle between corresponding vectors


How are the maps created

A modified WebSOM method is used:compact reference vectors representation

broad-topic initialization method

joint winner search method

multi-level (hierarchical) maps

three-phase document clustering:• initial grouping via PLSA/PHITS

• WEBSOM on document groups

• fuzzy cell clusters extraction and labelling


Reference vector representation

Vectors are sparse by nature

During learning process they become even sparser

Represented as a balanced red-black trees

Tolerance threshold imposed

Terms (dimensions) below threshold are removed

Significant complexity reduction without negative quality impact


Topic-sensitive initialization

Inter-topic similarities important both for map learning and visualization/cluster extraction

Simple approach:Use LSI to select K main broad topics

Select K map cells (evenly spread over the map) as the fixpoints for individual topics

Initialize selected fixpoints with broad topics

Initialize remaining cells with the following rule: #WZÓR#


Joint winner search

Global winner search: accurate but slow

Local winner search: faster but can be inaccurate during rapid changes

Start with single phase of global search

Document movements become more smooth during learning process: usually local search is enough

Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease)


Hierarchical maps

Bottom-up approach

Feasible (with joint winner search method)

Start with most detailed map

Compute weighted centroids of map areas: #WZÓR#

Use them as seeds for coarser map

Top-down approach is possible but requires fixpoints


Clustering document groups

Numerous methods exists but none of them directly applicable:

Extremely fuzzy structure of topical groups in SOM cells

Neccesity of taking into account similiarity measures both in original document space and in the map space

Outlier-handling problem during cluster formation

No a priori estimation of the number of topical groups

Fuzzy C-MEANS on lattice of map cells applied

Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering

Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy


Example: biomedical documents


Term RankCluster #1

sci.math

Cluster #2

sci.med / sci.math

Cluster #3

talk.religion misc

Cluster #4

soc.culture.

israel

Cluster #5

comp.

windows.x

Cluster #6

talk.

politics.misc

1 Die Cipher Men Israel Boot Funding

2 Probable Block Women Palestinian Windows Study

3 Theory Stream Raped Gun Files Taxes

4 Registers Key Children Aziz Menus Stock

5Mathematics

Algorithms Child Iraqis Lib Health

6 EquationCombinations

Sex Koppel Icon Market

7 Cos Distinction Soc Israeli Label Social

8 Sequence Encryption Father Jews Folder Mercer

9 Tex Epimethius Paternity Resolution Msvcrtd Governing

10 SpaceRandomness

Feminist Oliver Shortcut Vaccinations

11Gravitational

Smartcard Trolling Utah NetzeroMeasurement

12 Wave Entropy White Firearms Tab Bushes

13Latex Yahoo England

Settlements

Kernel Computer

14 Files Model Support Palestine Installed Companies

15 Unsigned Lottery Black Permitted Backup Diabetes

Label candidates (5 newsgroups) XXX


Experiments with execution time XXX

The impact of the following factors on the speed of map creation was investigated:

Map size (total number of cells)

Optimization methods:• dictionary optimization

• reference vector representation

Map quality assessment:Compare with ‘ideal’ map (e.g. without optimizations)

Identical initialization and learning parameters

Compute sum of squared distances of location of each document on both maps


Execution time - map size


Execution time - optimizations


Experiments with map convergence XXX

We examined the convergence of the maps to a stable state depending on:

type of alpha function (search radius reduction)

type of winner search method

type of initialization method


Convergence – alpha functions


Convergence – winner search


Future research

Maps for joint term-citation model, taking into account between-group link flow direction

Fully distributed map creation

Adaptive document retrieval and clustering:Bayesian network based relevance measure

Survival models for document update rate estimation

Dead link propagation methods for page freshness estimation

We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects


Future research XXXXXX

Bayesian networks will be applied in particular to: measure relevance and classify documentsaccelerate document clustering processesconstruct a thesaurus supporting query enrichmentkeyword extractionbetween-topic dependencies estimation

Immuno-genetic systems will be used for:adaptive document clustering by referring to the mechanism of so-called metadynamicsextraction of compact characteristics of document groups by exploitation of the mechanism of construction of universal and specialized antibodiesvisualization and resolution adjustment of document maps


Thank you!

Any

questions?

Any

questions?

Documents

Mining document maps Mieczyslaw Klopotek Slawomir Wierzchon Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish