Text Mining Huaizhong KOU PHD Student of Georges Gardarin PRiSM Laboratory

Text Mining

Huaizhong KOU

PHD Student of Georges Gardarin

PRiSM Laboratory

1. Introduction– What happens – What is Text Mining– Text Mining vs Data

Mining– Applications

2. Feature Extraction– Task– Indexing– Dimensionality Reduction

3. Document Categorization– Task– Architecture– Categorization Classifiers– Application:Trend

Analysis

4. Document Clustering– Task – Algorithms– Application

5. Product6.Reference

0. Content

1. Text Mining:Introduction1.1 What happens

1.2 What’s Text Mining

1.3 Text Mining vs Data Mining

1.4 Application

1.1 Introduction:What happens(1)

Internet

•Information explosive•80% information stored in text documents:journals, web pages,emails...•Difficult to extract special information•Current technologies...

1.1 Introduction: What happens(2)

•It is necessary toautomatically analyze,organize,summarize...

Knowledge

La valeur des actions

des sociétés XML vont augmenter

1.2 Introduction: What’s Text Mining(1)

•Text Mining::= the procedure of synthesizing the information by analyzing the relations, the patterns, and the rules among textual data - semi-structured or unstructured text.

•Techniques::=data miningmachine learning information retrievalstatisticsnatural-language understandingcase-based reasoning

Ref:[1]-[4]

1.2 Introduction: What’s Text Mining(2)

Sample Documents

TransformedTransformed

Representationmodels

Learning Domain specifictemplates/models

Text document

knowledge

VisualizationsLearning Working

1.3 Introduction:TM vs DM

Data Mining Text Mining

Data Object Numerical & categorical data Textual data

Data structure Structure Unstructure&semi-structure

Data representation Straightforward Complex

Space dimension <tens of thousands > tens of thousands

Methods Data analysis, machine learning Data mining, information

statistic, neural networks retrieval, NLP,...

Maturity Broad implementation since1994 Broad implementation

starting 2000

Market 105 analysts at large and mid 108 analysts corporate workers

size companies and individual users

Ref:[12][13]

1.4 Introduction:Application

•The potential applications are countless.

•Customer profile analysis

•Trend analysis

•Information filtering and routing

•Event tracks

•news stories classification

•Web search

•…….

Ref:[12][13]

2. Feature Extraction2.1 Task2.2 Indexing2.3 Weighting Model2.4 Dimensionality Reduction

Ref:[7][11][14][18][20][22]

2.1 Feature Extraction:Task(1)

Task: Extract a good subset of words to represent documents

Document collection

All unique words/phrases

Feature Extraction

All good words/phrases

Ref:[7][11][14][18][20][22]

2.1 Feature Extraction:Task(2)While more and more textual information is available online, effective retrieval is difficult without good indexing of text content.

While-more-and-textual-information-is-available-online-effective-retrieval-difficult-without-good-indexing-text-content

Feature Extraction

Text-information-online-retrieval-index

16

5

2 1 1 1 1 Ref:[7][11][14][18][20][22]

2.2 Feature Extraction:Indexing(1)

Identification all unique words

Removal stop wordsRemoval

stop words

Word Stemming

Training documents

Term Weighting•Naive terms•Importance of term in Doc

Removal of suffix to generate word stem grouping words increasing the relevance ex.{walker,walking}walk

non-informative word ex.{the,and,when,more}

Ref:[7][11][14][18][20][22]

2.2 Feature Extraction:Indexing(2)•Document representations: vector space models

d=(w1,w2,…wt)Rt

wi is the weight of ith term in document d.

Ref:[7][11][14][18][20][22]

2.3 Feature Extraction:Weighting Model(1)

•tf - Term Frequency weightingwij = Freqij

Freqij : := the number of times jth term occurs in document Di. Drawback: without reflection of importance factor for document discrimination.

•Ex.ABRTSAQWAXAO

RTABBAXAQSAK

D1

D2

A B K O Q R S T W X

D1 3 1 0 1 1 1 1 1 1 1

D2 3 2 1 0 1 1 1 1 0 1

Ref:[11][22]

2.3 Feature Extraction:Weighting Model(2)•tfidf - Inverse Document Frequency weighting

wij = Freqij * log(N/ DocFreqj) .N : := the number of documents in the training document collection.DocFreqj ::= the number of documents in which the jth term occurs.

Advantage: with reflection of importance factor for document discrimination. Assumption:terms with low DocFreq are better discriminator than ones with high DocFreq in document collection

A B K O Q R S T W X

D1 0 0 0 0.3 0 0 0 0 0.3 0

D2 0 0 0.3 0 0 0 0 0 0 0

•Ex.

Ref:[13]Ref:[11][22]

2.3 Feature Extraction:Weighting Model(3)

•Entropy weighting

)(1*0.1logij iij wentropyFREQw where

N

j j

ij

j

iji

DOCFREQ

FREQ

DOCFREQ

FREQ

Nwentropy

1

loglog

1)(

is average entropy of ith term and -1: if word occurs once time in every document 0: if word occurs in only one document

Ref:[13]Ref:[11][22]

2.4 Feature Extraction:Dimension Reduction

•2.4.1 Document Frequency Thresholding•2.4.2 X2-statistic•2.4.3 Latent Semantic Indexing

Ref:[11][20][21][27]

2.4.1 Dimension Reduction:DocFreq Thresholding•Document Frequency Thresholding

Calculates DocFreq(w)

Sets threshold

Removes all words:DocFreq <

Naive TermsTraining documents D

Feature Terms

Ref:[11][20][21][27]

2.4.2 Dimension Reduction: X2-statistic

•Assumption:a pre-defined category set for a training collection D•Goal: Estimation independence between term and category

Sets threshold

Removes all words: X2max(w)<

Naive TermsCategory setC={c1,c2,..cm}

FEATURE TERMS

DCBADBCA

CBADNcw j

22 ,X

jj

cwXwX ,22max max

A:=|{d| d cj w d}|B:=|{d| d cj w d}|C:=|{d| d cj w d}|D:=|{d| d cj w d}|N:=|{d| d D}|

Term categorical score

Ref:[11][20][21][27]

2.4.3 Dimension Reduction:LSI(1)•LSI=Latent Semantic Indexing.

=****

*S0X T0 D0

'terms

documents

(t,d) (t,m) (m,m) (m,d)

D0' D0 = I

T0' T0 = I

m<=min(t,d)

1.SVD Model:Singular Value Decomposition of matrix

T0: t m orthogonal eigenvector matrix. Its rows are eigenvectors of X X'

D0: d m orthogonal eigenvector matrix. Its rows are eigenvectors of X X S0: m m diagonal matrix of singular value( square roots of eigenvalue) in

decreasing order of importance.

m: rank of matrix X. m<=min(t,d).

'

= S0 T0D0'documents

(d,m) (m,m) (m,t)X

(d,t)'

Ref:[11][20][21][27]

2.4.3 Dimension Reduction:LSI(2)

2.Approximate Model

=****

*S T D'

(t,d) (t,k) (k,k) (k,d)

appr(X)X Selectk<=m

!!

•Every rows of matrix appr(X) approximately represents one documents.•Given a row xi and its corresponding row di, the following holds:

'

D S T 'X appr(X) = ' '

xi = di S T '

and

di = xi T S-1

Ref:[11][20][21][27]

2.4.3 Dimension Reduction:LSI(3)

3.Document Represent Model:

new document d t Naive Terms

d=(w1,w2,…,wt) Rt

(1,k) (1,t) (t,k) (k, k)

=appr(d)d d T S-1 Rk

•No good methods to determine k. It depends on application domain.•Some experiments suggest: 100 300.

Ref:[11][20][21][27]

3. Document Categorization

3.1 Task3.2 Architecture3.3 Categorization Classifiers3.4 Application

Ref:[1][2][4][5][11][18][23][24]

3.1 Categorization:Task

•Task: assignment of one or more predefined categories to one document.

TopicsThemes

3.2 Categorization:Architecture

Training documents preprocessing

Weighting

Selecting feature

Predefinedcategories

New document

d

Classifier

Category(ies) to d

3.3 Categorization Classifiers

3.3.1Centroid-based Classifier

3.3.2 k-Nearest Neighbor Classifier

3.3.3 Naive Bayes Classifier

3.3.1 Model:Centroid-Based Classifier(1)

1.Input:new document d =(w1, w2,…,wn);2.Predefined categories:C={c1,c2,….,cl};3.//Compute centroid vector

i

cdi c

d

c i

'

'

, ciC

4.//Similarity model - cosine function

jlil

jlil

ji

jijiji

ww

ww

dd

ddddddSimil

22

22

,cos),(

5.//Compute similarity ),cos(),( dcdcSimil ii

6.//Output:Assign to document d the category cmax

),(),( max dcSimildcSimil i

3.3.1 Model:Centroid-Based Classifier(2)

d1

d2

d3

>

•cos()<cos()

•d2 is more close to d1 than d3

•Cosine-based similarity model can reflect the relations between features.

3.3.2 Model:K-Nearest Neighbor Classifier

1.Input:new document d;2.training collection:D={d1,d2,…dn };3.predefined categories:C={c1,c2,….,cl};4.//Compute similarities for(diD){ Simil(d,di) =cos(d,di); }5.//Select k-nearest neighborConstruct k-document subset Dk so that Simil(d,di) < min(Simil(d,doc) | doc Dk) di D- Dk.

6.//Compute score for each category for(ciC){ score(ci)=0; for(docDk){ score(ci)+=((docci)=true?1:0)} }7.//Output:Assign to d the category c with the highest score: score(c) score(ci) , ci C- {c}

Basic assumption: all terms distribute in documents independently.1.Input:new document d;2.predefined categories:C={c1,c2,….,cl};3.//Compute the probability that d is in each class c C for(ciC){

3.3.3 Model:Naive Bayes Classifier

Cc jj

i

j

iiii

ccd

ccd

d

ccddc

PrPr

PrPr

Pr

PrPr)Pr(

//note that terms wi in document are independent each other

Cc k

d

jj

d

Jiji

kk ccw

cwc

PrPr

PrPr

1

1

}4.//output:Assigns to d the category c with the highest probability: cdcd

Cc

PrmaxPr

3.4 Categorization: ApplicationTrend Analysis-EAnalyst System

Goal: Predicts the trends in stock price based on news stories

Time series data

Sample Textual data

Find Trends

Trends

Retrieve Docs Sample

relevant docs

Align trends with docs

Languagemodels

Newsdocuments

Newsdocuments Bayes

Classifier

LanguagemodelNew trend

Learning Process

Piecewise linear fitting

Trend cluster

Categorization

Sample docs

Trend::=(slope,confidence)Trend::=(slope,confidence)

Ref:[28]

4.1 Task 4.2 Algorithms4.3 Application

4. Document Clustering

Ref:[5][7][8][9][10][15][16][29]

4.1 Document Clustering:Task

•Task: It groups all documents so that the documents in the same group are more similar than ones in other groups.

•Cluster hypothesis: relevant documents tend to be more closely related to each other than to non-relevant document.

Ref:[5][7][8][9][10][15][16][29]

4.2 Document Clustering:Algorithms

•4.2.1 k-means•4.2.2 Hierarchic Agglomerative Clustering(HAC)•4.2.3 Association Rule Hypergraph Partitioning (ARHP)

Ref:[5][7][8][9][10][15][16][29]

4.2.1 Document Clustering:k-means

•k-means: distance-based flat clustering

•Advantage:•linear time complexity •works relatively well in low dimension space

•Drawback:•distance computation in high dimension space•centroid vector may not well summarize the cluster documents•initial k clusters affect the quality of clusters

0. Input: D::={d1,d2,…dn }; k::=the cluster number;1. Select k document vectors as the initial centriods of k clusters 2. Repeat3. Select one vector d in remaining documents4. Compute similarities between d and k centroids5. Put d in the closest cluster and recompute the centroid 6. Until the centroids don’t change7. Output:k clusters of documents

Ref:[5][7][8][9][10][15][16][29]

4.2.2 Document Clustering:HAC •Hierarchic agglomerative clustering(HAC):distance-based hierarchic clustering

•Advantage:•producing better quality clusters•works relatively well in low dimension space

•Drawback:•distance computation in high dimension space•quadratic time complexity

0. Input: D::={d1,d2,…dn };1. Calculate similarity matrix SIM[i,j] 2. Repeat3. Merge the most similar two clusters, K and L, to form a new cluster KL4. Compute similarities between KL and each of the remaining cluster and update SIM[i,j]5. Until there is a single(or specified number) cluster6. Output: dendogram of clusters

Ref:[5][7][8][9][10][15][16][29]

4.2.3 Document Clustering: Association Rule Hypergraph Partitioning(1)

•HypergraphH=(V,E)V::a set of verticesE::a set of hyperedges.

A

B

C E

F

GDa 1

b 1/3 f 3/2

e 3/4

c 1

g1/3

h 5/6

i 1/2

Ref:[30]-[35]

4.2.3 Document Clustering: Association Rule Hypergraph Partitioning (2)

•Transactional View of Documents and featuresitem::=Documenttransaction::=feature

Doc1 Doc2 Doc3 … Docn

w1 5 5 2 ... 1

w2 2 4 3 … 5

w3 0 0 0 … 1 . . . . … . . . . . .. . . . ... .

wt 6 0 0 … 3

items

transactions

(Transactional database of Documents and features)

Ref:[30]-[35]


•Clustering

Document-feature transaction database

Discovering association rules

Constructing hypergraph

Association rule hypergraph

Partitioning hypergraph

k partitions

Apriori algorithm

Hypergraph partitioning algorithm

•Hyperedges::=frequent item sets•Hyperedge weight::=average of the confidences of all rules•Assumption: documents occurring in the same frequent item set are more similar

Ref:[30]-[35]


•Advantage •Without the calculation of the mean of clusters.•Linear time complexity.•The quality of the clusters is not affected by the space dimensionality.•performing much better than traditional clustering in high dimensional space in terms of the quality of clusters and runtime.

Ref:[30]-[35]

4.3 Document Clustering:Application

•Summarization of documents•Navigation of large document collections•Organization of Web search results

Ref:[10][15]-[17]

5. Product:Intelligent Miner for Text(IMT)(1)

IMT

Text AnalysisTools

Feature extraction

Categorization

Summarization

Clustering

Name Extractions

Term Extraction

Abbreviation Extraction

Relationship Extraction

Hierarchical Clustering

Binary relational Clustering

Web Searching Tools

Text search engine

NetQuestion Solution

Web Crawler

Ref:[5][36]

1.Feature extraction tools1.1 Information extraction

•Extract linguistic items that represent document contents 1.2 Feature extraction

•Assign of different categories to vocabulary in documents, •Measure their importance to the document content.

1.3 Name extraction•Locate names in text,•Determine what type of entity the name refers to

1.4 Term extraction•Discover terms in text. Multiword technical terms•Recognize variants of the same concept

1.5 Abbreviation recognition•Find abbreviation and math them with their full forms.

1.6 Relation extraction


5. Product:Intelligent Miner for Text(IMT)(3)Feature extraction Demo.


2.Clustering tools2.1 Applications

• Provide a overview of content in a large document collection

• Identify hidden structures between groups of objects

• Improve the browsing process to find similar or related information

• Find outstanding documents within a collection

2.2 Hierarchical clustering• Clusters are organized in a clustering tree and related clusters occurs in the same branch of tree.

2.3 Binary relational clustering• Relationship of topics.

• document cluster topic.

NB:preprocessing step for the categorization tool

5.Product:Intelligent Miner for Text(IMT)(5)•Clustering demo.:navigation of document collection


3.Summarization tools3.1 Steps

the most relevant sentences the relevancy of a sentence to a document

a summary of the document with length set by user

3.2 Applications•Judge the relevancy of a full text

Easily determine whether the document is relevant to read.

•Enrich search results The results of a query to a search engine can be enriched with a short

summary of each document.

•Get a fast overview over document collections

summary full document

5.Product:Intelligent Miner for Text(IMT)(7)

4.Categorization tool •Applications

• Organize intranet documents

• Assign documents to folders

• Dispatch requests

• Forward news to subscribers

Newsarticle categorizercategorizer

sports

cultures

health

politics

economics

vacations Black cat

I like health news

new router

6. Reference (1) Bibliography[1] Marti A. Hearst, Untangling Text Data Mining, Proceedings of ACL’99 the 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, June 20-26, 1999 (invited paper) http://www.sims.berkeley.edu/~hearst

[2] Feldman and Dagan 1995 KDT - knowledge discovery in texts. In Proceedings of the First Annual Conference on Knowledge Discovery and Data Mining (KDD), Montreal.

[3] IJCAI-99 Workshop TEXT MINING: FOUNDATIONS, TECHNIQUES AND APPLICATIONS Stockholm, Sweden August 2, 1999 http://www.cs.biu.ac.il/~feldman/ijcai-workshop%20cfp.html

[4] Taeho C. Jo Text Categorization considering Categorical Weights and Substantial Weights of Informative Keywords, 1999 (http://www.sccs.chukyo-u.ac.jp/ICCS/olp/p3-13/p3-13.htm)

[5] A White Paper from IBM TechnologyText Mining: Turning Information Into Knowledge, February 17, 1998 editor: Daniel Tkach IBM Software Solutions ( http://allen.comm.virginia.edu/jtl5t/whiteweb.html)

[6] http://allen.comm.virginia.edu/jtl5t/index.htm

[7] G. Salton et al, Introduction to Modern Information Retrieval, McGraw-Hill Book company, 1983

[8] Michael Steinbach and George Karypis and Vipin Kumar, A Comparison of Document Clustering Techiques, KDD-2000

[9] Douglass R. Cutting, Divid R. Karger, Jan O. Pedersen, and John W. Tukey, Scatter/Gather : A Cluster-based Approach to Browsing large Document Collections, SIGIR ’92,Pages 318 - 329

[10] Oren Zamir, Oren Etzioni, Omid Madani, Richard M. Karp, Fast and Intuitive Clustering of Web Documents, KDD ’97, Pages 287 – 290, 1997

6. Reference (2) Bibliography[11] Kjersti Aas et al. Text Categorization: Survey, 1999

[12] Text mining White Paper http://textmining.krdl.org.sg/whiteppr.html

[13] Gartner Group, Text mining White Paper , June 2000 http://www.xanalys.com/intelligence_tools/products/text_mining_text.html

[14] Yiming Y. and Jan O. Pedersen, A Comparative Study on Feature Selection in Text Categorization, In the 14th Int. Conf. On Machine Learning, PP. 412- 420, 1997

[15] C. J. van Rijsbergen, (1979), Information Retrieval, Buttersworth, London.

[16] Chris Buckley and Alan F. Lewit, Optimizations of inverted vector searches, SIGIR ’85, PP 97 – 110,1985.

[17] Daphe Koller and Mehran Sahami, Hierarchically classifying documents using very few words, proceedings of the 14th International Conference on Machine Learning, Nashville, Tennessee, July 1997, PP170 – 178.

[18] T. Joachims, A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization, In Int. Conf. Machine Learning, 1997.

[19] K. Lang, NewsWeeder: Learning to Filter Netnews, International conference on Machine learning, 1995, http://anther.learning.cs.cmu.edu/ml95.ps

[20] Hart, S.P. A Probabilistic Approch to Automatic Keyword Indexing, Journal of the American Society for Information Science, July-August, 1975

6. Reference (3) Bibliography[21] Scott D., Indexing by Latent Semantic Analsis, Journal of the American Society for Information Science, Vol. 41,No,6, P.P. 391-407, 1990

[22] S.T. Dumais, Improving the retrieval information from external sources, Behavior Research Methods, Instruments and Computers, Vol23, No.2,PP.229-236, 1991

[23] T. Yavuz and H. A. Guvenir, Application of k-Nearest Neighbor on Feature Projections Classifier to Text Categorization, 1998

[24]Eui-Hong H. and George K., Centroid-Based Document Classification: Analysis & Experimental Results, In European Conference on Principles of Data Mining and Knowledge Discovery(PKDD), 2000

[25]Vladmimir N. Vapnik, the Nature of Statistical Learning Theory, Springer, New York, 1995

[26]Ji He,A-Hwee Tan and Chew-Lim Tan, A comparative Study on Chinese Text Categorization Methods,, PRICAI 2000 Workshop on Text And Web Mining, Melbourne, pp.24-35,August 2000 (http://textmining.krdl.org.sg/PRICAI2000/text-categorization.pdf)

[27] Erik W. and Jan O. Pedersen and Andreas S. Weigend, A Neural Network Approach to Topic Spotting, In Pro 4th annaul symposium on cocument analysis and information retrieval, PP 22-34,1993

[28] Lavrenko, V., Schmill, M., Lawrie, D., Ogilvie, P., Jensen, D. and Allan, J. in the Proceedings of KDD 2000 Conference, pp. 37-44.

[29]Peter W. , Recent Trends in Hierarchic Document Clustering : A Critical Review, Information Procession & Management Vol.24, No. 5 pp. 577-597, 1988

6. Reference (4) Bibliography

[30] J. Larocca Neto, A.D. Santos, C.A.A. Kaestner, A.A. Freitas. Document clustering and text summarization. Proc. 4th Int. Conf. Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), 41-55. London: The Practical Application Company. 2000.

[31] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering Based On Association

Rule Hypergraphs , SIGMOD'97 Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.

[32] Daniel Boley, Maria Gini, Robert Gross, Eui-Hong (Sam) Han, Kyle Hastings, George Karypis, Vipin Kumar, Bamshad Mobasher, and Jerome Moore, Parititioning-Based Clustering for Web Document Categorization,Decision Support Systems Journal, Vol 27, No. 3, pp 329-341, 1999.

[34 George K. and Rajat A. and Vipin K. and Shashi S., Multilevel Hypergraph Partitioning: Applications in VLSI Domain, In proceedings o th e Design and Automation Conferences 97. [35] Eui-Hong (Sam) Han, George Karypis, Vipin Kumar and Bamshad Mobasher, Clustering In A High-Dimensional Space Using Hypergraph Models, Technical report #97-019, http://www-users.cs.umn.edu/~han/.

[36] IBM White paper: Information Mining with the IBM Intelligent Miner Family, Daniel S. Tkach, February, 1998