Comparative Text Mining

Preview:

DESCRIPTION

gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 …. t1. …. t2. T. microarray 0.2 gene 0.1 protein 0.05. ?. B. Information 0.2 topic 0.1 classification 0.1 text 0.05. A. ?. web 0.3 classification 0.1 topic 0.1. C. rules 0.0142 association 0.0064 - PowerPoint PPT Presentation

Citation preview

Comparative Text Mining

Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai

DAIS The Database and Information Systems Laboratory .

at The University of Illinois at Urbana-Champaign

Large Scale Information Management

Cross-Collection Text Mining Cross-Collection Text Mining (II)

Temporal Text Mining Temporal Text Mining (II)

Spatiotemporal Text Mining Spatiotemporal Text Mining (II)

1

4

6

3

2

5

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

“DELL” specific“APPLE” specific“IBM” specificCommon Themes

Moderate, 1-2 GhzVery Fast, 3-4 GhzSlow, 100-200 MhzSpeed

Medium, 20-50 GBSmall, 5-10 GBLarge, 80-100 GBHard disk

Short, 2-1 hrsMedium, 3-2 hrsLong, 4-3 hrsBattery Life

Many applications involve a comparative analysis of several text collections

Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis

We aim at developing methods for comparing multiple collections of text and performing comparative text mining

…………………

Background B

Theme 1 in common: 1

Theme 1Specific

to C1

1,1

Theme k in common: k

Theme kSpecific

to C1

k,1

Theme 1Specific

to C2

1,2

Theme 1Specific

to Cm

1,m

Theme kSpecific

to C2

k,2

Theme kSpecific

to Cm

k,m

B1

1,i

1-C

C

k

k,i

1-C

d,1

d,k

B

1-B

Background

WC

- A mixture model for cross-collection comparative text mining

,1

,

( | ) (1 ) ( | )

[ ( | )

(1 ) ( | )]

d i B B

k

B d j C jj

C j i

p w C p w

p w

p w

Goal: Extract common themes and specific themes from comparable collections

Applications: Opinion extraction, business intelligence, news summarization, etc.

“Generating” word w in doc d in collection Ci

Sample results (comparing news articles about Iraq war and Afghan war)

Reference: C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text

Mining. KDD 2004.

Goal: Extract evolutionary theme patterns from time labeled collection

Applications: News summarization, literature analysis, opinion monitoring, etc.Theme Evolution Theme Evolution Graph and Graph and threads of threads of Tsunami data setTsunami data set

Immediate Reports

Statistics of Death and loss

Personal Experience of Survivors

Statistics of further impact

Aid from Local Areas Aid from the world

Donations from countries

Specific Events of Aid…

Lessons from Tsunami

Research inspired

Time

Doc1Doc3 Doc ..

Theme spans Evolutionary transitions

Theme evolution thread

Theme 1

Theme k

Theme 2

Background B

warning 0.3 system 0.2..Aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

Is 0.05the 0.04a 0.03 ..

Document d

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Tt1 … t2

A

C?

B?microarray 0.2gene 0.1protein 0.05

web 0.3classification 0.1topic 0.1

Information 0.2topic 0.1 classification 0.1text 0.05

Evolutionary Transition

Theme similarity

= Themes life cycles of Themes life cycles of KDD AbstractsKDD Abstracts

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Stre

ngth

of T

hem

e

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

rules 0.0142association 0.0064support 0.0053…

Themes life cycles from Themes life cycles from CNN news datasetCNN news dataset

The Collection

Decoding Decoding CollectionCollection

……

θθ11 θθ22

θθ33

BB

output probability P (w|θ)=

w ww ww ww ww ww ww ww ww ww

Reference:

Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005.

Goal: model the spatiotemporal theme patterns from a collection of text.

model the mixture of topics: common themes

spatiotemporal content analysis: theme life cycles, theme coverage snapshots

Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc.

1 i k Themes

Spatiotemporal Context

Time = t; Location = l

B

Background

Word w

d

Document d at time t and location l

……

B

1 - B

TL 1 - TL

P(i|t,l) P(i|d)

P(w|i)

P(w|B)

Spatiotemporal model:

Compute theme life cycles:

Compute theme snapshots:

Reference: Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme

Pattern Mining on Weblogs. WWW 2006.

Sample results:

Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.):

Models:

Cluster 1 Cluster 2 Cluster 3

Common

Theme

united 0.042nations 0.04…

killed 0.035month 0.032deaths 0.023…

Iraq

Theme

n 0.03Weapons 0.024Inspections 0.023…

troops 0.016hoon 0.015sanches 0.012…

Afghan

Theme

northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…

taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…

The common theme indicates that

“United Nations” is involved in both

wars

Collection-specific themes indicate different roles of “United Nations” in the two wars

The first 2 weeks are mostly about “aid from the world”

The next 2 weeks are mostly about “personal experience”Dropping

Rising

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme is distributed more uniformly over the states

Week2: The discussion moves towards the northern and western states

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

k

jjB ltdwpBwPltdwp

1

),,|,()1()|(),,:(

Ttj

jj

ltpltp

ltpltpltp

~)

~,~()

~,~|(

)~

,()~

,|()

~,|(

Ll

k

jj

jj

ltpltp

ltpltptlp

~1'

' )~

,~()~

,~|(

),~(),~|()~|(

Recommended