1
Comparative Text Mining Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai DAIS The Database and Information Systems Laboratory . at The University of Illinois at Urbana-Champaign Large Scale Information Management Cross-Collection Text Mining Cross-Collection Text Mining (II) Temporal Text Mining Temporal Text Mining (II) Spatiotemporal Text Mining Spatiotemporal Text Mining (II) 1 4 6 3 2 5 IBM Laptop Reviews APPLE Laptop Reviews DELL Laptop Reviews “DELL” specific “APPLE” specific “IBM” specific Common Themes Moderate, 1-2 Ghz Very Fast, 3-4 Ghz Slow, 100-200 Mhz Speed Medium, 20-50 GB Small, 5-10 GB Large, 80-100 GB Hard disk Short, 2-1 hrs Medium, 3-2 hrs Long, 4-3 hrs Battery Life Many applications involve a comparative analysis of several text collections Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis We aim at developing methods for comparing multiple collections of text and performing comparative text mining ………………… Background B Theme 1 in common: 1 Theme 1 Specific to C 1 1,1 Theme k in common: k Theme k Specific to C 1 k,1 Theme 1 Specific to C 2 1,2 Theme 1 Specific to C m 1,m Theme k Specific to C 2 k,2 Theme k Specific to C m k,m B 1 1,i 1- C C k k,i 1- C d,1 d,k B 1- B Background W C - A mixture model for cross-collection comparative text mining , 1 , (| ) (1 )(| ) [ (| ) (1 )(| )] d i B B k B dj C j j C ji p wC pw pw pw Goal: Extract common themes and specific themes from comparable collections Applications: Opinion extraction, business intelligence, news summarization, etc. “Generating” word w in doc d in collection C i Sample results (comparing news articles about Iraq war and Afghan war) Reference: C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text Mining. KDD 2004. Goal: Extract evolutionary theme patterns from time labeled collection Applications: News summarization, literature analysis, opinion monitoring, etc. Theme Theme Evolution Evolution Graph and Graph and threads of threads of Tsunami data Tsunami data set set Immediate Reports Statistics of Death and loss Personal Experience of Survivors Statistics of further impact Aid from Local Areas Aid from the world Donations from countries Specific Events of Aid Lessons from Tsunami Research inspired Time Doc1 Doc3 Doc .. Theme spans Evolutionary transitions Theme evolution thread Theme 1 Theme k Theme 2 Background B warning 0.3 system 0.2.. Aid 0.1 donation 0.05 support 0.02 .. statistics 0.2 loss 0.1 dead 0.05 .. Is 0.05 the 0.04 a 0.03 .. Document d k 1 2 B B W d,1 d, k 1 - B d,2 “Generating” word w in doc d in the collection T t1 t2 A C ? B ? microarray 0.2 gene 0.1 protein 0.05 web 0.3 classification 0.1 topic 0.1 Information 0.2 topic 0.1 classification 0.1 text 0.05 Evolutionary Transition Theme similarity = Themes life cycles Themes life cycles of of KDD Abstracts KDD Abstracts 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 1999 2000 2001 2002 2003 2004 Tim e (year) Norm alized Strength of Them e Biology D ata W eb Inform ation Tim e S eries Classification Association R ule Clustering Bussiness gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 rules 0.0142 association 0.0064 support 0.0053 Themes life cycles Themes life cycles from from CNN news dataset CNN news dataset The Collection Decoding Decoding Collectio Collectio n θ 1 θ 2 θ 3 B output probability P (w|θ) = ww ww ww ww ww ww ww ww ww w Reference: Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005. Goal: model the spatiotemporal theme patterns from a collection of text. model the mixture of topics: common themes spatiotemporal content analysis: theme life cycles, theme coverage snapshots Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc. 1 i k Themes Spatiotemporal Context Time = t; Location = l B Background Word w d Document d at time t and location l B 1 - B TL 1 - TL P( i |t,l) P( i |d) P(w| i ) P(w| B ) Spatiotemporal model: Compute theme life cycles: Compute theme snapshots: Reference: Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme Pattern Mining on Weblogs. WWW 2006. Sample results: Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.): Models: Cluster 1 Cluster 2 Cluster 3 Common Theme united 0.042 nations 0.04 killed 0.035 month 0.032 deaths 0.023 Iraq Theme n 0.03 Weapons 0.024 Inspections 0.023 troops 0.016 hoon 0.015 sanches 0.012 Afghan Theme northern 0.04 alliance 0.04 kabul 0.03 taleban 0.025 aid 0.02 taleban 0.026 rumsfeld 0.02 hotel 0.012 front 0.011 The common theme indicates that “United Nations” is involved in both wars Collection-specific themes indicate different roles of “United Nations” in the two wars The first 2 weeks are mostly about “aid from the world” The next 2 weeks are mostly about “personal experience” Dropping Rising Week4: The theme is again strong along the east coast and the Gulf of Mexico Week3: The theme is distributed more uniformly over the states Week2: The discussion moves towards the northern and western states Week5: The theme fades out in most states Week1: The theme is the strongest along the Gulf of Mexico k j j B l t d w p B w P l t d w p 1 ) , , | , ( ) 1 ( ) | ( ) , , : ( T t j j j l t p l t p l t p l t p l t p ~ ) ~ , ~ ( ) ~ , ~ | ( ) ~ , ( ) ~ , | ( ) ~ , | ( L l k j j j j l t p l t p l t p l t p t l p ~ 1 ' ' ) ~ , ~ ( ) ~ , ~ | ( ) , ~ ( ) , ~ | ( ) ~ | (

Comparative Text Mining

Embed Size (px)

DESCRIPTION

gene 0.0173 expressions 0.0096 probability 0.0081 microarray 0.0038 …. t1. …. t2. T. microarray 0.2 gene 0.1 protein 0.05. ?. B. Information 0.2 topic 0.1 classification 0.1 text 0.05. A. ?. web 0.3 classification 0.1 topic 0.1. C. rules 0.0142 association 0.0064 - PowerPoint PPT Presentation

Citation preview

Page 1: Comparative Text Mining

Comparative Text Mining

Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai

DAIS The Database and Information Systems Laboratory .

at The University of Illinois at Urbana-Champaign

Large Scale Information Management

Cross-Collection Text Mining Cross-Collection Text Mining (II)

Temporal Text Mining Temporal Text Mining (II)

Spatiotemporal Text Mining Spatiotemporal Text Mining (II)

1

4

6

3

2

5

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

“DELL” specific“APPLE” specific“IBM” specificCommon Themes

Moderate, 1-2 GhzVery Fast, 3-4 GhzSlow, 100-200 MhzSpeed

Medium, 20-50 GBSmall, 5-10 GBLarge, 80-100 GBHard disk

Short, 2-1 hrsMedium, 3-2 hrsLong, 4-3 hrsBattery Life

Many applications involve a comparative analysis of several text collections

Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis

We aim at developing methods for comparing multiple collections of text and performing comparative text mining

…………………

Background B

Theme 1 in common: 1

Theme 1Specific

to C1

1,1

Theme k in common: k

Theme kSpecific

to C1

k,1

Theme 1Specific

to C2

1,2

Theme 1Specific

to Cm

1,m

Theme kSpecific

to C2

k,2

Theme kSpecific

to Cm

k,m

B1

1,i

1-C

C

k

k,i

1-C

d,1

d,k

B

1-B

Background

WC

- A mixture model for cross-collection comparative text mining

,1

,

( | ) (1 ) ( | )

[ ( | )

(1 ) ( | )]

d i B B

k

B d j C jj

C j i

p w C p w

p w

p w

Goal: Extract common themes and specific themes from comparable collections

Applications: Opinion extraction, business intelligence, news summarization, etc.

“Generating” word w in doc d in collection Ci

Sample results (comparing news articles about Iraq war and Afghan war)

Reference: C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text

Mining. KDD 2004.

Goal: Extract evolutionary theme patterns from time labeled collection

Applications: News summarization, literature analysis, opinion monitoring, etc.Theme Evolution Theme Evolution Graph and Graph and threads of threads of Tsunami data setTsunami data set

Immediate Reports

Statistics of Death and loss

Personal Experience of Survivors

Statistics of further impact

Aid from Local Areas Aid from the world

Donations from countries

Specific Events of Aid…

Lessons from Tsunami

Research inspired

Time

Doc1Doc3 Doc ..

Theme spans Evolutionary transitions

Theme evolution thread

Theme 1

Theme k

Theme 2

Background B

warning 0.3 system 0.2..Aid 0.1donation 0.05support 0.02 ..

statistics 0.2loss 0.1dead 0.05 ..

Is 0.05the 0.04a 0.03 ..

Document d

k

1

2

BB

W

d,1

d, k

1 - Bd,2

“Generating” word w in doc d in the collection

Tt1 … t2

A

C?

B?microarray 0.2gene 0.1protein 0.05

web 0.3classification 0.1topic 0.1

Information 0.2topic 0.1 classification 0.1text 0.05

Evolutionary Transition

Theme similarity

= Themes life cycles of Themes life cycles of KDD AbstractsKDD Abstracts

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Stre

ngth

of T

hem

e

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…

rules 0.0142association 0.0064support 0.0053…

Themes life cycles from Themes life cycles from CNN news datasetCNN news dataset

The Collection

Decoding Decoding CollectionCollection

……

θθ11 θθ22

θθ33

BB

output probability P (w|θ)=

w ww ww ww ww ww ww ww ww ww

Reference:

Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005.

Goal: model the spatiotemporal theme patterns from a collection of text.

model the mixture of topics: common themes

spatiotemporal content analysis: theme life cycles, theme coverage snapshots

Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc.

1 i k Themes

Spatiotemporal Context

Time = t; Location = l

B

Background

Word w

d

Document d at time t and location l

……

B

1 - B

TL 1 - TL

P(i|t,l) P(i|d)

P(w|i)

P(w|B)

Spatiotemporal model:

Compute theme life cycles:

Compute theme snapshots:

Reference: Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme

Pattern Mining on Weblogs. WWW 2006.

Sample results:

Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.):

Models:

Cluster 1 Cluster 2 Cluster 3

Common

Theme

united 0.042nations 0.04…

killed 0.035month 0.032deaths 0.023…

Iraq

Theme

n 0.03Weapons 0.024Inspections 0.023…

troops 0.016hoon 0.015sanches 0.012…

Afghan

Theme

northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…

taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…

The common theme indicates that

“United Nations” is involved in both

wars

Collection-specific themes indicate different roles of “United Nations” in the two wars

The first 2 weeks are mostly about “aid from the world”

The next 2 weeks are mostly about “personal experience”Dropping

Rising

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme is distributed more uniformly over the states

Week2: The discussion moves towards the northern and western states

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

k

jjB ltdwpBwPltdwp

1

),,|,()1()|(),,:(

Ttj

jj

ltpltp

ltpltpltp

~)

~,~()

~,~|(

)~

,()~

,|()

~,|(

Ll

k

jj

jj

ltpltp

ltpltptlp

~1'

' )~

,~()~

,~|(

),~(),~|()~|(