71
2009 © Qiaozhu Mei University of Illinois at Urbana- Champaign Contextual Text Mining Qiaozhu Mei [email protected] University of Illinois at Urbana- Champaign

Contextual Text Mining

  • Upload
    shalom

  • View
    61

  • Download
    0

Embed Size (px)

DESCRIPTION

Contextual Text Mining. Qiaozhu Mei [email protected] University of Illinois at Urbana-Champaign. Knowledge Discovery from Text. Text Mining System. Trend of Text Content. - Ramakrishnan and Tomkins 2007. Text on the Web (Unconfirmed). ~100B. 10B. Gold?. ~3M day. ~750k /day. - PowerPoint PPT Presentation

Citation preview

Page 1: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Contextual Text Mining

Qiaozhu [email protected]

University of Illinois at Urbana-Champaign

Page 2: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Knowledge Discovery from Text

2

Text Mining System

Page 3: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 3

Trend of Text Content

Content Type

Published Content

Professional web content

User generated content

Private text content

Amount / day 3-4G ~ 2G 8-10G ~ 3T

- Ramakrishnan and Tomkins 2007

Page 4: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Text on the Web (Unconfirmed)

4

~750k /day

~3M day

~150k /day

1M

10B

6M

~100B

Where to Start? Where to Go?

Gold?

Page 5: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context Information in Text

5

Author

Time

Source

Author’s occupati

on

Language Social

Network

Check Lap Kok, HK

self designer, publisher, editor …

3:53 AM Jan 28th

From Ping.fm

Location

Sentiment

Sentiment

Page 6: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Rich Context in Text

6

102M blogs

~3M msgs /day

~150k bookmarks /day

~300M words/month

~2M users

5M users 500M URLs

8M contributors 100+ languages

750K posts/day

100M users > 1M groups

73 years~400k authors ~4k sources

1B queries? Per hour?Per IP?

Page 7: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Text + Context = ?

7

+

Context = GuidanceI Have A Guide!

=

Page 8: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Query + User = Personalized Search

8

MSR

Modern System Research

Medical simulation

Montessori School of Raleigh

Mountain Safety Research

MSR Racing

Wikipedia definitions

Metropolis Street Racer

Molten salt reactor

Mars sample return

Magnetic Stripe Reader

How much can personalized help?

If you know me, you should give me Microsoft Research…

Page 9: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 9

Common Themes IBM APPLE DELL

Battery Life Long, 4-3 hrs Medium, 3-2 hrs Short, 2-1 hrs

Hard disk Large, 80-100 GB Small, 5-10 GB Medium, 20-50 GB

Speed Slow, 100-200 Mhz Very Fast, 3-4 Ghz Moderate, 1-2 Ghz

IBM LaptopReviews

APPLE LaptopReviews

DELL LaptopReviews

Customer Review + Brand = Comparative Product Summary

Can we compare Products?

Page 10: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 10

Hot Topics in SIGMOD

Literature + Time = Topic Trends

What’s hot in literature?

Page 11: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 11

One Week Later

Blogs + Time & Location = Spatiotemporal Topic Diffusion

How does discussion spread?

Page 12: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 12

Tom Hanks, who is my favorite movie star act the leading role.

protesting... will lose your faith by watching the movie.

a good book to past time.

... so sick of people making such a big deal about a fiction book

The Da Vinci Code

Blogs + Sentiment = Faceted Opinion Summary

What is good and what is bad?

Page 13: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 13

Information retrieval

Machine learning Data mining

Coauthor Network

Publications + Social Network =Topical Community

Who works together on what?

Page 14: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Query log + User = Personalized SearchLiterature + Time = Topic TrendsReview + Brand = Comparative OpinionBlog + Time & Location = Spatiotemporal Topic

DiffusionBlog + Sentiment = Faceted Opinion SummaryPublications + Social Network = Topical Community

Text + Context = Contextual Text Mining

14

…..

A General Solution for All

Page 15: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Contextual Text Mining

• Generative Model of Text• Modeling Simple Context• Modeling Implicit Context• Modeling Complex Context • Applications of Contextual Text Mining

15

Page 16: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Generative Model of Text

16

)|( ModelwordP

the.. movie.. harry ..

potter is .. based.. on.. j..k..rowling

theis

harrypottermovie

plottime

rowling

0.10.070.050.040.040.020.010.01

the

Generation

Inference, Estimation

harry

pottermovie

harry

is

Page 17: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Contextualized Models

17

book

Generation: • How to select contexts?• How to model the relations ofcontexts?

Inference:• How to estimate contextual models?• How to reveal contextual patterns?

),|( ContextModelwordP

Year = 2008

Year = 1998

Location = USLocation = China

Source = official

Sentiment = +

harry

potter

is

bookharry

potterrowling

0.150.100.080.05

movieharry

potterdirector

0.180.090.080.04

Page 18: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topics in Text

• Topic (Theme) = the subject of a discourse• A topic covers multiple documents• A document has multiple topics• Topic = a soft cluster of documents• Topic = a multinomial distribution of words

18

Many text mining tasks:• Extracting topics from text• Reveal contextual topic patterns

WebSearch

search 0.2engine 0.15query 0.08user 0.07ranking 0.06……

Page 19: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Probabilistic Topic Models

19

ipodnano

musicdownload

apple

0.150.080.050.020.01

movieharrypotter

actressmusic

0.100.090.050.040.02

Topic 1

Topic 2

Apple iPod

Harry Potter

Ki

iTopicwPizPwP..1

)|()()(

I downloaded

the music of

the movie

harry potter to

my ipod nano

ipod 0.15

harry 0.09

Page 20: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Parameter Estimation

• Maximizing data likelihood:

• Parameter Estimation using EM algorithm

20

))|(log(maxarg* ModelDataP

ipodnano

musicdownload

apple

0.150.080.050.020.01

movieharrypotter

actressmusic

0.100.090.050.040.02

I downloaded

the music of

the movie

harry potter to

my ipod nano

?????

?????

Guess the affiliation

Estimate the params

I downloaded

the music of

the movie

harry potter to

my ipod nano

I downloaded

the music of

the movie

harry potter to

my ipod nano

I downloaded

the music of

the movie

harry potter to

my ipod nano

Pseudo-Counts

Page 21: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

How Context Affects Topics

21

• Topics in science literature:16th Century v.s. 21st Century

• When do a computer scientist and a gardener use “tree, root, prune” in text?

• What does “tree” mean in “algorithm”?

• In Europe, “football” appears a lot in a soccer report. What about in the US?

Text are generated according to the Context!!

Page 22: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Simple Contextual Topic Model

22

Topic 1

Topic 2

Context 1:2004

Context 2:2007

Cj Ki

jij ContextTopicwPContextizPjcPwP..1 ..1

),|()|()()(

Apple iPod

Harry Potter

I downloaded

the music of

the movie

harry potter to

my iphone

Page 23: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Contextual Topic Patterns

• Compare contextualized versions of topics: Contextual topic patterns

• Contextual topic patterns conditional distributions– z: topic; c: context; w: word

• : strength of topics in context• :content variation of topics

23

) )|((or )|( jzcPiczP

),|( iczwP

Page 24: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Topic Life Cycles (Mei and Zhai KDD’05)

24

0

0. 002

0. 004

0. 006

0. 008

0. 01

0. 012

0. 014

0. 016

0. 018

0. 02

1999 2000 2001 2002 2003 2004Time (year)

Nor

mal

ized

Str

engt

h of

The

me

Biology Data

Web Information

Time Series

Classification

Association Rule

Clustering

Bussiness

Context = time

Comparing )|( zcP

Page 25: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Spatiotemporal Theme Pattern (Mei et al. WWW’06)

25

Week4: The theme is again strong along the east coast and the Gulf of Mexico

Week3: The theme distributes more uniformly over the states

Week2: The discussion moves towards the north and west

Week5: The theme fades out in most states

Week1: The theme is the strongest along the Gulf of Mexico

About Government Responsein Hurricane Katrina

Context = time & location

Comparing )|( czP

Page 26: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Evolutionary Topic Graph (Mei and Zhai KDD’05)

26

T

SVM 0.007criteria 0.007classifica – tion 0.006linear 0.005…

decision 0.006tree 0.006classifier 0.005class 0.005Bayes 0.005…

Classifica - tion 0.015text 0.013unlabeled 0.012document 0.008labeled 0.008learning 0.007…

Informa - tion 0.012web 0.010social 0.008retrieval 0.007distance 0.005networks 0.004…

1999

web 0.009classifica –tion 0.007features0.006topic 0.005…

mixture 0.005random 0.006cluster 0.006clustering 0.005variables 0.005… topic 0.010

mixture 0.008LDA 0.006 semantic 0.005…

2000 2001 2002 2003 2004

KDD

Context = timeComparing ),|( czwP

Page 27: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 27

Example: Event Impact Analysis(Mei and Zhai KDD’06)

vector 0.0514concept 0.0298model 0.0291space 0.0236boolean 0.0151function 0.0123…

xml 0.0678email 0.0197 model 0.0191collect 0.0187judgment 0.0102rank 0.0097…

probabilist 0.0778model 0.0432logic 0.0404 boolean 0.0281algebra 0.0200estimate 0.0119weight 0.0111…

model 0.1687language 0.0753estimate 0.0520 parameter 0.0281distribution 0.0268smooth 0.0198likelihood 0.0059…

1998

Publication of the paper “A language modeling approach to information retrieval”

Starting of the TREC conferences

year1992

term 0.1599relevance 0.0752weight 0.0660 feedback 0.0372model 0.0310probabilistic 0.0188document 0.0173…

Theme: retrieval models

SIGIR papersSIGIR papers

Context = eventComparing ),|( czwP

Page 28: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Implicit Context in Text

• Some contexts are hidden– Sentiments; intents; impact; etc.

• Document contexts: don’t know for sure– Need to infer this affiliation from the data

• Train a model M for each implicit context• Provide M to the topic model as guidance

28

Page 29: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Modeling Implicit Context

29

Topic 1

Topic 2

Positive

Negative

???hate

awfuldisgust

0.210.030.01

goodlike

perfect

0.100.050.02

Apple iPod

Harry Potter

I like the

song of

movie on

perfect but

hate the accent

my

ipod

the

Page 30: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 30

Semi-supervised Topic Model(Mei et al. WWW’07)

Maximum A Posterior (MAP)

Estimation

Maximum Likelihood

Estimation (MLE)Add Dirichlet

priors

w

Topics

1

2

k

d1

d2

dk

Document

love great

hateawful

r1

r2

Similar to adding pseudo-counts to the observation

Guidance from

the user

))|(log(maxarg* DP

))()|(log(maxarg* PDP

Page 31: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Example: Faceted Opinion Summarization (Mei et al. WWW’07)

Neutral Positive Negative

Topic 1:Movie

... Ron Howards selection of Tom Hanks to play Robert Langdon.

Tom Hanks stars in the movie,who can be mad at that?

But the movie might get delayed, and even killed off if he loses.

Directed by: Ron Howard Writing credits: Akiva Goldsman ...

Tom Hanks, who is my favorite movie star act the leading role.

protesting ... will lose your faith by ... watching the movie.

After watching the movie I went online and some research on ...

Anybody is interested in it?

... so sick of people making such a big deal about a FICTION book and movie.

Topic 2:Book

I remembered when i first read the book, I finished the book in two days.

Awesome book. ... so sick of people making such a big deal about a FICTION book and movie.

I’m reading “Da Vinci Code” now.

So still a good book to past time.

This controversy book cause lots conflict in west society.

31Context = topic & sentiment

Page 32: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 32

Results: Sentiment Dynamics

Facet: the book “ the da vinci code”. ( Bursts during the movie, Pos > Neg )

Facet: the impact on religious beliefs. ( Bursts during the movie, Neg > Pos )

Page 33: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 33

Results: Topic with User’s Guidance

• Topics for iPod:

No Prior With Prior

Battery, nano Marketing Ads, spam Nano Battery

battery apple free nano battery

shuffle microsoft sign color shuffle

charge market offer thin charge

nano zune freepay hold usb

dock device complete model hour

itune company virus 4gb mini

usb consumer freeipod dock life

hour sale trial inch rechargable

Guidance from the user: I know two topics should look like this

Page 34: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Complex Context in Text

• Complex context structure of contexts• Many contexts has latent structure

– Time; location; social network

• Why modeling context structure?– Review novel contextual patterns;– Regularize contextual models;– Alleviate data sparseness: smoothing;

34

Page 35: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Modeling Complex Context

35

Topic 1

Topic 2

A

B

Context 1

Two Intuitions:• Regularization: Model(A) and Model(B) should be similar• Smoothing: Look at B if A doesn’t have enough data

Context A and B are closely related

tionRegularizaLikelihood)( CO

Page 36: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Applications of Contextual Text Mining

• Personalized Search– Personalization with backoff

• Social Network Analysis (for schools)– Finding Topical communities

• Information Retrieval (for industry labs)– Smoothing Language Models

36

Page 37: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Application I: Personalized Search

37

Page 38: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 38

Personalization with Backoff (Mei and Church WSDM’08)

• Ambiguous query: MSG– Madison Square Garden– Monosodium Glutamate

• Disambiguate based on user’s prior clicks• We don’t have enough data for everyone!

– Backoff to classes of users• Proof of Concept:

– Context = Segments defined by IP addresses• Other Market Segmentation (Demographics)

Page 39: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Apply Contextual Text Mining to Personalized Search

• The text data: Query Logs• The generative model: P(Url| Query)• The context: Users (IP addresses)• The contextual model: P(Url| Query, IP)• The structure of context:

– Hierarchical structure of IP addresses

39

Page 40: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 40

Evaluation Metric: Entropy (H)

• • Difficulty of encoding information (a distr.)

– Size of search space; difficulty of a task

• H = 20 1 million items distributed uniformly• Powerful tool for sizing challenges and

opportunities – How hard is search? – How much does personalization help?

Xx

xpxpXH )(log)()(

Page 41: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 41

How Hard Is Search?

• Traditional Search– H(URL | Query)– 2.8 (= 23.9 – 21.1)

• Personalized Search– H(URL | Query, IPIP)– 1.21.2 (= 27.2 – 26.0)

Entropy (H)

Query 21.1

URL 22.1

IP 22.1

All But IP 23.9

All But URL 26.0

All But Query 27.1

All Three 27.2Personalization cuts H in Half!

Page 42: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Context = First k bytes of IP

42

),|(

),|(

),|(

),|(

),|(),|(

00

11

22

33

44

QIPUrlP

QIPUrlP

QIPUrlP

QIPUrlP

QIPUrlPQIPUrlP

156.111.188.243

156.111.188.*

156.111.*.*

156.*.*.*

*.*.*.*

Full personalization: every context has a different model: sparse data!

No personalization: all contexts share the same model

Personalization with backoff:

similar contexts have similar

models

Page 43: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 43

Backing Off by IP

• λs estimated with EM• A little bit of personalization

– Better than too much – Or too little

Lambda

0

0.05

0.1

0.15

0.2

0.25

0.3

λ4 λ3 λ2 λ1 λ0

λ4 : weights for first 4 bytes of IP λ3 : weights for first 3 bytes of IPλ2 : weights for first 2 bytes of IP

……

4

0

),|(),|(i

ii QIPUrlPQIPUrlP

Sparse Data Missed Opportunity

Page 44: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 44

Context Market Segmentation

• Traditional Goal of Marketing:– Segment Customers (e.g., Business v. Consumer)– By Need & Value Proposition

• Need: Segments ask different questions at different times• Value: Different advertising opportunities

• Segmentation Variables– Queries, URL Clicks, IP Addresses– Geography & Demographics (Age, Gender, Income)– Time of day & Day of Week

Page 45: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign 45

Business Days v. Weekends:More Clicks and Easier Queries

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

8,000,000

9,000,000

1 3 5 7 9 11 13 15 17 19 21 23

Jan 2006 (1st is a Sunday)

Clic

ks

1.001.021.041.061.081.101.121.141.161.181.20

En

tro

py

(H)

Total Clicks H(Url | IP, Q)

Easier

More Clicks

Page 46: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Harder Queries at TV Time

46

Harder queries

Page 47: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Application II: Information Retrieval

47

Page 48: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Application: Text Retrieval

Document d

A text mining paper

data mining

Doc Language Model (LM) θd : p(w|d) text 4/100=0.04

mining 3/100=0.03clustering 1/100=0.01…data = 0computing = 0…

Query q

Data ½=0.5Mining ½=0.5

Query Language Model θq : p(w|q)

Data ½=0.4Mining ½=0.4Clustering =0.1…

?p(w|q’)

text =0.039mining =0.028clustering =0.01…data = 0.001computing = 0.0005… Similarity

function

)|(

)|(log)|()||(

d

qq

Vwdq wp

wpwpD

Smoothed Doc LM θd' : p(w|d’)

48

Page 49: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothing a Document Language Model

49

Retrieval performance estimate LM smoothing LM

text 4/100 = 0.04mining 3/100 = 0.03Assoc. 1/100 = 0.01clustering 1/100=0.01…data = 0computing = 0…

text = 0.039mining = 0.028Assoc. = 0.009clustering =0.01…data = 0.001computing = 0.0005…

Assign non-zero prob. to unseen words

Estimate a more accurate distribution from sparse data

text = 0.038mining = 0.026Assoc. = 0.008clustering =0.01…data = 0.002computing = 0.001…

)|( dMLE wP

)|( collectionwP )|()|()1()|( collectiondMLE wPwPdwP

Page 50: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Apply Contextual Text Mining to Smoothing Language Models

• The text data: collection of documents• The generative model: P(word)• The context: Document• The contextual model: P(w|d)• The structure of context:

– Graph structure of documents

• Goal: use the graph of documents to estimate a good P(w|d)

50

Page 51: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Traditional Document Smoothing in Information Retrieval

d

Collection

d

Clusters

d

Nearest Neighbors

Collection

Cluster

neighbors

d

d~

Interpolate MLE

with Reference LM

Estimate a Reference language model

θref using the collection (corpus)

ref

)|()|()|( refdMLE wPwPdwP

[Ponte & Croft 98]

[Liu & Croft 04]

[Kurland& Lee 04]

51

Page 52: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Graph-based Smoothing for Language Models in Retrieval (Mei et

al. SIGIR 2008)• A novel and general view of

smoothing

52

d

P(w|d): MLEP(w|d): Smoothed

P(w|d) = Surface on top of the Graph

projection on a plain

Smoothed LM = Smoothed Surface!

Collection = Graph (of Documents)

Collection

P(w|d1)P(w|d2)

d1d2

Can also be a word graph

Page 53: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

The General Objective of Smoothing

53

2

),(

2 ))(,()~

)(()1()(

Evu

vuVu

uu ffvuwffuwCO

ufuf~ 2)

~)(( uu ffuw

Fidelity to MLE 2

),(

))(,(

Evu

vu ffvuw

Smoothness of the surface

)(uw

Importance of vertices

),( vuw

- Weights of edges (1/dist.)

Page 54: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothing Language Models using a Document Graph

54

Construct a kNN graph of documents;

d w(u): Deg(u) w(u,v): cosine

AdditionalDirichlet

Smoothing

fu= p(w|du);

uf

;)|()(

),()|()1()|(

Vv

vuMLEu dwPuDeg

vuwdwPdwP

Document language model:

Page 55: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Effectiveness of the Framework

55

Data Sets Dirichlet DMDG DMWG † DSDG QMWG

AP88-90 0.217 0.254 ***(+17.1%)

0.252 ***(+16.1%)

0.239 ***(+10.1%)

0.239(+10.1%)

LA 0.247 0.258 **(+4.5%)

0.257 **(+4.5%)

0.251 **(+1.6%)

0.247

SJMN 0.204 0.231 ***(+13.2%)

0.229 ***(+12.3%)

0.225 ***(+10.3%)

0.219(+7.4%)

TREC8 0.257 0.271 *** (+5.4%)

0.271 **(+5.4%)

0.261 (+1.6%)

0.260(+1.2%)

† DMWG: reranking top 3000 results. Usually this yieldsto a reduced performance than ranking all the documents

Wilcoxon test: *, **, *** means significance level 0.1, 0.05, 0.01

Graph-based smoothing >> BaselineSmoothing Doc LM >> relevance score >> Query LM

Page 56: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Intuitive Interpretation – Smoothing using Document Graph

d

d1 0

0))|(1)(1(1)|()1()|( uMLuMLu dwPdwPdwP

Vv

vdwPuDeg

vuw)|(

)(

),( Absorption Probability to the “1” state

Writing a word w in a document = random walk on the doc Markov chain; write down w if reaching “1”

)1( uP )0( uP

)( vuP

Act as neighbors do

56

Page 57: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Application III: Social Network Analysis

57

Page 58: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topical Community Analysis

58

physicist, physics, scientist, theory, gravitation …

writer, novel, best-sell, book, language, film…

Topic modeling to help community extraction

Information Retrieval +Data Mining +Machine Learning, …

=Domain Review +Algorithm +Evaluation, …

orComputer Science Literature

Network analysis to help topic extraction

?

Page 59: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Apply Contextual Text Mining to Topical Community Analysis

• The text data: Publications of researchers• The generative model: topic model• The context: author• The contextual model: author-topic model• The structure of context:

– Social Network: coauthor network of researchers

59

Page 60: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Intuitions

• People working on the same topic belong to the same “topical community”

• Good community: coherent topic + well connected• A topic is semantically coherent if people working on

this topic also collaborate a lot

60

IR

IR

IR?

More likely to be an IR person or a compiler person?

Intuition: my topics are similar to my neighbors

Page 61: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Social Network Context for Topic Modeling

61

• Context = author• Coauthor = similar contexts• Intuition: I work on similar topics to

my neighbors

Smoothed Topic distributions P(θj|author)

e.g. coauthor network

Page 62: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Evu

k

jjj

jd w

k

jj

vpupvuw

wpdpdwcGCO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

Topic Modeling with Network Regularization (NetPLSA)

62

• Basic Assumption (e.g., co-author graph)• Related authors work on similar topics

PLSA

Graph Harmonic Regularizer,

Generalization of [Zhu ’03],

Evu

k

jjj

jd w

k

jj

vpupvuw

wpdpdwcGCO

, 1

2

1

)))|()|((),(2

1(

))|()|(log),(()1(),(

importance (weight) of an edge

difference of topic distribution on neighbor vertices

tradeoff betweentopic and smoothness

topic distribution of a document

)|( , 2

1,

...1

upfwhereff jujkj

jTj

Page 63: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topics & Communities without Regularization

Topic 1 Topic 2 Topic 3 Topic 4

term 0.02 peer 0.02 visual 0.02 interface 0.02

question 0.02 patterns 0.01 analog 0.02 towards 0.02

protein 0.01 mining 0.01 neurons 0.02 browsing 0.02

training 0.01 clusters 0.01 vlsi 0.01 xml 0.01

weighting 0.01

stream 0.01 motion 0.01 generation 0.01

multiple 0.01 frequent 0.01 chip 0.01 design 0.01

recognition 0.01 e 0.01 natural 0.01 engine 0.01

relations 0.01 page 0.01 cortex 0.01 service 0.01

library 0.01 gene 0.01 spike 0.01 social 0.01

63

?? ? ?

Noisy community assignment

Page 64: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topics & Communities with Regularization

64

Topic 1 Topic 2 Topic 3 Topic 4

retrieval 0.13 mining 0.11 neural 0.06 web 0.05

information 0.05 data 0.06 learning 0.02 services 0.03

document 0.03 discovery 0.03 networks 0.02 semantic 0.03

query 0.03 databases 0.02 recognition 0.02 services 0.03

text 0.03 rules 0.02 analog 0.01 peer 0.02

search 0.03 association 0.02 vlsi 0.01 ontologies 0.02

evaluation 0.02 patterns 0.02 neurons 0.01 rdf 0.02

user 0.02 frequent 0.01 gaussian 0.01 management 0.01

relevance 0.02 streams 0.01 network 0.01 ontology 0.01

Information Retrieval

Data mining Machine learning

Web

Coherent community assignment

Page 65: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Topic Modeling and SNA Improve Each Other

Methods Cut Edge Weights

Ratio Cut/ Norm. Cut

Community Size

Community 1

Community 2

Community 3

Community 4

PLSA 4831 2.14/1.25 2280 2178 2326 2257

NetPLSA 662 0.29/0.13 2636 1989 3069 1347

NCut 855 0.23/0.12 2699 6323 8 11

65

-Ncut: spectral clustering with normalized cut. J. Shi et al. 2000- pure network based community finding

Network Regularization helps extract coherent communities(network assures the focus of topics)

Topic Modeling helps balancing communities(text implicitly bridges authors)

The smaller the betterThe smaller the better

Page 66: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Smoothed Topic Map

66

Map a topic on the network (e.g., using p(θ|a))

PLSA(Topic : “information retrieval”)

NetPLSA

Core contributors

Irrelevant

Intermediate

Page 67: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Summary of My Talk

67

• Text + Context = Contextual Text Mining– A new paradigm of text mining

• A novel framework for contextual text mining– Probabilistic Topic Models– Contextualize by simple context, implicit context,

complex context;

• Applications of contextual text mining

Page 68: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

A Roadmap of My Work

68

Information Retrieval& Web Search

Contextual Text Mining

KDD 05

KDD 06b

WWW 06

WWW 07

WWW 08

Contextual TopicModels

KDD 06a

SIGIR 07

SIGIR 08

KDD 07

WSDM 08

CIKM 08

ACL 08

Page 69: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Research Discipline

69

Text InformationManagement

Text Mining InformationRetrieval

Data Mining

Natural LanguageProcessing

Database

Bioinformatics

Machine Learning

Applied Statistics

Social Networks

Information Science

Page 70: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

End Note

70

+ =

Page 71: Contextual Text Mining

2009 © Qiaozhu Mei University of Illinois at Urbana-Champaign

Thank You

71