19
Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi Shirakawa 1 , Kotaro Nakayama 2 , Takahiro Hara 1 , Shojiro Nishio 1 1 Osaka University, Osaka, Japan 2 University of Tokyo, Tokyo, Japan 1

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities

  • Upload
    howie

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities. Masumi Shirakawa 1 , Kotaro Nakayama 2 , Takahiro Hara 1 , Shojiro Nishio 1 1 Osaka University, Osaka, Japan 2 University of Tokyo, Tokyo, Japan. Challenge in short text analysis. - PowerPoint PPT Presentation

Citation preview

Page 1: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

1

Probabilistic Semantic Similarity Measurements

for Noisy Short Texts Using Wikipedia Entities

Masumi Shirakawa1, Kotaro Nakayama2, Takahiro Hara1, Shojiro Nishio1

1Osaka University, Osaka, Japan2University of Tokyo, Tokyo, Japan

Page 2: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Challenge in short text analysisStatistics are not always enough.

2

A year and a half after Google pulled its popular search engine out of mainland China

Baidu and Microsoft did not disclose terms of the agreement

Page 3: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Challenge in short text analysisStatistics are not always enough.

3

A year and a half after Google pulled its popular search engine out of mainland China

Baidu and Microsoft did not disclose terms of the agreement

Search enginesand China

They are talking about...

Page 4: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Challenge in short text analysisStatistics are not always enough.

4

A year and a half after Google pulled its popular search engine out of mainland China

Baidu and Microsoft did not disclose terms of the agreement

How do machines know that the two sentences mention about the similar

topic?

Search enginesand China

They are talking about...

Page 5: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Reasonable solutionUse external knowledge.

5

A year and a half after Google pulled its popular search engine out of mainland China

Baidu and Microsoft did not disclose terms of the agreement

Wikipedia Thesaurus [Nakayama06]

Page 6: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Related workESA: Explicit Semantic Analysis [Gabrilovich07]

Add Wikipedia articles (entities) to a text as its semantic representation.

1. Get search ranking of Wikipedia for each term (i.e. Wiki articles and scores).2. Simply sum up the scores for aggregation.

6

Apple Inc.

Apple Inc.

pricing

pearApple sells

a new product

Appleproduct

Key termextraction

Related entityfinding

AggregationiPhone

iPadiPhon

epearInput: T

t c Output: ranked list of c

sellsnew

business

Page 7: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Problems in real world noisy short texts“Noisy” means semantically noisy in this work. (We do not handle informal or casual surface forms, or misspells)

Term ambiguity• Apple (fruit) should not be related with Microsoft.

Fluctuation of term dominance• A term is not always important in texts.

7

We explore more effective aggregation method.

Page 8: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

We propose Extended naïve Bayes to aggregate related entities

Apple Inc.

Apple Inc.

Probabilistic method

8

𝑃 (𝑡∈𝑇 ) 𝑃 (𝑒|𝑡 ) 𝑃 (𝑐|𝑒)∏𝑘𝑃 (𝑐|𝑡𝑘 )

𝑃 (𝑐 )𝐾−1

𝑃 (𝑐∨𝑇 )=∏𝑘=1

𝐾

(𝑃 (𝑡𝑘∈𝑇 )𝑃 (𝑐|𝑡𝑘)+(1−𝑃 (𝑡𝑘∈𝑇 ) )𝑃 (𝑐))𝑃 (𝑐 )𝐾− 1

From text T to related articles c

𝑃 (𝑐|𝑡 )

[Milne08]

[Mihalcea07] [Song1

1]

pricing

pearApple sells

a new product

Appleproduct

Key termextraction

Related entityfinding

AggregationiPhone

iPadiPhon

eiPadInput: T t c Output:

ranked list of c[Nakayama06]

Page 9: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

When input is multiple termsApply naïve Bayes [Song11] to multiple terms to obtain related entity c using each probability P (c |tk ).

𝑃 (𝑐|𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑡 1 ,…, 𝑡𝐾|𝑐 )𝑃 (𝑐 )

𝑃 (𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑐 )∏

𝑘𝑃 ( 𝑡𝑘|𝑐 )

𝑃 (𝑡 1 ,… ,𝑡𝐾 )=∏𝑘𝑃 (𝑐|𝑡𝑘 )

𝑃 (𝑐 )𝐾−1

𝑡1𝑡 2

𝑡𝐾

Apple

product

new

iPhone

Compute for each related entity c

𝑐

𝑃 (𝑐∨𝑡1)

𝑃 (𝑐∨𝑡2)

𝑃 (𝑐∨𝑡𝐾)

“iPhone”

9

Page 10: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

When input is multiple termsApply naïve Bayes [Song11] to multiple terms to obtain related entity c using each probability P (c |tk ).

𝑃 (𝑐|𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑡 1 ,…, 𝑡𝐾|𝑐 )𝑃 (𝑐 )

𝑃 (𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑐 )∏

𝑘𝑃 ( 𝑡𝑘|𝑐 )

𝑃 (𝑡 1 ,… ,𝑡𝐾 )=∏𝑘𝑃 (𝑐|𝑡𝑘 )

𝑃 (𝑐 )𝐾−1

𝑡1𝑡 2

𝑡𝐾

Apple

product

new

iPhone

Compute for each related entity c

𝑐

𝑃 (𝑐∨𝑡1)

𝑃 (𝑐∨𝑡2)

𝑃 (𝑐∨𝑡𝐾)

“iPhone”

10

By using naïve Bayes, entities that are related

to multiple terms can be boosted.

Page 11: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

When input is textNot “multiple terms” but “text,” i.e., we don’t know which terms are key terms.

We developed extended naïve Bayes to solve this problem.Cannot observewhich are key

terms

𝑡1𝑡 2

𝑡𝐾

Apple

product

new

iPhone 𝑐

𝑃 (𝑐∨𝑡1)

𝑃 (𝑐∨𝑡2)

𝑃 (𝑐∨𝑡𝐾)

“iPhone”

11

Page 12: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Extended naïve Bayes𝑇 ′={𝑡1 }

𝑇 ′={𝑡1 , 𝑡2 }

𝑇 ′={𝑡1 ,⋯ ,𝑡𝐾 }

𝑇 …

Appleproduc

t

new

… Apple

Appleproduc

t

Appleproduc

t

new

Candidates of key

term

Apply naïve Bayesto each state T’

Probability that the set of key terms T is a

state T’ : …

12

Page 13: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Extended naïve Bayes𝑇 ′={𝑡1 }

𝑇 ′={𝑡1 , 𝑡2 }

𝑇 ′={𝑡1 ,⋯ ,𝑡𝐾 }

𝑇 …

Appleproduc

t

new

… Apple

Appleproduc

t

Appleproduc

t

new

Candidates of key

term

Apply naïve Bayesto each state T’

Probability that the set of key terms T is a

state T’ : ∑𝑇 ′

𝑃 (𝑐∨𝑇 ′)𝑃 (𝑇=𝑇 ′)=∏𝑘

(𝑃 ( 𝑡𝑘∈𝑇 )𝑃 (𝑐|𝑡𝑘 )+(1− 𝑃 ( 𝑡𝑘∈𝑇 ))𝑃 (𝑐 ))𝑃 (𝑐 )𝐾−1

…13

Page 14: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Extended naïve Bayes𝑇 ′={𝑡1 }

𝑇 ′={𝑡1 , 𝑡2 }

𝑇 ′={𝑡1 ,⋯ ,𝑡𝐾 }

𝑇 …

Appleproduc

t

new

… Apple

Appleproduc

t

Appleproduc

t

new

Candidates of key

term

Apply naïve Bayesto each state T’

Probability that the set of key terms T is a

state T’ : ∑𝑇 ′

𝑃 (𝑐∨𝑇 ′)𝑃 (𝑇=𝑇 ′)=∏𝑘

(𝑃 ( 𝑡𝑘∈𝑇 )𝑃 (𝑐|𝑡𝑘 )+(1− 𝑃 ( 𝑡𝑘∈𝑇 ))𝑃 (𝑐 ))𝑃 (𝑐 )𝐾−1

…14

Term dominance is incorporated into naïve Bayes

Page 15: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Experiments on short text sim datasets[Datasets] Four datasets derived from word similarity datasets using dictionary [Comparative methods] Original ESA [Gabrilovich07], ESA with 16 parameter settings[Metrics] Spearman’s rank correlation coefficient

15

ESA with well-adjusted parameter is superior to our method for “clean”

texts.

Page 16: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Tweet clusteringK-means clustering using the vector of related entities for

measuring distance[Dataset] 12,385 tweets including 13 topics

[Comparative methods] Bag-of-words (BOW), ESA with the same parameter, ESA with well-adjusted parameter

[Metric] Average of Normalized Mutual Information (NMI), 20 runs16

#MacBook (1,251) #Silverlight (221)#VMWare (890)#MySQL (1,241) #Ubuntu (988) #Chrome (1,018)#NFL (1,044) #NHL (1,045)#NBA (1,085)#MLB (752) #MLS (981) #UFC (991)#NASCAR (878)

Page 17: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Results

10 20 50 100 200 500 1000 200000.10.20.30.40.50.6

0.429 0.421 0.524

0.567

BOW ESA-sameESA-adjusted Our method

Number of related entities

NMI s

core p-value <

0.01

17

Page 18: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

Results

10 20 50 100 200 500 1000 200000.10.20.30.40.50.6

0.429 0.421 0.524

0.567

BOW ESA-sameESA-adjusted Our method

Number of related entities

NMI s

core p-value <

0.01

18

Our method outperformed ESA withwell-adjusted parameter for noisy short

texts.

Page 19: Probabilistic Semantic  Similarity Measurements for Noisy  Short  Texts Using  Wikipedia Entities

ConclusionWe proposed extended naïve Bayes to derive related Wikipediaentities given a real world noisy short text.

[Future work]Tackle multilingual short textsDevelop applications of the method

19