Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi...

Probabilistic Semantic Similarity Measurements

for Noisy Short Texts Using Wikipedia Entities

Masumi Shirakawa1, Kotaro Nakayama2, Takahiro Hara1, Shojiro Nishio1

1Osaka University, Osaka, Japan2University of Tokyo, Tokyo, Japan

Challenge in short text analysis

Statistics are not always enough.

A year and a half after Google pulled its popular search engine out of mainland China

Baidu and Microsoft did not disclose terms of the agreement

Search enginesand China

They are talking about...

How do machines know that the two sentences mention about the similar

topic?

Search enginesand China

They are talking about...

Reasonable solution

Use external knowledge.

Wikipedia Thesaurus [Nakayama06]

Related work

ESA: Explicit Semantic Analysis [Gabrilovich07]Add Wikipedia articles (entities) to a text as its semantic representation.

1. Get search ranking of Wikipedia for each term (i.e. Wiki articles and scores).2. Simply sum up the scores for aggregation.

Apple Inc.

pricing

pearApple sells

a new product

product

Key termextraction

Related entityfinding

Aggregation

iPhone

iPhonepear

Input: Tt c Output:

ranked list of csells

business

Problems in real world noisy short texts“Noisy” means semantically noisy in this work. (We do not handle informal or casual surface forms, or misspells)

Term ambiguity• Apple (fruit) should not be related with Microsoft.

Fluctuation of term dominance• A term is not always important in texts.

We explore more effective aggregation method.

We propose Extended naïve Bayes to aggregate related entities

Apple Inc.

Probabilistic method

𝑃 (𝑡∈𝑇 ) 𝑃 (𝑒|𝑡 ) 𝑃 (𝑐|𝑒)∏𝑘

𝑃 (𝑐|𝑡𝑘 )

𝑃 (𝑐 )𝐾−1

𝑃 (𝑐∨𝑇 )=∏𝑘=1

(𝑃 (𝑡𝑘∈𝑇 )𝑃 (𝑐|𝑡𝑘)+(1−𝑃 (𝑡𝑘∈𝑇 ) )𝑃 (𝑐))𝑃 (𝑐 )𝐾− 1

From text T to related articles c

𝑃 (𝑐|𝑡 )

[Milne08]

[Mihalcea07] [Song1

pricing

pearApple sells

a new product

product

Key termextraction

Related entityfinding

Aggregation

iPhone

iPhoneiPad

Input: T t c Output: ranked list of c

[Nakayama06]

When input is multiple terms

Apply naïve Bayes [Song11] to multiple terms to obtain related entity c using each probability P (c |tk ).

𝑃 (𝑐|𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑡 1 ,…, 𝑡𝐾|𝑐 )𝑃 (𝑐 )

𝑃 (𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑐 )∏

𝑃 ( 𝑡𝑘|𝑐 )

𝑃 (𝑡 1 ,… ,𝑡𝐾 )=∏𝑘

𝑃 (𝑐 )𝐾−1

𝑡 2

𝑡𝐾

product

iPhone

Compute for each related entity c

𝑃 (𝑐∨𝑡1)

𝑃 (𝑐∨𝑡2)

𝑃 (𝑐∨𝑡𝐾)

“iPhone”

When input is multiple terms

Apply naïve Bayes [Song11] to multiple terms to obtain related entity c using each probability P (c |tk ).

𝑃 (𝑐|𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑡 1 ,…, 𝑡𝐾|𝑐 )𝑃 (𝑐 )

𝑃 (𝑡1 ,…, 𝑡𝐾 )=𝑃 (𝑐 )∏

𝑃 ( 𝑡𝑘|𝑐 )

𝑃 (𝑡 1 ,… ,𝑡𝐾 )=∏𝑘

𝑃 (𝑐 )𝐾−1

𝑡 2

𝑡𝐾

product

iPhone

Compute for each related entity c

𝑃 (𝑐∨𝑡1)

𝑃 (𝑐∨𝑡2)

“iPhone”

By using naïve Bayes, entities that are related

to multiple terms can be boosted.

When input is text

Not “multiple terms” but “text,” i.e., we don’t know which terms are key terms.

We developed extended naïve Bayes to solve this problem.Cannot observewhich are key

𝑡 2

𝑡𝐾

product

iPhone 𝑐

𝑃 (𝑐∨𝑡1)

𝑃 (𝑐∨𝑡2)

“iPhone”

Extended naïve Bayes𝑇 ′={𝑡1 }

𝑇 ′={𝑡1 , 𝑡2 }

𝑇 ′={𝑡1 ,⋯ ,𝑡𝐾 }

Appleproduc

… Apple

Appleproduc

Candidates of key

Apply naïve Bayesto each state T’

Probability that the set of key terms T is a

state T’ : …

𝑇 ′={𝑡1 , 𝑡2 }

𝑇 ′={𝑡1 ,⋯ ,𝑡𝐾 }

Appleproduc

… Apple

Appleproduc

Candidates of key

state T’ : ∑𝑇 ′

𝑃 (𝑐∨𝑇 ′)𝑃 (𝑇=𝑇 ′)=∏𝑘

(𝑃 ( 𝑡𝑘∈𝑇 )𝑃 (𝑐|𝑡𝑘 )+(1−𝑃 ( 𝑡𝑘∈𝑇 ))𝑃 (𝑐 ))𝑃 (𝑐 )𝐾−1

𝑇 ′={𝑡1 , 𝑡2 }

𝑇 ′={𝑡1 ,⋯ ,𝑡𝐾 }

Appleproduc

… Apple

Appleproduc

Candidates of key

state T’ : ∑𝑇 ′

𝑃 (𝑐∨𝑇 ′)𝑃 (𝑇=𝑇 ′)=∏𝑘

(𝑃 ( 𝑡𝑘∈𝑇 )𝑃 (𝑐|𝑡𝑘 )+(1−𝑃 ( 𝑡𝑘∈𝑇 ))𝑃 (𝑐 ))𝑃 (𝑐 )𝐾−1

Term dominance is incorporated into naïve Bayes

Experiments on short text sim datasets[Datasets] Four datasets derived from word similarity datasets using dictionary

[Comparative methods] Original ESA [Gabrilovich07], ESA with 16 parameter settings

[Metrics] Spearman’s rank correlation coefficient

ESA with well-adjusted parameter is superior to our method for “clean”

texts.

Tweet clustering

K-means clustering using the vector of related entities for measuring distance

[Dataset] 12,385 tweets including 13 topics

[Comparative methods] Bag-of-words (BOW), ESA with the same parameter, ESA with well-adjusted parameter

[Metric] Average of Normalized Mutual Information (NMI), 20 runs

#MacBook (1,251) #Silverlight (221)#VMWare (890)#MySQL (1,241) #Ubuntu (988) #Chrome (1,018)#NFL (1,044) #NHL (1,045)#NBA (1,085)#MLB (752) #MLS (981) #UFC (991)#NASCAR (878)

Results

10 20 50 100 200 500 1000 20000

BOW ESA-sameESA-adjusted Our method

Number of related entities

ore p-value <

Results

10 20 50 100 200 500 1000 20000

BOW ESA-sameESA-adjusted Our method

Number of related entities

ore p-value <

Our method outperformed ESA withwell-adjusted parameter for noisy short

texts.

Conclusion

We proposed extended naïve Bayes to derive related Wikipediaentities given a real world noisy short text.

[Future work]Tackle multilingual short textsDevelop applications of the method

Probabilistic Semantic Similarity Measurements for Noisy Short Texts Using Wikipedia Entities Masumi...

Documents

Mi Moleskine Arquitectónico_ El Paisaje Cultural de Shirakawa Go

Masaaki Shirakawa: How to address tail risks · 2011. 7. 8. · Masaaki Shirakawa: How to address tail risks Speech by Mr Masaaki Shirakawa, Governor of the Bank of Japan, at the

Islamda Mezhep M.Sultan Masumi

Masaaki Shirakawa: Globalization and population aging - … · 2016-02-01 · BIS central bankers’ speeches 1 Masaaki Shirakawa: Globalization and population aging – challenges

Shirakawa Elementary School - fmsd.org

13 X J 7 Atanas Ourkouzounov Masumi Yoshikawa … · 13 X J 7 Atanas Ourkouzounov Masumi Yoshikawa [soprano] Sho Sato [violoncello] Ayano Morita [guitar] Norio Sato [guitar] Yoji

Yasuki Shirakawa Climate Consulting, LLC

Tinh Yeu Nao - Masumi Toytome.indd 1 21/10/10 4:32 PM · Tinh Yeu Nao - Masumi Toytome.indd 1 21/10/10 4:32 PM. 7Î1+

OSAKA SHIRAKAWA TOTTORI 6D4N ซุปตาร์ ยอดนกัสบื ... · 2019. 6. 13. · hj1002 : osaka shirakawa tottori ซุปตาร์ยอดนักสืบ

ISHIHARA Hirokazu, TAKIYANAGI Masumi, SADANO Shogo …

DAFTAR PUSTAKA Acharya, B., Shirakawa, T., Pungky, A., Damanik

Japan23 Shirakawa go3

Masaaki Shirakawa: Demographic changes and macroeconomic … · 2012-05-31 · BIS central bankers’ speeches 1 Masaaki Shirakawa: Demographic changes and macroeconomic performance

Archiving, Distribution and Analysis of Solar-B Data Masumi Shimojo SOLAR-B Project ・ NSRO/NAOJ and The SOLAR-B MODA Working Group Masumi Shimojo SOLAR-B

CV, Masumi Yamamoto

Real-time Seismology Masumi Yamada Kyoto University

Possible stick-slip behavior before the Rausu …masumi/etc/yamada2016grl.pdfPossible stick-slip behavior before the Rausu landslide inferred from repeating seismic events Masumi Yamada

Hiroaki SHIRAKAWA Patricia San Miguel Nagoya University Thursday/2 Key... · Hiroaki SHIRAKAWA Patricia San Miguel Nagoya University . Introduction •In order to reduce imported

shirakawa - katekyo-fukushima.com · 0120- Title: shirakawa Created Date: 6/13/2019 1:03:41 PM

MASUMI TSU MIYAMOTO BARITONE RECITAL MUSICA ITALIANA