21
111/06/23 1 Extracting Key Terms From Noisy and Multi- theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International World Wide Web Conference, ACM WWW, 2009 Speaker: Chien- Liang Wu

2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Embed Size (px)

Citation preview

Page 1: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

112/04/21 1

Extracting Key Terms From Noisy and Multi-theme Documents

Maria Grineva, Maxim Grinev and Dmitry Lizorkin

Proceeding of the 18th International World Wide Web Conference, ACM WWW, 2009

Speaker: Chien-Liang Wu

Page 2: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Outline Motivation Framework of Key Terms Extraction

Candidate Terms Extraction Word Sense Disambiguation Building Semantic Graph Discovering Community Structure of the Semantic Graph Selecting Valuable Communities

Experiments Conclusions

2

Page 3: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Motivation Key Terms Extraction

Basic step for various NLP tasks Document classification Document clustering Text summarization

Challenges Web pages are typically noisy

Side bars/menus, comments, … Dealing with multi-theme Web pages

Portal home pages3

Page 4: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Motivation (cont.)

State-of-the-art Approaches to Key Terms Extraction Based on statistical learning

TFxIDF model, keyphrase-frequency, … Require training set

Approach in this paper Based on analyzing syntactic or semantic term relatedness

within a document Compute semantic relatedness between terms using Wiki Model document as a semantic graph of terms and apply

graph analysis techniques to it No training set required 4

Page 5: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Framework

5

Page 6: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Candidate Terms Extraction Goal:

Extract all terms from the document For each term prepare a set of Wikipedia articles that can describe

its meaning Parse the input document and extract all possible n-grams For each n-gram (+ its morphological variations) provide a

set of Wikipedia article titles "drinks", "drinking", "drink" => [Wikipedia:] Drink; Drinking

Avoid nonsense phrases appearing in the results "using", "electric cars are",…

6

Page 7: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Word Sense Disambiguation Goal:

Choose the most appropriate Wikipedia article from the set of

candidate articles for each ambiguous term extracted on the

previous step

Reference:

“Semantic Relatedness Metric for Wikipedia Concepts Based on

Link Analysis and its Application to Word Sense

Disambiguation”, SYRCoDIS, 2008

7

Page 8: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Word Sense Disambiguation (contd.)

Example Text:

Jigsaw is W3C's open-source project that started in May 1996. It is

a web server platform that provides a sample HTTP 1.1

implementation and …

Ambiguous term: “platform”

Four Wikipedia concepts around this word: “open-source”,

“web server”, “HTTP”, and “implementation”

8

Page 9: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Word Sense Disambiguation (contd.)

A neighbor of an article All Wikipedia articles that have an incoming or an outgoing

link to the original article

Each term is assigned with a single Wikipedia article that describes its meaning

9

Page 10: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Building Semantic Graph Goal:

Build document semantic graph using semantic relatedness

between terms

Semantic graph is a weighted graph Vertex: term

Edge: two vertices are semantically related

Weight of edge: semantic relatedness measure of the two

terms

10

Page 11: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Building Semantic Graph (contd.)

Using Dice-measure for Wikipedia-based semantic relatedness (reference: SYRCoDIS, 2008)

Weights for various link types

11

)()(

)()(2),(

BnAn

BnAnBADice

Where n(A) is the neighbors of article A

Page 12: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Detecting Community Structure of the Semantic Graph using Newman Algorithm

12

A news article: "Apple to Make ITunes More Accessible For the Blind"

Page 13: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Selecting Valuable Communities Goal: rank term communities in a way that:

the higher ranked communities contain key terms the lower ranked communities contain not important terms,

and possible disambiguation mistakes Use

Density of community Ci:

13

i

ii C

CsumCdensity

ofvertex #

)in edgeinner of(weight )(

Page 14: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Selecting Valuable Communities(contd.)

Informativeness of community Ci :

Higher values to the named entities (for example, Apple Inc., Steve Jobs, Braille) than to general terms (Consumer, Agreement, Information)

Community rank: density*informativeness 14

i

ii C

CsumC

of term#

)in termof ness(keyphrase)(enessinformativ

)D(

)D(ess(term)keyphrasen

term

Link

count

count

Where: • count(DLink) is the number of Wikipedia articles in which this term appears as a link

• count(Dterm) is the total number of articles in which it appears

Page 15: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Selecting Valuable Communities(contd.)

Decline is a border between important and non-important term communities

For test collection, decline coincides with the maximum F-measure in 73%

15

Page 16: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Experiment Noise-free dataset

252 posts from 5 technical blogs 22 annotators took part in this experiment

Each document was analyzed by 5 different annotators A key term was valid if at least two participants identified it For each document, two sets of key terms were built

Finally, got 2009 key terms, 93% of them are Wiki titles16

Uncontrolledkey terms

Controlledkey terms

Each annotator identified 5~10 key terms

Match Wiki article title

Page 17: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Evaluation

17

elected}{machine_s

elected}{machine_sselected}{manually_precision

selected}{manually_

elected}{machine_sselected}{manually_recall

recallprecision

recall*precision*2measure-F

Page 18: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Results

Revision of precision and recall Communities-based method extracts more related terms in each thematic

group than a human better terms coverage Each participant reviewed these automatically extracted key terms and, if

possible, extended his manually identified key terms 389 additional manually selected key terms Precision ↑ 46.1%, recall ↑ 67.7%

18

Page 19: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Evaluation on Web Pages 509 real-world web pages:

Manually select key terms from web pages in the same manner

Noise stability

19

Page 20: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Evaluation on Web Pages (contd.)

Multi-theme stability 50 web pages with diverse topics

News websites and home pages of Internet portals with lists of featured articles

Result:

20

Page 21: 2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

Conclusion Extract key terms from text document No training dataset required Wikipedia-based knowledge base

Word sense disambiguation Semantic graph Semantic relatedness

Valuable key term communities

21