2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International

112/04/21 1

Extracting Key Terms From Noisy and Multi-theme Documents

Maria Grineva, Maxim Grinev and Dmitry Lizorkin

Proceeding of the 18th International World Wide Web Conference, ACM WWW, 2009

Speaker: Chien-Liang Wu

Outline Motivation Framework of Key Terms Extraction

Candidate Terms Extraction Word Sense Disambiguation Building Semantic Graph Discovering Community Structure of the Semantic Graph Selecting Valuable Communities

Experiments Conclusions

2

Motivation Key Terms Extraction

Basic step for various NLP tasks Document classification Document clustering Text summarization

Challenges Web pages are typically noisy

Side bars/menus, comments, … Dealing with multi-theme Web pages

Portal home pages3

Motivation (cont.)

State-of-the-art Approaches to Key Terms Extraction Based on statistical learning

TFxIDF model, keyphrase-frequency, … Require training set

Approach in this paper Based on analyzing syntactic or semantic term relatedness

within a document Compute semantic relatedness between terms using Wiki Model document as a semantic graph of terms and apply

graph analysis techniques to it No training set required 4

Framework

5

Candidate Terms Extraction Goal:

Extract all terms from the document For each term prepare a set of Wikipedia articles that can describe

its meaning Parse the input document and extract all possible n-grams For each n-gram (+ its morphological variations) provide a

set of Wikipedia article titles "drinks", "drinking", "drink" => [Wikipedia:] Drink; Drinking

Avoid nonsense phrases appearing in the results "using", "electric cars are",…

6

Word Sense Disambiguation Goal:

Choose the most appropriate Wikipedia article from the set of

candidate articles for each ambiguous term extracted on the

previous step

Reference:

“Semantic Relatedness Metric for Wikipedia Concepts Based on

Link Analysis and its Application to Word Sense

Disambiguation”, SYRCoDIS, 2008

7

Word Sense Disambiguation (contd.)

Example Text:

Jigsaw is W3C's open-source project that started in May 1996. It is

a web server platform that provides a sample HTTP 1.1

implementation and …

Ambiguous term: “platform”

Four Wikipedia concepts around this word: “open-source”,

“web server”, “HTTP”, and “implementation”

8

Word Sense Disambiguation (contd.)

A neighbor of an article All Wikipedia articles that have an incoming or an outgoing

link to the original article

Each term is assigned with a single Wikipedia article that describes its meaning

9

Building Semantic Graph Goal:

Build document semantic graph using semantic relatedness

between terms

Semantic graph is a weighted graph Vertex: term

Edge: two vertices are semantically related

Weight of edge: semantic relatedness measure of the two

terms

10

Building Semantic Graph (contd.)

Using Dice-measure for Wikipedia-based semantic relatedness (reference: SYRCoDIS, 2008)

Weights for various link types

11

)()(

)()(2),(

BnAn

BnAnBADice

Where n(A) is the neighbors of article A

Detecting Community Structure of the Semantic Graph using Newman Algorithm

12

A news article: "Apple to Make ITunes More Accessible For the Blind"

Selecting Valuable Communities Goal: rank term communities in a way that:

the higher ranked communities contain key terms the lower ranked communities contain not important terms,

and possible disambiguation mistakes Use

Density of community Ci:

13

i

ii C

CsumCdensity

ofvertex #

)in edgeinner of(weight )(

Selecting Valuable Communities(contd.)

Informativeness of community Ci :

Higher values to the named entities (for example, Apple Inc., Steve Jobs, Braille) than to general terms (Consumer, Agreement, Information)

Community rank: density*informativeness 14

i

ii C

CsumC

of term#

)in termof ness(keyphrase)(enessinformativ

)D(

)D(ess(term)keyphrasen

term

Link

count

count

Where: • count(DLink) is the number of Wikipedia articles in which this term appears as a link

• count(Dterm) is the total number of articles in which it appears

Selecting Valuable Communities(contd.)

Decline is a border between important and non-important term communities

For test collection, decline coincides with the maximum F-measure in 73%

15

Experiment Noise-free dataset

252 posts from 5 technical blogs 22 annotators took part in this experiment

Each document was analyzed by 5 different annotators A key term was valid if at least two participants identified it For each document, two sets of key terms were built

Finally, got 2009 key terms, 93% of them are Wiki titles16

Uncontrolledkey terms

Controlledkey terms

Each annotator identified 5~10 key terms

Match Wiki article title

Evaluation

17

elected}{machine_s

elected}{machine_sselected}{manually_precision

selected}{manually_

elected}{machine_sselected}{manually_recall

recallprecision

recall*precision*2measure-F

Results

Revision of precision and recall Communities-based method extracts more related terms in each thematic

group than a human better terms coverage Each participant reviewed these automatically extracted key terms and, if

possible, extended his manually identified key terms 389 additional manually selected key terms Precision ↑ 46.1%, recall ↑ 67.7%

18

Evaluation on Web Pages 509 real-world web pages:

Manually select key terms from web pages in the same manner

Noise stability

19

Evaluation on Web Pages (contd.)

Multi-theme stability 50 web pages with diverse topics

News websites and home pages of Internet portals with lists of featured articles

Result:

20

Conclusion Extract key terms from text document No training dataset required Wikipedia-based knowledge base

Word sense disambiguation Semantic graph Semantic relatedness

Valuable key term communities

21