Upload
thomasina-barker
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
112/04/21 1
Extracting Key Terms From Noisy and Multi-theme Documents
Maria Grineva, Maxim Grinev and Dmitry Lizorkin
Proceeding of the 18th International World Wide Web Conference, ACM WWW, 2009
Speaker: Chien-Liang Wu
Outline Motivation Framework of Key Terms Extraction
Candidate Terms Extraction Word Sense Disambiguation Building Semantic Graph Discovering Community Structure of the Semantic Graph Selecting Valuable Communities
Experiments Conclusions
2
Motivation Key Terms Extraction
Basic step for various NLP tasks Document classification Document clustering Text summarization
Challenges Web pages are typically noisy
Side bars/menus, comments, … Dealing with multi-theme Web pages
Portal home pages3
Motivation (cont.)
State-of-the-art Approaches to Key Terms Extraction Based on statistical learning
TFxIDF model, keyphrase-frequency, … Require training set
Approach in this paper Based on analyzing syntactic or semantic term relatedness
within a document Compute semantic relatedness between terms using Wiki Model document as a semantic graph of terms and apply
graph analysis techniques to it No training set required 4
Framework
5
Candidate Terms Extraction Goal:
Extract all terms from the document For each term prepare a set of Wikipedia articles that can describe
its meaning Parse the input document and extract all possible n-grams For each n-gram (+ its morphological variations) provide a
set of Wikipedia article titles "drinks", "drinking", "drink" => [Wikipedia:] Drink; Drinking
Avoid nonsense phrases appearing in the results "using", "electric cars are",…
6
Word Sense Disambiguation Goal:
Choose the most appropriate Wikipedia article from the set of
candidate articles for each ambiguous term extracted on the
previous step
Reference:
“Semantic Relatedness Metric for Wikipedia Concepts Based on
Link Analysis and its Application to Word Sense
Disambiguation”, SYRCoDIS, 2008
7
Word Sense Disambiguation (contd.)
Example Text:
Jigsaw is W3C's open-source project that started in May 1996. It is
a web server platform that provides a sample HTTP 1.1
implementation and …
Ambiguous term: “platform”
Four Wikipedia concepts around this word: “open-source”,
“web server”, “HTTP”, and “implementation”
8
Word Sense Disambiguation (contd.)
A neighbor of an article All Wikipedia articles that have an incoming or an outgoing
link to the original article
Each term is assigned with a single Wikipedia article that describes its meaning
9
Building Semantic Graph Goal:
Build document semantic graph using semantic relatedness
between terms
Semantic graph is a weighted graph Vertex: term
Edge: two vertices are semantically related
Weight of edge: semantic relatedness measure of the two
terms
10
Building Semantic Graph (contd.)
Using Dice-measure for Wikipedia-based semantic relatedness (reference: SYRCoDIS, 2008)
Weights for various link types
11
)()(
)()(2),(
BnAn
BnAnBADice
Where n(A) is the neighbors of article A
Detecting Community Structure of the Semantic Graph using Newman Algorithm
12
A news article: "Apple to Make ITunes More Accessible For the Blind"
Selecting Valuable Communities Goal: rank term communities in a way that:
the higher ranked communities contain key terms the lower ranked communities contain not important terms,
and possible disambiguation mistakes Use
Density of community Ci:
13
i
ii C
CsumCdensity
ofvertex #
)in edgeinner of(weight )(
Selecting Valuable Communities(contd.)
Informativeness of community Ci :
Higher values to the named entities (for example, Apple Inc., Steve Jobs, Braille) than to general terms (Consumer, Agreement, Information)
Community rank: density*informativeness 14
i
ii C
CsumC
of term#
)in termof ness(keyphrase)(enessinformativ
)D(
)D(ess(term)keyphrasen
term
Link
count
count
Where: • count(DLink) is the number of Wikipedia articles in which this term appears as a link
• count(Dterm) is the total number of articles in which it appears
Selecting Valuable Communities(contd.)
Decline is a border between important and non-important term communities
For test collection, decline coincides with the maximum F-measure in 73%
15
Experiment Noise-free dataset
252 posts from 5 technical blogs 22 annotators took part in this experiment
Each document was analyzed by 5 different annotators A key term was valid if at least two participants identified it For each document, two sets of key terms were built
Finally, got 2009 key terms, 93% of them are Wiki titles16
Uncontrolledkey terms
Controlledkey terms
Each annotator identified 5~10 key terms
Match Wiki article title
Evaluation
17
elected}{machine_s
elected}{machine_sselected}{manually_precision
selected}{manually_
elected}{machine_sselected}{manually_recall
recallprecision
recall*precision*2measure-F
Results
Revision of precision and recall Communities-based method extracts more related terms in each thematic
group than a human better terms coverage Each participant reviewed these automatically extracted key terms and, if
possible, extended his manually identified key terms 389 additional manually selected key terms Precision ↑ 46.1%, recall ↑ 67.7%
18
Evaluation on Web Pages 509 real-world web pages:
Manually select key terms from web pages in the same manner
Noise stability
19
Evaluation on Web Pages (contd.)
Multi-theme stability 50 web pages with diverse topics
News websites and home pages of Internet portals with lists of featured articles
Result:
20
Conclusion Extract key terms from text document No training dataset required Wikipedia-based knowledge base
Word sense disambiguation Semantic graph Semantic relatedness
Valuable key term communities
21