Natural Language Processingusing Wikipedia
Rada Mihalcea University of North Texas
Text Wikification Finding key terms in documents and linking
them to relevant encyclopedic information.
Text Wikification (continued)
Motivation: Help Wikipedia contributors NLP applications (summarization, text
categorization, metadata annotation, text similarity) Enrich educational materials Annotating web pages (semantic web)
Combined problem Finding the important concepts
Keyword extraction Finding the correct article
Word sense disambiguation
Wikification pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Keyword Extraction Finding important words/phrases in raw
text Two-stage process
Candidate extraction Typical methods: n-grams, noun phrases
Candidate ranking Rank the candidates by importance Typical methods:
Unsupervised: information theoretic Supervised: machine learning using positional and
linguistic features
Keyword Extraction using Wikipedia1. Candidate extraction Semi-controlled vocabulary
Wikipedia article titles and anchor texts (surface forms). E.g. “USA”, “U.S.” = “United States of America”
More than 2,000,000 terms/phrases Vocabulary is broad (e.g., the, a are
included)
Keyword Extraction using Wikipedia2. Candidate ranking tf * idf
Wikipedia articles as document collection Chi-squared independence of phrase
and text The degree to which it appeared more
times than expected by chance Keyphraseness:
)(
)()|(
W
key
Dcount
DcountWkeywordP
Evaluations
Gold standard 85 documents containing 7.286 links
Links selected by Wikipedia users Have undergone the continuous editorial process of
Wikipedia Extract N keywords from the ranking N=6% of number of words
Results
30.00%
35.00%
40.00%
45.00%
50.00%
55.00%
60.00%
Precision Recall F
tf.idf
chi-squared
keyphraseness
Example Keyword ExtractionAutomatically extracted
Wikipedia annotations
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
Wikification Pipeline
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Word Sense Disambiguation
Channel: A channel is also the natural or man-made deeper course through a reef, bar, bay, or any shallow body of water.
Meter: Each bar has a 2-beat unit, a 5-beat unit, and a 3-beat unit, with a stress at the beginning of each unit.
Aida (café): In most shops a quick coffee while standing up at the bar is possible.
Aida (café): In most shops a quick coffee while standing up at the bar is possible.
Wikipedia as a Sense Tagged Corpus
In most shops a quick coffee while standing up at the [[bar (counter) | bar]] is possible.
A channel is also the natural or man-made deeper course through a reef, [[bar (landform) | bar]], bay, or any shallow body of water.
Each [[bar (music) | bar]] has a 2-beat unit, a 5-beat unit, and a 3-beat unit, with a stress at the beginning of each unit.
Wikipedia links = Sense annotations
Sense Inventory Alternative 1: disambiguation webpages
Does not include all possible annotations [[measure (music) | bar ]] measure (music)
not listed Inconsistent
identifier of disambiguation page: paper (disambiguation) vs. paper
Alternative 2: extract all link annotations bar (counter), bar (music), bar (landform) map them to WordNet senses
Building a Sense Tagged Corpus
Given ambiguous word W 1. Extract all the paragraphs in Wikipedia
containing the ambiguous word W inside a link
2. Collect all the possible Wikipedia labels = leftmost component of each link
3. Map the Wikipedia labels to WordNet senses
An ExampleGiven ambiguous word W = BAR1. Extract all the paragraphs in Wikipedia
containing the ambiguous word W inside a link 1,217 paragraphs
remove examples with [[bar]] (ambiguous): 1,108 examples
2. Collect all the possible Wikipedia labels = leftmost component of each link
40 Wikipedia labels bar (music); measure music; musical
notation 3. Map the Wikipedia labels to WordNet senses
9 WordNet senses
bar (counter) bar_(counter) The counter from which drinks are dispensed
A counter where you can obtain food or drink
bar (music) bar_(music), measure_music, musical_notation
A period of music
Musical notation for a repeating pattern of musical beats
bar (landform) bar_(landform) A type of beach behind which lies a lagoon
A submerged (or partly submerged) ridge in a river or along a shore
Word sense Wikipedia label Wikipedia definition WordNet definition
Supervised Word Sense Disambiguation Local and topical features in a Naïve Bayes
classifier Good performance on Senseval-2 and Senseval-3
data Local features
Current word and part-of-speech Surrounding context of three words Collocational features
Topical features Five keywords per sense, occurring at least three
times (Ng & Lee, 1996), (Lee & Ng, 2002)
Experiments on Senseval-2 / Senseval-3Lexical sample WSD 49 ambiguous nouns from Senseval-2 (29),
Senseval-3 (20) Remove the words with one Wikipedia sense
detention Remove the words with all Wikipedia senses
mapped to one WordNet sense Roman church, Catholic church Catholic church
Final set: 30 nouns with Wikipedia labels mapped to at least two WordNet senses
Ten-fold cross validations [WSD] Supervised word sense
disambiguation on Wikipedia sense tagged corpora
[MFS] Most frequent sense: choose the most frequent sense by default
[Similarity] Similarity between current example and training data available for each sense
Results on Senseval-2 / Senseval-3
#s #ex MFS Similarity WSDargument 2 114 70.17% 73.63% 89.47%arm 2 291 61.85% 69.31% 84.87%bank 3 1074 97.20% 97.20% 97.20%bar 10 1108 47.38% 68.09% 83.12%circuit 4 327 85.32% 85.62% 87.15%degree 7 849 58.77% 73.05% 85.98%stress 3 565 53.27% 54.28% 86.37%
Average 3.31 316 72.58% 78.02% 84.65%
Some Notes
Words with no improvement Small number of examples in Wikipedia
restraint (9), shelter (17) Skewed sense distributions
bank: 1044 occurrences as “financial institution”, 30 occurrences as “river bank”
Different granularity Coarser grained senses in Wikipedia
Missing senses: atmosphere: ambiance Coarse distinctions: grasp: act of grasping (#1) = hold
(#2) Exceptions: dance performance, theatre performance
Experiments on WikipediaAll-words WSD “Link disambiguation”
Find the link assigned by the Wikipedia annotators Data set
The same data set used in keyword evaluation 85 documents containing 7.286 links
Three methods Supervised Similarity
Unsupervised: measure similarity of context and candidate article
Combined: voting
Results
50.00%
55.00%
60.00%
65.00%
70.00%
75.00%
80.00%
85.00%
90.00%
95.00%
100.00%
Precision Recall F
Random baseline
Most frequent sense
Similarity
Supervised
Combined
Wikification
Candidate
Extraction
Candidate
Ranking
Extract Sense Definitions
from Sense Inventory
Knowledge- based
Lesk- like Definition
Overlap
Data Driven
Naive Bayes
trained on Wikipedia
Voting
Tex
t w
ith
sel
ecte
d k
eyw
ord
s
Dec
om
po
siti
on
Raw
(h
yper
)tex
t
Cle
an T
ext
Rec
om
posi
tion
(Hyp
er)t
ext
wit
h
linked
key
wo
rds
Annotated Text
Word Sense DisambiguationKeyword Extraction
Wikify! system (http://lit.csci.unt.edu/~wikify/
or www.wikifyer.com)
Overall System Evaluation
Turing-like test Annotation of educational materials
Turing-like Test Given a Wikipedia article, decide if it was
annotated by humans or our automated systemAutomatically
extractedWikipedia annotations
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
The United States of America is a federal constitutional republic comprising fifty states and a federal district. The country is situated almost entirely in the western hemisphere: its forty-eight contiguous states and Washington, D.C., the capital district, lie in central North America between the Pacific and Atlantic Oceans, bordered by Canada to the north and Mexico to the south; the state of Alaska is in the northwest of the continent with Canada to its east, and the state of Hawaii is in the mid-Pacific.
Turing-like Test
20 test subjects (mixed background) 10 document pairs for each subject
(side by side) Average accuracy: 57% Ideal case = 50% success rate (total
confusion)
Annotation of Educational Materials Studies in cognitive science
“An important part of the learning process is the ability to connect the learning material to the prior knowledge of the learner” (Walter Kinsch, 1998)
Amount of required background material Depends on the level of explicitness of the text Knowledge of the learner
Low-knowledge vs. high-knowledge learners Use the text wikifier to facilitate access to
background knowledge
A History Test
A test consisting of 14 questions from a quiz from an online history course at UNT Multiple-choice questions Half the questions linked to Wikipedia, half left in their
original format 60 students taking the test
Randomly either the first or the last 7 questions were wikified
Students were instructed: they were allowed to use any information they wanted
to answer the questions they were not required to use the Wikipedia links
Results
60.00%
65.00%
70.00%
75.00%
80.00%
Correct
Raw Wikified
(p<0.1) (p<0.05)50.00
55.00
60.00
65.00
70.00
75.00
80.00
Time
Lessons Learned Wikipedia can be used as a source of
evidence for text processing tasks Keyword extraction Word sense disambiguation
Text wikification: linking documents to encyclopedic knowledge Enrich educational materials Annotation of web pages (semantic web) NLP applications
summarization, information retrieval, text categorization
text adaptation, topic identification, multilingual semantic networks
Ongoing Work: Text AdaptationPlanning for a Long Trip (Magellan’s Stories)Serrao’s letters helped build in my mind the location
of the Spice Islands, which later became the destination for my great voyage. I asked the King of Portugal to support my journey, but he refused. After that, I begged the King of Spain. He was interested in my plan since Spain was looking for a better sea route to Asia than the Portuguese route around the southern tip of Africa. It was going to be hard to find sailors, though. None of the Spanish sailors wanted to sail with me because I was Portuguese.
Funded by the National Science Foundation under CAREER IIS-0747340, 2008-2013Collaboration with Educational Testing Service (ETS)
Def: long trip with a specific objective, esp. by sea or airEn: trip, journeyEs: travesia, viaje
Def: to travel by boatEn: navigateEs: salir, navigar
38
Ongoing Work: Topic Identification Automatic identification of the topic/category of a text (e.g., computer
science, psychology) Books Learning objects
Funded by the Texas Higher Education Coord. Board, Google 2008-2010
“The United States was involved in the Cold War.”
United States0.3793
Cold War0.3111
Vietnam War0.0023
World War I0.0023
Communism0.0027
Ronald Reagan0.0027
Michail Gorbachev0.0023
Cat: Wars Involvingthe United States0.00779
Cat: Global Conflicts0.00779
Ongoing WorkMultilingual Semantic Networks
Funded by the National Science Foundation under IIS-1018613, 2010-2013
COMPOSEREn: composerFr: compositeurDe: komponist
JOHN WILLIAMSEn: John Williams, WilliamsFr: John WilliamsDe: John Williams
PIANISTEn: pianistFr: pianisteDe: pianist
MUSICIANEn: musicianFr: musicienDe: Musiket
CONDUCTOREn: conductorFr: chef d’orchestreDe: Dirigent
CONDUCTOR OF THE BOSTON POPS ORCHESTRAEn: conductor of the Boston Pops OrchestraFr: chef d’orchestre de l’Orchestre Boston PopsDe: Dirigent de Boston Pops Orchestra
ORCHESTRAEn: orchestraFr: orchestreDe: orchester
BOSTON POPS ORCHESTRAEn: Boston Pops OrchestraFr: Orchestre Boston Pops
isA
instanceOfinstanceOf
instanceOf
instanceOf
instanceOf
partOf
partOfisA
isA
John Williams served as the principal conductor of the Boston Pops Orchestra
Thank You!
Questions?
Wikipedia for Natural Language Processing Word similarity
(Strube & Ponzetto, 2006) (Gabrilovich & Markovitch, 2007)
Text categorization (Gabrilovich & Markovitch, 2006)
Named entity disambiguation (Bunescu & Pasca, 2006)
Wikipedia vs. WordNet (Senseval) Different granularity
Coarser grained senses in Wikipedia Missing senses: atmosphere: ambiance Coarse distinctions: grasp: act of grasping (#1) =
hold (#2) Exceptions: dance performance, theatre
performance
Wikipedia vs. Senseval – different sense distribution Low sense distribution correlation r = 0.51#s #ex MFS LeskC WSDSenseval 4.6 226 51.53% 58.33% 68.13%Wikipedia 3.31 316 72.58% 78.02% 84.65%
Sense Disambiguation Learning Curve Disambiguation accuracy using 10%,
20%… 100% of the data
70
75
80
85
90
1 2 3 4 5 6 7 8 9 10 11
Fraction of data
Acc
ura
cy
Text Wikification
Finding key terms in documents and linking them to relevant encyclopedic information.
Text Wikification Finding key terms in documents and linking
them to relevant encyclopedic information.
Text Wikification Finding key terms in documents and linking
them to relevant encyclopedic information.
Text Wikification Finding key terms in documents and linking
them to relevant encyclopedic information.
48
Lexical Semantics Find the meaning of all-words in unrestricted text Required for automatic machine translation,
information retrieval, text understanding SenseLearner – minimally supervised learning
Senseval-2, Senseval-3, Semeval (Semeval @ ACL 2007) Publicly available http://lit.csci.unt.edu/~senselearner
GWSD – unsupervised graph-based algorithms Random walks on text structures Find the most central meanings in a text http://lit.csci.unt.edu/index.php/Downloads
Funded by the National Science Foundation
49
Lexical substitution: SubFinder Find semantically-equivalent substitutes for a target
word in a given context Combine corpus-based and knowledge-based
approaches Combine monolingual and multilingual resources Wordnet, Encarta, bilingual dictionaries, large corpora
Faired well in the Semeval 2007 lexical substitution task
TransFinder Find the translation of a target word in a given context Assist Hispanic students with the understanding of
English texts Task at Semeval 2010
Lexical Semantics
Funded by the National Science Foundation
Text-to-text semantic similarity Find if two pieces of text contain the same
information Useful for information retrieval (search
engines), text summarization Focus on automatic student answer grading
Given the instructor answer and the student answer, assign a grade and identify potential misunderstandings and areas that need clarifications
Lexical Semantics
Funded by the National Science Foundation
Metadata Annotation for Learning Object Repositories Learning object repositories: support
sharing and reuse of educational materials
Identify keywords and related concepts for the automatic annotation of learning object repositories
Keyword extraction using Graph-based algorithms Knowledge drawn from Wikipedia
51Funded by the Texas Higher Education Coordinating Board (THECB)
52
Sentiment and Subjectivity
Add subjectivity and sentiment labels to word senses Important for automatic analysis of political opinions,
product reviews, market research Collaboration with Jan Wiebe, U. Pittsburgh
Automatic assignment of subjectivity to word senses
Projection of subjectivity annotations and resources to other languages Via parallel texts / bilingual dictionaries Via machine translation
Bootstrapping of subjectivity / sentiment seeds using propagation on graphs and word similarity
Funded by the National Science Foundation
53
Affective text Automatic annotation of emotions in text Anger, disgust, fear, joy, sadness, surprise Collaboration with Carlo Strapparava, IRST Large data sets constructed
Computational humour Learning to recognize humour Identification of connections with other linguistic properties: affect, valence, semantic classes
Sentiment and Subjectivity
Text-to-image Synthesis Language learning
Children Second (foreign) language People with language disorders
International language-independent knowledge base Pictures are transparent to languages
Applications Pictorial translations (“Letters to my cousin”)
Bridge the gap between research in image and text processing Image retrieval/classification, natural language
Typical entry in a dictionary
pipe, tobacco pipe a tube with a small
bowl at one end; used for smoking tobacco
pipe, pipage, piping a long tube made of
metal or plastic that is used to carry water or oil or gas etc.)
pipe, tabor pipe a tubular wind
instrument
+ pictorial representations