View
1.177
Download
3
Embed Size (px)
DESCRIPTION
PhD defense held at Kno.e.sis Center, Wright State University, December 03, 2013.
Citation preview
Adaptive Semantic Annotation of Entity and Concept Mentions in Text
Pablo N. MendesPhD dissertation defense
Ohio Center of Excellence in Knowledge-enabled Computing (kno.e.sis)
Wright State UniversityDayton, OH
Introductions and Thank you!
Outline
● Introduction, Motivation, Background– KB Tagging, Annotation as a Service
● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies
– tweets, audio transcripts, educational material
Outline
● Introduction, Motivation, Background– KBT: Knowledge Base Tagging of Text
– AaaS: Annotation as a Service
– Adaptability
● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies
KBT, informally
● Knowledge Base Tagging (KBT)● A developer needs to
– “extract entities”,
– “identify what is mentioned”,
– “connect to knowledge bases”.
● He/she is not an NLP or IE expert● Would like to reuse as much as possible● May have limited computational resources
→ Annotation as a Service (AaaS)5
arrivals hallsparkspassenger terminal ceilingfire
fire
LOCATION LOCATION
On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.
On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.
DATE TIMENamed Entity Recognition (NER)
Keyphrase Extraction (KE)
airport Düsseldorf, Germany
On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.
Automatic Term Recognition (ATR)
On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.
Wikification (WKF)
Entity Linking (EL)Düsseldorf
LOCATION
ID:4213421 6
Related Work
Voquette SCORESemagix Freedom
SemTag
My work
Domain-specific Web contentAuto-extracted facts
Community generated Cross-domain Multilingual
Syntactic
Semantic
7
NER
KEATR Wikification
Illinois Wikifier
TagMe
AIDA / Yago
Related Work (commercial)
Adaptability
News
Scientific literature
Tweets
Audio transcripts
Query keywords
New terms
Named Entities
Important phrases
Concepts related to an objective
● Each developer may have a different application in mind– different input and output– “get key topics for summarization?”– “exhaustive tagging for semantic search?”
● There is no one-size-fits all.● But can we support adaptation to different “fits”?
Requirements
● Transparent process– Clear understanding of where things are working or
failing
● Adaptable process– Ability to exchange individual components in order
to achieve different goals
– Ability to modify the behavior of existing components
● Adaptable to different inputs
10
Outline
● Introduction, Motivation, Background● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion
KB
On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.
LOCATION LOCATIONDATE
User (Creator)
User (Consumer)
Objective
System
Phrase Recognition
Candidate Selection
Disambiguation
Tagging
A Conceptual Model of KBT
Editor
12Annotations
feedback
Spark_(fire)
(0.87)
KBT and Related Tasks
On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.
LOCATION LOCATIONDATE
13
Extraction Task Outcome
Recognize known terms
Recognize new terms
Classify ontological type
Resolve ambiguity
Measure importance/relevance
Tag each occurrence
KE
x
x (to text)
NER
x
x
x
EL
x (NIL)
/
x
x
WSD
x
WKF
x
x
x
ATR
x
x (to domain)
KBT
x
x
x
x
x
x
Novelty in the model
● Users and objective are explicit in the model– Knowledge about content creators provides context
for new types of KBT
– Knowledge about consumer and objective for customizing output
– Using feedback to learn from mistakes
Outline
● Introduction, Motivation, Background● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies
Wikipedia Extraction
Knowledge Base
● DBpedia is a cross-domain KB extracted from Wikipedia [Auer et al. 2007, Bizer et al. 2009]
– Describes 3.7M things through 400M facts
– Use an ontology of 320 classes and 1,650 properties
● DBpedia Live keeps DBpedia up-to-date with Wikipedia changes
● A whole ecosystem with an active community
[Lehmann et al. 2013]
[Hellmann et al., 2009][Morsey et al., 2012]
17
DBpedia Extraction Framework
[with Lehmann et al. @ SWJ 2013]
Added new extractors to support KBT:- Thematic Concepts- Topical signatures- Distributional Semantic Model statistics for semantic relatedness
18
Outline
● Introduction, Motivation, Background● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion
System: default workflow
● Phrase Recognition: – mention recognition (e.g. NER)
● Candidate Selection: – detecting possible senses for a surface form
● Disambiguation: – choosing (ranking/classifying) one sense for a
mention
● Tagging: – deciding if should annotate: to account for entities
not in the KB, or uninformative annotations.20
(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
New York (magazine) New York Manhattan Province of New York New York City New York, New York (film) New York metropolitan areaWest New York, New JerseyRoman Catholic Archdiocese of New YorkPennsylvania Station (New York City)
Phrase Recognition
Candidate Selection
DisambiguationNew York City
ContextualRelatedness
0.100.340.220.230.670.450.560.010.330.07
Tagging21
“New York”type: city, pos: 78relevance: 0.67, ...
22
A quick example
22
23
LSU_Tigers
Louisiana State University
Show Top-K Candidates
23
Virtuous Cycle
[with Héder @ WWW'2012]
Through Sztakipedia toolbar,
- DBpedia Spotlight suggests links
- to Wikipedia Editors
- catalyzes evolution of the knowledge source.
/feedback service
- allows users to submit judgements
- enables system evolution with feedback
- also on blogs, etc. with RDFaCE [Khalili]
24
Contextual relatedness score: TF*ICF
25
“Washington”
ICF(“Washington”,”USA”) < ICF(“Washington”,”Seattle”)
Washington, DC W={“capital”,”USA”,...}
George Washington W={“president”,”USA”,...}
Washington State W={“Seattle”,”USA”,...}
[Mendes et al. @ ISEM2011]
TF*IDF (Term Freq. * Inverse Doc. Freq.)
TF: relevance of a word in the context of a DBpedia ResourceIDF: words that are too common are less useful
ICF: Inverse Candidate Frequency
Entropy-inspiredICF is the rarity of a word with relation to the possible senses
Outline
● Introduction, Motivation, Background● Conceptual Model● Knowledge Base● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion
(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.
New York (magazine) New York Manhattan Province of New York New York City New York, New York (film) New York metropolitan areaWest New York, New JerseyRoman Catholic Archdiocese of New YorkPennsylvania Station (New York City)
PhraseRecognition
Candidate Selection
DisambiguationNew York City
ContextualRelatedness
0.100.340.220.230.670.450.560.010.330.07
Tagging
Core Evaluations
27
“New York”type: city, pos: 78relevance: 0.67, ...
Phrase Recognition Results
At LREC'2012.
(L) Lexicon-based
(LNP*
) Lexicon-based with at least one noun
(NPL) Noun Phrases, lexicon-lookup (bloom filter)
(CW) Lexicon-based removing common words
(Kea) Keyphrases
(NER) Named Entities Only
(NER U NP) N-Grams within Noun Phrases and NEs
PoliciesS = { s | p(s) > cutoff_S }
Different spotting strategies with CSAW datasetTake home- It is not only about importance / relevance- Precision is not important: taken care in steps downstream- Recall is key: a missing phrase at this stage is an overall fail- Simple methods work quite well- Combinations of techniques improve results 28
Context-Independent Strategies
● NAÏVE
– Use surface form to build URI: “berlin” → dbpedia:Berlin
● PROMINENCE
– P(u): n(u) / N (what is the ‘popularity’/importance of this URL)● n(u): number of times URI u occurred● N: total number of occurrences
– Intuition: URIs that have appeared a lot are more likely to appear again
● DEFAULT SENSE
– P(u|s): n(u,s) / n(s)● n(u,s): number of times URI u occurred with surface form s
– Intuition: some surface forms are strongly associated to some specific URIs
29
Disambiguation
● Preliminary results:
● With 155,000 randomly selected wikilink samples● Balance of common and less prominent concepts(default sense: 55.12%)● Highly ambiguous (random: 17.77%)
At I-Semantics 2011.30
Disambiguation+NIL
● Named Entities Only
At TAC KBP 2011.
TACKBP2010DefaultSense: 79.91%Random: 62.00%Unambiguous: 30.36%
NIL accuracy = 79.27 %Non-NIL accuracy = 87.88 %Overall accuracy = 82.71 %
31
Disambiguation Difficulty● Geopolitical entities KB: 830K entities● 311 blog posts, 790 annotations
32
Geolocation Disamb. Eval. results
33
- Validates our measure of “difficulty” (performance degrades)- Shows that our system is more robust for disambiguating low dominance entities
Dominance Analysis
34
Dominance Analysis
35
Tagging
• Decide which spots to annotate with links to the disambiguated resources
• Different use cases have different needs– Only annotate prominent resources?
– Only if you’re sure disambiguation is correct?
– Only people?
– Only things related to Berlin?
36
Tagging in DBpedia Spotlight
• Tagging needs are application/user-specific
• Can be configured based on:– Thresholds
• Confidence
• Prominence (support)
– Whitelist or Blacklist of types• Hide all people, Show only organizations
– Complex definition of a “type” through a SPARQL query.
37
Tagging Evaluation (News)
● Preliminary results
Able to approximate best precision and best recall
Varying parameters allows to cover a wide range of the P/R trade-off
38
Tagging (Take home)
● Combines features from spotting, candidate selection and disambiguation
● More informed to make decisions● Can avoid/fix some mistakes from previous
steps● Offers a chance to adapt to users' needs
39
Outline
● Introduction, Motivation, Background● Conceptual Model● Knowledge Base● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion
Case Study: Audio Tagginghttp://www.bbc.co.uk/programmes
41
Example: Audio Transcript
whirlpool or not the b. b. c. witnessed when the jam and capital but then fell to the advancing bad timing in maine nineteen forty five the civilians living there feared to sing and violence steve athens hughes from one woman custom finds that tying it sit beginning of may nineteen forty five but then is being squeezed between the british americans from the west in the russian army from the east but sides fighting for every inch of land and forgets to this city is being pulverized ...
● BBC Audio Archive tag suggestion
Raimond & Lowis, LDOW2012.
German capitalMay 1945
● Tags: Berlin, World War II, Russian Army, etc.
42
Scenario: Audio Transcript Tagging
Textual Content Creator(System)
Audio Creator
audio
transcript
No punctuation or capitalizationHigh token transcription error rates
Editor
editorial tags
System
Phrase Recognition
Candidate Selection
Disambiguation
Tagging
1. Contextual relatedness
2. Mention detection (dictionary-based)
3. Entity type preference-based reranking
KB
automated tags
AdaptedWorkflow
43
Tagging Audio Transcripts
● Traditional NER features are missing
– Sentence boundaries, POS tags, 50% token error, etc.● Lexicon-based lookup is also difficult
– “big date” → big data● Our approach:
– On-the-fly adaptation
– Skip spotting, focus on named entities
– Preliminary results:
– TopN = 0.19 – 0.21
44
Case Study: Tweet NER
Creator
tweet
KBT System
Phrase Recognition
Candidate Selection
Disambiguation
Tagging
KB
tags as features45
Retrained CRF recognizer
● NER challenges
– informal text, faulty grammar, misspellings, short text, irregular capitalization, etc.
– Segmentation harder than classification
● Our approach:
– distant supervision from DBpedia
– DBpedia Spotlight tagging used as features
Entity mentions
46
Tweet NER Results
● KBT tags added as features to a Linear chain CRF tagger
● NER improves with distant supervision from KBT
47
Educational Material
● Emergency Management Training
“tags that summarize what happened”“configuration parameters allowed removing tags that were 'too general'”
48
Knowledge Base
Some User @someuser 4 NovAt home I have an IPad and my bro has a Microsoft Surface.
Case Study: Smart Filtering
Another User @anotheruser 5 NovThe Asus Transformer Infinity is actually quite nifty.
https://twitter.com/someuser/status/123
https://twitter.com/anotheruser/status/456
category:Wi-Fi
category:Touchscreen
IPad
Microsoft Surface
Asus Transformer Infinity
mentions
mentions
Annotations
SELECT ?tweet ?product ?category IPadmentions belongs belongs
SMART FILTERING
[Mendes et al. WI'2010 and Triplification Challenge 2010]
How to look for competitors?
Microposts mentioning competitors
48
49
Case Study: Website Tagging
Consumer
KBT System
KB
WebsiteSimilarity
Objective
Evaluation: retrieving similar sites
Outline
● Introduction, Motivation, Background● Conceptual Model● Knowledge Base● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion
Conclusion
● Model enables cross-task evaluations– KE, NER, etc. can be reused for KBT but
individually often do not suffice
● Model enables deeper evaluations (beyond “black box”)– Prescribes modularized evaluation to identify steps
that need improvement
– Introduces and validates a measure of “difficulty to disambiguate”
● System adapts well to very distinct use cases
51
Limitations● What the proposed model is not:
– A silver bullet for all problems
– A substitute for machine learning or expert knowledge or linguistics research
52
Extensions to DBpedia
● We extended DBpedia to enable KBT● Created new extractors for necessary data /
statistics● Multilinguality: community process to maintain
international chapters● Results:
– Data to power the computation of features necessary for adaptive KBT
– Prominence, relevance, pertinence, types, etc.
– All reusable to other systems that use DBpedia53
54
Demo:
– http://spotlight.dbpedia.org/demo/ Web Service:
- http://spotlight.dbpedia.org/rest/{component}
Components are exposed as services:
– Phrase Recognition (/spot),
– Disambiguation (/disambiguation)
– Top K disambiguations (/candidates)
– Relatedness (/related)
– Annotation (/annotation)
Source code:https://github.com/dbpedia-spotlight/dbpedia-spotlight/
Apache V2 License
DBpedia Spotlight
My Ph.D. in retrospect
Complex Entity Recognition and Relationship Extraction EKAW'08 WI'08
Knowledge-driven Text Exploration Scooner ACMSE'10 BIBM'10 IESD@HT'13
Evolution WebSci'10 WWW'12a
Cross-domain Entity Recognition and Linking
Knowledge base tagging
ISEM'11 TAC'11
LREC'12a ISEM'13KCAP'11
LREC'12b
MSM'13 CIKM'12
Knowledge-driven Query Formulation CuadroCuebeeTcruziKBICSC'08
Real-time Information Exploration / Filtering
Twarql WI'10 TwitrisSWC'10SFSW@ESWC'10
ISEM'10
SWJ'13Linked Data EvoDyn'12WWW'12b ISWC'12 EDBT'12Sieve
55
Genome databases TcruziDB Garsa ProtozoaDBNAR'08NAR'06 Bioinformatics'05
This dissertation
More thanks!
… and other mentors and collaborators(too many great people for one slide!)
References
Other publications● Bioinformatics IE & Querying
– 1 Bioinformatics Journal
– 2 Nucleic Acid Research Journal
– 1 IEEE ICSC, 1 EKAW, 1 Web Intelligence
● Linked Data Quality and Fusion– 1 LWDM 2012 @ EDBT
● Book chapters– Semantic Search on the Web, with Bizer et al.
– The People’s Web Meets NLP, with OKF OWLG
61
Impact of my research● scholar.google.com: 480+ citations, h-index=12
● Best paper award at I-Semantics 2011
– 174 citations (according to scholar.google.com)
– 4+2 students on Google Summer of Code 2012+2013
– About 6 open sourced third-party clients
● Awarded first prize on:
– Triplification Challenge 2010
– Scripting for Semantic Web Challenge 2010
● 37 publications
– 9 conferences, 5 workshops/posters, 3 magazines (bioinfo)
– 2 book chapters
– 3 workshop proceedings
Leadership and Community Involvement● Co-organizer of Web of Linked Entities workshop series
– ISWC2012 and WWW2013
● Founder of the DBpedia Portuguese initiative, involving volunteers from 5 Brazilian universities
● Maintainer of 3 open source projects– Cuebee: query formulation for RDF
– Twarql: streaming annotated microposts
– DBpedia Spotlight: adaptive semantic annotation
● PC member in several conferences and workshops: ISWC, ESWC, LREC, LDOW, IJSWIS, LDL'2012, JWS, SWJ, etc.
● EU projects– leading FUB's participation in PlanetData (FP7 Network of Excellence).
– research on LOD2 (FP7 IP). and BIG Public-Private Forum