Pablo Mendes' Defense: Adaptive Semantic Annotation of Entity and Concept Mentions in Text

Adaptive Semantic Annotation of Entity and Concept Mentions in Text

Pablo N. MendesPhD dissertation defense

Ohio Center of Excellence in Knowledge-enabled Computing (kno.e.sis)

Wright State UniversityDayton, OH

Introductions and Thank you!

Outline

● Introduction, Motivation, Background– KB Tagging, Annotation as a Service

● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies

– tweets, audio transcripts, educational material

Outline

● Introduction, Motivation, Background– KBT: Knowledge Base Tagging of Text

– AaaS: Annotation as a Service

– Adaptability

● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies

KBT, informally

● Knowledge Base Tagging (KBT)● A developer needs to

– “extract entities”,

– “identify what is mentioned”,

– “connect to knowledge bases”.

● He/she is not an NLP or IE expert● Would like to reuse as much as possible● May have limited computational resources

→ Annotation as a Service (AaaS)5

arrivals hallsparkspassenger terminal ceilingfire

fire

LOCATION LOCATION

On Thursday, April 11, 1996, a fire in an occupied passenger terminal at the airport in Düsseldorf, Germany, killed 17 people and injured 62. The fire began at approximately 3:31 p.m., about the time someone reported seeing sparks falling from the ceiling in the vicinity of a flower shop at the east end of the arrivals hall on the first floor.


DATE TIMENamed Entity Recognition (NER)

Keyphrase Extraction (KE)

airport Düsseldorf, Germany


Automatic Term Recognition (ATR)


Wikification (WKF)

Entity Linking (EL)Düsseldorf

LOCATION

ID:4213421 6

Related Work

Voquette SCORESemagix Freedom

SemTag

My work

Domain-specific Web contentAuto-extracted facts

Community generated Cross-domain Multilingual

Syntactic

Semantic

7

NER

KEATR Wikification

Illinois Wikifier

TagMe

AIDA / Yago

Related Work (commercial)

Adaptability

News

Scientific literature

Tweets

Audio transcripts

Query keywords

New terms

Named Entities

Important phrases

Concepts related to an objective

● Each developer may have a different application in mind– different input and output– “get key topics for summarization?”– “exhaustive tagging for semantic search?”

● There is no one-size-fits all.● But can we support adaptation to different “fits”?

Requirements

● Transparent process– Clear understanding of where things are working or

failing

● Adaptable process– Ability to exchange individual components in order

to achieve different goals

– Ability to modify the behavior of existing components

● Adaptable to different inputs

10

Outline

● Introduction, Motivation, Background● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion

KB


LOCATION LOCATIONDATE

User (Creator)

User (Consumer)

Objective

System

Phrase Recognition

Candidate Selection

Disambiguation

Tagging

A Conceptual Model of KBT

Editor

12Annotations

feedback

Spark_(fire)

(0.87)

KBT and Related Tasks


LOCATION LOCATIONDATE

13

Extraction Task Outcome

Recognize known terms

Recognize new terms

Classify ontological type

Resolve ambiguity

Measure importance/relevance

Tag each occurrence

KE

x

x (to text)

NER

x

x

x

EL

x (NIL)

/

x

x

WSD

x

WKF

x

x

x

ATR

x

x (to domain)

KBT

x

x

x

x

x

x

Novelty in the model

● Users and objective are explicit in the model– Knowledge about content creators provides context

for new types of KBT

– Knowledge about consumer and objective for customizing output

– Using feedback to learn from mistakes

Outline

● Introduction, Motivation, Background● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies

Wikipedia Extraction

Knowledge Base

● DBpedia is a cross-domain KB extracted from Wikipedia [Auer et al. 2007, Bizer et al. 2009]

– Describes 3.7M things through 400M facts

– Use an ontology of 320 classes and 1,650 properties

● DBpedia Live keeps DBpedia up-to-date with Wikipedia changes

● A whole ecosystem with an active community

[Lehmann et al. 2013]

[Hellmann et al., 2009][Morsey et al., 2012]

17

DBpedia Extraction Framework

[with Lehmann et al. @ SWJ 2013]

Added new extractors to support KBT:- Thematic Concepts- Topical signatures- Distributional Semantic Model statistics for semantic relatedness

18

Outline

● Introduction, Motivation, Background● Conceptual Model● Knowledge Base: DBpedia● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion

System: default workflow

● Phrase Recognition: – mention recognition (e.g. NER)

● Candidate Selection: – detecting possible senses for a surface form

● Disambiguation: – choosing (ranking/classifying) one sense for a

mention

● Tagging: – deciding if should annotate: to account for entities

not in the KB, or uninformative annotations.20

(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.

New York (magazine) New York Manhattan Province of New York New York City New York, New York (film) New York metropolitan areaWest New York, New JerseyRoman Catholic Archdiocese of New YorkPennsylvania Station (New York City)

Phrase Recognition

Candidate Selection

DisambiguationNew York City

ContextualRelatedness

0.100.340.220.230.670.450.560.010.330.07

Tagging21

“New York”type: city, pos: 78relevance: 0.67, ...

http://en.wikipedia.org/wiki/John_Lennon

http://en.wikipedia.org/wiki/Paul_McCartney

http://en.wikipedia.org/wiki/New_York_City

http://en.wikipedia.org/wiki/Apple_Corps

22

A quick example

22

23

LSU_Tigers

Louisiana State University

Show Top-K Candidates

23

Virtuous Cycle

[with Héder @ WWW'2012]

Through Sztakipedia toolbar,

- DBpedia Spotlight suggests links

- to Wikipedia Editors

- catalyzes evolution of the knowledge source.

/feedback service

- allows users to submit judgements

- enables system evolution with feedback

- also on blogs, etc. with RDFaCE [Khalili]

24

Contextual relatedness score: TF*ICF

25

“Washington”

ICF(“Washington”,”USA”) < ICF(“Washington”,”Seattle”)

Washington, DC W={“capital”,”USA”,...}

George Washington W={“president”,”USA”,...}

Washington State W={“Seattle”,”USA”,...}

[Mendes et al. @ ISEM2011]

TF*IDF (Term Freq. * Inverse Doc. Freq.)

TF: relevance of a word in the context of a DBpedia ResourceIDF: words that are too common are less useful

ICF: Inverse Candidate Frequency

Entropy-inspiredICF is the rarity of a word with relation to the possible senses

Outline

● Introduction, Motivation, Background● Conceptual Model● Knowledge Base● System: DBpedia Spotlight● Core Evaluations● Case Studies● Conclusion

(…) Upon their return, Lennon and McCartney went to New York to announce the formation of Apple Corps.

New York (magazine) New York Manhattan Province of New York New York City New York, New York (film) New York metropolitan areaWest New York, New JerseyRoman Catholic Archdiocese of New YorkPennsylvania Station (New York City)

PhraseRecognition

Candidate Selection

DisambiguationNew York City

ContextualRelatedness

0.100.340.220.230.670.450.560.010.330.07

Tagging

Core Evaluations

27

“New York”type: city, pos: 78relevance: 0.67, ...

http://en.wikipedia.org/wiki/John_Lennon

http://en.wikipedia.org/wiki/Paul_McCartney

http://en.wikipedia.org/wiki/New_York_City

http://en.wikipedia.org/wiki/Apple_Corps

Phrase Recognition Results

At LREC'2012.

(L) Lexicon-based

(LNP*

) Lexicon-based with at least one noun

(NPL) Noun Phrases, lexicon-lookup (bloom filter)

(CW) Lexicon-based removing common words

(Kea) Keyphrases

(NER) Named Entities Only

(NER U NP) N-Grams within Noun Phrases and NEs

PoliciesS = { s | p(s) > cutoff_S }

Different spotting strategies with CSAW datasetTake home- It is not only about importance / relevance- Precision is not important: taken care in steps downstream- Recall is key: a missing phrase at this stage is an overall fail- Simple methods work quite well- Combinations of techniques improve results 28

Context-Independent Strategies

● NAÏVE

– Use surface form to build URI: “berlin” → dbpedia:Berlin

● PROMINENCE

– P(u): n(u) / N (what is the ‘popularity’/importance of this URL)● n(u): number of times URI u occurred● N: total number of occurrences

– Intuition: URIs that have appeared a lot are more likely to appear again

● DEFAULT SENSE

– P(u|s): n(u,s) / n(s)● n(u,s): number of times URI u occurred with surface form s

– Intuition: some surface forms are strongly associated to some specific URIs

29

Disambiguation

● Preliminary results:

● With 155,000 randomly selected wikilink samples● Balance of common and less prominent concepts(default sense: 55.12%)● Highly ambiguous (random: 17.77%)

At I-Semantics 2011.30

Disambiguation+NIL

● Named Entities Only

At TAC KBP 2011.

TACKBP2010DefaultSense: 79.91%Random: 62.00%Unambiguous: 30.36%

NIL accuracy = 79.27 %Non-NIL accuracy = 87.88 %Overall accuracy = 82.71 %

31

Disambiguation Difficulty● Geopolitical entities KB: 830K entities● 311 blog posts, 790 annotations

32

Geolocation Disamb. Eval. results

33

- Validates our measure of “difficulty” (performance degrades)- Shows that our system is more robust for disambiguating low dominance entities

Dominance Analysis

34

Dominance Analysis

35

Tagging

• Decide which spots to annotate with links to the disambiguated resources

• Different use cases have different needs– Only annotate prominent resources?

– Only if you’re sure disambiguation is correct?

– Only people?

– Only things related to Berlin?

36

Tagging in DBpedia Spotlight

• Tagging needs are application/user-specific

• Can be configured based on:– Thresholds

• Confidence

• Prominence (support)

– Whitelist or Blacklist of types• Hide all people, Show only organizations

– Complex definition of a “type” through a SPARQL query.

37

Tagging Evaluation (News)

● Preliminary results

Able to approximate best precision and best recall

Varying parameters allows to cover a wide range of the P/R trade-off

38

Tagging (Take home)

● Combines features from spotting, candidate selection and disambiguation

● More informed to make decisions● Can avoid/fix some mistakes from previous

steps● Offers a chance to adapt to users' needs

39

Outline


Case Study: Audio Tagginghttp://www.bbc.co.uk/programmes

41

Example: Audio Transcript

whirlpool or not the b. b. c. witnessed when the jam and capital but then fell to the advancing bad timing in maine nineteen forty five the civilians living there feared to sing and violence steve athens hughes from one woman custom finds that tying it sit beginning of may nineteen forty five but then is being squeezed between the british americans from the west in the russian army from the east but sides fighting for every inch of land and forgets to this city is being pulverized ...

● BBC Audio Archive tag suggestion

Raimond & Lowis, LDOW2012.

German capitalMay 1945

● Tags: Berlin, World War II, Russian Army, etc.

42

Scenario: Audio Transcript Tagging

Textual Content Creator(System)

Audio Creator

audio

transcript

No punctuation or capitalizationHigh token transcription error rates

Editor

editorial tags

System

Phrase Recognition

Candidate Selection

Disambiguation

Tagging

1. Contextual relatedness

2. Mention detection (dictionary-based)

3. Entity type preference-based reranking

KB

automated tags

AdaptedWorkflow

43

Tagging Audio Transcripts

● Traditional NER features are missing

– Sentence boundaries, POS tags, 50% token error, etc.● Lexicon-based lookup is also difficult

– “big date” → big data● Our approach:

– On-the-fly adaptation

– Skip spotting, focus on named entities

– Preliminary results:

– TopN = 0.19 – 0.21

44

Case Study: Tweet NER

Creator

tweet

KBT System

Phrase Recognition

Candidate Selection

Disambiguation

Tagging

KB

tags as features45

Retrained CRF recognizer

● NER challenges

– informal text, faulty grammar, misspellings, short text, irregular capitalization, etc.

– Segmentation harder than classification

● Our approach:

– distant supervision from DBpedia

– DBpedia Spotlight tagging used as features

Entity mentions

46

Tweet NER Results

● KBT tags added as features to a Linear chain CRF tagger

● NER improves with distant supervision from KBT

47

Educational Material

● Emergency Management Training

“tags that summarize what happened”“configuration parameters allowed removing tags that were 'too general'”

48

Knowledge Base

Some User @someuser 4 NovAt home I have an IPad and my bro has a Microsoft Surface.

Case Study: Smart Filtering

Another User @anotheruser 5 NovThe Asus Transformer Infinity is actually quite nifty.

https://twitter.com/someuser/status/123

https://twitter.com/anotheruser/status/456

category:Wi-Fi

category:Touchscreen

IPad

Microsoft Surface

Asus Transformer Infinity

mentions

mentions

Annotations

SELECT ?tweet ?product ?category IPadmentions belongs belongs

SMART FILTERING

[Mendes et al. WI'2010 and Triplification Challenge 2010]

How to look for competitors?

Microposts mentioning competitors

48

49

Case Study: Website Tagging

Consumer

KBT System

KB

WebsiteSimilarity

Objective

Evaluation: retrieving similar sites

Outline


Conclusion

● Model enables cross-task evaluations– KE, NER, etc. can be reused for KBT but

individually often do not suffice

● Model enables deeper evaluations (beyond “black box”)– Prescribes modularized evaluation to identify steps

that need improvement

– Introduces and validates a measure of “difficulty to disambiguate”

● System adapts well to very distinct use cases

51

Limitations● What the proposed model is not:

– A silver bullet for all problems

– A substitute for machine learning or expert knowledge or linguistics research

52

Extensions to DBpedia

● We extended DBpedia to enable KBT● Created new extractors for necessary data /

statistics● Multilinguality: community process to maintain

international chapters● Results:

– Data to power the computation of features necessary for adaptive KBT

– Prominence, relevance, pertinence, types, etc.

– All reusable to other systems that use DBpedia53

54

Demo:

– http://spotlight.dbpedia.org/demo/ Web Service:

- http://spotlight.dbpedia.org/rest/{component}

Components are exposed as services:

– Phrase Recognition (/spot),

– Disambiguation (/disambiguation)

– Top K disambiguations (/candidates)

– Relatedness (/related)

– Annotation (/annotation)

Source code:https://github.com/dbpedia-spotlight/dbpedia-spotlight/

Apache V2 License

DBpedia Spotlight

http://spotlight.dbpedia.org/rest/

https://github.com/dbpedia-spotlight/dbpedia-spotlight/

My Ph.D. in retrospect

Complex Entity Recognition and Relationship Extraction EKAW'08 WI'08

Knowledge-driven Text Exploration Scooner ACMSE'10 BIBM'10 IESD@HT'13

Evolution WebSci'10 WWW'12a

Cross-domain Entity Recognition and Linking

Knowledge base tagging

ISEM'11 TAC'11

LREC'12a ISEM'13KCAP'11

LREC'12b

MSM'13 CIKM'12

Knowledge-driven Query Formulation CuadroCuebeeTcruziKBICSC'08

Real-time Information Exploration / Filtering

Twarql WI'10 TwitrisSWC'10SFSW@ESWC'10

ISEM'10

SWJ'13Linked Data EvoDyn'12WWW'12b ISWC'12 EDBT'12Sieve

55

Genome databases TcruziDB Garsa ProtozoaDBNAR'08NAR'06 Bioinformatics'05

This dissertation

More thanks!

… and other mentors and collaborators(too many great people for one slide!)

References

Other publications● Bioinformatics IE & Querying

– 1 Bioinformatics Journal

– 2 Nucleic Acid Research Journal

– 1 IEEE ICSC, 1 EKAW, 1 Web Intelligence

● Linked Data Quality and Fusion– 1 LWDM 2012 @ EDBT

● Book chapters– Semantic Search on the Web, with Bizer et al.

– The People’s Web Meets NLP, with OKF OWLG

61

Impact of my research● scholar.google.com: 480+ citations, h-index=12

● Best paper award at I-Semantics 2011

– 174 citations (according to scholar.google.com)

– 4+2 students on Google Summer of Code 2012+2013

– About 6 open sourced third-party clients

● Awarded first prize on:

– Triplification Challenge 2010

– Scripting for Semantic Web Challenge 2010

● 37 publications

– 9 conferences, 5 workshops/posters, 3 magazines (bioinfo)

– 2 book chapters

– 3 workshop proceedings

Leadership and Community Involvement● Co-organizer of Web of Linked Entities workshop series

– ISWC2012 and WWW2013

● Founder of the DBpedia Portuguese initiative, involving volunteers from 5 Brazilian universities

● Maintainer of 3 open source projects– Cuebee: query formulation for RDF

– Twarql: streaming annotated microposts

– DBpedia Spotlight: adaptive semantic annotation

● PC member in several conferences and workshops: ISWC, ESWC, LREC, LDOW, IJSWIS, LDL'2012, JWS, SWJ, etc.

● EU projects– leading FUB's participation in PlanetData (FP7 Network of Excellence).

– research on LOD2 (FP7 IP). and BIG Public-Private Forum