33
Undefined 1 (2009) 1–5 1 IOS Press Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge Series Emerging Trends in Mining Semantics from Tweets Giuseppe Rizzo a,* , Bianca Pereira b , Andrea Varga c , Marieke van Erp d and Amparo Elizabeth Cano Basave e a ISMB, Turin, Italy. E-mail: [email protected] b Insight Centre for Data Analytics, NUI Galway, Ireland. E-mail: [email protected] c The Content Group, Godalming, UK. E-mail: [email protected] d Vrije Universiteit Amsterdam, Netherlands. E-mail: [email protected] e Cube Global, United Kingdom. E-mail: [email protected] Abstract. The large number of tweets generated daily is providing decision makers with means to obtain insights into recent events around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspection of a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities, resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. While a tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domain- specific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributed to the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition and linking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This article reports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessons learnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets. Keywords: Microposts, Named Entity Recognition, Named Entity Linking, Disambiguation, Knowledge Base, Evaluation, Challenge 1. Introduction Tweets have been proven to be useful in different applications and contexts such as music recommen- dation, spam detection, emergency response, market analysis, and decision making. The limited number of tokens in a tweet however implies a lack of suffi- cient contextual information necessary for understand- ing its content. A commonly used approach is to ex- tract named entities, which are information units such * Corresponding author. E-mail: [email protected] as the names of a Person or an Organisation, a Loca- tion, a Brand, a Product, a numerical expression in- cluding Time, Date, Money and Percent found in a sen- tence [39], and enrich the content of the tweet with such information. In the context of the NEEL chal- lenge series, we extended this definition of named en- tity as being a phrase representing the name, exclud- ing the preceding definite article (i.e. “the”) and any other pre-posed (e.g. “Dr”,“Mr”) or post-posed mod- ifiers, that belongs to a class of the NEEL Taxonomy (Appendix A) and is linked to a DBpedia resource. Se- mantically enriched tweets have been shown to help 0000-0000/09/$00.00 c 2009 – IOS Press and the authors. All rights reserved

IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Undefined 1 (2009) 1–5 1IOS Press

Lessons Learnt from the Named EntityrEcognition and Linking (NEEL) ChallengeSeriesEmerging Trends in Mining Semantics from Tweets

Giuseppe Rizzo a,∗, Bianca Pereira b, Andrea Varga c, Marieke van Erp d andAmparo Elizabeth Cano Basave e

a ISMB, Turin, Italy. E-mail: [email protected] Insight Centre for Data Analytics, NUI Galway, Ireland. E-mail: [email protected] The Content Group, Godalming, UK. E-mail: [email protected] Vrije Universiteit Amsterdam, Netherlands. E-mail: [email protected] Cube Global, United Kingdom. E-mail: [email protected]

Abstract. The large number of tweets generated daily is providing decision makers with means to obtain insights into recentevents around the globe in near real-time. The main barrier for extracting such insights is the impossibility of manual inspectionof a diverse and dynamic amount of information. This problem has attracted the attention of industry and research communities,resulting in algorithms for the automatic extraction of semantics in tweets and linking them to machine readable resources. Whilea tweet is shallowly comparable to any other textual content, it hides a complex and challenging structure that requires domain-specific computational approaches for mining semantics from it. The NEEL challenge series, established in 2013, has contributedto the collection of emerging trends in the field and definition of standardised benchmark corpora for entity recognition andlinking in tweets, ensuring high quality labelled data that facilitates comparisons between different approaches. This articlereports the findings and lessons learnt through an analysis of specific characteristics of the created corpora, limitations, lessonslearnt from the different participants and pointers for furthering the field of entity recognition and linking in tweets.

Keywords: Microposts, Named Entity Recognition, Named Entity Linking, Disambiguation, Knowledge Base, Evaluation,Challenge

1. Introduction

Tweets have been proven to be useful in differentapplications and contexts such as music recommen-dation, spam detection, emergency response, marketanalysis, and decision making. The limited numberof tokens in a tweet however implies a lack of suffi-cient contextual information necessary for understand-ing its content. A commonly used approach is to ex-tract named entities, which are information units such

*Corresponding author. E-mail: [email protected]

as the names of a Person or an Organisation, a Loca-tion, a Brand, a Product, a numerical expression in-cluding Time, Date, Money and Percent found in a sen-tence [39], and enrich the content of the tweet withsuch information. In the context of the NEEL chal-lenge series, we extended this definition of named en-tity as being a phrase representing the name, exclud-ing the preceding definite article (i.e. “the”) and anyother pre-posed (e.g. “Dr”,“Mr”) or post-posed mod-ifiers, that belongs to a class of the NEEL Taxonomy(Appendix A) and is linked to a DBpedia resource. Se-mantically enriched tweets have been shown to help

0000-0000/09/$00.00 c© 2009 – IOS Press and the authors. All rights reserved

Page 2: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

2 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

addressing complex information seeking behaviour insocial media, such as semantic search [86], derivinguser interests [88], and disaster detection [87].

The automated identification, classification andlinking of named entities has proven to be challeng-ing due to, among other things, the inherent character-istics tweets: i) the restricted length and ii) the noisylexical nature, i.e. terminology differs between userswhen referring to the same thing, and non-standardabbreviations are common. Numerous initiatives havecontributed to the progress in the field broadly cov-ering different types of textual content (and thus go-ing beyond the boundaries of tweets). For exampleTAC-KBP [46] has established a yearly challenge inthe field covering newswire, websites, discussion fo-rum posts, ERD [50] with search queries content, andSemEval [51] with technical manuals and reports.

The NEEL challenge series, established first in 2013and since then running yearly, has captured a com-munity need for making sense from tweets through awealth of high quality annotated corpora and to mon-itor the emerging trends in the field. The first editionof the challenge named Concept Extraction (CE) Chal-lenge [1] focused on entity identification and classifi-cation. A step further into this task is to ground enti-ties in tweets by linking them to knowledge base ref-erents. This prompted the Named Entity Extractionand Linking (NEEL) Challenge the following year [2].These two research avenues, which add to the intrin-sic complexity of the tasks proposed in 2013 and 2014,prompted the Named Entity rEcognition and Linking(NEEL) Challenge in 2015 [3]. In 2015, the role ofthe named entity type in the grounding process was in-vestigated, as well as the identification of named enti-ties that cannot be grounded because they do not havea knowledge base referent (defined as NIL). The En-glish DBpedia 2014 dataset was the designated refer-ent knowledge base for the 2015 NEEL challenge, andthe evaluation was performed through live queryingthe Web APIs participants prepared, in an automaticfashion to measure the computing time. The 2016 edi-tion [4] consolidated the 2015 edition, using the En-glish DBpedia 2015-04 version as referent knowledgebase. This edition proposed an offline evaluation wherethe computing time was not taken into account in thefinal evaluation.

The four challenges have published four incremen-tal manually labeled benchmark corpora. The creationof the corpora followed rigid designations and proto-cols, resulting in high quality labeled data that can beused as seeds for reasoning and supervised approaches.

Despite these protocols, the corpora have strengths andweaknesses that we have discovered over the years andthey are discussed in this article.

The purpose of each challenge was to set up an openand competitive environment that would encourageparticipants to deliver novel approaches or improve onexisting ones for recognising and linking entities fromtweets to either a referent knowledge base entry orNIL where such an entry does not exist. From the first(in 2013) to the 2016 NEEL challenge, thirty researchteams have submitted at least one entry to the com-petitions proposing state-of-the-art approaches. Morethan three hundred teams have explicitly acquired thecorpora in the four years, underlining the importanceof the challenges in the field.1 The NEEL challengeshave also attracted from strong interest from the indus-try as both participants and funding agencies. For ex-ample, in 2013 and 2015 the best performing systemswere proposed by industrial participants. The prizeswere sponsored by industry (ebay2 in 2013 and Spazio-Dati3 in 2015) and research projects (LinkedTV4 in2014, and FREME5 in 2016). The NEEL challengealso triggered the interest of localised challenges suchas NEEL-IT, the NEEL challenge for tweets written inItalian [89] that brings the multilinguality aspect in theNEEL contest.

This paper reports on the findings and lessons learntfrom the last four years of NEEL challenges, analysingthe corpora in detail, highlighting their limitations, andproviding guidance to implement top performing ap-proaches in the field from the different participants.The resulting body of work has implications for re-searchers, application designers and social media en-gineers who wish to harvest information from tweetsfor their own objectives. The remainder of this paperis structured as follows: in Section 2 we introduce acomparison with recent shared tasks in entity recog-nition and linking and underline the reason that hasprompted the need to establish the NEEL challenge se-ries. Next, in Section 3, the decisions regarding dif-ferent versions of the NEEL challenge are introducedand the initiative is compared against the other sharedtasks. We then detail the steps followed in generatingthe four different corpora in Section 4, followed by a

1This number does not account for the teams who experimentedwith the corpora out of the challenges’ timeline.

2http://www.ebay.com3http://www.spaziodati.eu4http://www.linkedtv.eu5http://freme-project.eu/

Page 3: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 3

quantitative and qualitative analysis of the corpora inSection 5. We then list the different approaches pre-sented and narrow down the emerging trends in Sec-tion 6, grounding the trends according to the evalua-tion strategies presented in Section 7. Section 8 reportsthe participants’ results and provides an error analy-sis. We conclude and list our future activities in Sec-tion 9. We then provide in appendix the NEEL Taxon-omy (Appendix A) and the NEEL Challenge Annota-tion Guidelines (Appendix B).

2. Entity Linking Background

The first research challenge to identify the impor-tance of the recognition of entities in textual docu-ments was held in 1997 during the 7th Message Un-derstanding Conference (MUC-7) [45]. In this chal-lenge, the term named entity was used for the firsttime to represent terms in text that refer to instancesof classes such as Person, Location, and Organisa-tion. Since then, named entities have become a keyaspect in different research domains, such as Infor-mation Extraction, Natural Language Processing, Ma-chine Learning, and the Semantic Web.

Recognising an entity in a textual document wasthe first big challenge, but after overcoming this obsta-cle, the research community moved into a second andchallenging task: disambiguating entities. This prob-lem appears when a mention in text may refer to morethan one entity. For instance, the mention Paul appear-ing in text may refer to the singer Paul McCartney,to the actor Paul Walker, or to any of the millions ofpeople called Paul around the world. In the same man-ner, Copacabana can be a mention of the beach in Riode Janeiro, Brazil, or the beach in Dubrovnik, Croa-tia. The problem of ambiguity is translated into thequestion “which is the exact entity that the mentionin text refers to?”. To solve this problem, recognisingthe mention to an entity in text is only the first stepfor the semantic processing of textual documents, thenext one is to ground the mention to an unambiguousrepresentation of the same entity in a knowledge base.This task became known in the research community asEntity Disambiguation.

The Entity Disambiguation task popularised afterBunescu and Pasca [41] in 2006 explored the use of anencyclopaedia as a source for entities. In particular, af-ter [42] demonstrated the benefit of using Wikipedia,6

6http://en.wikipedia.org

a free crowd-sourced encyclopaedia, for such purpose.The reason why encyclopedic knowledge is importantis that an encyclopaedia contains representation of en-tities in a variety of domains, and, moreover, contains asingle representation for each entity along with a sym-bolic or textual description. Therefore from 2006 un-til 2009, there were two main areas of research: En-tity Recognition, as a legacy of the work started duringthe MUC-7 challenge, and Entity Disambiguation, ex-ploring encyclopedic knowledge bases as catalogs ofentities.

In 2009, the TAC-KBP challenge [46] introduced anew problem to both the Entity Recognition and EntityDisambiguation communities. In Entity Recognition,the mention is recognised in text without informationabout the exact entity that is referred to by the men-tion. On the other hand, Entity Disambiguation focusesonly on the resolution of entities that have a referent ina provided knowledge base. The TAC-KBP challengeillustrated the problem that a mention identified in textmay not have a referent entity in the knowledge base.In this case, the suggestion was to link such a mentionto a NIL entity in order to indicate that it is not presentin the knowledge base. This problem was referred asNamed Entity Linking and it is still a hard and currentresearch problem. Nowadays, however, the terms En-tity Disambiguation and Entity Linking have becomeinterchangeable.

Since the TAC-KBP challenge, there has been anexplosion in the number of algorithms proposed forEntity Linking using a variety of textual documents,Knowledge Bases, and even using different definitionsof entities. This variety, whilst beneficial, also extendsto how approaches are evaluated, regarding metricsand gold standard datasets used. Such diversity makesit difficult to perform comparisons between variousEntity Linking algorithms and creates the need forbenchmark initiatives.

In this section, we first introduce the main com-ponents of the Entity Linking task and their possiblevariations, followed by a typical workflow used to ap-proach to the task, the expected output of each step andthree strategies for evaluation of Entity Linking sys-tems. We conclude with an overview of benchmark ini-tiatives and their decisions regarding the use of EntityLinking components and evaluation strategies.

2.1. Entity Linking components

Entity Linking is defined as the task of groundingentity mentions (i.e. phrases) in textual documents with

Page 4: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

4 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

knowledge base entries, in which both mention andknowledge base entry are recognised as references tothe same entity. If there is no knowledge base entryto ground a specific entity mention then the mentionshould be linked to a NIL reference instead.

In this definition, Entity Linking contains three maincomponents: text, knowledge base, and entity. The fea-tures of each component may vary, and consequently,have an impact on the results of algorithms used to per-form the task. For instance, a state-of-the-art solutionbased on long textual documents may have a poor per-formance when evaluated over short documents withlittle contextual information within the text. In a sim-ilar manner, a solution developed to link entities oftypes Person, Location, and Organisation may not beable to link entities of type Movie. Therefore, thechoice of each component defines which type of solu-tions are being evaluated by each specific benchmarkinitiative.

2.1.1. Textual DocumentIn Entity Linking, textual documents are usually di-

vided in two main categories: long text, and short text.Long textual documents usually contain more than 400words, such as news articles and web sites. Short doc-uments (such as microposts7 or tweets) may have asfew as 200 characters, or even contain a single word,as in search queries. Different types of texts have theirown characteristics that may influence Entity Linkingalgorithms.

Long textual documents provide a series of document-level features that can be explored for Entity Link-ing such as: the presence of multiple entity mentionsin a single document; well-written text (expressed bythe lack or relative absence of misspellings); and theavailability of contextual information that supports thegrounding of each mention. Contextual informationentails the supporting facts that help in deciding thebest knowledge base entry to be linked with a givenmention. For instance, let us assume the knowledgebase has two candidate entries to be linked with themention Michael Jordan. One of these entries refer tothe professor at University of California, Berkeley andthe other to the basketball player. In order to decidewhich is the correct entry to be linked with the text,some context needs to be provided such as: “played agame yesterday”, or “won the championship”, and so

7Microposts is the term used in the social media field to refer totweets and social media posts in general.

on. More context makes the task easier, and little or nocontext makes the task more challenging.

Short text documents are considered more challeng-ing than long ones because they have the exact oppo-site features such as: the presence of few entity men-tions in a single document (due to the limited sizeof the text); the presence of misspellings or phoneticspelling (e.g. “I call u 2morrow” rather than “I callyou tomorrow”); and the low availability of contex-tual information within the text. It is important to notethough that even within the short text category thereare still important distinctions between microposts andsearch queries that may impact the performance of En-tity Linking algorithms.

In performing a search, it is expected that the searchquery will be composed by a mention to the entity ofinterest being searched and, sometimes, by additionalcontextual information. Therefore, despite the chal-lenge of being a short text document, search queriesare assumed to contain at least one mention to an en-tity and likely to contain additional contextual infor-mation. However, for microposts this assumption doesnot hold.

Microposts do not necessarily have an entity as tar-get. For instance, a document with the content “Sohappy today!!!” does not explicitly cite any entity men-tion. Also, microposts may be used to talk about enti-ties without providing any context within the text, as in“Adele, you rock!!”. In this aspect, Entity Linking formicroposts is more challenging than for search queriesbecause it is unclear if a micropost will contain an en-tity and context for the linking. Furthermore, microp-osts are also more likely to contain misspellings andphonetic writing than search queries. If a search en-gine user misspells a term then it is very likely thatshe will not find the desired information. In this case,it is safe to assume that search engine users will tryto avoid misspellings and phonetic writing as muchas possible. On the other hand, in micropost commu-nities, misspellings and phonetic writing are used asstrategies to shorten words, thus enabling the commu-nication of more information within a single micro-post. Therefore, misspelling and phonetic writing arecommon features of microposts and need to be takeninto consideration when performing Entity Linking.

2.1.2. Knowledge BaseThe second component of Entity Linking we con-

sider is the knowledge base used. Knowledge basesdiffer from each other regarding the domains of knowl-edge they cover (e.g. domain-specific or encyclopedic

Page 5: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 5

knowledge), the features used to describe entries (e.g.long textual descriptions, attribute-value pairs, or rela-tionship links between entities), their ratio of updates,and the amount of entities they cover.

As with textual documents, different characteristicswill impact in the Entity Linking task. The domaincovered by the knowledge base will influence whichentity mentions will possibly have a link. If there is amismatch between the domain of the text (e.g. biomed-ical text) and the domain of the knowledge base (e.g.politics) then all, or most, entity mentions found intext will not have a reference in the knowledge base.In the extreme case of complete mismatch, the EntityLinking process will be reduced to Entity Recognition.Therefore, in order to perform linking, the knowledgebase should at least be partially related to the domainof the text being linked.

Furthermore, the features used to describe entitiesin the knowledge base influence which algorithms canmake use of it. For instance, if entities are representedonly through textual descriptions, a text-based algo-rithm needs to be used to find the best mention-entrylink. If, however, knowledge base entries are only de-scribed through relationship links with other entitiesthen a graph-based algorithm may be more suitable.

The third characteristic of a knowledge base whichimpacts Entity Linking is its ratio of updates. Staticknowledge bases (i.e. knowledge bases that are notor infrequently updated) represent only the status of agiven domain at the moment it was generated. Any en-tity which becomes relevant to that domain, after thatpoint in time will not be represented in the knowledgebase. Therefore, in a textual document, only mentionsto entities prior to the creation of the knowledge basewill have a link, all others would be linked to NIL. Thefaster entities change in a given domain the more likelyit is for the knowledge base to become outdated. In thelikelihood that there is a complete disjoint between textand knowledge base, all links from text would invari-ably be linked to NIL. Depending on the textual doc-ument to be linked, the ratio of updates may or maynot be an important feature. Social and news media aremore likely to have a faster change on their entities ofinterest than manufacturing reports, for instance.

Another characteristic of a knowledge base whichmay impact Entity Linking is the amount of entities itcovers. Two knowledge bases with the same charac-teristics may still vary on the amount of entities theycover. When applied to Entity Linking, the more enti-ties a knowledge base covers the more likely there willbe a matching between text and knowledge base. Of

course in this case we should assume that both knowl-edge bases are focused on representing key entities intheir domain rather than long tail ones.

2.1.3. EntityThe third component of interest for Entity Linking

is the definition of entity. Despite its importance for theEntity Linking task, entities are not formally defined.Instead, entities are defined either through example orthrough the data available. Named entities are the mostcommon case of definition by example. Named entitieswere introduced in 1997 as part of the Message Under-standing Conference as instances of Person, Organisa-tion, and Geo-political types. An extension of namedentities is usually performed through the inclusion ofadditional types such as Locations, Facilities, Movies.In these cases there is no formal definition of entities,rather they are exemplars of a set of categories.

An alternative definition of entities assumes that en-tities are anything represented by the knowledge base.In other words, the definition of entity is given by thedata available (in this case, data from the knowledgebase). Whereas this definition makes the Entity Link-ing task easier by not requiring any refined “human-like” reasoning about types, it makes it impossible toidentify NIL links. If entity is anything in the knowl-edge base, how could we ever possibly have, by defi-nition, an entity which is not in the knowledge base?

The choice of entity will depend on the Entity Link-ing desired. If the goal is to consider links to NIL thenthe definition based on types is the most suitable, oth-erwise the definition based on the knowledge base maybe used.

2.2. Typical Entity Linking Workflow and EvaluationStrategies

Regardless of the different Entity Linking compo-nents, most proposed systems for Entity Linking fol-low a workflow similar to the one presented in Fig-ure 1. This workflow is composed of the followingsteps: Mention Detection, Entity Typing, CandidateDetection, Candidate Selection, and NIL Clustering.Note that, although it is usually a sequential workflow,there are approaches that create a feedback loop be-tween different steps, or merge two or more steps intoa single one.

The Mention Recognition step receives textual doc-uments as input and recognises all terms in text thatrefer to entities. The goal of this step is to perform typ-ical Named Entity Recognition. Next, the Entity Typ-

Page 6: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

6 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

Fig. 1. Typical Entity Linking workflow with expected output ofeach step.

ing step detects the type of each mention previouslyrecognised. This task is usually framed as a categori-sation problem. Candidate Detection next receives thedetected mentions and produces a list with all entriesin the knowledge base that are candidates to be linkedwith each mention. In the Candidate Selection step,these candidate lists are processed and, by making useof available contextual information, the correct link foreach mention, either an entry from the knowledge baseor a NIL reference, is provided. Last, the NIL Clus-tering step receives a series of mentions linked to NILas input and generates clusters of mentions referring tothe same entity, i.e. each cluster contains all NIL men-tions representing one, and only one, entity, and thereare no two clusters representing the same entity.

The evaluation of Entity Linking systems is basedon this typical workflow and can be of three types: end-to-end, step-by-step, or partial end-to-end.

An end-to-end strategy evaluates a system basedonly on the aggregated result of all its steps. It meansthat if one step in the workflow does not perform welland its error propagates through all subsequent steps,this type of evaluation will judge the system based onlyon the aggregated error. In this case, a system that per-forms excellent Candidate Selection but poor Mention

Detection can be considered as good as a system thatperforms a poor Candidate Selection but an excellentMention Detection. The end-to-end strategy is veryuseful for application benchmark in which the goal isto maximise the results that will be consumed by an-other application based on Entity Linking (e.g. entity-based search). However, for research benchmark, it isimportant to know which algorithms are the best fit foreach of the steps in the Entity Linking workflow.

The opposite to an end-to-end evaluation is a step-by-step strategy. The goal of this evaluation is to pro-vide a robust benchmark of algorithms for each stepof the Entity Linking workflow. Each step is providedwith the gold standard input (i.e. the correct input datafor that specific step) in order to eliminate propagationof errors from previous steps. The output of each stepis then evaluated separately. Despite the robustness ofthis approach, this type of evaluation does not accountfor systems that do not follow the typical Entity Link-ing workflow, e.g. systems that merge two steps into asingle one or that create feedback loops; and it is alsoa highly time and labour consuming task to set up.

Finally, the partial end-to-end evaluation aims atevaluating the output of each Entity Linking step butby analysing the final result of the whole system. Par-tial end-to-end evaluation uses different metrics thatare influenced only by specific parts of the Entity Link-ing workflow. For instance, one metric evaluates onlythe link between mentions and entities in the knowl-edge base, another metric evaluates only links withNIL, yet another one evaluates only the correct men-tions recognized, whereas another metric measures theperformance of the NIL Clustering.

2.3. Entity Linking Benchmark Initiatives

The number of variations in Entity Linking makesit hard to benchmark Entity Linking systems. Dif-ferent research communities focus on different typesof text and knowledge base, and different algorithmswill perform better or worse on any specific step. Inthis section, we present the Entity Linking benchmarkinitiatives to date, the Entity Linking specificationsused, and the communities involved. The challengesare summarised in Table 2.3.

2.3.1. TAC-KBPEntity Linking was first introduced in 2009 as a

challenge for the Text Analysis Conference.8 This con-

8http://www.nist.gov/tac

Page 7: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 7

CharacteristicTAC-KBP ERD SemEval W-NUT NEEL

2014 2015 2016 2014 2015 2015 2013 2014 2015 2016

Textnewswire web sites technical manualweb sites search queries reports tweets tweets

discussion forum posts formal discussions

Knowledge Base Wikipedia Freebase Freebase Babelnet none none DBpedia

Entity given by Type given by KB given by KB given by Type given by Type

Evaluationfile API file file file API file

partialend-to-end end-to-end end-to-end end-to-end

partialend-to-end end-to-end

Target Conference TAC SIGIR NAACL-HLT ACL-IJCNLP WWW

Table 1Named Entity Recognition and Linking challenges since 2013

ference was aimed at a community focused on theanalysis of textual documents and the challenge itselfwas part of the Knowledge Base Population track (alsocalled TAC-KBP) [46]. The goal of this track was toexplore algorithms for automatic knowledge base pop-ulation from textual sources. In this track, Entity Link-ing was perceived as a fundamental step, in which en-tities are extracted from text and evaluated if they al-ready exist in a knowledge base to be populated (i.e.link to a knowledge base entry) or if they should beincluded in the knowledge base (i.e. link to NIL). Theresults of Entity Linking could be used either for directpopulation of knowledge bases or used in conjunctionwith other TAC-KBP tasks such as Slot Filling.

As of 2009, the TAC-KBP benchmark was not con-cerned about recognition of entities in text, in partic-ular considering that their entities of interest were in-stances of types Organisation, Geo-political, and Per-son, and the recognition of these types of entities intext was already a well-established task in the commu-nity. The challenge was then mainly concerned withcorrect Entity Typing and Candidate Selection. In lateryears, Mention Detection and NIL Clustering werealso included in the TAC-KBP pipeline [47]. Also,more entity types are now considered such as Locationand Facility, as well as, multiple languages [48].

Characteristics that have been constant in TAC-KBPare the use of long textual documents, entities given byType, and the use of encyclopedic knowledge bases. Areason for long textual documents would be that thistype of text is more likely to contain contextual in-formation to populate a knowledge base, in particularnews articles and web sites. The use of entities givenby Type is a direct consequence of the availability ofnamed entity recognition algorithms based on typesand the need for NIL detection. The use of an encyclo-

pedic knowledge base was because Person, Organisa-tion, and Geo-political entities are not domain-specificand due to the availability of Wikipedia as a free avail-able knowledge base on the Web.

2.3.2. ERDThe Entity Recognition and Disambiguation (ERD)

challenge [50] was a benchmark initiative organisedin 2015 as part of the SIGIR conference9 with the fo-cus of enabling entity-based information retrieval. Forsuch a task, a system needs to be able to index docu-ments based on entities, rather than words, and to iden-tify which entities satisfy a given query. Therefore, theERD challenge proposed two Entity Linking tracks,a long text track based on web sites (e.g. the docu-ments to be indexed), and a short text track based onsearch queries. For both tracks, entities identified intext should be linked with a subset of Freebase,10 alarge collaborative encyclopedic knowledge base con-taining structured data.

The Information Retrieval community, and conse-quently the ERD challenge, focuses on the process-ing of large amounts of information. Therefore, thesystems evaluated should provide not only the correctresults but also fulfill basic standards for large scaleweb systems, i.e. they should be available through WebAPIs for public use, they should accept a minimumnumber of requests without timeout, and they shouldensure a minimum uptime availability. All these stan-dards were translated into the evaluation method of theERD challenge that required systems to have a givenWeb API available for querying during the time of theevaluation. Also, large scale web systems are evalu-ated regarding how useful their output is for the task at

9http://sigir.org/sigir201410https://developers.google.com/freebase

Page 8: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

8 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

hand regardless of the internal algorithms used, so theevaluation used by ERD was an end-to-end evaluationusing standard information retrieval evaluation metrics(i.e. precision, recall, and f-measure).

2.3.3. W-NUTThe community of natural language processing and

computational linguistics within the ACL-IJCNLP11

conferences have always been interested in the studyof long textual documents. One of the main character-istics of these documents is that they are usually writ-ten using standard English writing. However, with theadvent of Twitter and other forms of microblogging,short documents started to receive increased attentionfrom the academic community of computational lin-guists in special because of their non-standard writing.

In 2015, the Workshop on Noisy User-generatedText (W-NUT) [52] promoted the study of documentsthat are not written in standard English, with tweetsas the focus of its two shared tasks. One of thesetasks was targeted at the normalisation of text. In otherwords, expressions such as “r u coming 2” should benormalised into standard English on the form of “areyou coming to”. The second task proposed named en-tity recognition within tweets in which systems wererequired to detect mentions to entities correspondingto a list of ten entity types. This proposed task cor-responds to the first two steps of the Entity Linkingworkflow: Mention Detection and Entity Typing.

2.3.4. SemEvalWord Sense Disambiguation and Entity Linking are

two tasks that perform disambiguation of textual doc-uments through links with a knowledge base. Theirmain difference is that the former disambiguates themeaning of words with respect to a dictionary of wordsenses, whereas the latter disambiguates words withrespect to a list of entity referents. These two taskshave been historically treated as different tasks sincethey require knowledge bases of a dissimilar nature.However, with the development of Babelnet, a knowl-edge base containing both entities and word senses,Word Sense Disambiguation and Entity Linking couldbe finally performed using a single knowledge base.

In 2015, a shared task for Multilingual All-WordsSense Disambiguation and Entity Linking [51] usingBabelnet was proposed as part of the InternationalWorkshop on Semantic Evaluation (SemEval).12 Inthis task, the goal was to create a system that could

11https://www.aclweb.org12http://alt.qcri.org/semeval2015

perform both Word Sense Disambiguation and En-tity Linking. In word sense disambiguation, senses areanything that is available in the dictionary of senses.Therefore, in order to make the integration of the twotasks easier, it followed that an entity is anything thatis available in the knowledge base of entities. Also,given the complexity involved in joining the two tasks,the SemEval shared task focused on technical manuals,reports, and formal discussions that tend to follow amore rigid written structure than tweets or other formsof informal natural language text. The use of suchwell-written texts makes the task easier at the mentionrecognition level (i.e. Mention Detection), and leavesthe challenge at the disambiguation level (i.e. Candi-date Selection).

3. The NEEL Challenge Series

Named Entity Recognition and Entity Linking havebeen active research topics since their introductionby MUC-7 in 1997 and TAC-KBP in 2009, respec-tively. The main focus of these initiatives had been onlong textual documents, such as news articles, or websites. Meanwhile, microposts emerged as a new typeof communication on the Social Web and have beena widespread format to express opinions, sentiments,and facts about entities. The popularisation of micro-posts through the use of Twitter,13 an established plat-form for publication of microposts, reinforced a gap inthe research of the Named Entity Recognition and En-tity Linking communities. The NEEL series was pro-posed as a benchmark initiative to fill this gap.

The evolution of the NEEL challenge followed theevolution of Entity Linking. The challenge was firstheld in 2013 under the name of Concept Extraction(CE) and was concerned with the detection of men-tions to entities in microposts and the specification oftheir types. In the next year, already under the acronymof NEEL, the challenge also included linking mentionsto an encyclopedic Knowledge Base or to NIL. In 2015and 2016, NEEL was expanded to also include cluster-ing of NIL mentions.

To propose a fair benchmark for approaches to En-tity Linking with microposts, the organisation of theNEEL challenge had to make certain decisions con-cerning different Entity Linking components and theavailable strategies for evaluation, always taking into

13http://www.twitter.com

Page 9: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 9

consideration the trends and needs of the researchcommunity focused on the Web and microposts. In thissection, we provide the motivation for these decisions.A discussion on their impact will be provided in latersections.

Text. The first decision that had to be taken regardsthe text used for the challenge. Twitter was chosenas the source of textual documents since it is a well-known platform for microposts on the Web, and it pro-vides a public API that makes it easy to extract microp-osts both for generation of the benchmark corpora andfor future use of the evaluated Entity Linking systems.More information on how Twitter was used to build theNEEL corpora is presented in Section 4.

Knowledge Base. Despite the type of text used, itis important for an Entity Linking challenge to pro-vide a balance among mentions linked to the Knowl-edge Base and mentions linked to NIL. A better bal-ance enables a fairer evaluation, otherwise the chal-lenge would be biased towards algorithms that performone task better than the other. In the case of tweets, thefrequency of knowledge base updates is an importantfactor for the balance among knowledge base links andNIL. Microposts are a dynamic form of communica-tion usually dealing with recent events. If the collec-tion of tweets is more recent than the entities in theknowledge base, the amount of NIL links is likely tobe much higher than the links to entries in the Knowl-edge Base. Therefore, the rate in which the knowledgebase is updated is an important factor for the NEELchallenge.

Taking this into account, we chose to use DB-pedia [84], a structured knowledge base based onWikipedia, mainly because it is frequently updatedwith entities appearing in events covered in social me-dia. Another motivation for using DBpedia is that itsformat lends itself better to the task than Wikipedia it-self. Each NEEL version used the latest available ver-sion of DBpedia.

Definition of entities. Due to the dynamic natureof microposts, the recognition of NILs was recognisedas an important feature since the introduction of En-tity Linking in the NEEL challenge in 2014. Due tothat, but also to accommodate the participants from theConcept Extraction challenge, the definition of entitiesis given by type.

In 2013, the list of entity types was based on thetaxonomy used in CoNLL 2003 [40]. From 2014 on-wards, the NEEL Taxonomy (Appendix A) was cre-ated with the goal of providing a more fine-grainedclassification of entities. This would represent a vast

amount of entities of interest in the context of the Web.The types of entities used and how the NEEL Taxon-omy was built is described in Section 4.

Evaluation. The evaluation is the main componentof a benchmark initiative because, after all, the goal ofbenchmarking is to compare different systems appliedto the same data by using the same evaluation met-rics. There are two main decisions regarding the evalu-ation process. The first decision is about the format inwhich the results of each system are gathered (i.e. viafile transfer or call to a Web API). The second decisionregards how the results will be evaluated and whichevaluation metrics will be applied.

The NEEL challenge has used different evaluationsettings in different versions of the challenge. Eachchange has its own motivation, but the main focus foreach of them was to provide a fair and comprehensiveevaluation of the submitted systems.

The first decision regards the submission of a fileor the evaluation through Web APIs. Both approacheshave their advantages and disadvantages. The use of afile lowers the bar to new participants in the challengebecause they do not need to develop a Web API in ad-dition to the usual Entity Linking steps, nor to have aWeb server available during the whole evaluation pro-cess. This was the proposed model for 2013, 2014, and2016. However, during NEEL 2014, some participantssuggested that the challenge should apply a blind eval-uation, i.e. the participants should know the input datajust at the time of the query in order to avoid com-mon mistakes of tuning the system based on evalua-tion data. Therefore, in 2015 the submission of evalua-tion results was changed to Web API calls. The impactof this change was that a few teams could not partici-pate in the challenge, mainly because their Web serverwas not available during evaluation or their API didnot have the correct signature. This format of evalua-tion also required extra effort on the part of the organ-isers that had to advise participant teams that their webservers were not available. Given the amount of prob-lems generated and no real benefit experienced, the or-ganisation opted for going back to the transfer of fileswith the results of the systems as in previous years.

The second decision concerns the evaluation strat-egy, which impacts the metrics used and on the overallbenchmark ranking. In this step, we either have the op-tion for an end-to-end, a partial end-to-end, or a step-by-step evaluation. Borrowing from the named entityrecognition community, the first two versions of thechallenge (i.e. 2013 and 2014) were based on an end-to-end evaluation. In this evaluation, standard evalua-

Page 10: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

10 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

tion metrics (i.e. precision, recall, and f-measure) wereapplied on top of the aggregated results of the system.A drawback of end-to-end evaluation is that in EntityLinking, if one step in the typical workflow does notperform well, its error will propagate until the last step.Therefore, an end-to-end evaluation will only evaluatebased on the aggregated errors from all steps. This wasnot a problem when the systems were required to per-form one or two simple steps, but when the challengestarts requiring a larger number of steps then a morefine-grained evaluation is required.

A partial end-to-end strategy evaluates the output ofeach Entity Linking step by analysing only the final re-sult of the system. This evaluation uses different met-rics for each part of the workflow and had been suc-cessfully performed by multiple TAC-KBP versions.Therefore, due to its benefits for the research commu-nity, the partial end-to-end evaluation has also been ap-plied in the NEEL challenge in 2015 and 2016. Fur-thermore, the NEEL challenge applied this strategy us-ing the same evaluation tool as TAC-KBP [49], whichaimed to enable an easier interchange of participantsbetween both communities.

The step-by-step evaluation has never being appliedwithin the NEEL series. Despite its robustness by elim-inating error propagation, it is very time consuming, inparticular if participant systems do not implement thetypical workflow. The evaluation process for each yearas well as the specific metrics used will be discussedin Section 7.

Target Conference. The NEEL challenge keeps inmind that microposts are of interest of a broader com-munity, composed of researchers in Natural LanguageProcessing, Information Retrieval, Computational Lin-guistics, and also from a community interested on theWorld Wide Web. Given this, the NEEL Challengeswere proposed as part of the International Workshopon Making Sense of Microposts that was held in con-junction with consecutive World Wide Web confer-ences.

In the next sections we will explain in detail how theNEEL challenges were organised, how the benchmarkcorpora were generated semi-manually, details of par-ticipant systems in each year, and the impact of eachchange in the participation in subsequent years.

4. Corpus Creation

The organisation of the NEEL challenges led to theyearly release of datasets of high value for the research

community. Over the years, the datasets increased insize and coverage.

4.1. Collection procedure and statistics

The initial 2013 challenge dataset contains 4,265tweets collected from the end of 2010 to the begin-ning of 2011 using the Twitter firehose with no explicithashtag search. These tweets cover a variety of topics,including comments on news and politics. The datasetwas split into 66% training and 33% test.

The second 2014 challenge dataset contains 3,505event-annotated tweets, where each entity was linkedto its corresponding DBpedia URI. This dataset wascollected as part of the Redites project14 from 15thJuly 2011 to 15th August 2011 (31 days) comprising aset of over 18 million tweets obtained from the Twit-ter firehose. The 2014 dataset includes both event andnon-event related tweets. The collection of event re-lated tweets did not rely on the use of hashtags buton applying the First Story Detection (FSD) algorithmof [36] and [37]. This algorithm relies on locality-sensitive hashing, which processes each tweet as it ar-rives in time. The hashing dynamically builds up tweetclusters representing events. Notice that hashing in thiscontext refers to a compression methodology not to aTwitter hashtag. Within this collection, the FSD algo-rithm identified a series of events (stories) includingthe death of Amy Winehouse, the London Riots andthe Oslo bombing. Since the challenge task was to au-tomatically recognise and link named entities (to DB-pedia referents), we built the challenge dataset con-sidering both event and non-event tweets. While eventtweets are more likely to contain entities, non-eventtweets enabled us to evaluate the performance of thesystem in avoiding false positives in the entity extrac-tion phase. This dataset was split into a training (70%)and testing (30%) sets. Given the task of identifyingmentions and linking to the referent knowledge baseentities in 2014, the class information was removedfrom the final release.

The 2015 challenge dataset extends the 2014 dataset.This dataset consists of tweets published over a longerperiod, between 2011 and 2013. In addition to this, wealso collected tweets from the Twitter firehose fromNovember 2014 covering both event (such as the UCICyclo-cross World Cup) and non-event tweets. Thedataset was split into training (58%), consisting of the

14http://demeter.inf.ed.ac.uk/redites

Page 11: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 11

dataset tweets words tokens tokens/tweet entities NILs total entities entities/tweet NILs/tweet

2013 training 2,815 10,439 51,969 18.46 2,107 - 3,195 1.88 -2013 test 1,450 6,669 29,154 20.10 1,140 - 1,557 1.79 -

2014 training 2,340 12,758 41,037 17.54 1,862 - 3,819 3.26 -2014 test 1,165 6,858 20,224 17.36 834 - 1,458 2.50 -

2015 training 3,498 13,752 67,393 19.27 2,058 451 4,016 1.99 0.222015 dev 500 3,281 7,845 15.69 564 362 790 2.04 0.942015 test 2,027 10,274 35,558 - 17.54 2,122 1,478 3,860 2.32 0.89

2016 training 6,025 26,247 100,071 16.61 3,833 2,291 8,665 1.43 0.382016 dev 100 841 1,406 14.06 174 85 338 3.38 0.852016 test 3,164 13,728 45,164 14.27 430∗ 284∗ 1,022∗ 3.412+ 0.95+

Table 2General statistics of the training, dev, and test data sets. tweets refers to the number of tweets in the set; words to the unique number of words, thuswithout repetition; tokens refers to the total number of words; tokens/tweet represents the average number of tokens per tweet, entities refers tothe unique number of named entities including NILs; NILs refers to the number of entities not yet available in the knowledge base; total entitiescorresponds to the number of entities with repetition in the set; entities/tweet refers to the average of entities per tweet; NILs/tweet correspondsto the average of NILs per tweet. ∗ only 300 tweets have been randomly selected to be manually annotated and included in the gold standard. +

these figures refer to the 300 tweets of the gold standard.

entire 2014 dataset, development (8%), which enabledparticipants to tune their systems, and test (34%) fromthe newly added 2015 tweets.

The 2016 challenge dataset builds on the 2014 and2015 datasets, and consists of tweets extracted fromthe Twitter firehose from 2011 to 2013 and from2014 to 2015 via a selection of popular hashtags. Thisdataset was split into training (65%) consisting of theentire 2015 dataset, development (1%), and test (34%)sets from the newly collected tweets for the 2016 chal-lenge.

Statistics describing the training, development andtest sets are provided in Table 2. In all, but not inthe 2015 challenge, the training datasets presented ahigher rate of named entities linked to DBpedia thanthe development and test datasets. The percentage oftweets that mention at least one entity is 74.42% in thetraining, 72.96% in the test set for the 2013 dataset;32% in the training, and 40% in the test set for the 2014dataset; 57.83% in the training set, 77.4% in the de-velopment set, and 82.05% in the test set for the 2015dataset; and 67.60% in the training set, 100% in thedevelopment set, and 9.35% in the test set for the 2016dataset. The overlap of entities between the trainingand test data is 8.09% for the 2013 dataset, 13.27% forthe 2014 dataset, 4.6% for the 2015 dataset, and 6.59%for the 2016 dataset. Following the Terms of Use ofTwitter, for all the four challenge datasets, participantswere only provided the tweet IDs and the annotations,the tweet text had to be downloaded from Twitter.

CE 2013 NEEL Taxonomy

MISC ThingPER PersonLOC LocationORG Organization

- Character- Product- Event

Table 3Mapping between the taxonomy used in the first challenge of theNEEL challenge series (left column), and the taxonomy used sincethe 2014 on (right column).

4.2. Annotation taxonomy and class distribution

The taxonomy for annotating the entities changedfrom a four-class taxonomy, based on the taxonomyused in CoNLL 2003 [40], in 2013 to an extended ver-sion seven-type taxonomy, namely the NEEL Taxon-omy (Appendix A), which is derived from the NERDOntology [35]. This new taxonomy was introduced toprovide a more fine-grained classification of the enti-ties, covering also names of characters, products andevents. Furthermore, it is deemed to better answer theneed to cope with the semantic diversity of namedentities in textual documents as shown in [60]. Ta-ble 3 shows the mapping between the two classifica-tion schemes. Summary statistics of the entity typesare provided in Tables 4, 5, and 6 for the 2013, 2015,and 2016 corpora respectively.15 The most frequent en-

15The statistics cover the observable data in the corpora. Thus, thedistributions of implicit classes in the 2014 corpus are not reported.

Page 12: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

12 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

tity type across all datasets is Person. This is followedby Organisation and Location in the 2013 and 2015datasets. In the 2016 dataset the second and third mostfrequent types are Product and Organisation. The dis-tributional differences between the entity types in thethree sets are quite apparent, making the NEEL taskchallenging, particularly when addressed with super-vised learning approaches.

Type Training Test

Person 1,722 (53.89%) 1,128 (72.44%)Location 621 (19.44%) 100 (6.42%)

Organisation 618 (19.34%) 236(15.16%)Misceleneous 233 (7.29%) 95(6.10%)

Table 4Entity type statistics for the two data sets from 2013.

Type Training Dev Test

Character 43 (1.07%) 5 (0.63%) 15 (0.39%)Event 182 (4.53%) 81 (10.25%) 219 (5.67%)

Location 786 (19.57%) 132 (16.71%) 957 (24.79%)Organization 968 (24.10%) 125(15.82%) 541 (14.02%)

Person 1102 (27.44%) 342 (43.29%) 1402 (36.32%)Product 541 (13.47%) 80 (10.13%) 575 (14.9%)Thing 394 (9.81%) 25 (3.16%) 151 (3.92%)

Table 5Entity type statistics for the three data sets from 2015.

Type Training Dev Test

Character 63 (0.73%) 19 (5.62%) 57 (5.58%)Event 482 (5.56%) 7 (2.07%) 24 (2.35%)

Location 1,868 (21.56%) 17 (5.03%) 43 (4.21%)Organization 1,641 (18.94%) 33 (9.76%) 158 (15.46%)

Person 2,846 (32.84%) 120 (35.50%) 337 (32.97%)Product 1,199 (13.84%) 128 (37.87%) 355 (34.74%)Thing 570 (6.58%) 14 (4.14%) 49 (4.79%)

Table 6Entity type statistics for the three data sets from 2016. The statisticsof the Test set refer to the manually annotated set of tweets selectedto generate the gold standard.

The choice of removing the class information from the release wasmade on purpose because of the final objective of the task of havingend-to-end solutions.

4.3. Annotation procedure

In the 2013 challenge, 4 annotators created the goldstandard; in the 2014 challenge a total of 14 annotatorswere used who had different backgrounds, includingcomputer scientists, social scientists, social web ex-perts, semantic web experts and natural language pro-cessing experts; in the 2015 challenge, 3 annotatorsgenerated the annotations; in the 2016 challenge, 2 ex-perts took on the manual annotation campaign.

The annotation process for the 2013 dataset startedwith the unannotated corpus and consists of the fol-lowing steps:

Phase 1. Manual annotation: the corpus was split intofour quarters, each was annotated by a differenthuman annotator.

Phase 2. Consistency: for consistency checking, eachannotator further checked the annotations that theother three performed to verify correctness.

Phase 3. Consensus: for the annotations without con-sensus, discussions among the four annotatorswas used to come to a final conclusion. This pro-cess resulted in resolving annotation inconsisten-cies.

Phase 4. Adjudication: a very small number of errorswas also reported by the participants, which wastaken into account in the final version of thedataset.

With the inclusion of entity links, the annotationprocess for the 2014, 2015 datasets was amended toconsist of the following phases:

Phase 1. Unsupervised automated annotation: the cor-pus was initially annotated using the NERDframework [43], which extracted potential entitymentions, candidate links to DBpedia, and entitytypes. The NERD framework was used as an off-the-shelf annotation tool, i.e. it was used withouttraining it properly.

Phase 2. Manual annotation: the labeled data set wasdivided into batches, with different annotators -three annotators in the 2014 challenge, and twoannotators in the 2015 challenge - to each batch.In this phase manual annotations were performedusing an annotation tool (e.g. CrowdFlower forthe 2014 challenge dataset,16 and GATE [44]

16For annotating the 2014 challenge dataset, we used Crowd-flower with selected expert annotators rather than the crowd.

Page 13: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 13

for the 2015 challenge dataset17). The annota-tors were asked to analyse the annotations gen-erated in Phase 1 by adding or removing entityannotations as required. The annotators were alsoasked to mark any ambiguous cases encountered.Along with the batches, the annotators also re-ceived the NEEL Challenge Annotation Guide-lines (Appendix B).

Phase 3. Consistency: the annotators - three experts inthe 2014 challenge, and a forth annotator in the2015 challenge - double-checked the annotationsand generated the gold standard (for the training,development and test sets). Three main tasks werecarried out here: i) consistency checking of entitytypes; ii) consistency checking of URIs; iii) res-olution of ambiguous cases raised by the annota-tors. The annotators looped through Phase 2 and3 of the process until the problematic cases wereresolved.

Phase 4. NIL Clustering: particular to the 2015 chal-lenge, a seed cluster generation algorithm throughmerging of string- and type- identical named en-tity mentions was used to generate an initial NILClustering.

Phase 5. Consensus: also in the 2015 challenge, basedon the results of the seed algorithm, the third an-notator manually verified all NIL clusters in orderto remove links asserted to the wrong cluster, andmerge clusters referring to the same entity. Spe-cial attention was paid to name variations such asacronyms, misspellings, and similar names.

Phase 6. Adjudication: where the challenge partici-pants reported incorrect or missing annotations.Each reported mention was evaluated by one ofthe challenge chairs to check compliance with theNEEL Challenge Annotation Guidelines, and ad-ditions and corrections made as required.

In the 2016 challenge, the training set was built ontop of the 2014 and 2015 datasets in order to providecontinuity with previous years and to build upon exist-ing findings. The 2016 challenge used the NEEL Chal-lenge Annotation Guidelines provided in 2015. Due tothe intensity of the annotation task, 10% of the testset was annotated manually.18 A random selection wasperformed while preserving the original distributions

17For the 2015 challenge we chose GATE instead of Crowdflower,because GATE allows for the annotation of entities according to anontology, and to compute inter-annotator agreement on the dataset.

18The participants were asked to annotate the entire corpus oftweets.

of types in the corpus by the law of large numbers [85].The annotation process for the 2016 test set consistedof the following steps:

Phase 1. Manual annotation: the data set was dividedinto 2 batches, one for each annotator. In thisphase, annotations were performed using GATE.The annotators were asked to analyse the annota-tions generated in Phase 1 by adding or remov-ing entity annotations as required. The annotatorswere also asked to mark any ambiguous cases en-countered. Along with the batches, the annotatorsreceived the NEEL Challenge Annotation Guide-lines.

Phase 2. Consistency: the annotators checked eachother annotations and generated the gold stan-dard (for the training, development and test sets).Three main tasks were carried out here: i) con-sistency checking of entity types; ii) consistencychecking of URIs; iii) resolution of ambiguouscases raised by the annotators. The annotators it-erated between Phases 1 and 2 until the problem-atic cases were resolved.

Phase 3. NIL Clustering: an unsupervised NIL Clus-tering generation was performed, using a seedcluster generation algorithm based on exact stringmatching of mention strings and their types.

Phase 4. Consensus: one of the two expert annotatorswent through all NIL clusters in order to, whereappropriate, include or exclude them from a givencluster.

Phase 5. Adjudication: where the challenge partici-pants reported incorrect or missing annotations.Each reported mention was evaluated by one ofthe challenge chairs to check compliance with theChallenge Annotation Guidelines, and additionsand corrections were made as required.

The inter-annotator agreement (IAA) for the chal-lenge datasets (2014, 2015 and 2016) is presented inTable 7.19 We computed these values using the annota-tion diff tool in GATE. As the annotators are not onlyclassifying predefined mentions but can also define dif-ferent mentions, traditional IAA measures such as Co-hen’s Kappa are less suited to this task. Therefore, wemeasured the IAA in terms of precision (P), recall (R)and F-measure (F1) using exact matches, where the an-notations must be correct for both the entity mentions

19The inter-annotator agreement for the 2013 dataset could not becomputed, as the challenge settings and intermediate data were lostdue to a lack of organisation of the challenge.

Page 14: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

14 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

and the entity types over the full spans (entity men-tion and label) [80]. Annotations which were partiallycorrect were considered as incorrect.

Dataset P R F1

NEEL 2014 49.49% 73.10% 59.02%NEEL 2015 97.00% 98.5% 98.00%NEEL 2016 90.31% 92.27% 91.28%

Table 7Inter-Annotator Agreement on the challenge datasets

The lessons learnt from building high quality goldstandards are that the annotation process must beguided with annotation guidelines, at least two an-notators must be involved in the annotation processto ensure consistency, and the feedback from theparticipants is valuable in improving the quality ofthe datasets, providing complementary annotations tothe cases found by experts. The annotation guide-lines, written by experts, must describe the annotationtask (for instance, entity types and NEEL taxonomy)through examples, and must be regularly updated dur-ing the manual annotation stage, describing specialcases, issues encountered. In order to speed up the an-notation process, it is a good practice to employ an an-notation tool. We used GATE because the annotationprocess was guided by a taxonomy-centric view. Theannotation task took less time if the annotators sharedthe same background (e.g. all annotators were seman-tic web and natural language processing experts withexperience in information extraction).

5. Corpus Analysis

While the main goals of the 2013-2016 challengeswere the same, and the 2014-2016 corpora are largelybuilt on top of each other, there are some differencesamong the datasets. In this section, we will analyse thedifferent datasets according to the characteristics of theentities and events annotated in them. We hereby reusemeasures and scripts from [79] and add a readabilitymeasure analysis of the corpora. Note that for the En-tity Linking analyses, we can only compare the 2014-2016 NEEL corpora since the 2013 corpus (CE2013)does not contain entity links.

5.1. Entity Overlap

Table 8 presents the entity overlap between the dif-ferent datasets. Each row in the table represents thepercentage of unique entities present in that datasetthat are also represented in the other datasets.

5.2. Confusability

We define the true confusability of a surface form sas the number of meanings that this surface form canhave.20 Because new organisations, people and placesare named every day, there is no exhaustive collectionof all named entities in the world. Therefore, the trueconfusability of a surface form is unknown, but we canestimate the confusability of a surface form throughthe function A(s) : S ⇒ N that maps a surface form toan estimate of the size of its candidate mapping, suchthat A(s) = |C(s)|.

The confusability of a location name offers only arough a priori estimate of the difficulty in linking thatsurface form. Observing the annotated occurrences ofthis surface form in a text collection allows us to makemore informed estimates. We show the average num-ber of meanings denoted by a surface form, indicat-ing the confusability, as well as complementary statis-tical measures on the datasets in Table 9. In this table,we observe that most datasets have a low number ofaverage meanings per surface form, but there is a fairamount of variation, i.e. number of surface forms thatcan refer to a meaning.

5.3. Dominance

We define the true dominance of an entity resourceri

21 for a given surface form si to be a measure ofhow commonly ri is meant with regard to other pos-sible meanings when si is used in a sentence. Let thedominance estimate D(ri, si) be the relative frequencywith which the resource ri appears in Wikipedia linkswhere si appears as the anchor text. Formally:

D(ri, si) =|WikiLinks(si, ri)|∀r∈R|WikiLinks(si, r)|

The dominance statistics for the analysed datasetsare presented in Table 10. The dominance scores forall corpora are quite high and the standard deviation islow, meaning that in the vast majority of cases, a sin-gle resource is associated with a certain surface formin the annotations, creating a low of variance for anautomatic disambiguation system.

20As surface form we refer to the lexical value of the mention.21An entity resource is an entry in a knowledge base that describes

that entity, for example http://dbpedia.org/resource/Hillary_Clinton is the DBpedia entry that describes the Amer-ican politician Hillary Rodham Clinton.

Page 15: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 15

NEEL 2014 NEEL 2015 NEEL 2016

NEEL 2014 (2,380) - 1,630 (68.49%) 1,633 (68.61%)NEEL 2015 (2,800) 1,630 (58.21%) - 2,800 (100%)NEEL 2016 (2,992) 1,633 (54.58%) 2,800 (93.58%) -

Table 8Entity overlap in the analysed datasets. Behind the dataset name ineach row the number of unique entities present in that dataset isgiven. For each dataset pair the overlap is given as the number ofentities and percentage (in parentheses).

Corpus Average Min. Max. σ

NEEL 2014 1.02 1 3 0.16NEEL 2015 1.05 1 4 0.25NEEL 2016 1.04 1 3 0.22

Table 9Confusability stats for analysed datasets. Average stands for averagenumber of meanings per surface form, Min. and Max. stand for theminimum and maximum number of meanings per surface form foundin the corpus respectively, and σ denotes the standard deviation.

Corpus Dominance Max Min σ

NEEL 2014 0.99 47 1 0.06NEEL 2015 0.98 88 1 0.09NEEL 2016 0.98 88 1 0.08

Table 10Dominance stats for analysed datasets.

5.4. Summary

In this section, we have analysed the corpora interms of their variance in named entities and readabil-ity.

As the datasets are built on top of each other, theyshow a fair amount of overlap in entities between eachother. This need not to be a problem, if there is enoughvariation among the entities, but the confusability anddominance statistics show that there are very few en-tities in our datasets with many different referents(“John Smiths”) and if such an entity is present, oftenonly one of its referents is meant. To remedy this, fu-ture entity linking corpora should take care to balancethe entity distribution and include more variety.

We experimented with various readability measuresto assess the reading difficulty of the various tweet cor-pora. These measures would indicate that tweets aregenerally not very difficult in terms of word and sen-tence length, but the abbreviations and slang present intweets proves them to be more difficult to interpret forreaders outside the target community. To the best ofour knowledge, there is no readability metric that takesthis into account. Therefore we chose not to includethose experimental results in this article.

6. Emerging Trends and Systems Overview

In the remainder of this analysis, we focus on twomain tasks, namely Mention Detection and CandidateSelection. Thirty different approaches were applied infour editions of the challenge since 2013. Table 11 listsall ranked teams.

6.1. Emerging Trends

Whilst there are substantial differences between theproposed approaches, a number of trends can be ob-served in the top-performing named entity recognitionand linking approaches to tweets. Firstly, we observethe large adoption of data-driven approaches: while inthe first and second year of the challenge there wasan extensive use of off-the-shelf approaches, the topranking systems from 2013 - 2016 show a high de-pendence on the training data. This is not surprising,since these approaches are supervised, but this clearlysuggests that, to reach top performance, labeled datais necessary. Additionally, the extensive use of knowl-edge bases as dictionaries of typed entities and en-tity relation holder have dramatically affected the per-formance over the years. This strategy overcomes thelexical limitations of a tweet and performs well onthe identification of entities available in the knowledgebase used as referent. A common phase in all sub-mitted approaches is normalisation, meaning smooth-ing the lexical variations of the tweets and translatingthem to language structures that can be better parsedby state-of-the-art approaches that expect more formaland well-formed text. Whilst the linguistic workflowfavours the use of sequential solutions, Entity Recog-nition and Linking for tweets is proposed as joint stepusing large knowledge bases as referent entity direc-tories. While knowledge bases support the linking ofentities with mentions in text, they cannot support theidentification of novel and emerging entities. Ad-hocsolutions for tweets for the generation of NILs havebeen proposed, ranging from edit distance-based solu-tions to the use of Brown clustering.

Page 16: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

16 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

APPROACH AUTHORS NO.OF RUNS

2013 Entries

1 Habib, M. et al. [5] 12 Dlugolinsky, S . et al. [6] 33 van Erp, M. et al. [7] 34 Cortis, K. [8] 15 Godin, F. et al. [9] 16 van Den Bosch, M. et al. [10] 37 Munoz-Garcia, O. et al. [11] 18 Genc, Y. et al. [12] 19 Hossein, A. [13] 1

10 Mendes, P. et al. [14] 311 Das, A. et al. [15] 312 Sachidanandan, S. et al. [16] 113 de Oliveira, D. et al. [17] 1

2014 Entries

14 Chang, M. et al. [18] 115 Habib, M. et al. [19] 216 Scaiella, U. et al. [20] 217 Amir, M. et al. [21] 318 Bansal, R. et al. [22] 119 Dahlmeier, D. et al. [23] 1

2015 Entries

20 Yamada, I. et al. [24] 1021 Gârbacea, C. et al. [26] 1022 Basile, P. et al. [27] 223 Guo, Z. et al. [25] 124 Barathi Ganesh, H. B. et al. [28] 125 Sinha, P. et al. [29] 3

2016 Entries

26 Waitelonis, J. et. al. [34] 127 Torres-Tramon, P. et al. [33] 128 Greenfield, K. et al. [31] 229 Ghosh, S. et al. [30] 330 Caliano, D. et al. [32] 2

Table 11Per year submissions and number of runs for each team.

Between the first NEEL challenge on Concept Ex-traction (CE) and the 2016 edition we observe the fol-lowing:

– tweet normalisation as first step of any approach.This is generally defined as preprocessing and itincreases the expressiveness of the tweets, e.g. viathe expansion of Twitter accounts and hashtagswith the actual names of entities they represent,

or with conversion of non-ASCII characters, and,generally, noise filtering;

– the contribution of knowledge bases in the men-tion detection and typing task. This leads tohigher coverage, which, along with the linguisticanalysis and type prediction, better fits this par-ticular domain;

– the use of high performing end-to-end approachesfor the candidate selection. Such a methodologywas further developed with the addition of fuzzydistance functions operating over n-grams andacronyms;

– the inclusion of a pruning stage to filter out can-didate entities. This was presented in various ap-proaches ranging from Learning-to-Rank to re-casting the problem as a classification task. Weobserved that the approach based on a classi-fier reached better performance (in particular, theclassifier that performed best for this task was im-plemented using a SVM based on a radial basisfunction kernel), however it required an exten-sive feature engineering of the feature set used astraining;

– utilising hierarchical clustering of mentions to ag-gregate exact mentions of the same entity in thetext and thus complementing the knowledge baseentity directory in case of absence of an entity;

– a considerable decrease in off-the-shelf systems.These were popular in the first editions of NEEL,but in later editions their performance grew in-creasingly limited as the task became more con-strained.

Table 12 provides an overview of the methods andfeatures used in these four years, grouped according tothe step involved in the workflow. In addition to the listof the steps listed in Figure 1.

6.2. Systems overview

Table 13 presents a description of the approachesused for Mention Detection combined with Typing.Participants approached the task using lexical similar-ity matchers, machine learning algorithms, and hybridmethods that combine the two. For 2013, the strategiesyielding the best results where hybrid, where mod-els relied on the application of off-the-shelf systems(e.g., AIDA [55], ANNIE [56], OpenNLP,26 Illinois

26https://opennlp.apache.org

Page 17: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 17

Step Method Features Knowledge Base Off-the-ShelfSystems

Preprocessing Cleaning, Expansion, Extraction

stop words, spellingdictionary, acronyms,hashtags, Twitteraccounts, tweettimestamps, punctuation,capitalisation, tokenpositions

- -

MentionDetection

Approximate String Matching,Exact String Matching, FuzzyString Matching, Acronym Search,Perfect String Matching,Levenshtein Matching, ContextSimilarity Matching, ConditionalRandom Fields, Random Forest,Jaccard String Matching, PriorProbability Matching

POS tags, tokens andadjacent tokens,contextual features, tweettimestamps, stringsimilarity, n-grams, propernouns, mention similarityscore, Wikipedia titles,Wikipedia redirects,Wikipedia anchors, wordembeddings

Wikipedia,DBpedia

Semanticizer22

Entity TypingDBpedia Type, LogisticRegression, Random Forest,Conditional Random Fields

tokens, linguistic features,word embeddings, entitymentions, NIL mentions,DBpedia and Freebasetypes

DBpedia,Freebase

AlchemyAPI,23

OpenCalais,24

Zemanta25

CandidateSelection

Distributional Semantic Model,Random Forest, RankSVM,Random Walk with Restart,Learning to Rank

gloss, contextual features,graph distance

Wikipedia,DBpedia

DBpediaSpotlight [61],AlchemyAPI,Zemanta,Babelfy [64]

NIL Clustering

Conditional Random Fields,Random Forest, Brown Clustering,Lack of candidate, ScoreThreshold, Surface FormAggregation, Type Aggregation

POS tags, contextualwords, n-grams length,predicted entity types,capitalization ratio, entitymention label, entitymention typeTable 12

Map of the approaches per sub-task applied in the NEEL series ofchallenges from 2013 until 2016.

NET [57], Illinois Wikifier [58], LingPipe,27 Open-Calais, Stanford NER [59], WikiMiner,28 NERD [60],TwitterNLP [62], AlchemyAPI, DBpedia Spotlight,Zemanta) for both the identification of the boundariesof the entity (mention detection) and the assignmentof a semantic type (entity typing). The top perform-ing system proved to be System 1, which proposedan ensemble learning approach composed of a Condi-tional Random Fields (CRF) and a Support Vector Ma-chine (SVM) with a radial basis function kernel specif-ically trained with the challenge dataset. The ensem-

27http://alias-i.com/lingpipe28http://wikipedia-miner.cms.waikato.ac.nz

ble is performed via a union of the extraction results,while the typing is assigned via the class computed bythe CRF.

The 2014 systems approached the Mention Detec-tion task adding lexicons and features computed fromDBpedia resources. System 14, the best performingsystem, used a matcher from n-grams computed fromthe text and the lexicon entries taken from DBpedia.From the 2014 challenge on, we observe more ap-proaches favouring recall in the Mention Detection,while focusing less on using linguistic features formention detection. System 15, proposed by the sameauthors of the best performing system in 2014, ad-dressed the Mention Detection task with a large setof linguistic and lexicon-related features (such as the

Page 18: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

18 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

TEAMEXTERNAL SYSTEM MAIN FEATURES

MENTION DETECTIONSTRATEGY

LANGUAGE RESOURCE

2013 Entries

1 AIDA IsCap, AllCap, TwPOS2011 CRF and SVM (RBF) YAGO, Microsoft n-grams,WordNet

2ANNIE, OpenNLP, Illinois NET,Illinois Wikifier, LingPipe, OpenCalais,StanfordNER, WikiMiner

IsCap, AllCap, LowerCase,isNP, isVP, Token length C4.5 decision tree Google Gazetteer

3 StanfordNER, NERD, TwitterNLPIsCap, AllCap, Prefix, suffix,TwPOS2011, First word, lastword

SVM SMO -

4 ANNIE IsCap, ANNIE Pos ANNIE DBpedia and ANNIEGazetteer

5 Alchemy, DBpedia Spotlight,OpenCalais, Zemanta - Random Forest -

6 - PosTreebank, lowercasing IGTree memory-based taggers Geonames.org Gazetteer, JRCnames corpus

7 Freeling n-gram, PosFreeling 2012,isNP, Token Length Lexical Similarity Wiki and DBpedia Gazetteers

8 NLTK [63] n-grams, NLTKPos Lexical Similarity Wikipedia9 Babelfy API [64] - Lexical Similarity DBpedia and BabelNet

10 DBpedia Spotlight n-grams, IsCap, AllCap,lower case CRF DBpedia, BALIE Gazetteers

11 - Stem, IsCap, TwPos2011,Follows CRF

Country names, City namesGazetteers, Samsad andNICTA dictionaries, IsOOV

12 - IsCap, prefix, suffix CRF Wiki and Freebase Gazetteers13 - n-gram PageRank, CRF YAGO, Wikipedia, WordNet

2014 Entries

14 - n-grams, stop words removal,punctuation as tokens Lexical Similarity Wikipedia and Freebase

lexicons

15 TwiNER [67] Regular Expression, Entityphrases, N-gram TwiNER and CRF DBpedia Gazetteer, Wikipedia

16 TAGME [65] Wikipedia anchor texts,N-grams

Collective agreement andWikipedia statistics Wikipedia

17 StanfordNER - - NER Dictionary

18 TwitterNLP proper nouns sequence,n-grams - Wikipedia

19 DBpedia Spotlight, TwitterNLP

Unigram, POS, lower, titleand upper case, strippedwords, isNumber, wordcluster, DBpedia

CRF DBpedia Gazetteer, BrownClustering [66]

2015 Entries

20 - n-grams Lexical Similarity joint withCRF, Random Forest Wikipedia

21 Semanticizer - CRF DBpedia22 POS Tagger n-grams Maximun Entropy DBpedia23 TwitIE [68] - - DBpedia24 TwitIE tokens - DBpedia25 - tokens CRF joint with POS Tagger -

2016 Entries

26 - unigrams Lexical Similarity DBpedia27 GATE NLP tokens CRF -28 - n-grams Lexical Similarity DBpedia

29 Stanford NER and ARK Twitter POStagger [69] tokens and POS CRF -

30 - tokens Lexical Similarity and LexicalSimilarity -

Table 13Submissions and number of runs for each team for the Mention Detection phase.

Page 19: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 19

probability of the candidate obtained from the Mi-crosoft Web N-Gram services, or its appearance inWordNet) and using a SVM classifier with a radial ba-sis function kernel specifically trained with the chal-lenge data. Such an approach resulted in high preci-sion, but it slightly penalised recall.

The 2015 best performing approach for MentionDetection, System 20, was largely inspired by the 2014winning approach: the use of n-grams used to lookup resources in DBpedia and a set of lexical featuressuch as POS tags and position in tweets. The typewas assigned by a Random Forest classifier specifi-cally trained with the challenge dataset and using DB-pedia related features (such as PageRank [38]), wordembeddings (contextual features), temporal popular-ity knowledge of an entity extracted from Wikipediapage view data, string similarity measures to measurethe similarity between the title of the entity and themention (such as edit distance), and linguistic features(such as POS tags, position in tweets, capitalization).

The 2016 best performing system, System 26, im-plements a lexicon matcher to match the entity in theknowledge base to the unigrams computed from thetext. The approach proposed a preliminary stage oftweet normalisation resolving acronyms, hashtags tomentions written in natural language.

From 2014 on, the challenge task required partici-pants to produce systems that were also able to linkthe detected mentions to their corresponding DBpedialink (if existing). Table 14 describes the approachestaken by the 2014, 2015, 2016 participants for the Can-didate Detection and Selection, and NIL Clusteringstages. In 2014, most of the systems proposed a Candi-date Selection step as subsequent of the Mention De-tection stage, implementing the conventional linguisticpipeline detecting first the mention, and then lookingfor referents of the mention in the external knowledgebase. This resulted in a set of candidate links, whichhave been ranked according to the similarity of thelink with respect to the mention, and the surroundedtext. However, the best performing system (System14), approached the Candidate Selection as a jointstage with the Mention Detection and link assignment,proposing a so-called end-to-end system. As opposedto most of the participants which used off-the-shelftools, System 14 proposed a SMART gradient boostingalgorithm [83], specifically trained with the challengedataset where the features are textual features (such astextual similarity, contextual similarity), graph-basedfeatures (such as semantic cohesiveness between theentity-entity and entity-mention pairs), and statistical

features (such as mention popularity using the Web asarchive). The majority of the systems, including Sys-tem 14, applied name normalisation for feature extrac-tion, which was useful for identifying entities orig-inally appearing as hashtags, or username mentions.Among the most commonly used external knowledgesources are: NER dictionaries (e.g., Google Cross-Wiki), Knowledge Base Gazetteers (e.g., Yago, DB-pedia), weighted lexicons (e.g., Freebase, Wikipedia),other sources (e.g., Microsoft Web N-gram).29 A widerange of features were investigated for Candidate Se-lection strategies: n-grams, by capturing jointly thelocal (within a tweet) and global (within the knowl-edge base) contextual information of an entity viagraph-based features (e.g., entity semantic cohesive-ness). Other novel features included the use of Twitteraccount metadata and popularity-based statistical fea-tures for mentions and entity characterisation respec-tively.

In the 2015 challenge, System 20 (ranked first) pro-posed an enhanced version of the 2014 challenge win-ner, combined with a pruning stage meant to increasethe precision of the Candidate Selection while con-sidering the role of the entity type being assignedby a Conditional Random Field (CRF) classifier. Inparticular, System 20 is a five-sequential stage ap-proach: preprocessing, generation of potential entitymentions, candidate selection, NIL detection, and en-tity mention typing. In the preprocessing stage, a to-kenisation and Part-of-Speech (POS) tagging approachbased on [69] was used, along with the extraction oftweet timestamps. They address the generation of po-tential entity mentions by computing n-grams (withn = 1..10 words) and matching them to Wikipediatitles, Wikipedia titles of the redirect pages, and an-chor text using exact, fuzzy, and approximate matchfunctions. An in-house dictionary of acronyms is builtby splitting the mention surface into different n-grams(where one n-gram corresponds to one character). Atthis stage all entity mentions are linked to their candi-dates, i.e., the Wikipedia counterparts. The candidateselection is approached as a learning-to-rank problem:each mention is assigned a confidence score computedas the output of a supervised learning approach using arandom forest as the classifier. An empirically definedthreshold is used to select the relevant mentions; incase of mention overlap the span with the highest score

29http://research.microsoft.com/apps/pubs/default.aspx?id=130762

Page 20: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

20 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

is selected. System 20 implemented a Random Forestclassifier working with a feature set composed of thepredicted entity types, contextual features such as sur-rounding words, POS, length of the n-gram and capi-talization features to predict the NIL reference linkedto the mention, thus turning the NIL Clustering to asupervised learning task that uses as training data thetraining challenge dataset. The mention entity typingstage is treated as a supervised learning task where twoindependent classifiers are built: a Logistic Regressionclassifier for typing entity mentions and a RandomForest for typing NIL entries. The other approachescan be classified as sequential, where the complexityis moved to only performing the right matching of then-gram from the text and the (candidate) entity in theknowledge base. Most of these approaches exploit thepopularity of the entities and apply distance similarityfunctions to better rank entities. From the analysis, themove to fully supervised in-house developed pipelinesemerges while the use of external systems is signifi-cantly reduced. The 2015 challenge introduced the taskof linking mentions to novel entities, i.e. not present inthe knowledge base. All approaches in this challengeexploit lexical similarity distance functions and classinformation of the mentions.

In 2016, the top performing system, System 26, pro-posed a lexicon-based joint Mention Extraction andCandidate Selection approach, where unigrams fromtweets are mapped to DBpedia entities. A preprocess-ing stage cleans and classifies the part-of-speech tags,and normalises the initial tweets converting alphabetic,numeric, and symbolic Unicode characters to ASCIIequivalents. For each entity candidate, it considers lo-cal and context-related features. Local features includethe edit distance between the candidate labels and then-gram, the candidates link graph popularity, its DB-pedia type, the provenance of the label and the sur-face form that matches best. The context-related fea-tures assess the relation of a candidate entity to theother candidates within the given context. They in-clude graph distance measurements, connected com-ponent analysis, or centrality and density observationsusing as pivot the DBpedia graph. The candidate selec-tion is sorted according to the confidence score, whichis used as a means to understand whether the entityactually describes the mention. In case the confidencescore is lower than an empirically-determined thresh-old, the mention is annotated with a NIL.

The other approaches implement linguistic pipelineswhere the Candidate Selection is performed by look-ing up entities according to the exact lexical value of

the mentions with DBpedia titles, redirect pages, anddisambiguation pages. While we observed a reduc-tion in complexity for the NIL clustering, resulting inonly considering the lexical distance of the mentions asfor System 27 with the Monge-Elkan similarity mea-sure [81], or System 28, that experimented the nor-malised Damerau-Levenshtein, performing better thanBrown clustering [82].

7. Evaluation Strategies

In this section, the evaluation metrics used in the dif-ferent challenges are described.

7.1. 2013 Evaluation Measures

In 2013, the submitted systems were evaluatedbased on performance in extracting a mention and as-signing its correct class as assigned in the Gold Stan-dard (GS). Thus a system was requested to provide aset of tuples of the form: (m, t), where m is the men-tion and t is the type, which are then compared againstthe tuples of the gold standard (GS). A type is anyvalid materialisation of the class defined in Table 3 anddefined as Person-type, Organisation-type, Location-type, Misc-type. The precision (P ), recall (R) and F-measure (F1) metrics were computed for each entitytype. The final result for each system was reported asthe average performance across the four entity typesconsidered in the task. The evaluation was based onmacro-averages across annotation types and tweets.

We performed a strict match between the tuples sub-mitted and those in the GS. A strict match refers to anexact match, with conversion to lowercase, between asystem value and the GS value for a given entity typet. Let (m, t) ∈ St denote the set of tuples extracted foran entity type t by system S; (m, t) ∈ GS denotes theset of tuples for entity type t in the gold standard. Thenthe set of true positives (TP ), false positives (FP ) andfalse negatives (FN ) for a system is defined as:

TPt = {(m, t)S ∈ S|∃(m, t)GS ∈ GS} (1)

FPt = {(m, t)S ∈ S|¬ ∈ (m, t)GS ∈ GS} (2)

FNt = {(m, t)GS ∈ GS|¬ ∈ (m, t)S ∈ S} (3)

Page 21: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 21

TEAMEXTERNAL SYSTEM MAIN FEATURES

CANDIDATE SELECTIONSTRATEGY

LINGUISTIC KNOWLEDGE

2014 Approaches

14 -

n-grams, lower case, entitygraph features (entitysemantic cohesiveness),popularity-based statisticalfeatures (clicks and visitinginformation from the Web)

DCD-SSVM[71] and SMARTgradient boosting Wikipedia, Freebase

15 Google Search n-grams, DBpedia andWikipedia links, capitalisation SVM

Wikipedia, DBpedia,WordNet, Web N-Gram,YAGO

16 TAGME link probability, mention-linkcommonness distance C4.5 (for taxonomy-filter) Wikipedia, DBpedia

17 -prefix, POS, suffix, Twitteraccount metadata, normalisedmentions, trigrams

Entity Aggregate Prior,Prefix-tree Data StructureClassifier, Lexical Similarity

Wikipedia, DBpedia, YAGO

18 -wikipedia context-basedmeasure, anchor text measure,Twitter entity popularity

LambdaMART Wikipedia Gazetteer, GoogleCross Wiki Dictionary

19 Wikipedia Search API, DBpediaSpotlight, Google Search mentions Lexical Similarity and

Rule-based Wikipedia, DBpedia

2015 Approaches

20 -

word embeddings, entitypopularity, commonnessdistance, string similaritydistance

Random Forest, LogisticRegression DBpedia

21 Semanticizer - Learning to Rank DBpedia22 - mentions Lesk [70] DBpedia23 - mentions, PageRank Random Walks DBpedia24 - mentions Lexical Similarity DBpedia25 DBpedia Spotlight mentions Lexical Similarity -

2016 Approaches

26 -

graph distances, connectedcomponent analysis, orcentrality and densityobservations

Learning to Rank DBpedia

27 - mentions, graph distances Lexical Similarity DBpedia

28 -

commonness, inversedocument frequency anchor,term entity frequency, TCN,term entity frequency, termfrequency paragraph, andredirect

SVM DBpedia

29 Bebelfy mentions - -

30 - mentions Lexical Similarity, contextsimilarity Wikipedia

Table 14Submissions and number of runs for each team for the Candidate Selection phase.

Since we require strict matches, a system must both

detect the correct mention (m) and extract the correct

entity type (t) from a tweet. Then for a given entity

type we define:

Pt =|TPt|

TPt ∪ FPt(4)

Rt =|TPt|

TPt ∪ FNt(5)

Then it is computed the precision and recall on aper-entity-type basis as:

P =PPER + PORG + PLOC + PMISC

4(6)

Page 22: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

22 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

R =RPER +RORG +RLOC +RMISC

4(7)

F1 = 2× P ×R

P +R(8)

Submissions were evaluated offline as participantswere asked to annotate in a short time window a testset of the GS and to send the results in a TSV30 file.

7.2. 2014 Evaluation Measures

In 2014, a system S was evaluated in terms ofits performance in extracting both mentions and linksfrom a set of tweets. For each tweet of this set, asystem S provided a tuple of the form: (m, l) wherem is the mention and l is the link. A link is anyvalid DBpedia URI31 that points to an existing re-source (e.g. http://dbpedia.org/resource/Barack_Obama). The evaluation consisted of com-paring the submission entry pairs against those in GS.The measures used to evaluate each pair are precision(P ), recall (R), and f-measure (F1). The evaluationwas based on micro-averages.32

The evaluation procedure involved an a priori nor-malisation stage for each submission. Since some DB-pedia links lead to redirect pages that point to finalresources, we implemented a resolve mechanism forlinks that was uniformly applied to all participants.The correctness of tuples provided by a system S as theexact-match of the mention and the link was assessed.Here the tuple order was also taken into account. Wedefine (m, l)S ∈ S as the set of pairs extracted by thesystem S, (m, l)GS ∈ GS denotes the set of pairs inthe gold standard. We define the set of true positives(TP ), false positives (FP ), and false negatives (FN )for a given system as:

TPl = {(m, l)S ∈ S|∃(m, l)GS ∈ GS} (9)

30TSV stands for tab separated value.31We considered all DBpedia v3.9 resources valid.32Since the 2014 NEEL Challenge, we opted to weigh all in-

stances of TP , FP , FN for each tweet in the scoring, instead ofweighing arithmetically by entity classes. This gives a better and de-tailed effectiveness of the system performances across different tar-gets (typed mention, links) and tweets.

FPl = {(m, l)S ∈ S|¬ ∈ (m, l)GS ∈ GS} (10)

FNl = {(m, l)GS ∈ GS|¬ ∈ (m, l)S ∈ S} (11)

TPl defines the set of relevant pairs in S, in otherwords, the set of pairs in S that match the ones in GS.FPl is the set of irrelevant pairs in S, in other wordsthe pairs in S that do not match the pairs in GS. FNl

is the set of false negatives denoting the pairs that arenot recognised by S, yet appear in GS. As our evalua-tion is based on a micro-average analysis, we sum theindividual true positives, false positives, and false neg-atives. As we require an exact-match for pairs (m, l)we are looking for strict entity recognition and linkingmatches; each system has to link each recognised en-tity to the correct resource l. Precision, Recall, F1 aredefined in Equation 12, Equation 13, and Equation 8respectively.

P =

∑l |TPl|∑

l TPl ∪ FPl(12)

R =

∑l |TPl|∑

l TPl ∪ FNl(13)

Submissions were evaluated offline, where partici-pants were asked to annotate in a short time windowthe TS and to send the results in a TSV file.

7.3. 2015 and 2016 Evaluation Measures

In the 2015 and 2016 editions of the NEEL chal-lenge, systems were evaluated according to the num-ber of mentions correctly detected, their type correctlyasserted (i.e. output of Mention Detection and EntityTyping), the links correctly assigned between a men-tion in a tweet and a knowledge base entry, and a NILassigned when no knowledge base entry disambiguatesthe mention.

The required outputs were measured using a set ofthree evaluation metrics: strong_typed_mention_match,strong_link_match, and mention_ceaf. These metricswere combined into a final score (Equation 14).

Page 23: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 23

score = 0.4 ∗mention_ceaf (14)

+ 0.3 ∗ strong_typed_mention_match

+ 0.3 ∗ strong_link_match,

where the weights are empirically assigned to favourmore the role of the mention_ceaf, i.e. the ability of asystem S to link the mention either to an existing entryin DBpedia or to a NIL entry generated by S and iden-tified uniquely and consistently across different NILs.

The strong_typed_mention_match measures the per-formance of the system regarding the correct identi-fication of mentions and their correct type assertion.The detection of mentions is still based on strict match-ing as in previous versions of the challenge. There-fore true positive (Equation 9), false positive (Equa-tion 2), and false negative (Equation 3) counts arestill calculated in the same manner. However, the mea-surement of precision and recall changed slightly. In2013, we used macro-averaged precision and recall. Inthis case, first the precision, recall and F-measure arecomputed over the classes which are then averaged.Here, a more popular class can dominate the metrics.In 2015 we used micro-averaged metrics. In a micro-averaged precision and recall setup, each mention hasan equal impact on the final result regardless of itsclass size. Therefore, precision (P ) is calculated ac-cording to the Equation 15 and recall (R) according toEquation 16. Finally, strong_typed_mention_match isthe micro-averaged (F1) as given by Equation 8.

P =

∑t |TPt|∑

t TPt ∪ FPt(15)

R =

∑t |TPt|∑

t TPt ∪ FNt(16)

The strong_link_match metric measures the correctlink between a correctly recognised mention and aknowledge base entry. For a link to be considered cor-rect, a system must detect a mention (m) and its typecorrectly (t) as well as the correct Knowledge Baseentry (l). Note also that this metric does not evaluatelinks to NIL. The detection of mentions is still basedon strict matching as in previous versions of the chal-lenge. Therefore true positive (Equation 9), false pos-itive (Equation 10), and false negative (Equation 11)

counts are still calculated in the same manner. Thismetric is also based on micro-averaged precision andrecall as defined in Equation 12 and Equation 13 andthe F1 as in Equation 8.

The last metric in our evaluation score is the Con-strained Entity-Alignment F-measure (CEAF) [54].This is a metric that measures coreference chains andis used to jointly evaluate Candidate Selection andNIL Clustering steps. Let E = {m1, . . . mn} de-note the set of all mentions linked to e, where e is ei-ther a knowledge base entry or a NIL identifier. men-tion_ceaf finds the optimal alignment between the setsprovided by the system and the gold standard and thenperforms the micro-averaged precision and recall overeach mention.

In 2015, submissions were evaluated through an on-line process as participants were required to imple-ment their systems as a publicly accessible web servicefollowing a REST-based protocol, where they couldsubmit up to 10 contending entries to a registry ofthe NEEL challenge services. Each endpoint had aWeb address (URI) and a name, which was referredas runID. Upon receiving the registration of the RESTendpoint, calls to the contending entry were scheduledfor two different time windows, namely, D − Time -to test the APIs, and T − Time - for the final evalu-ation and metric computations. To ensure the correct-ness of the results and avoid any loss we triggered N(with N=100) calls to each entry. We then applied amajority voting approach over the set of annotationsper tweet and statistically evaluated the latency by ap-plying the law of large numbers [85]. The algorithm isdetailed in Algorithm 1. This offered the opportunityto measure the computing time systems spent in pro-viding the answer. The computing time was proposedto solve potential draws from Equation 14. Where E isthe set of entities, and Tweet is the set of tweets.

As setting up a REST API increased the systemimplementation load on the participants, we revertedback to an offline evaluation setup in 2016. As in pre-vious challenges, participants were asked to annotatethe TS during a short time window and to send the re-sults in a TSV file which was then evaluated by thechallenge chairs.

7.4. Summary

Three editions out of four followed an offline eval-uation procedure. A discontinuity was introduced in2015 with the introduction of the online evaluationprocedure. Two issues were noted by the participants

Page 24: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

24 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

Algorithm 1 EVALUATE(E, Tweet,N = 100,M =30)1: for all ei ∈ E do2: AS = ∅, LS = ∅3: for all tj ∈ Tweet do4: for all nk ∈ N do5: (A,L) = annotate(tj , ei)6: end for7:8: // Majority Voting Selection of a fromA9: for all ak ∈ A do

10: hash(ak)11: end for12: AS

j = Majority Voting on the exact same hash(ak)13:14: // Random Selection of l from L15: generateLT from the uniformly random selection of M l fromL16: (µ, σ) = computeMuAndSigma(LT )17: LS

j = (µ, σ)

18: end for19: end for

of the 2015 edition: i) increasing complexity of thetask, going beyond the pure NEEL objectives; ii) un-fair comparison of the computing time with respect tobig players that can afford better computing resourcesthan small research teams. These motivations causedthe use of a conventional offline procedure for the 2016edition. The emerging trend sees a consolidation of astandard de-facto scorer that was proposed in TAC-KBP and also now successfully adopted and widelyused in our community. This scorer supports the mea-surement of the performance of the approaches in theentire annotation pipeline, ranging from the MentionExtraction, Candidate Selection, Typing, and detectionof novel and emerging entities from highly dynamiccontexts such as tweets.

8. Results

This section presents a compilation of the NEELchallenge results across the years. As the NEEL taskevolved, the results among these years are not entirelycomparable. Table 15 shows results for the 2013 chal-lenge task, where we report scores averaged for thefour entity types analysed on this task.

The 2013 task consisted of building systems thatcould identify four entity types (i.e., Person, Loca-tion, Organisation and Miscellaneous) in a tweet. Thistask proved to be challenging, with some approachesfavouring precision over recall. The best rank in preci-sion was obtained by Team 1, which used a combina-tion of rule types and data-driven approaches achiev-ing a 76.4% precision. For recall, results varied acrossthe four entity types with results for the miscellaneous

2013 Entries

TEAM P R F1

1 0.764 0.604 0.672 0.724 0.613 0.6623 0.735 0.611 0.6584 0.734 0.595 0.615 0.688 0.546 0.5896 0.774 0.548 0.5897 0.683 0.483 0.5618 0.685 0.5 0.549 0.662 0.482 0.51810 0.627 0.383 0.49411 0.564 0.43 0.49112 0.501 0.468 0.48913 0.53 0.402 0.399

Table 15Scores achieved for the NEEL 2013 submissions.

and organisation types ranking the lowest. Averagingover entity types, the best approach was obtained byTeam 2, whose solution relied on gazetteers. All top3 teams ranked by F-measure followed a hybrid ap-proach combining rules and gazetteers.

The 2014 challenge task extended the concept ex-traction challenge by not only considering the entitytype recognition but also the linking of entities to theDBpedia v3.9 knowledge base. Table 16 presents theresults for this task, which follow the evaluation de-scribed in Section 7. There was a clear winner that out-performed all other systems on all three metrics andit was proposed by the Microsoft Research Lab Red-mond.33 Most of the 2014 submissions followed a se-quential approach doing first the recognition and thenthe linking. The winning system (Team 14) introduceda novel approach, namely joint learning of recognitionand linking from the training data. This approach out-performed the second best team in F-measure by over15%.

The 2015 task extended the 2014 recognition andlinking tasks with a clustering task. For this task par-ticipants had to provide clusters where each clustercontained only mentions to the real world entity. For2015 we also computed the latency of each system.Table 17 presents a ranked list of results for the 2015submissions. The last column shows the final score foreach participant following Equation 14. Here the win-

33https://www.microsoft.com/en-us/research/lab/microsoft-research-redmond/

Page 25: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 25

2014 Entries

TEAM P R F1

14 77.1 64.2 70.0615 57.3 52.74 54.9316 60.93 42.25 49.917 53.28 39.51 45.3718 50.95 40.67 45.2319 49.58 32.17 39.02

Table 16Scores achieved for the NEEL 2014 submissions.

ner (Team 20) outperformed the second best with aboost in tagging F1 of 41.9%, in clustering F1 of 28%,and linking F1 of 23.9%. Team 20 improved upon thesecond best team on the general score with 33.1%. For2015, the winner team proposed an end-to-end sys-tem for both candidate selection and mention typing,along with a linguistic pipeline to perform entity typ-ing and filtering. As in 2014, the best ranked systemwas proposed by a company, namely Studio Ousia,34

that focuses on knowledge extraction and artificial in-telligence.

Finally, the 2016 challenge followed the same taskas 2015. Team 26 outperformed all other participants,with an overall F1 score of 0.5486 and an absoluteimprovement of 16.58% compared to the second-bestapproach. Team 26 used a learning-to-rank approachfor the candidate selection task along with a series ofgraph-based metrics making use of DBpedia as theirmain linguistic knowledge source.

9. Conclusion

The NEEL challenge series was established in 2013to foster the development of novel automated ap-proaches for mining semantics from tweets and pro-viding standardised benchmark corpora enabling thecommunity to compare systems.

This paper describes the decisions and proceduresfollowed in setting up and running the task. We firstdescribed the annotation procedures used to create theNEEL corpora over the years. The procedures were in-crementally adjusted over time to provide continuityand ensure reusability of the approaches over the dif-ferent editions. While the consolidation has provided

34http://www.ousia.jp/en

consistent labeled data, it has also showed the robust-ness of the community.

We also described the different approaches proposedby the NEEL challenge participants. Over the years,we witnessed to a convergence of the approaches to-wards data-driven solutions supported by knowledgebases. Knowledge bases are prominently used as asource to discover known entities, relations amongdata, and labelled data for selecting candidates andsuggesting novel entities. Data-driven approaches havebecome, with variations, the leading solution. Despitethe consolidated number of options for addressing thechallenge task, the participants’ results show that theNEEL task remains challenging in the microposts do-main.

Furthermore, we explained the different evaluationstrategies used in different challenges. These changeswere driven by a desire to ensure fairness of the eval-uation, transparency, and correctness. These adapta-tions involve the use of in-house scoring tools in 2013and 2014, which were made publicly available and dis-cussed in the community. Since 2015 the TAC-KBPchallenge scorer was adopted to both leverage the wideexperience developed in the TAC-KBP community andbreak down the analysis to account for the clustering.

Thanks to the yearly releases of the annotations andtweet IDs with a public license, the NEEL corpus hasstarted to become widely adopted. Beyond the thirtyteams who completed the evaluations in four years,more than three hundred participants have contactedthe NEEL organisers with a request to acquire the cor-pora. The teams come from more than twenty differ-ent countries and are both from academia and industry.The 2014 and 2015 winners were companies operat-ing in the field, respectively Microsoft and Studio Ou-sia. The 2013 and 2016 winners were academic teams.The success of the NEEL challenges is also illustratedby the sponsorships of the challenges offered by com-panies (ebay35 in 2013 and SpazioDati36 in 2015) andresearch projects (LinkedTV37 in 2014, and FREME38

in 2016).The NEEL challenges also triggered the interest of

local communities such as NEEL-IT. This communityis pushing the NEEL Challenge Annotation Guidelines(with minor variations due to the intra-language de-pendencies) and know-how to create a benchmark for

35http://www.ebay.com36http://www.spaziodati.eu37http://www.linkedtv.eu38http://freme-project.eu/

Page 26: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

26 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

2015 Entries

TEAM TAGGING F1 CLUSTERING F1 LINKING F1 LATENCY[S] SCORE

20 0.807 0.84 0.762 8.5±3.62 0.806725 0.388 0.506 0.523 0.13±0.02 0.475721 0.412 0.643 0.316 0.19±0.09 0.475622 0.367 0.459 0.464 2.03±2.35 0.432923 0.329 0.394 0.415 3.41±7.62 0.380824 0 0.001 0 12.89±27.6 0.004

Table 17Scores achieved for the NEEL 2015 submissions. Tagging refers tostrong_typed_mention_match, clustering refers to mention_ceaf, andlinking to strong_link_match.

2016 Entries

TEAM TAGGING F1 CLUSTERING F1 LINKING F1 SCORE

26 0.473 0.641 0.501 0.548627 0.246 0.621 0.202 0.382828 0.319 0.366 0.396 0.360929 0.312 0.467 0.248 0.354830 0.246 0.203 0.162 0.3353

Table 18Scores achieved for the NEEL 2016 submissions. Tagging refers to strong_typed_mention_match, clustering refers to mention_ceaf, and linkingto strong_link_match.

sharing the algorithms and results of mining semanticsfrom Italian tweets. In 2015, we also built bridges withthe TAC community. We plan to strengthen these andto involve a larger audience of potential participantsranging from Linguistics, Machine Learning, Knowl-edge Extraction, Data and Web Science.

Future work involves the generation of corpora thataccount for the low variance of entity-type semantics.We aim to create larger datasets covering a broaderrange of entity types and domains within the Twit-ter sphere. The 2015 enhancements in the evaluationstrategy, which accounts for computational time, high-lighted new challenges when focusing on an algo-rithm’s efficiency vs efficacy. Since more efforts onhandling large scale data mining involve distributedcomputing and optimisation, we aim to develop newevaluation strategies. These strategies will ensure thefairness of the results when asking participants to pro-duce large scale annotations in a small window of time.Among the future efforts, we aim to identify the dif-ferences in performance among the disparate systemsand their approaches, first characterising what can beconsidered an error by the systems in the context of thechallenge, and then deriving insightful conclusions on

the building blocks needed to build an optimal systemautomatically.

Finally, given the increasing interest in adopting theNEEL guidelines in creating corpora for other lan-guages, we aim to develop a multilingual NEEL chal-lenge as a future activity.

Acknowledgments

This work was supported by the H2020 FREMEproject (GA no. 644771), by the research grant fromScience Foundation Ireland (SFI) under Grant NumberSFI/12/RC/2289, and by the CLARIAH-CORE projectfinanced by the Netherlands Organisation for ScientificResearch (NWO).

References

[1] A. E. Cano Basave, A. Varga, M. Rowe, M. Stankovic, A.Dadzie, Making Sense of Microposts (#MSM2013) Concept Ex-traction Challenge, in: Concept Extraction Challenge at theWorkshop on ’Making Sense of Microposts’, Amparo E. Cano,Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

Page 27: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 27

[2] A. E. Cano Basave, G. Rizzo, A. Varga, M. Rowe, M. Stankovic,A. Dadzie, Making Sense of Microposts (#Microposts2014)Named Entity Extraction & Linking Challenge, in: 4th Work-shop on Making Sense of Microposts, Matthew Rowe, MilanStankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2014.

[3] G. Rizzo, A. E. Cano Amparo, B. Pereira, A. Varga, Makingsense of Microposts (#Microposts2015) Named Entity rEcogni-tion & Linking Challenge, in: 5th International Workshop onMaking Sense of Microposts, Matthew Rowe, Milan Stankovic,Aba-Sah Dadzie, ed., CEUR-WS, 2015.

[4] G. Rizzo, M. van Erp, J. Plu, R. Troncy, Making Sense of Micro-posts (#Microposts2016) Named Entity rEcognition and Link-ing (NEEL) Challenge, in: 6th International Workshop on Mak-ing Sense of Microposts, Aba-Sah Dadzie, Daniel PreoÅciuc-Pietro, Danica RadovanoviÄG, Amparo E. Cano Basave, KatrinWeller, ed., CEUR-WS, 2016.

[5] M. Habib, M. Van Keulen, Z. Zhu, Concept Extraction Chal-lenge: University of Twente at #MSM2013, in: Concept Extrac-tion Challenge at the Workshop on ’Making Sense of Microp-osts’, Amparo E. Cano, Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[6] S. Dlugolinský, Peter Krammer, Marek Ciglan, Michal La-clavik, MSM2013 IE Challenge: Annotowatch, in: Concept Ex-traction Challenge at the Workshop on ’Making Sense of Mi-croposts’, Amparo E. Cano, Matthew Rowe, Milan Stankovic,Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[7] M. Van Erp, G. Rizzo, R. Troncy, Learning with the Web:Spotting Named Entities on the Intersection of NERD and Ma-chine Learning, in: Concept Extraction Challenge at the Work-shop on ’Making Sense of Microposts’, Amparo E. Cano,Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[8] K. Cortis, ACE: A Concept Extraction Approach using LinkedOpen Data, in: Concept Extraction Challenge at the Work-shop on ’Making Sense of Microposts’, Amparo E. Cano,Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[9] F. Godin, P. Debevere, E. Mannens, W. De Neve, R. Van deWalle, Leveraging Existing Tools for Named Entity Recogni-tion in Microposts, in: Concept Extraction Challenge at theWorkshop on ’Making Sense of Microposts’, Amparo E. Cano,Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[10] A. van Den Bosch, T. Bogers, Memory-based Named EntityRecognition in Tweets, in: Concept Extraction Challenge at theWorkshop on ’Making Sense of Microposts’, Amparo E. Cano,Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[11] Ó. Muñoz-García, A. García-Silva, Ó. Corcho, Towards Con-cept Identification using a Knowledge-Intensive Approach, in:Concept Extraction Challenge at the Workshop on ’MakingSense of Microposts’, Amparo E. Cano, Matthew Rowe, MilanStankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[12] Y. Genc, W. Mason, J. V. Nickerson, Classifying ShortMessages using Collaborative Knowledge Bases: ReadingWikipedia to Understand Twitter, in: Concept Extraction Chal-lenge at the Workshop on ’Making Sense of Microposts’,Amparo E. Cano, Matthew Rowe, Milan Stankovic, Aba-SahDadzie, ed., CEUR-WS, 2013.

[13] A. Hossein Jadidinejad, Unsupervised Information Extractionusing BabelNet and DBpedia, in: Concept Extraction Challenge

at the Workshop on ’Making Sense of Microposts’, Amparo E.Cano, Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed.,CEUR-WS, 2013.

[14] P. Mendes, D. Weissenborn, C. Hokamp, DBpedia Spotlightat the MSM2013 Challenge, in: Concept Extraction Challengeat the Workshop on ’Making Sense of Microposts’, Amparo E.Cano, Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed.,CEUR-WS, 2013.

[15] A. Das, U. Burman, B. Ar, S. Bandyopadhyay, NER fromTweets: SRI-JU System MSM 2013, in: Concept ExtractionChallenge at the Workshop on ’Making Sense of Microposts’,Amparo E. Cano, Matthew Rowe, Milan Stankovic, Aba-SahDadzie, ed., CEUR-WS, 2013.

[16] S. Sachidanandan, P. Sambaturu, K. Karlapalem, NERTUW:Named Entity Recognition on Tweets using Wikipedia, in:Concept Extraction Challenge at the Workshop on ’MakingSense of Microposts’, Amparo E. Cano, Matthew Rowe, MilanStankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[17] D. de Oliveira, A. Laender, A. Veloso, A. Da Silva, Filter-Stream Named Entity Recognition: A Case Study at the#MSM2013 Concept Extraction Challenge, in: Concept Extrac-tion Challenge at the Workshop on ’Making Sense of Microp-osts’, Amparo E. Cano, Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2013.

[18] M. Chang, B. Hsu, H. Ma, R. Loynd, K. Wang, E2E: An End-to-End Entity Linking System for Short and Noisy Text, in: 4th

Workshop on Making Sense of Microposts, Matthew Rowe, Mi-lan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2014.

[19] M. B. Habib, M. van Keule, Z. Zhu, Named Entity Extrac-tion and Linking Challenge: University of Twente at #Microp-osts2014, in: 4th Workshop on Making Sense of Microposts,Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2014.

[20] U. Scaiella, M. Barbera, S. Parmesan, G. Prestia, E. Del Tes-sandoro, M. Veri, DataTXT at #Microposts2014 Challenge, in:4th Workshop on Making Sense of Microposts, Matthew Rowe,Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2014.

[21] M. Amir Yosef, J. Hoffart, Y. Ibrahim, A. Boldyrev, G.Weikum, Adapting AIDA for Tweets, in: 4th Workshop on Mak-ing Sense of Microposts, Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2014.

[22] R. Bansal, S. Panem, P. Radhakrishnan, M. Gupta, V. Varma,Linking Entities in #Microposts, in: 4th Workshop on MakingSense of Microposts, Matthew Rowe, Milan Stankovic, Aba-SahDadzie, ed., CEUR-WS, 2014.

[23] D. Dahlmeier, N. Nandan, W. Ting, Part-of-Speech is (almost)enough: SAP Research & Innovation at the #Microposts2014NEEL Challenge, in: 4th Workshop on Making Sense of Micro-posts, Matthew Rowe, Milan Stankovic, Aba-Sah Dadzie, ed.,CEUR-WS, 2014.

[24] I. Yamada, H. Takeda, Y. Takefuji, An End-to-End Entity Link-ing Approach for Tweets, in: 5th International Workshop onMaking Sense of Microposts, Matthew Rowe, Milan Stankovic,Aba-Sah Dadzie, ed., CEUR-WS, 2015.

[25] Z. Guo, D. Barbosa, Entity Recognition and Linking on Tweetswith Random Walks, in: 5th International Workshop on MakingSense of Microposts, Matthew Rowe, Milan Stankovic, Aba-SahDadzie, ed., CEUR-WS, 2015.

[26] C. Gârbacea, D. Odijk, D. Graus, I. Sijaranamual, M. de Ri-jke, Combining Multiple Signals for Semanticizing Tweets: Uni-versity of Amsterdam at #Microposts2015, in: 5th International

Page 28: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

28 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

Workshop on Making Sense of Microposts, Matthew Rowe, Mi-lan Stankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2015.

[27] P. Basile, A. Caputo, G. Semeraro, F. Narducci, UNIBA: Ex-ploiting a Distributional Semantic Model for Disambiguatingand Linking Entities in Tweets, in: 5th International Work-shop on Making Sense of Microposts, Matthew Rowe, MilanStankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2015.

[28] H. B. Barathi Ganesh, N. Abinaya, M. Anand Kumar, R.Vinaykumar, K.P. Soman, AMRITA - CENNEEL: Identificationand Linking of Twitter Entities, in: 5th International Work-shop on Making Sense of Microposts, Matthew Rowe, MilanStankovic, Aba-Sah Dadzie, ed., CEUR-WS, 2015.

[29] P. Sinha, B. Barik, Named Entity Extraction and Linkingin #Microposts, in: 5th International Workshop on MakingSense of Microposts, Matthew Rowe, Milan Stankovic, Aba-SahDadzie, ed., CEUR-WS, 2015.

[30] S. Ghosh, P. Maitra, D. Das, Feature Based Approach to NamedEntity Recognition and Linking for Tweets, in: 6th InternationalWorkshop on Making Sense of Microposts, Aba-Sah Dadzie,Daniel PreoÅciuc-Pietro, Danica RadovanoviÄG, Amparo E.Cano Basave, Katrin Weller, ed., CEUR-WS, 2016.

[31] K. Greenfield, R. Caceres, M. Coury, K. Geyer, Y. Gwon, J.Matterer, A. Mensch, C. Sahin, O. Simek, A Reverse Approachto Named Entity Extraction and Linking in Microposts, in: 6th

International Workshop on Making Sense of Microposts, Aba-Sah Dadzie, Daniel PreoÅciuc-Pietro, Danica RadovanoviÄG,Amparo E. Cano Basave, Katrin Weller, ed., CEUR-WS, 2016.

[32] D. Caliano, E. Fersini, P. Manchanda, M. Palmonari, E.Messina, UniMiB: Entity Linking in Tweets using Jaro-WinklerDistance, Popularity and Coherence, in: 6th InternationalWorkshop on Making Sense of Microposts, Aba-Sah Dadzie,Daniel PreoÅciuc-Pietro, Danica RadovanoviÄG, Amparo E.Cano Basave, Katrin Weller, ed., CEUR-WS, 2016.

[33] P. Torres-Tramon, H. Hromic, B. Walsh, B. Heravi, C. HayesKanopy4Tweets: Entity Extraction and Linking for Twitter,in: 6th International Workshop on Making Sense of Mi-croposts, Aba-Sah Dadzie, Daniel PreoÅciuc-Pietro, DanicaRadovanoviÄG, Amparo E. Cano Basave, Katrin Weller, ed.,CEUR-WS, 2016.

[34] J. Waitelonis, H. Sack, Named Entity Linking in #Tweets withKEA, in: 6th International Workshop on Making Sense of Mi-croposts, Aba-Sah Dadzie, Daniel PreoÅciuc-Pietro, DanicaRadovanoviÄG, Amparo E. Cano Basave, Katrin Weller, ed.,CEUR-WS, 2016.

[35] G. Rizzo, R. Troncy, S. Hellmann, M. Brummer, NERD meetsNIF: Lifting NLP Extraction Results to the Linked Data Cloud,in: 5th International Workshop on Linked Data on the Web(LDOW), 2012

[36] M. Osborne, S. Petrovic, R. Mccreadie, C. Macdonald and I.Ounis, Bieber no more: First story detection using twitter andwikipedia, in: Workshop on Time-aware Information Access,2012.

[37] S. Petrovic, M. Osborne, V. Lavrenko, Streaming First storydetection with application to Twitter, in: Human LanguageTechnologies: The 2010 Annual Conference of the North Amer-ican Chapter of the Association for Computational Linguistics,Association for Computational Linguistics, 2010.

[38] S. Brin and L. Page, The anatomy of a large-scale hypertextualweb search engine, Comput. Netw. ISDN Syst. 30 (1998): 107-117, doi: 10.1016/S0169-7552(98)00110-X.

[39] R. Grishman, B. Sundheim, Message Understanding

Conference-6: a brief history, in: 16th International Con-ference on Computational Linguistics - Volume 1 (COL-ING), Association for Computational Linguistics, 1996, doi:10.3115/992628.992709.

[40] E. F. Tjong Kim Sang, F. D. Meulder, Introductionto the CoNLL-2003 Shared Task: Language IndependentNamed Entity Recognition. in: 7th Conference on Natu-ral Language Learning at HLT-NAACL 2003 - Volume 4(CoNLL), Association for Computational Linguistics, 2003,doi: 10.3115/1119176.1119195.

[41] R. C. Bunescu, M. Pasca, Using Encyclopedic Knowledge forNamed entity Disambiguation, in: European Chapter of the As-sociation for Computational Linguistics (EACL), 2006.

[42] S. Cucerzan, Large-Scale Named Entity Disambiguation Basedon Wikipedia Data, in: Empirical Methods in Natural LanguageProcessing (EMNLP-CoNLL), 2007.

[43] G. Rizzo, M. van Erp, R. Troncy, Benchmarking the Extractionand Disambiguation of Named Entities on the Semantic Web,in: 9th International Conference on Language Resources andEvaluation, 2014.

[44] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, N.Aswani, I. Roberts, G. Gorrell, A. Funk, A. Roberts, D. Daml-janovic, T. Heitz, M. A. Greenwood, H. Saggion, J. Petrak, Y.Li, W. Peters, Text Processing with GATE (Version 6), GatewayPress CA, 2011.

[45] N. chinchor, P. Robinson. MUC-7 named entity task definition,in: 7th Conference on Message Understanding, 1997.

[46] P. McNamee, H. T. Dang, Overview of the TAC 2009 Knowl-edge Base Population Track - NIST, in: Text Analysis Confer-ence (TAC), 2009.

[47] H. Ji, R. Grishman, H. T. Dang, Overview of the TAC 2011Knowledge Base Population Track - NIST, in: Text AnalysisConference (TAC), 2011.

[48] H. Ji, J. Nothman, B. Hachey, R. Florian, Overview of TAC-KBP2015 Tri-lingual Entity Discovery and Linking, in: TextAnalysis Conference (TAC), 2015.

[49] B. Hachey, J. Nothman, W. Radford (2014), Cheap and easyentity evaluation, in: Association for Computational LinguisticsConference (ACL), 2014.

[50] D. Carmel, M.-W. Chang, E. Gabrilovich, B.-J. P. Hsu,K. Wang, ERD’14: Entity Recognition and Disambigua-tion Challenge, SIGIR Forum 48 (2014), 63–77, doi:10.1145/2701583.2701591.

[51] A. Moro, R. Navigli, SemEval-2015 task 13: multilingual all-words sense disambiguation and entity linking, in: InternationalWorkshop on Semantic Evaluation (SemEval), 2015.

[52] T. Baldwin, Y.-B. Kim, M. C. de Marneffe, A. Ritter, B. Han,W. Xu, Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entityrecognition, in: 53rd Annual Meeting of the Association forComputational Linguistics and The 7th International Joint Con-ference of the Asian Federation of Natural Language Processing(ACL-IJCNLP), 2015.

[53] A. Bagga, B. Baldwin, Algorithms for scoring coreferencechains, in: 1st International Conference on Language Resourcesand Evaluation Workshop on Linguistics Coreference, 1998.

[54] X. Luo, On coreference resolution performance met-rics, Human Language Technology and Empirical Meth-ods in Natural Language Processing Conference (HLT-EMNLP), Association for Computational Linguistics, 2005, doi:10.3115/1220575.1220579.

Page 29: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 29

[55] J. Hoffart, M. Amir Yosef, I. Bordino, H. Fürstenau, M. Pinkal,M. Spaniol, B. Taneva, S. Thater, G. Weikum, Robust Disam-biguation of Named Entities in Text, in: Empirical Methods inNatural Language Processing Conference (EMNLP), Associa-tion for Computational Linguistics, 2011.

[56] H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan, GATE:A Framework and Graphical Development Environment for Ro-bust NLP Tools and Applications, in: 40th Anniversary Meetingof the Association for Computational Linguistics (ACL), 2002.

[57] L. Ratinov, D. Roth, Design challenges and misconcep-tions in named entity recognition, in: 13th Conferenceon Computational Natural Language Learning (CoNLL),Association for Computational Linguistics, 2009 doi:10.3115/1596374.1596399.

[58] L. Ratinov, D. Roth, D. Downey, and M. Anderson, Local andGlobal Algorithms for Disambiguation to Wikipedia, in: 49th

Annual Meeting of the Association for Computational Linguis-tics (ACL), Association for Computational Linguistics, 2011.

[59] J. R. Finkel, T. Grenager, and C. Manning, Incorporating Non-local Information into Information Extraction Systems by GibbsSampling, in: 43rd Annual Meeting on Association for Compu-tational Linguistics (ACL), Association for Computational Lin-guistics, 2005, doi: 10.3115/1219840.1219885.

[60] G. Rizzo, R. Troncy, NERD: A Framework for UnifyingNamed Entity Recognition and Disambiguation Web ExtractionTools, in: Demonstrations at the 13th Conference of the Euro-pean Chapter of the Association for Computational Linguistics(EACL), Association for Computational Linguistics, 2012.

[61] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer, DBpe-dia Spotlight: Shedding Light on the Web of Documents, in: 7th

International Conference on Semantic Systems (I-Semantics),ACM, 2011, doi: 10.1145/2063518.2063519.

[62] A. Ritter, S. Clark, Mausam, and O. Etzioni, Named En-tity Recognition in Tweets: An Experimental Study, in: Confer-ence on Empirical Methods on Natural Language Processing(EMNLP), Association for Computational Linguistics, 2011.

[63] E. Loper, S. Bird, NLTK: The Natural Language Toolkit, in:ACL-02 Workshop on Effective Tools and Methodologies forTeaching Natural Language Processing and Computational Lin-guistics - Volume 1, Association for Computational Linguistics,2002, doi: 10.3115/1118108.1118117.

[64] A. Moro, A. Raganato, R. Navigli, Entity Linking meets WordSense Disambiguation: a Unified Approach, Transactions of theAssociation for Computational Linguistics 2 2014.

[65] P. Ferragina, U. Scaiella, TAGME: On-the-fly Annotation ofShort Text Fragments (by Wikipedia Entities), in: 19th Inter-national Conference on Information and Knowledge (CIKM),ACM, 2010, doi: 10.1145/1871437.1871689.

[66] P. Liang, Semi-Supervised Learning for Natural Language,Master Thesis, MIT, 2005.

[67] C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B. S.Lee, TwiNER: Named Entity Recognition in Targeted TwitterStream, in: 35th 35th International ACM SIGIR Conference onResearch and Development in Information Retrieval (ACM SI-GIR), ACM, 2012, doi: 10.1145/2348283.2348380.

[68] K. Bontcheva, L. Derczynski, A. Funk, M.A. Greenwood,D. Maynard, N. Aswani, TwitIE: An Open-Source InformationExtraction Pipeline for Microblog Text, in: International Con-ference on Recent Advances in Natural Language Processing,RANLP, 2013.

[69] K. Gimpel, N. Schneider, B. O’Connor, D. Das, Daniel Mills,

Jacob Eisenstein, Michael Heilman, Dani Yogatama, JeffreyFlanigan, and N. A. Smith, Part-of-speech Tagging for Twitter:Annotation, Features, and Experiments, in: 49th Annual Meet-ing of the Association for Computational Linguistics: HumanLanguage Technologies: Short Papers - Volume 2 (ACL), Asso-ciation for Computational Linguistics, 2011.

[70] P. Basile, A. Caputo, and G. Semeraro, An Enhanced LeskWord Sense Disambiguation Algorithm through a DistributionalSemantic Model, in: 25th International Conference on Compu-tational Linguistics (COLING), 2014.

[71] M. Chang, W. Yih, Dual Coordinate Descent Algorithms forEfficient Large Margin Structured Prediction, Transactions ofthe Association for Computational Linguistics 1 (2017), 207-218.

[72] Carl-Hugo Björnsson, Läsbarhet, Liber, 1968.[73] Meri Coleman and T. L. Liau, A computer readability formula

designed for machine scoring, Journal of Applied Psychology60 (1975), 283–284, doi: 10.1037/h0076540.

[74] Rudolf Flesch, How to Write Plain English: A Book forLawyers and Consumers, Harper & Row, 1979.

[75] Robert Gunning, The Technique of Clear Writing, McGraw-Hill, 36–37 , 1952.

[76] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, and B. S.Chissom, Derivation of new readability formulas (automatedreadability index, fog count and flesch reading ease formula)for navy enlisted personnel, Technical report, Naval TechnicalTraining, U. S. Naval Air Station, Memphis, TN, 1975.

[77] G. Harry McLaughlin, Smog grading – a new readability for-mula, Journal of Reading 12 (1969), 639–646.

[78] R. J. Senter and E. A. Smith, Automated readability index,Technical report, Wright-Patterson Air Force Base, 1965.

[79] Marieke van Erp, Pablo Mendes, Heiko Paulheim, FilipIlievski, Julien Plu, Giuseppe Rizzo, and Joerg Waitelonis, Eval-uating entity linking: An analysis of current benchmark datasetsand a roadmap for doing a better job, in: 10th InternationalConference on Language Resources and Evaluation (LREC),2016.

[80] H. Cunningham, D. Maynard, K. Bontcheva, and et al., Devel-oping Language Processing Components with GATE Version 8(a User Guide), Technical Report, The University of Sheffield,Department of Computer Science, 2014.

[81] W. Cohen, P. Ravikumar, S. E. Fienberg, A Comparison ofString Metrics for Matching Names and Records, in: KDDWorkshop on Data Cleaning and Object Consolidation, 2003.

[82] P. F. Brown, P. V. de Souza, R. L. Mercer, V. J. Della Pietra, J.C. Lai, Class-based N-gram Models of Natural Language, Com-put. Linguist. 18 (1992), 467–479.

[83] J. H. Friedman. Greedy function approximation: A gradientboosting machine, in: Annals of Statistics, 1999.

[84] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas,P.N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, C.Bizer, DBpedia âAS a large-scale, multilingual knowledge baseextracted from Wikipedia, Semantic Web 6 (2015), 167–195.

[85] R. Walpole, R. Myers, Probability and statistics for engineers& scientists (Eighth Edition), Pearson Education International,2007.

[86] F. Abel, I. Celik, G. Houben, P. Siehndel Leveraging the se-mantics of tweets for adaptive faceted search on twitter, in: 10thInternational Conference on The Semantic Web - Volume Part I(ISWC), Springer-Verlag, 2011.

[87] A. Varga, A.E. Cano, M. Rowe, F. Ciravegna, Y. He, Linked

Page 30: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

30 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

knowledge sources for topic classification of microposts: A se-mantic graph-based approach, Web Semantics: Science, Servicesand Agents on the World Wide Web 26 (2014), 36–57.

[88] F. Abel, Q. Gao, G. Houben, K. Tao, Semantic Enrichment ofTwitter Posts for User Profile Construction on the Social Web,The Semanic Web: Research and Applications. ESWC 2011,Antoniou G. et al. (eds), Lecture Notes in Computer Science,vol 6644. Springer, Berlin, Heidelberg, doi: 10.1007/978-3-642-21064-8_26.

[89] P. Basile, A. Caputo, A. L. Gentile, G. Rizzo, Overview of theEVALITA 2016 Named Entity rEcognition and Linking in ItalianTweets (NEEL-IT) Task, in: 5th Evaluation Campaign of NaturalLanguage Processing and Speech Tools for Italian (EVALITA),2016.

Appendix

A. NEEL Taxonomy

Thinglanguagesethnic groupsnationalitiesreligionsdiseasessportsastronomical objects

Examples:If all the #[Sagittarius] in theworldJon Hamm is an [American] actor

Eventholidayssport eventspolitical eventssocial events

Examples:[London Riots][2nd World War][Tour de France][Christmas][Thanksgiving] occurs the ...

Characterfictional characterscomic characterstitle characters

Examples:[Batman][Wolverine][Donald Draper][Harry Potter] is the strongestwizard in the school

Locationpublic places (squares, opera houses, museums,

schools, markets, airports, stations, swimming pools,hospitals, sports facilities, youth centers, parks, townhalls, theatres, cinemas, galleries, universities, churches,medical centers, parking lots, cemeteries)

regions (villages, towns, cities, provinces, coun-tries, continents, dioceses, parishes) commercial places(pubs, restaurants, depots, hostels, hotels, industrialparks, nightclubs, music venues, bike shops)

buildings (houses, monasteries, creches, mills,army barracks, castles, retirement homes, towers,halls, rooms, vicarages, courtyards)

Examples:[Miami]Paul McCartney at [Yankee Stadium]president of [united states]Five New [Apple Retail Store]Opening Around

Organizationcompanies (press agencies, studios, banks, stock

markets, manufacturers, cooperatives)subdivisions of companiesbrandspolitical partiesgovernment bodies (ministries, councils, courts,

political unions)press names (magazines, newspapers, journals)public organizations (schools, universities, chari-

ties)collections of people (sport teams, associations, the-ater companies, religious orders, youth organizations,musical bands)

Examples:[Apple] has updated Mac Os X[Celtics] won against[Police] intervene afterdisturbances[Prism] performed in Washington[US] has beaten the Japanese team

Page 31: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 31

Personpeople’s names (titles and roles are not included,

such as Dr. or President)

Examples:[Barack Obama] is the current[Jon Hamm] is an American actor[Paul McCartney] at Yankee Stadiumcall it [Lady Gaga]

Productmoviestv seriesmusic albumspress products (journals, newspapers, magazines,

books, blogs)devices (cars, vehicles, electronic devices)operating systemsprogramming languages

Examples:Apple has updated [Mac Os X]Big crowd at the [Today Show][Harry Potter] has beaten anyrecordsWashington’s program [Prism]

B. NEEL Challenge Annotation Guidelines

The challenge task consists of three consecutivestages: 1) extraction and typing of entity mentionswithin a tweet; 2) link of each mention to an entryin the English DBpedia39 representing the same realworld entity, or NIL in case such entry does not exist;and 3) clustering of all mentions linked to NIL. Thus,the same entity, which does not have a correspondingentry in DBpedia, will be referenced with the sameNIL identifier. This section introduces various defini-tions relevant to this task.

B.1. Named Entity

A named entity, in the context of the NEEL Chal-lenge series, is a phrase representing the name, exclud-

39In the 2016 NEEL Challenge we used as referent knowl-edge base DBpedia 2015-04 http://wiki.dbpedia.org/dbpedia-data-set-2015-04, in 2015 we used DBpe-dia 2014 http://wiki.dbpedia.org/data-set-2014,while for the 2014 edition we used DBpedia v3.9 http://wiki.dbpedia.org/data-set-39.

ing the preceding definite article (i.e. “the”) and anyother pre-posed (e.g. “Dr”, “Mr”) or post-posed mod-ifiers, that belongs to a class of the NEEL Taxonomy(Appendix A) and is linked to a DBpedia resource.Compound entities should be annotated in isolation.

B.2. Mentions and Typification

In this task we consider that an entity may be ref-erenced in a tweet as a proper noun or acronym if:

1. it belongs to one of the categories specified in theNEEL Taxonomy (Appendix A);40

2. it can be linked to a DBpedia entry or to a NILreference given the context of the tweet.

Notice:

– Pronouns (e.g., he, him): they are not consideredmentions to entities in the context of this chal-lenge.

– Misspellings: lower cased words and compressedwords (e.g. “c u 2night” rather than “see youtonight”) are common in tweets. Thus, they arestill considered mentions if they can be directlymapped to proper nouns.

– Complete Mentions: entity extents that are sub-strings of a complete named entity mention arenot identified. In:Tweet: Barack Obama gives aspeech at NATO

you could not select either of the words Barackor Obama by themselves. This is because theyconstitute a substring of the full mention [BarackObama]. However, in the text: “Barack was bornin the city, at which time his parents named himObama” both of the terms [Barack] and [Obama]should be selected as entity mentions.

– Overlapping mentions: nested entities with qual-ifiers should be considered as independent enti-ties.Tweet: Alabama CF Taylor Dugashas decided to end negotiationswith the Cubs and will returnto Alabama for his seniorseason. #bamabaseball

In this case the [Alabama CF] entity qualifies[Taylor Dugas], the annotation for such a caseshould be: [Alabama CF, Organization, dbr:

40The typification was introduced since the 2015 NEEL Chal-lenge.

Page 32: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

32 Rizzo et al. / Lessons Learnt from the NEEL Challenge Series

Alabama_Crimson_Tide41] [Taylor Dugas,Person, NIL]

– Noun phrases: these are not considered as entitymentions.

∗ Ambiguous noun phrases:Tweet: I am happy that an#asian team have won thewomens world cup! After justreturning from #asia i haveseen how special you allare! Congrats

While [asian team] could be considered asan Organisation-type it can refer to multi-ple entities therefore we do not consider itas an entity mention and it should not be an-notated.∗ Noun phrases referring to an entity: While

noun phrases can be dereferenced to an ex-isting entity, we do not consider them as en-tity mentions. In such cases we only keep“embedded” entity mentions.Tweet: head of sharm el sheikhhospital is DENYING ...

In this case head of sharm el sheikh hos-pital refers to a Person-type however sinceit is not a proper noun we do not considerit as an entity mention. For that reason inthis case the annotation should only con-tain the embedded entity [sharm el sheikhhospital]: [sharm el sheikh hospital, Organi-zation, dbr:Sharm_International_Hospital]∗ Noun phrases completing the definition of

an entity.Tweet: The best PanasonicLUMIX digital camera from awide range of models

While digital camera describes the entityPanasonic LUMIX it is not consideredwithin the entity annotation. In this casethe annotation should be [Panasonic, ORG,dbr:Panasonic] [LUMIX, Product, dbr:Lumix]

– Entity Typing by Context: entity mentions in atweet can be typified based on the context inwhich they are used.

41dbr is the prefix of http://dbpedia.org/resource/

Tweet: Five New Apple RetailStores Opening Around theWorld: As we reported, Apple isopening 5 new retail stores on...

In this case [Apple Retail Stores] refers to aLocation-type while the second [Apple] mentionrefers to an Organisation-type.

B.3. Special Cases in Social Media (#s and @s)

Entities may be referenced in a tweet preceded byhashtags and @s or composed by hashtagged and @-nouns:Tweets:#[Obama] is proud to support theRespect for Marriage Act#[Barack Obama] is proud to supportthe Respect for Marriage Act@[BarackObama] is proud to supportthe Respect for Marriage Act

Hashtags can refer to entities, but it does not meanthat all hashtags will be considered entities. We con-sider the following cases:

– Hashtagged nouns and noun-phrases.Tweet: I burned the cake again.#fail

the hashtag #fail does not represent an entity asdefined in the Guidelines. Thus, it should not begiven as an entity mention.

– Partially tagged entities.Tweet: Congrats to Wayne Gretzky,his son Trevor has officiallysigned with the Chicago @Cubstoday

Here Chicago @Cubs refers to the proper nouncharacterising the Chicago Cubs entity (Noticethat in this case that Chicago is not a qualifierbut rather part of the entity mention). In this casethe annotation should be [Chicago,Organization,dbr:Chicago_Cubs] [Cubs, Organization,dbr:Chicago_Cub].

– Tagged entities: If a proper noun is splitted andtagged with two hashtags, the entity mentionshould be splitted into two separate mentions.Tweet: #Amy #Winehouse

In this case the annotation should be [Amy, Per-son, dbr:Amy_Winehouse] [Winehouse, Per-son, dbr:Amy_Winehouse]

Page 33: IOS Press Lessons Learnt from the Named Entity rEcognition and …giusepperizzo.github.io/publications/Rizzo_SWJ.pdf · 2019-01-31 · Undefined 1 (2009) 1–5 1 IOS Press Lessons

Rizzo et al. / Lessons Learnt from the NEEL Challenge Series 33

Note that the characters # and @ should not be in-cluded in the annotation string.

B.4. Use of Nicknames

The use of nicknames (i.e. descriptive names re-placing the actual name of an entity) are commonplacein Social Media. For example the use of “SFGiants”for referring to “the San Francisco Giants”. For thesecases we coreferenced the nickname to the entity itrefers to in the context of the tweet.Tweet: #[Panda] with 3 straight hitsto give #[SFGiants] 6-1 lead in12th

[Panda] is linked to http://dbpedia.org/page/Pablo_Sandoval and [SFGiants] is linked to http://dbpedia.org/page/San_Francisco_Giants.

B.5. Links

The NEEL Challenge series use the English DBpe-dia as Knowledge Base for linking.

DBpedia is a widely available Linked Data datasetand is composed by a series of RDF (Resource De-scription Framework) resources. Each resource is

uniquely identified by a URI (Uniform Resource Iden-tifier). A single RDF resource can be represented by aseries of triples of the type < s, p, o > where s con-tains the identifier of the resource (to be linked with amention), p contains the identifier for a property ando may contain a literal value or a reference to anotherresource.

In this challenge, a mention in a tweet should belinked to the identifier of a resource (i.e. the s in atriple) if available. Note that in DBpedia there are caseswhere one resource does not represent an entity, in-stead it represents an ambiguity case (disambiguationresource), a category, or it just redirects to another re-source. In this challenge, only the final IRI properlydescribing a real world entity (i.e. containing their de-scriptive attributes as well as relations to other enti-ties) is considered for linking. Thus, if there is a redi-rection chain given by the property wikiPageRedirects,the correct IRI is the one at the end of this redirectionchain.

If the resource is not available, it is asked to pro-vide a “NIL” plus an alphanumeric code (such as NIL1or NILA) that identifies uniquely the entity in the an-notated set.