Separating the Signal from the Noise: Predicting the ...uu.diva-portal.org/smash/get/diva2:1437921/FULLTEXT01.pdfand the later is a benchmark dataset for entity linking tasks. Three

Separating the Signal fromthe Noise: Predicting theCorrect Entities inNamed-Entity Linking

Drew Perkins

Uppsala University

Department of Linguistics and Philology

Master Programme in Language Technology

Master’s Thesis in Language Technology, 30 ects credits

June 9, 2020

Supervisors:

Gongbo Tang, Uppsala University

Thorsten Jacobs, Seavus

Abstract

In this study, I constructed a named-entity linking system that maps between

contextual word embeddings and knowledge graph embeddings to predict correct

entities. To establish a named-entity linking system, I �rst applied named-entity

recognition to identify the entities of interest. I then performed candidate gener-

ation via locality sensitivity hashing (LSH), where a candidate group of potential

entities were created for each identi�ed entity. Afterwards, my named-entity dis-

ambiguation component was performed to select the most probable candidate. By

concatenating contextual word embeddings and knowledge graph embeddings in

my disambiguation component, I present a novel approach to named-entity link-

ing. I conducted the experiments with the Kensho-Derived Wikimedia Dataset

and the AIDA CoNLL-YAGO Dataset; the former dataset was used for deployment

and the later is a benchmark dataset for entity linking tasks. Three deep learning

models were evaluated on the named-entity disambiguation component with

di�erent context embeddings. The evaluation was treated as a classi�cation task,

where I trained my models to select the correct entity from a list of candidates.

By optimizing the named-entity linking through this methodology, this entire

system can be used in recommendation engines with high F1 of 86% using the

former dataset. With the benchmark dataset, the proposed method is able to

achieve F1 of 79%.

Contents

Acknowledgments 5

1. Introduction 6

1.1. Purpose and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2. Background 8

2.1. Graph Theory and Concepts . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.1. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2. Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1. Knowledge Representation . . . . . . . . . . . . . . . . . . . . 10

2.2.2. Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3. Named-Entity Linking Components . . . . . . . . . . . . . . . . . . . 11

2.3.1. Named-Entity Recognition . . . . . . . . . . . . . . . . . . . . 11

2.3.2. Candidate Generation via Locality Sensitivity Hashing . . . . 11

2.3.3. Named-Entity Disambiguation . . . . . . . . . . . . . . . . . 12

2.4. Feature Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.1. Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.2. Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5. Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . 16

2.5.1. Long Short Term Memory (LSTM) . . . . . . . . . . . . . . . 16

2.5.2. Convolutional Neural Networks (CNN) . . . . . . . . . . . . . 16

2.5.3. Contextual Embeddings from Language Models (ELMo) . . . 17

3. Methodology 19

3.1. Named-Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2. Disambiguation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.1. Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.2. BiLSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.3. CNN-BiLSTM Model . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.4. ELMo Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2.5. Feed-Forward Neural Network (FFNN) . . . . . . . . . . . . . 22

4. Experiments 23

4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1. Kensho-Derived Wikimedia Dataset (KDWD) . . . . . . . . . 23

4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA) . . . . . . . . 24

4.2. Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.1. Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 26

4.3.2. Classi�cation Metrics . . . . . . . . . . . . . . . . . . . . . . . 26

4.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4.1. The E�ect of Training Data Size . . . . . . . . . . . . . . . . . 27

4.4.2. Disambiguation Models . . . . . . . . . . . . . . . . . . . . . 27

4.4.3. Candidate List Accuracy . . . . . . . . . . . . . . . . . . . . . 28

4.4.4. Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . 29

3

5. Conclusion and Future Work 31

A. Named-Entity Linking Examples 33

4

Acknowledgments

I would like to thank my university supervisor Gongbo Tang for his help and guidance

in the structure and pragmatics of my thesis. I am deeply indebted to the Seavus AB

team, particularly my company supervisor Thorsten Jacobs, for their support, feedback,

resources, ideas, and deadlines for this thesis. I would like to thank COVID-19 for

destroying my social life in the months leading up to completing my thesis. Finally, I

would like to thank my family, friends, and girlfriend for their ongoing support during

my Master’s studies.

Seavus AB is an IT consulting �rm that provides enterprise-wide business solutions

across the world, mainly covering the US and European markets. The department I

conducted this work in was their arti�cial intelligence and machine learning division

located in Stockholm. Their current work includes but is not limited to chatbots, QA

systems, and business intelligence.

5

1. Introduction

"I am convinced that the crux of the problem of learning is recognizing

relationships and being able to use them"

Christopher Strachey in a letter to Alan Turing, 1954

Knowledge graphs have been exploding in recent years within the scope of natural

language processing. Whether it be natural language generation, question-answering,

or named-entity recognition and relation linking, when common natural language tasks

are leveraged with knowledge graphs, improvements can be made across tasks and

domains. That is why I sought to construct a named-entity linking system, whereupon

ambiguities in the named-entities can be detected and properly clari�ed with assistance

from knowledge graphs. On some initial work with named-entity recognition of a

corpus, I noticed "Bush" would come up several times without any clarity to whether

this person was in reference to "George H.W. Bush" or "George W. Bush". For our

purposes, this seemed like a glaring oversight and one that we chose to expand on

to �nd a proper solution. One clear method to solve this problem is through named-

entity linking. A named-entity linking system consists of three primary components:

named-entity recognition, candidate generation, and named-entity disambiguation.

With all three of these components, we e�ectively identify entities, construct a list of

possible candidate for the identi�ed entities, and �nally disambiguate these entities

from the candidate list and link to a distinctive identi�er within a knowledge graph.

It is this �nal component – named-entity disambiguation – that I focused on for my

research and evaluation. The data I originally trained the disambiguation component

on was the Kensho-Derived Wikimedia Dataset, which includes Wikipedia text, links,

and the Wikidata knowledge base. I then conducted further studies with the AIDA

CoNLL-YAGO Dataset, a benchmark in named-entity linking. Furthermore, I performed

named-entity linking with three disambiguation models that map between contextual

word embeddings and knowledge graph embeddings. To optimize named-entity linking

with deep learning, I treated the problem as a classi�cation task; the models predict

the correct entity among a series of candidates. With my best model performance with

the benchmark dataset, I achieved 87% recall, 73% precision, and an ROC-AUC of 83%.

The thesis project was conducted at Seavus AB, an IT consulting �rm in Stockholm,

Sweden. It was founded in Malmo, Sweden, yet has o�ces throughout Northern and

Southeastern Europe. Seavus AB o�ers state-of-the-art machine learning and arti�cial

intelligence services to companies from around the world.

1.1. Purpose and Motivation

The purpose of this thesis is to examine and evaluate di�erent ways to improve

on named-entity linking systems by mapping between context embeddings with

knowledge graph embeddings for correct entities; it would appear that there is bereft

research into the usage of both of these embeddings together. The motivation came

from my initial �ndings of ambiguity with certain persons and places, such as in

the "Bush" example. Particularly when we are working with news corpora, a last

6

name can often be found alone, leading to more than a mere isolated incident. Named-

entity linking is a step toward more accurate semantic representation, and synonym

extraction, that remains a challenge in NLP research despite the robust ontology

networks and lexicons widely available. Not only that, the relations we make in

how we think, what we watch, and who we connect with are all inexorably linked

together to how we have come to understand the world, and fundamental to language

technology.

There are several research questions I considered to help me move forward with my

work. Can entities be adequately predicted through the concatenation of knowledge

graph embeddings with contextual word embeddings? Most current disambiguation

methods rely purely on the contextual information of documents. Can my system

manage name variations – i.e., the same entity can appear with various naming conven-

tions? This may be caused by aliases, spelling errors, or abbreviations. Can my system

manage ambiguity – i.e., the same mention may be polysemous (i.e. have multiple

meanings) depending on the speci�c context? Can my system manage incomplete

information when there is a limited amount of knowledge? Will my system be able to

�ll the contextual gaps?

1.2. Outline

The outline of the thesis will consist of the following. In Chapter 2, I will introduce

essential graph theory concepts and terminology that will be necessary to understand

the rest of the thesis. This will be followed by knowledge graphs, the components

used for the named-entity linking system, feature embeddings, and the deep learning

necessary to understand the later chapters. In Chapter 3, I discuss the methodology

that went towards the system, embeddings, and disambiguation models. In Chapter 4, I

�rst investigate the datasets used. I then break down the settings of the data, the earlier

entity linking components, and of course, the model environments for disambiguation.

This is followed by soft introductions to the evaluation metrics used in this work. I

�nish this chapter by reporting my results with an analysis and discussion afterwards.

In Chapter 5, I conclude my �ndings and examine a few di�erent ways this research

could be expanded on in the future.

7

2. Background

First, I note the fundamental graph concepts and algorithms. I then discuss knowledge

graphs from base to graph structure. I follow with explanations of the named-entity

linking components that include named-entity recogntion, candidate generation, and

named-entity disambiguation. Afterwords, I delve into word and graph feature embed-

dings. Finally, I explain the core deep learning necessary for this research.

2.1. Graph Theory and Concepts

In graph theory, graphs are a high �delity way of modeling pairwise relations between

objects. Graphs are comprised of objects known as vertices, entities, or nodes that

are connected to edges, links, or relationships. A label of a node marks it as part of a

larger group. In classic graph theory, this traditionally signi�ed one node, but this has

since taken on the meaning of a node group. For relationships, they are classi�ed into

types rather than labels. Nodes and relationships can also have embedded attributes

known as properties that contain various data types, whether they be numerical or

categorical. A subgraph is a smaller section within a larger graph. A path is a group of

nodes and their connecting relationships.

Figure 2.1.: Semantic Triple

In addition, there are common graph attributes that should be considered. Graphs

can have nodes that connect or disconnect with relations. Nodes and relationships

can carry certain weights. Nodes can have relations with a �xed direction. In this

case, start nodes are known as heads and end nodes are known as tails; a series of

heads, relations, and tails are called semantic triples. The paths can be cyclic or acyclic

depending on which node it starts and ends on. The relationship to node ratio can

be sparse or dense, which can lead to divergent results. Monopartite, bipartite, and

k-partite graphs are those that connect nodes by one, two, or any number of node

types. They express a subgraph of a knowledge graph.

2.1.1. Algorithms

Pathfinding

Two fundamental algorithms to traverse an entire graph are depth-�rst search and

breadth-�rst search. A depth-�rst search algorithm iterates outward from a starting

node to some end node before repeating a similar search down a di�erent path from

8

the same start node. Breadth-�rst search iterates the graph one layer at a time, �rst

visiting each node at depth 1, then depth 2, and so on, until the entire graph has been

visited. Path�nding algorithms are built on top of graph search algorithms as these

two and explore routes between nodes until the goal node has been reached. The

path�nding algorithms primarily covered in my work are shortest path (shortest path

between nodes) and random walk (set of random nodes following any relationship,

selected arbitrarily).

Centrality

Centrality is often implemented to retrieve the most important people or most relevant

answers in response to a query. Some algorithms such as PageRank (Page et al., 1998),

that was devised at Google, permitted the traversal through its search engine to

measure the most important web pages. It counts the number and quality of links to a

page to determine a rough estimate of how important a web page is. The underlying

assumption here is that more important pages are likely to receive more links from

other pages:

%'pDq “ÿ

hP�D

%'phq

!phq

The PageRank value for a page (u) relies on the PageRank values for each page (v)

that the set (B) contains, divided by the number of links (L) from page v.

Community Detection

Social networks are the most striking, and paradigmatic, examples of relationships

between individuals (or communities) within a graph. Whether in a group comprised of

your coworkers, family, or friends, people gravitate to form groups. Zachary’s network

of karate club members, a standard benchmark in community detection demonstrated

aggregations, of people as they drifted apart and formed two factions of communities

(Zachary, 1977).

Figure 2.2.: Zachary’s Karate Club

By looking at Figure 2.2, it is possible to infer two major aggregations of people pulled

apart from those around vertex 34 (the club president) and vertex 1 (the instructor).

The Louvain method is a clustering algorithm for community detection that evaluates

how much more densely connected the nodes within a community are in comparison

9

within a random network (Lu et al., 2014), but graph neural networks have also been

used to detect overlapping and disjoint communities (Shchur and Günnemann, 2019).

Link Prediction

When we want to foreshadow the most likely future relations in Figure 2.2, we use

link prediction algorithms to predict future possible connections in the network. In

addition, they can be used to propose missing links for obstructed or missing data.

The Adamic Adar algorithm (Adamic and Adar, 2003) was adopted early to predict

links in social networks, such as in Figure 2.2, using the formula:

�pG,~q “ÿ

D P # pGq X # p~q1

;>6|# pDq|

In Adamic and Adar (2003), N(u) is the set of nodes adjacent to u. A value of 0 asserts

that two nodes are not close, while higher values indicate closeness.

Similarity

Di�erent vector-based metrics are applied when we want to compute the similarity

of pairs of nodes. A node similarity is calculated by looking at how many neighbors

two nodes share, as in an approximate nearest neighbor algorithm. This algorithm

constructs a k-nearest neighbors (k-NN) graph for a set of objects based on a given

similarity algorithm.

If I use the Euclidean distance and k-NN to calculate the distance between two

nodes:

�p?, @q “ �p@, ?q “

b

p@1 ´ ?1q2 ` p@2 ´ ?2q

2 ` ... ` p@= ´ ?=q2 “

g

f

f

e

=ÿ

8“1

p@8 ´ ?8q2

The above formula takes the n number of dimensions (or features). The similarity

algorithms that can be leveraged with k-nearest neighbor are Cosine similarity, Jaccard

similarity, Euclidean distance, and Pearson similarity, to name a few.

2.2. Knowledge Graphs

2.2.1. Knowledge Representation

Knowledge graphs model information in the form of entities and relationships between

them. This sort of knowledge representation is a �eld, long explored in logic and

reasoning, focused on representing abstract information about the world in a way that

a computer can interpret. A few examples of knowledge representation formalisms

include semantic nets, systems architecture, frames, rules, and ontologies.

2.2.2. Knowledge Bases

A knowledge base is a centralized database for storing, organizing, and disseminating

represented knowledge. The general representation for a knowledge base is an object

model with classes, subclasses, and instances. Some of what I touched on, such as

semantic nets and ontologies, are these object models. The two main forms of knowl-

edge bases are machine-readable and human-readable. Machine-readable knowledge

bases store data that can only be analyzed by arti�cial intelligence systems. Human-

readable knowledge bases store documents and physical texts that can be accessed by

10

humans. The key factors to consider to determine the usefulness of knowledge bases

are the completeness, accuracy, and quality of the information we are using. When

we structure large, unstructured data, we often use a graphical representation of this

knowledge known as a knowledge graph.

2.3. Named-Entity Linking Components

2.3.1. Named-Entity Recognition

Many natural language processing applications require identifying named-entities in

text data and classifying them. Named-entities can be, for example, person or company

names, dates and time expressions, organizations, locations, etc. The task of identifying

these in a text is called named-entity recognition and is often performed for a speci�c

domain of unstructured data.

Figure 2.3.: Named-Entity Recognizer for Music

Named-entity recognition is usually a supervised task because of its reliance on

annotated data, such as CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003). With

this strong need for annotated data for certain domains, such as medicine or law, this

task becomes knowledge intensive by nature. The most common NLP applications that

bene�t from named-entity recognition are QA, relation extraction, and coreference

resolution. In fact, coreference resolution solves the ambiguities of named-entity

recognition by �nding the references to the same entity in a text.

2.3.2. Candidate Generation via Locality Sensitivity Hashing

In the candidate generation process, I needed to �nd the k-NN among the candidates

(Dong et al., 2011). Using brute force to process all possible candidate combinations

would have given us the exact nearest neighbor but it is neither scalable or fast. I also

calculated the frequency that an anchor link (Wikipedia hyperlink) corresponds to a

target page, but this method alone was not enough. Thankfully, there are heuristics

that lead to a promising approximation to this k-NN search task, locality sensitivity

hashing (LSH) being one such algorithm (Gionis et al., 1999).

Figure 2.4.: Locality Sensitivity Hashing

LSH hashes data points into buckets so that data points near each other are located

in the same buckets with high probability, while data points far from each other

are likely to be in di�erent buckets. This makes it possible to observe data within

11

various degrees of similarity. The reason why I don’t simply rely on the anchor link

for candidate generation is that even a slight character di�erence could result in

no matches. Similarity metrics, such as Jaccard similarity, can successfully retrieve

entities with names similar to the identi�ed entity text (E. Zhu et al., 2016). Jaccard

similarity checks to see how similar two texts are and can be e�ciently approximated

by MinHash LSH:

� p�, �q “|� X �|

|� Y �|

If I have a greater intersection of characters between two words than I can expect

a higher Jaccard index. The frequency and rarity of candidates in a set should be

given proper consideration when choosing the similarity measure. Entities that are

constantly reoccurring in the text tend to have embeddings with large normalizations.

This can dominate the candidate generation process and using variants of similarity

measures that put less emphasis on the normalization of an entity should be applied.

Regularization can aid in the inverse issue of rare entities being selected over more

relevant ones.

Figure 2.5.: Jaccard Index of 1/4

As in Figure 2.5, I have one word "Floyd" and another word "Florida"; I can expect

there to be an overall Jaccard index of 1/4 because of how much they have in common

out of all the overall character variables. Min hashing is the most critical aspect of the

algorithm and was chosen because of its e�ectiveness with Jaccard similarity, that is

mapping similarity between sets.

2.3.3. Named-Entity Disambiguation

Named-entity disambiguation and named-entity linking are used interchangeably but

for the purposes of our research, I distinguish the overall system as named-entity

linking and the named-entity linking component as named-entity disambiguation.

This component performs the task of mapping words of interest, such as names of

persons, locations and companies, from an input text document to corresponding

unique entities in a target knowledge base such as Wikipedia. When performing

named-entity disambiguation, I do not directly employ Wikipedia; there are databases

better suited for accessing and retrieving information from their knowledge base, such

as DBpedia or Wikidata.

Kulkarni et al. (2009) were the �rst to annotate and bridge unstructured text with

entity IDs from a knowledge base to disambiguate entities. Two years later Ho�art

et al. (2011) made one of the preeminent contributions to the disambiguation task by

adding rich context through a combined framework of popularity priors, similarity

measures, and coherence algorithms. Their robust framework became the standard that

12

most state-of-the-art models augment and attempt to surpass. Finally, Parravicini et al.

(2019) achieved state-of-the-art accuracy by leveraging knowledge graph embeddings

for the disambiguation task.

Figure 2.6.: Named-Entity Disambiguation with Wikidata (Parravicini et al., 2019)

Fields of author disambiguation (Franzoni et al., 2019), natural language generation

(NLG) (Koncel-Kedziorski et al., 2019), (Logan et al., 2019), and QA systems (Reddy et al.,

2017) bene�t from higher-level representations of text that we cannot reach with simple

recognition or recommendation algorithms. In order for us to �nd concepts relevant

to the task or application that are separate from the text, named-entity disambiguation

can discover underlying meanings. For one example as to the bene�ts of using named-

entity disambiguation, consider a simple query, “Floyd revolutionized rock with the

Wall”. As I mentioned with PageRank and LSH, search and recommendation engines try

to �nd the most relevant documents to recommend a user and to �nd supplementary

information that may be to their liking.

Without a named-entity disambiguation component, the search engine only looks

for information that mention “Floyd”, "rock" and "wall." This engine may provide us

"Pink Floyd", "rock music", and "The Wall", but it could also give us false negatives,

meaning that it misses out on retrieving additional information pertaining to Pink

Floyd, such as "progrock", "David Gilmour", and "Comfortably Numb". Even worse, the

engine could provide us a series of false positives, such as information on "rock wall

climbing", "Floyd Mayweather", "border walls", and "The Rock".

2.4. Feature Embeddings

One of the more assured ways to encode the kind of properties that I have in mind is

through the use of feature vectors. Feature vectors consist of numeric or nominal values

that we embed speci�c information as an input to many machine learning algorithms.

This embedded information is a hidden low-dimensional vector representation used

to preserve linguistic, spatial, or extraneous features into a new space for e�ective

learning.

2.4.1. Word Embeddings

Word embeddings focus on the word features of a certain lexicon. They are capable

of capturing the context of a word, such as its semantic and syntactic similarity.

13

Words or phrases are mapped to vectors of real numbers. It involves a mathematical

embedding from a space with many dimensions per word to a continuous vector space

with a much lower dimension. Dimensionality reduction is the process of reducing

the number of random variables under consideration by obtaining a set of principal

variables. By reducing the size of word embeddings, we can improve their utility in

memory-constrained pipelines.

Figure 2.7.: Word2Vec

Word2Vec is one of the most popular techniques, developed by Google, to learn

word embeddings using a group of models (Goldberg and Levy, 2014). One of these

models in Word2Vec is known as a skip-gram model. It takes every word in a large

corpora and also takes one-by-one the words that surround it within a de�ned windowto then feed a neural network that after training will predict the probability for each

word to actually appear in the window around the highlighted word.

Figure 2.8.: Context Window

Context2Vec is a neural model that develops from a generic embedding function for

these context windows of target words. Melamud et al. (2016) demonstrated the e�-

ciency of training billions of words (with reasonable time constraints) could maintain

high-quality context representations which signi�cantly outperform traditional word

embeddings.

2.4.2. Graph Embeddings

Graph embeddings are the transformation of property graphs to a vector or set of

vectors. Embeddings should capture the graph topology, node-to-node relationship,

and other relevant information about graphs, subgraphs, and nodes. More properties

embedded have the potential to encode better results. Graph embeddings are often

divided between three main groups:

• Node embeddings: We encode each node with its own vector representation. We

would use this embedding when we want to perform visualization or prediction

on the node level, e.g. visualization of vertices in the 2D plane, or prediction

of new connections based on vertex similarities. In many ways, this method is

very similar in mapping as Word2Vec. A few examples are DeepWalk (Perozzi

et al., 2014) and Node2Vec (Grover and Leskovec, 2016).

14

• Bilinear-based embeddings: We encode the relationships between the two entity

vectors using multiple matrices. Assuming we have a total number of entities as

E and a total number of relations as R, the total number of parameters will be Ex E x R. Bilinear-based models like RESCAL (Nickel et al., 2011) generates the

score s of a triple (h,r,t) via tensor-factorization:

Bpℎ, A, Cq “ \)ℎ"A\C

The head nodes \ℎ are represented as a matrix transpose T and relations are

represented as a matrix "A . There is a need for weight decay with RESCAL

because each relation carries with it many parameters which generally leads to

over�tting and downgrading the overall performance (Nickel et al., 2011).

• Translation-based embeddings: We encode the whole relation with a single

vector. The fundamental notion is that the model is making the sum of the head

vector and relation vector as close as possible to the tail vector. Translation-based

models like TransE (Bordes et al., 2013) and TransD (Ji et al., 2015) solve link

prediction in multi-relational data by interpreting relationships as translations

operating on a learned low-dimensional embedding of the entities in a knowledge

graph, rather than on the graph structure itself.

Figure 2.9.: TransE

TransE is one of the most notable translation-based models for knowledge graph

embeddings due to the sheer simplicity of its method:

Bpℎ, A, Cq “ 3p\ℎ ` hA ´ \C q

As two embeddings are compared to generate the score s of their triple (h,r,t),the head embedding \ℎ is �rst translated by the relationship vector EA . TransE

returns the lower scores to entities that are close, therefore the semantic triple

score is computed as such, where d is a dissimilarity function like !1 or !2.

DistMult (Yang et al., 2014) is similar to both RESCAL and TransE. Instead of complex

matrices, Yang et al. (2014) reduce the number of relations by only using diagonal

matrices as vectors v to generate the score s of a triple (h,r,t) :

15

Bpℎ, A, Cq “ x\ℎ ` hA ´ \Cy “

�ÿ

3“1

\ℎ,3 hA ,3 \C ,3

Above, d is the diagonal operator, which is limited to representing only symmet-

ric relations; the same embedding space is on the left and right sides. DistMult and

TransE both use a low number of parameters to achieve state-of-the-art results. How-

ever, having a model that focuses solely on the diagonal matrices is not without its

limitations.

2.5. Neural Networks and Deep Learning

2.5.1. Long Short Term Memory (LSTM)

Long Short Term Memory (LSTM) networks are a form of recurrent neural networks

(RNN) that are capable of learning long-term dependencies (Hochreiter and Schmidhu-

ber, 1997). In standard RNNs, this repeating module will have a very simple structure,

such as a single tanh layer.

Figure 2.10.: LSTM Unit

LSTMs also have this linked, chain structure, but the repeating module has a di�erent

structure. Instead of having a single neural network, there are four layers, interacting

in a way that cleverly manages time intervals. In Figure 2.10, at center is an LSTM

unit composed of a cell, input gate, output gate and forget gate. The cell remembers

values over these time intervals and the gates regulate and control the information that

comes in and out of the cell. LSTMs were created to deal with the vanishing gradient

problem that is encountered with traditional RNN.

2.5.2. Convolutional Neural Networks (CNN)

A convolutional neural network will apply 1D convolutions to map features of text,

and concurrently apply max pooling operations over the time-step dimension to obtain

a �xed-length output. We are often talking about 1D convolutions when working with

text data and 2D convolutions when we are working with image data. With graphs in

mind, studies have gone into exploring the encapsulation of graphs through these 2D

and 3D convolutions, respectively. However, learning a graph through convolutions is

one di�culty and learning an entire knowledge graph through convolutions is another

(Battaglia et al., 2018).

The general purpose of using various convolutions within one network of graph

data is to capture larger representations in di�erent dimensions. For example, one

16

Figure 2.11.: CNN with Semantic Triples

knowledge graph will have nodes and edges, and each node and edge will more

than likely have additional labels and types. Unsurprisingly, where LSTM fails to

capture these features beyond linear representations, CNN shows more promise in

capturing these complexities. In link prediction, densely connected convolutional

neural networks have been e�ective when in conjunction with classic graph heuristics

and similarity metrics (W. Wang et al., 2019).

2.5.3. Contextual Embeddings from Language Models (ELMo)

Looking at a way to advance my context embeddings, where I can look at the en-

tire sentence before assigning each word to a corresponding embedding, I decided

to implement ELMo embeddings. ELMo performs many tasks with state-of-the-art

precision and recall on predicting following words in sentences (Peters et al., 2018).

ELMo uses two layers of bidirectional LSTMs (BiLSTM) in its training, with both layers

bridged with a residual connection. A residual connection is used to allow gradients

to �ow through a network directly, without passing through the non-linear activation

functions. The high-level intuition is that residual connections help neural networks

to train more successfully (Peters et al., 2018).

Figure 2.12.: BiLSTM Layers in ELMo

ELMo embeddings are character-based, which allows a neural network to use

morphological notions to form representations for out-of-vocabulary tokens unseen

in training. It is for this reason that static word embeddings like Word2Vec and GloVe

usually fall short. Even when we create a word embedding with a wide context window,

the word will ultimately have the same vector representation regardless of the context.

ELMo embeddings change with context. It is this text prediction that is being achieved

17

by the forward and backward language models in ELMo that make it one of the best

at tracking language patterns and transfer learning.

18

3. Methodology

I demonstrate the entire named-entity linking system from input to output. I then

discuss the embeddings, as well as the general model pipelines of the disambiguation

component and the feed-forward neural network (FFNN).

3.1. Named-Entity Linking

The input text is processed by a named-entity recognition component set up with the

SpaCy implementation. The start and end positions of the named-entity are catalogued

along with the raw text input and the identi�ed named-entity. A few named-entities

that I exempted were number related such as money, time, percentages, etc. Afterwards,

the input text is cleaned and realigned with new start and end positions. The top

candidates are then retrieved from the anchor link frequency and the LSH algorithm,

and �nally, the named-entity disambiguation model is performed to deliver the output.

Figure 3.1.: Named-Entity Linking Components

19

For evaluation purposes, I focus on the named-entity disambiguation component. To

prepare the training data, I performed text feature engineering for some key elements

that were used in our work. I �rst set up clean candidate/anchor link lists. Soon after,

I processed the 1.5 million sections of Wikipedia that I was using. This took several

days to process and the text had to be processed a few times over. The reason to

reprocess the text is the case of alignment. I had to ensure the alignment of the textual

mention, in our case the anchor link, in the section text was the same before and after

processing. This required recalculating positions each time I cleaned the section text.

In the case of number substitution, I opted to replace numbers with hashes as this was

a useful way to avoid problems that would arise from mixed data types.

3.2. Disambiguation Models

For my deep learning models, I treated the task of named-entity disambiguation as a

classi�cation task. For a candidate in the candidate list for an identi�ed entity in a text,

the model predicts whether this candidate is the true named-entity for the identi�ed

named-entity. Speci�cally, given the knowledge embedding of the candidate and the

local context embedding of the text as inputs, the model predicts true if the candidate

is the correct entity and false otherwise. For example, our baseline model uses word

embeddings with a local context window, where it is trained as part of the embedding

layer of a BiLSTM. Once this occurs, the context embeddings are concatenated with

knowledge graph embeddings to be fed into a feed-forward neural network. With this

framework, I successfully mapped between the knowledge graph embeddings and

local context embeddings to disambiguate the named-entities.

The notion of incorporating graph embeddings with local context embeddings to

map ambiguous named-entities to those in a knowledge base stems from Parravicini et

al. (2019). In their framework, Parravicini et al. (2019) performed successful leveraging

of graph embeddings to achieve named-entity disambiguation. Using DBpedia as the

knowledge base and existing graph algorithms for candidate generation, they were

able to achieve state-of-the-art accuracy on a number of datasets and fast retrieval of

entities in real-world engines. In comparison to my work, they use di�erent similarity

metrics in their candidate generation rather than Jaccard similarity and the LSH

algorithm; and their graph embeddings were node embeddings (DeepWalk), rather

than higher-level knowledge graph embeddings. Our approach is also novel since its

the �rst of its kind to concatenate contextual word embeddings, such as ELMo, with

knowledge graph embeddings to conduct disambiguation as a classi�cation task.

3.2.1. Embeddings

The Context2Vec embeddings were trained with the 500k most frequently occurring

words in the dataset. With this subset, I mapped them to embeddings with a context

window of (+/-) 10 words, which is fed into an embedding layer of the model. The

ELMo embeddings were trained by taking 1.5 million Wikipedia section texts from

the dataset. I truncated the text to a window of (+/-) 10 words to make ELMo simple to

compare with Context2Vec and also ease the computation.

The TransE embeddings were trained from 5 million vectors of entities and relations

in Wikidata and Wikipedia. This includes general domain entities such as concepts,

people, and things. I use the graph embedding engine GraphVite (X. Wang et al.,

2019; Z. Zhu et al., 2019) to utilize the knowledge graph embedding algorithms from

Chapter 2.4.2. and generate embeddings in a short amount of time. DistMult was used

in some preliminary experiments, but ultimately I chose TransE as my knowledge

graph embeddings for the �nal results.

20

3.2.2. BiLSTM Model

My baseline model takes the Context2Vec embeddings as input. This input is fed

into a BiLSTM, whose output is concatenated with the knowledge graph embeddings,

which are then fed into a FFNN. The FFNN maps between the context embeddings and

knowledge graph embeddings for the �nal output. The motivation for the baseline

model was to augment and tune a model to build up the Context2Vec embedding

section.

Figure 3.2.: BiLSTM Model Pipeline

I also know that a BiLSTM is quite e�ective on sequential tagging and word classi-

�cation. Particularly when we are looking at a window size of 10 words before and

after our target word, an LSTM is fundamental for Context2Vec.

3.2.3. CNN-BiLSTM Model

The motivation for the second model is much like the baseline model yet the purpose

was to develop a stack on BiLSTM. A CNN captures the hierarchical relations and there

have been positive stacks with text classi�cation and CNN-LSTM models (Zhou et al.,

2015). I use the CNN to extract a sequence of higher-level phrase representations, and

then further feed this into the BiLSTM for sentence representation. I fed my context

embeddings into one convolutional layer to see how it would perform.

Figure 3.3.: CNN-BiLSTM Model Pipeline

3.2.4. ELMo Model

The ELMo model replaces the Context2Vec embeddings with ELMo embeddings as

input, and since the ELMo embeddings is already made up of BiLSTMs, I only use a

basic LSTM to process the ELMo embeddings; this output is also concatenated like

the CNN-BiLSTM model and ELMo model. Depending on the window size of the

embedding, runtime would vary greatly.

21

Figure 3.4.: ELMo Model Pipeline

3.2.5. Feed-Forward Neural Network (FFNN)

Each model utilizes a FFNN towards the back of its architecture. The FFNN has a

sigmoid function that applies the transformations I commit to vectors in a range of

(0,1) coming out of the previously established network before the loss computation:

5 pB8q “1

1` 4´B8

It is independently connected to each element and is also known as the logistic

function. Unlike softmax loss, each vector component (class) is independent, therefore

the loss computed for each output class is not a�ected by other classes. A sigmoid

activation function applied to the scores before computing the cross-entropy loss:

�� “ ´

�ÿ

8

C8;>6pB8q

I use cross-entropy to calculate the di�erence between two (or more) probability dis-

tributions and it measures the performance of my models whose output is a probability

value of either 0 or 1.

22

4. Experiments

I perform a number of experiments and present these results on openly available

datasets using di�erent feature combinations, models, hyperparameters, and sample

sizes to exemplify this performance. The datasets are detailed for the named-entity dis-

ambiguation, as well as their settings. The Kensho-Derived Wikimedia Dataset (KDWD)

was used for the QA system and AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA) was

used as an entity linking benchmark. Afterwards, I de�ne the evaluation metrics used

for named-entity disambiguation, perform my experiments and compare the results

with other notable works.

4.1. Data

4.1.1. Kensho-Derived Wikimedia Dataset (KDWD)

Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-

sourced projects supported by the Wikimedia Foundation. Recently, Wikipedia added

its six millionth English article after two decades operating. Wikidata, a more machine-

readable sister project, holds more than 75 million items since its creation 8 years ago.

The Wikimedia Foundation disseminates this information under a free license, and

therefore, have been heavily researched by data scientists and computer science groups,

particularly in the �eld of natural language processing (NLP). The Kensho-DerivedWikimedia Dataset (KDWD) 1

is a concentrated subset of the raw Wikimedia data in a

condition more �t for NLP research.

Pages Tokens Entities Relations

5.3M 2.3B 51M 140M

Table 4.1.: KDWD Dataset

The KDWD dataset is structured with three layers of data; there is the text from the

Wikipedia page, the hyperlinks between the pages, and the entities and relations built

from the Wikidata graph. Entities and relations are synonymous with items (Q) and

statement (P) in Wikidata. For example, the Noam Chomsky (Q9049) item in Wikidata

has statements that the item is an instance of (P106) human and the occupation (P106)

linguist (Q14467526) and political writer (Q15958642).

If I only observe the "Introduction" sections of these pages, I am still left with 460M

tokens in our corpus. Sometimes evaluation can prove more di�cult depending on the

referent used in the data, whereby a system annotates an entity with an encrypted

redirect rather than the direct entity in the URI. DBPedia (Auer et al., 2007), which

relies on the Wikimedia project for its knowledge base, will have redirects such as

http://dbpedia.org/resource/PEHDTSCKJBMA that will be completely apt for referencing

http://dbpedia.org/resource/Tom_Waits. In both datasets, I established the direct links

without relying on above redirects for our network.

1https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data

23

https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data

4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA)

The AIDA CoNLL-YAGO Dataset created by Ho�art et al. (2011) contains assignments

of entities to the mentions of named-entities that were annotated in the original CoNLL

2003 NER task (Tjong Kim Sang and De Meulder, 2003). The entities are detected by

YAGO2 identi�cation, by Wikipedia URL, or by Freebase mid. For our purposes, I

used the YAGO2 entity identi�er as the target ID, rather than the Wikidata ID used

in KDWD. Each mention of an entity has the accompanying text section that can be

used to train the model.

Documents Entities

TRAIN 946 18k

VALID 215 4.6k

TEST 230 4.3k

Table 4.2.: CoNLL03/AIDA Dataset

The referent used is the Wikipedia link, or anchor link, that I used in KDWD. It

makes the comparison between KDWD and CoNLL03/AIDA easier when their entity-

linking is sourced from similar knowledge bases. CoNLL03/AIDA is the standard for

disambiguation.

4.2. Se�ings

I discuss the environment I created from the data preparation, the NER component,

candidate generation component, and provide a comprehensive look at the three

models in the disambiguation component.

Data Setting: Due to the size of KDWD, I train on only the "Introduction" sections of

1.5 million pages. The amount of text in KDWD can vary dramatically, therefore feature

engineering was quite computationally expensive. The data was split through the

traditional 70% training, 15% validation, and 15% testing. I keep the 500k most frequent

tokens as our Context2Vec lexicon. For the CoNLL03/AIDA dataset, I use all documents,

which were already partitioned. The entire document text accompanies each of the

entities mentioned in the dataset, whereas this structure preexists in KDWD. I use

the 21k most frequent tokens as our Context2Vec lexicon. A max length of 50 tokens

was set for processing the ELMo embeddings in both datasets. Text normalization was

standard, but numbers were replaced with hashtags.

NER Setting: The entities were recognized using Spacy v2.0, which uses subword

features and Bloom embeddings to parse entities (Serrà and Karatzoglou, 2017). The

entities recognized were person, location, organization, etc.

Candidate Generation Setting: The top candidates were calculated for the anchor

link frequency and the LSH algorithm in Chapter 2.3.2. I use jaccard similarity and

min hashing to map the similarity between these given sets of candidates. This setting

accounts for erroneous spelling and helps reduce the dimensionality of the data.

NED Setting: The general pipelines were discussed in Chapter 4 but here I have

the exact sizes of the input (with sequences no longer than 20) and output, along

with the detailed layers of their architecture. The candidate lists that are used for

disambiguation label one true entity among ten false entities.

• BiLSTMModel: The BiLSTM has an output size of 100 and the FFNN has a dense

layer size of 256 with ReLU activation. A dropout layer is added, and the �nal

output layer is a dense layer of 1 with sigmoid activation to classify the entities.

24

Learning rate was 0.007, batch size was 256, with 100 epochs that stopped once

the validation loss peaked after 2 patience.

Figure 4.1.: BiLSTM Model Architecture

• CNN-BiLSTM Model: Most of my hyperparameters and features were chosen

from Yenter and Verma (2017) who used a CNN-LSTM for binary classi�cation

of movie review sentiment. I have a dropout before the convolution. The convo-

lution has kernel size of 5, 64 �lters, ReLU activation, valid padding, and 1 stride.

I add a max pooling layer and batch normalization before feeding into the same

BiLSTM and FFNN as in the BiLSTM model. Learning rate was 0.007, batch size

was 256, with 100 epochs that stopped once the validation loss peaked after 2

patience.

Figure 4.2.: CNN-BiLSTM Model Architecture

• ELMo Model: I have an output layer of 1024 in our LSTM to match the size

of the ELMo embeddings. The LSTM has a recurrent dropout of 0.2. The �nal

dropout is 0.2. I used Adam optimizer with batch size of 64 and a learning rate

of 0.001 and 20 epochs that also had early stopping (Broscheit, 2019). The rest of

the model stays consistent with the other two after concatenation.

25

Figure 4.3.: ELMo Model Architecture

4.3. Evaluation Metrics

4.3.1. Precision and Recall

The general statistical measures I am observing is F1, precision, and recall. Precision

and recall both give us indications of the accuracy of a model but provide deeper

meanings for what the model is actually predicting. Precision means the percentage

of our results which are relevant to the task, whereas recall means the percentage of

total relevant results which are correctly classi�ed:

%'��(�$# “|Relevant ResultsX Retrieved Results|

|Retrieved Results|

'��!! “|Relevant ResultsX Retrieved Results|

|Truly Relevant Results|

The tradeo�s are that lowering our precision will give us irrelevant results not

suitable for a user and that raising precision will provide. This inverse relationship is

why we use the harmonic mean of the F1 score to balance precision and recall:

�1(�$'� “ 2 ¨Precision ¨ Recall

Precision` Recall

4.3.2. Classification Metrics

There are two ways I observe the classi�cation task in my method. I �rst observe

a confusion matrix of predicted results. A confusion matrix displays the number of

predicted values on the y-axis and the number of actual values on the x-axis, and is

broken down by each class. In my task, I should expect two. A confusion matrix allows

us to understand not just the errors being made by the classi�er, but more importantly,

the types of errors that are being produced by our model.

I secondly observe a receiver operating characteristic (ROC) curve and the area

under the curve (AUC). It expresses how well a model is capable of distinguishes

between classes; true named-entities and false named-entities in our classi�cation task.

The higher the AUC, the better the model is predicting true and false entities. The

ROC curve is plotted with the true positive rate (TPR) is measured against the false

positive rate (FPR). The TPR is synonymous with recall, yet in contrast to precision,

the FPR measures the ratio of false positives in the negative samples.

26

4.4. Results

First, I juxtapose the sample sizes and the correlative e�ects this has on the precision

and recall of my models. Then I compare confusion matrices of the two datasets with

the ELMo model and contrast the micro-precision of my model with state-of-the-art

models. I note the accuracy of the candidate lists that I am basing my predictions

from. I conclude with a brief analysis on the ROC curves and discussion on remaining

tangents.

4.4.1. The E�ect of Training Data Size

The sample sizes that are listed in Table 4.3 and Table 4.4 are approximate sizes of the

training and validation instances yielded. Since CoNLL03/AIDA dataset is far smaller

than KDWD, this was re�ected in experiment sizes.

Precision Recall F1 ROC-AUC Sample Size

BiLSTM Model 0.71 0.66 0.68 0.76 500k

CNN-BiLSTM Model 0.72 0.67 0.69 0.78 500k

ELMo Model 0.86 0.72 0.78 0.90 500k

BiLSTM Model 0.74 0.75 0.74 0.82 1M

CNN-BiLSTM Model 0.72 0.82 0.77 0.85 1M

ELMo Model 0.85 0.80 0.82 0.91 1M

BiLSTM Model 0.86 0.86 0.86 0.92 2.8M

CNN-BiLSTM Model 0.84 0.88 0.86 0.92 2.8M

ELMo Model 0.88 0.81 0.84 0.93 2.8M

Table 4.3.: Average Disambiguation Scores of 5 Runs - KDWD

Unsurprisingly, the increase in samples I train and validate from will correlate to an

improvement on the classi�cation task across all models. The BiLSTM model and CNN-

BiLSTM model slightly edge out the ELMo model in performance with 2.8M samples.

With less data, state-of-the-art models like ELMo can be e�ective in performing near

F1 of 80% with the smallest sample size in Table 4.3.

Precision Recall F1 ROC-AUC Sample Size

BiLSTM Model 0.66 0.85 0.74 0.80 20k

CNN-BiLSTM Model 0.63 0.86 0.73 0.80 20k

ELMo Model 0.73 0.87 0.79 0.83 20k

Table 4.4.: Average Disambiguation Scores of 5 Runs - CoNLL03/AIDA

4.4.2. Disambiguation Models

While I am training my models, I often include at maximum 10 false entities among the

true entity as the potential candidates. When predicting the candidates, the BiLSTM

Model and CNN-BiLSTM model repeatably have higher recall and lower precision,

underscoring the classi�er deciding too much of these candidates may be the true

candidate without classifying the exactly true candidates. The ELMo model has higher

precision and lower recall, which substantiates the notion that state-of-the-art language

models like ELMo and BERT (Sun et al., 2019) often can have a harder time generalizing,

thus generating higher false negatives, despite it generating su�cient true positives.

However, I �nd a direct contrast between the two datasets with the ELMo model. In

Figure 4.5, I generate a much higher recall (87%) and lower precision (73%). It should

27

be further noted that both confusion matrices in Figure 4.4 and Figure 4.5 denote the

datasets with the largest training instances. The BiLSTM model and CNN-BiLSTM

model have consistent confusion matrices that hold between both datasets, but it could

stand to reason that CoNLL03/AIDA generates higher false positives from the lack of

su�cient diversity in the data. For example, when a speci�c subject or theme manifests

in a text, the model has a harder time di�erentiating between the macro-context of

the text and the micro-context of the entity within the text.

Figure 4.4.: ELMo CM - KDWD Figure 4.5.: ELMo CM - CoNLL03/AIDA

The broader precision that I have thus far noted has been the micro-precision.

The micro-precision is the fraction of correctly disambiguated named-entities in an

entire corpus, whereas the macro-precision is the fraction of correctly disambiguated

named-entities averaged by their respective documents. In Table 4.5 there is a 9%

drop from our highest micro-precision with CoNLL03/AIDA compared to the lowest

state-of-the-art disambiguation model by Ho�art et al. (2011).

Micro-Precision

J. Raiman and O. Raiman (2018) 0.95

Sil et al. (2018) 0.94

Le and Titov (2018) 0.93

Ho�art et al. (2011) 0.82

Our Model 0.73

Table 4.5.: Disambiguation Models - CoNLL03/AIDA

These disambiguation models in Table 4.5 are comparatively more complex networks,

some of which implement rich integration of the other components of the entity linking

system that is absent in our work. J. Raiman and O. Raiman (2018) integrated symbolic

knowledge into the reasoning process of a neural network. Sil et al. (2018) trained

�ne-grained similarities and dissimilarities between the query and candidate document.

Le and Titov (2018) used multi-relational learning with candidates. Ho�art et al. (2011)

approximate e�ective joint mention-entity mapping.

4.4.3. Candidate List Accuracy

I evaluated the performance of the candidate lists that we were selecting the candidates

from, where the candidate with the highest probability is picked. It is not guaranteed

that the true candidate appears in the list if it happens to be missing. This is noted in

Table 4.6 and highlights the strengths of our models in predicting the correct candidates.

The accuracy is calculated from the recall at k approach.

In the classi�cation task, there is either the possibility for more than one candidate

to be predicted or the potential for no candidate to be predicted as the true entity. I

28

KDWD CoNLL03/AIDA

BiLSTM Model 0.92 0.83

CNN-BiLSTM Model 0.93 0.83

ELMo Model 0.92 0.85

Table 4.6.: Average Candidate List Accuracy of 5 Runs

have chosen to predict the candidate with the highest probability as the true entity. In

candidate generation, bottleneck problems materialize with gaps in the breadth of a

knowledge base. I extrapolate solutions and penalties for this issue in Chapter 4.5.5.

4.4.4. Analysis and Discussion

As I found, the greatest advantages in our models were careful consideration of the

context embeddings and the tradeo� between precision and recall. With more advanced

models like the ELMo model, I can expect higher precision at the cost of recall, with

the BiLSTM model and CNN-BiLSTM model I can expect higher recall at the cost

of precision. On testing with the CNN-BiLSTM model, I found a slight recall and

ROC-AUC boost compared to BiLSTM model. The ELMo embeddings often perform

stronger than the Context2Vec embeddings with less data.

Figure 4.6.: CNN-BiLSTM - KDWD Figure 4.7.: CNN-BiLSTM - CoNLL03/AIDA

With a sample size of 2.8M training and validation instances, the BiLSTM model

and CNN-BiLSTM model show stronger performance. The BiLSTM model increases

approximately 8-10% from the smaller sample size tier. CNN-BiLSTM model and ELMo

model begin to converge in their performance levels with more data. However, when

working with a smaller dataset and an incomplete knowledge base, it stands to reason

that ELMo embeddings would be a more robust model in this setting. Comparing

Figure 4.6 and Figure 4.7, I observe how the ROC curve is far more inconsistent with

CoNLL03/AIDA. I get an average of 26-30% false positives with CoNLL03/AIDA and

Context2Vec.

Figure 4.8.: ELMo - KDWD Figure 4.9.: ELMo - CoNLL03/AIDA

Nonetheless, Figure 4.8 and Figure 4.9 express two smooth curves in both datasets

when working with ELMo embeddings and the smallest sample sizes. In general, when

29

relying too heavily on Wikipedia as the fundamental knowledge base, I will have a

harder time generalizing my application to new data that does not re�ect this structure

well (Hachey et al., 2013). In my work, the models perform exceedingly well with

KDWD and fair with CoNLL03/AIDA; more datasets should be analyzed for a deeper

understanding to how e�ective the performance is across di�erent text and formats.

I must consider how much the model over�ts or under�ts the data, and monitoring

loss can be one clear indication. There is a 0.2-0.3 di�erence in training and validation

loss in CoNLL03/AIDA, whereas there is a magnitude smaller di�erence of 0.02-0.03

in KDWD. This illustrates how our models are over�tting to the CoNLL03/AIDA

dataset, but I believe this to be a healthy amount. Additional dropout layers and batch

normalization, along with larger data was used to address these issues where I could

apply them.

P(Target|Anchor) P(Anchor|Target)

Talking Heads 0.972 0.973

Talking Heads (series) 0.031 0.869

Talking Heads (Australian TV series) 0.021 1.000

Talking Heads (play) 0.015 1.000

Pundit 0.004 0.007

Table 4.7.: Anchor Link of "talking heads"

In Table 4.7, the top 5 likely target candidates are from the anchor link of "talking

heads". However, when I retrieve candidates for this same entry, I will get results such

as "the walking seeds", "walking through �re", "headshaking", and "talking horse" from

the LSH algorithm. Penalties, such as those that observe text dissimilarity wherein the

di�erence is given weight, could �lter out super�uous results during the disambigua-

tion stage.

Furthermore, time is considered. I ran our models on Tesla K80 or P100 GPU. ELMo

embeddings took approximately twice the amount of time to run as the other deep

learning models. When working with the 2.8M samples from KDWD our runtimes

would �nish in 100-110 min with Context2Vec and 160-170 min with ELMo. On the

�nal deployment of the entire named-entity linking system, I used the CNN-LSTM

model with 2.8M samples for its modest improvement on the response time of query

retrieval.

30

5. Conclusion and Future Work

In my work and research, I constructed a complete named-entity linking system that

solves many of the research questions that I originally posited. I manage name varia-

tions in the case of misspelling, so if we have "George Bush" rather than "George Bosh",

the system will conclude that these two entities are the same. Name variations are also

managed in the case of aliases, where it can understand that "Bush Jr", and "George

Bush" can be the same person given the proper context. However, abbreviations remain

a challenge and usually rely on the primary representation of identity.

The ambiguity challenge is largely covered by my work, where often the right

context can give way to the correct entity. In the case of coreference resolution, there

remains room for improvement as a downstream task. Nonetheless, ambiguity is the

most evident challenge solved. In the case of incomplete information, my system

can determine context for those candidates within this snapshot of the dataset. The

bottleneck problem of the candidate lists remains a problem that limits the success of

all of the challenges I addressed.

In order to begin construction, I �rst had to recognize our named-entities. I then

performed candidate generation by combining anchor link frequency with the LSH

algorithm, where a candidate group of potential entities was created for each identi�ed

entity. Afterwards, our named-entity disambiguation component was performed to

select the most probable candidate. The selection process for this �nal process was

treated as a classi�cation task.

I managed to successfully incorporate two types of context embeddings (Con-

text2Vec, ELMo) to concatenate with knowledge graph embeddings (TransE), a fresh ap-

proach to named-entity linking that performs well with both KDWD and CoNLL03/AIDA.

The focus of evaluation was mainly carried out with the disambiguation component of

the named-entity linking system. Running various models, hyperparameters, sample

sizes, and embeddings, I was able to achieve an e�cient �nal 79% F1 when applied to

the CoNLL04/AIDA benchmark.

For future work, there are a few di�erent aspects of my research that should be

expanded on. The �rst one would be improving the other components of my system.

Joint modeling for named-entity recognition, candidate generation, and named-entity

disambiguation has been done heuristically and through neural networks (Broscheit,

2019).

Our named-entity recognition system is a vanilla model, but using LSTM and

conditional random �elds (CRF) could likely improve it. CRFs are used for predicting

sequences that use contextual information to supplement information which will be

used by the model to make a correct prediction. Given its preeminence with contextual

information, it would make compelling work to develop a CRF for the disambiguation

component as well, yet this usage lacks proper research and implementation.

The candidate generation component could be improved by experimenting with

more graph algorithms for greater semantic understanding of text. I could have evalu-

ated Jaccard similarity with Cosine similarity, yet previous work suggested that Jaccard

similarity would perform best with the LSH algorithm (E. Zhu et al., 2016). In addition,

the candidate generation has bottleneck limitations that could be improved with more

considerate curation at the feature engineering stage instead of changing the similarity

metric.

31

Finally, Luan et al. (2018) assert that multi-task identi�cation of entities, relations,

and coreference resolution outperforms other models, such as mine, that only focus

on purely entities. There is a lot of unused relational data in my knowledge graph

embeddings that could be incorporated into deep learning, as was shown in Figure

2.11. I primarily use link prediction and similarity metrics, but graph algorithms like

PageRank and Louvain modularity have been leveraged with deep learning models

too (Cao et al., 2018).

32

A. Named-Entity Linking Examples

Figure A.1.: Named-Entity Linking Example 1

Figure A.2.: Named-Entity Linking Example 2

33

Bibliography

Adamic, Lada A. and Eytan Adar (2003). “Friends and neighbors on the Web”. Soc.Networks 25, pp. 211–230.

Auer, Sören, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and

Zachary Ives (2007). “DBpedia: A Nucleus for a Web of Open Data”. In: vol. 6. Jan.

2007, pp. 722–735. doi: 10.1007/978-3-540-76298-0_52.

Battaglia, Peter, Jessica Blake Chandler Hamrick, Victor Bapst, Alvaro Sanchez, Vinicius

Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro,

Ryan Faulkner, Caglar Gulcehre, Francis Song, Andy Ballard, Justin Gilmer, George

E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Jayne Langston, Chris

Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals,

Yujia Li, and Razvan Pascanu (2018). “Relational inductive biases, deep learning, and

graph networks”. arXiv. url: https://arxiv.org/pdf/1806.01261.pdf.

Bordes, Antoine, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana

Yakhnenko (2013). “Translating Embeddings for Modeling Multi-Relational Data”.

In: Proceedings of the 26th International Conference on Neural Information ProcessingSystems - Volume 2. NIPS’13. Lake Tahoe, Nevada: Curran Associates Inc., pp. 2787–

2795.

Broscheit, Samuel (2019). “Investigating Entity Knowledge in BERT with Simple Neural

End-To-End Entity Linking”. In: Proceedings of the 23rd Conference on Computa-tional Natural Language Learning (CoNLL). Hong Kong, China: Association for

Computational Linguistics, Nov. 2019, pp. 677–685. doi: 10.18653/v1/K19-1063. url:

https://www.aclweb.org/anthology/K19-1063.

Cao, Jinxin, Di Jin, Liang Yang, and Jianwu Dang (2018). “Incorporating network

structure with node contents for community detection on large networks using

deep learning”. Neurocomputing 297 (Feb. 2018). doi: 10.1016/j.neucom.2018.01.065.

Dong, Wei, Moses Charikar, and Kai Li (2011). “E�cient K-nearest neighbor graph

construction for generic similarity measures”. In: Jan. 2011, pp. 577–586. doi: 10.

1145/1963405.1963487.

Franzoni, Valentina, Michele Lepri, and Alfredo Milani (2019). “Topological and Seman-

tic Graph-based Author Disambiguation on DBLP Data in Neo4j”.CoRR abs/1901.08977.

arXiv: 1901.08977. url: http://arxiv.org/abs/1901.08977.

Gionis, Aristides, Piotr Indyk, and Rajeev Motwani (1999). “Similarity Search in High

Dimensions via Hashing”. In: Proceedings of the 25th International Conference on VeryLarge Data Bases. VLDB ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers

Inc., pp. 518–529. isbn: 1558606157.

Goldberg, Yoav and Omer Levy (2014). “word2vec Explained: deriving Mikolov et al.’s

negative-sampling word-embedding method”. CoRR abs/1402.3722. arXiv: 1402.3722.

url: http://arxiv.org/abs/1402.3722.

Grover, Aditya and Jure Leskovec (2016). “node2vec: Scalable Feature Learning for

Networks”. CoRR abs/1607.00653. arXiv: 1607.00653. url: http://arxiv.org/abs/1607.

00653.

Hachey, Ben, Will Radford, Joel Nothman, Matthew Honnibal, and James Curran (2013).

“Evaluating Entity Linking with Wikipedia”. Arti�cial Intelligence 194 (Jan. 2013),

pp. 130–150. doi: 10.1016/j.artint.2012.04.005.

34

https://doi.org/10.1007/978-3-540-76298-0_52

https://arxiv.org/pdf/1806.01261.pdf

https://doi.org/10.18653/v1/K19-1063

https://www.aclweb.org/anthology/K19-1063

https://doi.org/10.1016/j.neucom.2018.01.065

https://doi.org/10.1145/1963405.1963487

https://doi.org/10.1145/1963405.1963487

https://arxiv.org/abs/1901.08977

http://arxiv.org/abs/1901.08977






https://doi.org/10.1016/j.artint.2012.04.005

Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-term Memory”. Neuralcomputation 9 (Dec. 1997), pp. 1735–80. doi: 10.1162/neco.1997.9.8.1735.

Ho�art, Johannes, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred

Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum (2011).

“Robust Disambiguation of Named Entities in Text”. In: Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scot-

land, UK.: Association for Computational Linguistics, July 2011, pp. 782–792. url:

https://www.aclweb.org/anthology/D11-1072.

Ji, Guoliang, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao (2015). “Knowledge

Graph Embedding via Dynamic Mapping Matrix”. In: Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing,

China: Association for Computational Linguistics, July 2015, pp. 687–696. doi:

10.3115/v1/P15-1067. url: https://www.aclweb.org/anthology/P15-1067.

Koncel-Kedziorski, Rik, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Ha-

jishirzi (2019). “Text Generation from Knowledge Graphs with Graph Transformers”.

CoRR abs/1904.02342. arXiv: 1904.02342. url: http://arxiv.org/abs/1904.02342.

Kulkarni, Sayali, Ganesh Ramakrishnan, and Soumen Chakrabarti (2009). “Collective

annotation of Wikipedia entities in web text”. In: Jan. 2009, pp. 457–466. doi: 10.

1145/1557019.1557073.

Le, Phong and Ivan Titov (2018). “Improving Entity Linking by Modeling Latent

Relations between Mentions”. In: Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers). Melbourne, Aus-

tralia: Association for Computational Linguistics, July 2018, pp. 1595–1604. doi:

10.18653/v1/P18-1148. url: https://www.aclweb.org/anthology/P18-1148.

Logan, Robert, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh

(2019). “Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language

Modeling”. In: Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics. Florence, Italy: Association for Computational Linguistics,

July 2019, pp. 5962–5971. doi: 10.18653/v1/P19-1598. url: https://www.aclweb.org/

anthology/P19-1598.

Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman (2014). “Parallel Heuris-

tics for Scalable Community Detection”. Parallel Computing 486 (Oct. 2014). doi:

10.1016/j.parco.2015.03.003.

Luan, Yi, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi (2018). “Multi-Task

Identi�cation of Entities, Relations, and Coreference for Scienti�c Knowledge Graph

Construction”. In: Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing. Brussels, Belgium: Association for Computational Linguistics,

Oct. 2018, pp. 3219–3232. doi: 10.18653/v1/D18-1360. url: https://www.aclweb.org/

anthology/D18-1360.

Melamud, Oren, Jacob Goldberger, and Ido Dagan (2016). “context2vec: Learning

Generic Context Embedding with Bidirectional LSTM”. In: Proceedings of The 20thSIGNLL Conference on Computational Natural Language Learning. Berlin, Germany:

Association for Computational Linguistics, Aug. 2016, pp. 51–61. doi: 10.18653/v1/

K16-1006. url: https://www.aclweb.org/anthology/K16-1006.

Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel (2011). “A Three-Way Model

for Collective Learning on Multi-Relational Data.” In: Jan. 2011, pp. 809–816.

Page, Larry, Sergey Brin, R. Motwani, and T. Winograd (1998). The PageRank CitationRanking: Bringing Order to the Web.

Parravicini, Alberto, Rhicheek Patra, Davide B. Bartolini, and Marco D. Santambrogio

(2019). “Fast and Accurate Entity Linking via Graph Embedding”. In: Proceedingsof the 2nd Joint International Workshop on Graph Data Management Experiences

35

https://doi.org/10.1162/neco.1997.9.8.1735

https://www.aclweb.org/anthology/D11-1072

https://doi.org/10.3115/v1/P15-1067

https://www.aclweb.org/anthology/P15-1067



https://doi.org/10.1145/1557019.1557073

https://doi.org/10.1145/1557019.1557073

https://doi.org/10.18653/v1/P18-1148


https://doi.org/10.18653/v1/P19-1598



https://doi.org/10.1016/j.parco.2015.03.003

https://doi.org/10.18653/v1/D18-1360



https://doi.org/10.18653/v1/K16-1006

https://doi.org/10.18653/v1/K16-1006

https://www.aclweb.org/anthology/K16-1006

Systems (GRADES) and Network Data Analytics (NDA). GRADES-NDA’19. Amster-

dam, Netherlands: Association for Computing Machinery. isbn: 9781450367899. doi:

10.1145/3327964.3328499. url: https://doi.org/10.1145/3327964.3328499.

Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena (2014). “DeepWalk: Online Learning

of Social Representations”. CoRR abs/1403.6652. arXiv: 1403.6652. url: http://arxiv.

org/abs/1403.6652.

Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,

Kenton Lee, and Luke Zettlemoyer (2018). “Deep contextualized word representa-

tions”. In: Proc. of NAACL.

Raiman, Jonathan and Olivier Raiman (2018). “DeepType: Multilingual Entity Linking

by Neural Type System Evolution”. In: AAAI.Reddy, Sathish, Dinesh Raghu, Mitesh M. Khapra, and Sachindra Joshi (2017). “Gen-

erating Natural Language Question-Answer Pairs from a Knowledge Graph Using

a RNN Based Question Generation Model”. In: Proceedings of the 15th Conferenceof the European Chapter of the Association for Computational Linguistics: Volume 1,Long Papers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017,

pp. 376–385. url: https://www.aclweb.org/anthology/E17-1036.

Serrà, Joan and Alexandros Karatzoglou (2017). “Getting deep recommenders �t: Bloom

embeddings for sparse binary input/output networks”. CoRR abs/1706.03993. arXiv:

1706.03993. url: http://arxiv.org/abs/1706.03993.

Shchur, Oleksandr and Stephan Günnemann (2019). “Overlapping Community Detec-

tion with Graph Neural Networks”. arXiv: 1909.12201 [cs.LG].

Sil, Avirup, Gourab Kundu, Radu Florian, and Wael Hamza (2018). “Neural Cross-

Lingual Entity Linking”. In: AAAI.Sun, Chi, Xipeng Qiu, Yige Xu, and Xuanjing Huang (2019). “How to Fine-Tune BERT

for Text Classi�cation?” CoRR abs/1905.05583. arXiv: 1905.05583. url: http://arxiv.

org/abs/1905.05583.

Tjong Kim Sang, Erik F. and Fien De Meulder (2003). “Introduction to the CoNLL-2003

Shared Task: Language-Independent Named Entity Recognition”. In: Proceedings ofthe Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–

147. url: https://www.aclweb.org/anthology/W03-0419.

Wang, Wentao, Lintao Wu, Ye Huang, Hao Wang, and Rongbo Zhu (2019). “Link

Prediction Based on Deep Convolutional Neural Network”. Information 10 (May

2019), p. 172. doi: 10.3390/info10050172.

Wang, Xiaozhi, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang

(2019). KEPLER: A Uni�ed Model for Knowledge Embedding and Pre-trained LanguageRepresentation. arXiv: 1911.06136 [cs.CL].

Yang, Bishan, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and li Deng (2014). “Embedding

Entities and Relations for Learning and Inference in Knowledge Bases” (Dec. 2014).

Yenter, Alec and Abhishek Verma (2017). “Deep CNN-LSTM with combined kernels

from multiple branches for IMDb review sentiment analysis”. In: Oct. 2017, pp. 540–

546. doi: 10.1109/UEMCON.2017.8249013.

Zachary, Wayne W (1977). “An information �ow model for con�ict and �ssion in small

groups”. Journal of anthropological research, pp. 452–473.

Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis C. M. Lau (2015). “A C-LSTM

Neural Network for Text Classi�cation”. CoRR abs/1511.08630. arXiv: 1511.08630.

url: http://arxiv.org/abs/1511.08630.

Zhu, Erkang, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller (2016). “LSH Ensemble:

Internet-Scale Domain Search”. Proc. VLDB Endow. 9.12 (Aug. 2016), pp. 1185–1196.

issn: 2150-8097. doi: 10.14778/2994509.2994534. url: https://doi.org/10.14778/

2994509.2994534.

36

https://doi.org/10.1145/3327964.3328499

https://doi.org/10.1145/3327964.3328499




https://www.aclweb.org/anthology/E17-1036







https://www.aclweb.org/anthology/W03-0419

https://doi.org/10.3390/info10050172


https://doi.org/10.1109/UEMCON.2017.8249013



https://doi.org/10.14778/2994509.2994534

https://doi.org/10.14778/2994509.2994534

https://doi.org/10.14778/2994509.2994534

Zhu, Zhaocheng, Shizhen Xu, Meng Qu, and Jian Tang (2019). “GraphVite: A High-

Performance CPU-GPU Hybrid System for Node Embedding”. CoRR abs/1903.00757.

arXiv: 1903.00757. url: http://arxiv.org/abs/1903.00757.

37