Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Separating the Signal fromthe Noise: Predicting theCorrect Entities inNamed-Entity Linking
Drew Perkins
Uppsala University
Department of Linguistics and Philology
Master Programme in Language Technology
Master’s Thesis in Language Technology, 30 ects credits
June 9, 2020
Supervisors:
Gongbo Tang, Uppsala University
Thorsten Jacobs, Seavus
Abstract
In this study, I constructed a named-entity linking system that maps between
contextual word embeddings and knowledge graph embeddings to predict correct
entities. To establish a named-entity linking system, I �rst applied named-entity
recognition to identify the entities of interest. I then performed candidate gener-
ation via locality sensitivity hashing (LSH), where a candidate group of potential
entities were created for each identi�ed entity. Afterwards, my named-entity dis-
ambiguation component was performed to select the most probable candidate. By
concatenating contextual word embeddings and knowledge graph embeddings in
my disambiguation component, I present a novel approach to named-entity link-
ing. I conducted the experiments with the Kensho-Derived Wikimedia Dataset
and the AIDA CoNLL-YAGO Dataset; the former dataset was used for deployment
and the later is a benchmark dataset for entity linking tasks. Three deep learning
models were evaluated on the named-entity disambiguation component with
di�erent context embeddings. The evaluation was treated as a classi�cation task,
where I trained my models to select the correct entity from a list of candidates.
By optimizing the named-entity linking through this methodology, this entire
system can be used in recommendation engines with high F1 of 86% using the
former dataset. With the benchmark dataset, the proposed method is able to
achieve F1 of 79%.
Contents
Acknowledgments 5
1. Introduction 6
1.1. Purpose and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2. Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. Background 8
2.1. Graph Theory and Concepts . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1. Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2. Knowledge Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1. Knowledge Representation . . . . . . . . . . . . . . . . . . . . 10
2.2.2. Knowledge Bases . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3. Named-Entity Linking Components . . . . . . . . . . . . . . . . . . . 11
2.3.1. Named-Entity Recognition . . . . . . . . . . . . . . . . . . . . 11
2.3.2. Candidate Generation via Locality Sensitivity Hashing . . . . 11
2.3.3. Named-Entity Disambiguation . . . . . . . . . . . . . . . . . 12
2.4. Feature Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.1. Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2. Graph Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5. Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . 16
2.5.1. Long Short Term Memory (LSTM) . . . . . . . . . . . . . . . 16
2.5.2. Convolutional Neural Networks (CNN) . . . . . . . . . . . . . 16
2.5.3. Contextual Embeddings from Language Models (ELMo) . . . 17
3. Methodology 19
3.1. Named-Entity Linking . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2. Disambiguation Models . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1. Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2. BiLSTM Model . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3. CNN-BiLSTM Model . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.4. ELMo Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.5. Feed-Forward Neural Network (FFNN) . . . . . . . . . . . . . 22
4. Experiments 23
4.1. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1. Kensho-Derived Wikimedia Dataset (KDWD) . . . . . . . . . 23
4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA) . . . . . . . . 24
4.2. Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3. Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1. Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.2. Classi�cation Metrics . . . . . . . . . . . . . . . . . . . . . . . 26
4.4. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4.1. The E�ect of Training Data Size . . . . . . . . . . . . . . . . . 27
4.4.2. Disambiguation Models . . . . . . . . . . . . . . . . . . . . . 27
4.4.3. Candidate List Accuracy . . . . . . . . . . . . . . . . . . . . . 28
4.4.4. Analysis and Discussion . . . . . . . . . . . . . . . . . . . . . 29
3
5. Conclusion and Future Work 31
A. Named-Entity Linking Examples 33
4
Acknowledgments
I would like to thank my university supervisor Gongbo Tang for his help and guidance
in the structure and pragmatics of my thesis. I am deeply indebted to the Seavus AB
team, particularly my company supervisor Thorsten Jacobs, for their support, feedback,
resources, ideas, and deadlines for this thesis. I would like to thank COVID-19 for
destroying my social life in the months leading up to completing my thesis. Finally, I
would like to thank my family, friends, and girlfriend for their ongoing support during
my Master’s studies.
Seavus AB is an IT consulting �rm that provides enterprise-wide business solutions
across the world, mainly covering the US and European markets. The department I
conducted this work in was their arti�cial intelligence and machine learning division
located in Stockholm. Their current work includes but is not limited to chatbots, QA
systems, and business intelligence.
5
1. Introduction
"I am convinced that the crux of the problem of learning is recognizing
relationships and being able to use them"
Christopher Strachey in a letter to Alan Turing, 1954
Knowledge graphs have been exploding in recent years within the scope of natural
language processing. Whether it be natural language generation, question-answering,
or named-entity recognition and relation linking, when common natural language tasks
are leveraged with knowledge graphs, improvements can be made across tasks and
domains. That is why I sought to construct a named-entity linking system, whereupon
ambiguities in the named-entities can be detected and properly clari�ed with assistance
from knowledge graphs. On some initial work with named-entity recognition of a
corpus, I noticed "Bush" would come up several times without any clarity to whether
this person was in reference to "George H.W. Bush" or "George W. Bush". For our
purposes, this seemed like a glaring oversight and one that we chose to expand on
to �nd a proper solution. One clear method to solve this problem is through named-
entity linking. A named-entity linking system consists of three primary components:
named-entity recognition, candidate generation, and named-entity disambiguation.
With all three of these components, we e�ectively identify entities, construct a list of
possible candidate for the identi�ed entities, and �nally disambiguate these entities
from the candidate list and link to a distinctive identi�er within a knowledge graph.
It is this �nal component – named-entity disambiguation – that I focused on for my
research and evaluation. The data I originally trained the disambiguation component
on was the Kensho-Derived Wikimedia Dataset, which includes Wikipedia text, links,
and the Wikidata knowledge base. I then conducted further studies with the AIDA
CoNLL-YAGO Dataset, a benchmark in named-entity linking. Furthermore, I performed
named-entity linking with three disambiguation models that map between contextual
word embeddings and knowledge graph embeddings. To optimize named-entity linking
with deep learning, I treated the problem as a classi�cation task; the models predict
the correct entity among a series of candidates. With my best model performance with
the benchmark dataset, I achieved 87% recall, 73% precision, and an ROC-AUC of 83%.
The thesis project was conducted at Seavus AB, an IT consulting �rm in Stockholm,
Sweden. It was founded in Malmo, Sweden, yet has o�ces throughout Northern and
Southeastern Europe. Seavus AB o�ers state-of-the-art machine learning and arti�cial
intelligence services to companies from around the world.
1.1. Purpose and Motivation
The purpose of this thesis is to examine and evaluate di�erent ways to improve
on named-entity linking systems by mapping between context embeddings with
knowledge graph embeddings for correct entities; it would appear that there is bereft
research into the usage of both of these embeddings together. The motivation came
from my initial �ndings of ambiguity with certain persons and places, such as in
the "Bush" example. Particularly when we are working with news corpora, a last
6
name can often be found alone, leading to more than a mere isolated incident. Named-
entity linking is a step toward more accurate semantic representation, and synonym
extraction, that remains a challenge in NLP research despite the robust ontology
networks and lexicons widely available. Not only that, the relations we make in
how we think, what we watch, and who we connect with are all inexorably linked
together to how we have come to understand the world, and fundamental to language
technology.
There are several research questions I considered to help me move forward with my
work. Can entities be adequately predicted through the concatenation of knowledge
graph embeddings with contextual word embeddings? Most current disambiguation
methods rely purely on the contextual information of documents. Can my system
manage name variations – i.e., the same entity can appear with various naming conven-
tions? This may be caused by aliases, spelling errors, or abbreviations. Can my system
manage ambiguity – i.e., the same mention may be polysemous (i.e. have multiple
meanings) depending on the speci�c context? Can my system manage incomplete
information when there is a limited amount of knowledge? Will my system be able to
�ll the contextual gaps?
1.2. Outline
The outline of the thesis will consist of the following. In Chapter 2, I will introduce
essential graph theory concepts and terminology that will be necessary to understand
the rest of the thesis. This will be followed by knowledge graphs, the components
used for the named-entity linking system, feature embeddings, and the deep learning
necessary to understand the later chapters. In Chapter 3, I discuss the methodology
that went towards the system, embeddings, and disambiguation models. In Chapter 4, I
�rst investigate the datasets used. I then break down the settings of the data, the earlier
entity linking components, and of course, the model environments for disambiguation.
This is followed by soft introductions to the evaluation metrics used in this work. I
�nish this chapter by reporting my results with an analysis and discussion afterwards.
In Chapter 5, I conclude my �ndings and examine a few di�erent ways this research
could be expanded on in the future.
7
2. Background
First, I note the fundamental graph concepts and algorithms. I then discuss knowledge
graphs from base to graph structure. I follow with explanations of the named-entity
linking components that include named-entity recogntion, candidate generation, and
named-entity disambiguation. Afterwords, I delve into word and graph feature embed-
dings. Finally, I explain the core deep learning necessary for this research.
2.1. Graph Theory and Concepts
In graph theory, graphs are a high �delity way of modeling pairwise relations between
objects. Graphs are comprised of objects known as vertices, entities, or nodes that
are connected to edges, links, or relationships. A label of a node marks it as part of a
larger group. In classic graph theory, this traditionally signi�ed one node, but this has
since taken on the meaning of a node group. For relationships, they are classi�ed into
types rather than labels. Nodes and relationships can also have embedded attributes
known as properties that contain various data types, whether they be numerical or
categorical. A subgraph is a smaller section within a larger graph. A path is a group of
nodes and their connecting relationships.
Figure 2.1.: Semantic Triple
In addition, there are common graph attributes that should be considered. Graphs
can have nodes that connect or disconnect with relations. Nodes and relationships
can carry certain weights. Nodes can have relations with a �xed direction. In this
case, start nodes are known as heads and end nodes are known as tails; a series of
heads, relations, and tails are called semantic triples. The paths can be cyclic or acyclic
depending on which node it starts and ends on. The relationship to node ratio can
be sparse or dense, which can lead to divergent results. Monopartite, bipartite, and
k-partite graphs are those that connect nodes by one, two, or any number of node
types. They express a subgraph of a knowledge graph.
2.1.1. Algorithms
Pathfinding
Two fundamental algorithms to traverse an entire graph are depth-�rst search and
breadth-�rst search. A depth-�rst search algorithm iterates outward from a starting
node to some end node before repeating a similar search down a di�erent path from
8
the same start node. Breadth-�rst search iterates the graph one layer at a time, �rst
visiting each node at depth 1, then depth 2, and so on, until the entire graph has been
visited. Path�nding algorithms are built on top of graph search algorithms as these
two and explore routes between nodes until the goal node has been reached. The
path�nding algorithms primarily covered in my work are shortest path (shortest path
between nodes) and random walk (set of random nodes following any relationship,
selected arbitrarily).
Centrality
Centrality is often implemented to retrieve the most important people or most relevant
answers in response to a query. Some algorithms such as PageRank (Page et al., 1998),
that was devised at Google, permitted the traversal through its search engine to
measure the most important web pages. It counts the number and quality of links to a
page to determine a rough estimate of how important a web page is. The underlying
assumption here is that more important pages are likely to receive more links from
other pages:
%'pDq “ÿ
hP�D
%'phq
!phq
The PageRank value for a page (u) relies on the PageRank values for each page (v)
that the set (B) contains, divided by the number of links (L) from page v.
Community Detection
Social networks are the most striking, and paradigmatic, examples of relationships
between individuals (or communities) within a graph. Whether in a group comprised of
your coworkers, family, or friends, people gravitate to form groups. Zachary’s network
of karate club members, a standard benchmark in community detection demonstrated
aggregations, of people as they drifted apart and formed two factions of communities
(Zachary, 1977).
Figure 2.2.: Zachary’s Karate Club
By looking at Figure 2.2, it is possible to infer two major aggregations of people pulled
apart from those around vertex 34 (the club president) and vertex 1 (the instructor).
The Louvain method is a clustering algorithm for community detection that evaluates
how much more densely connected the nodes within a community are in comparison
9
within a random network (Lu et al., 2014), but graph neural networks have also been
used to detect overlapping and disjoint communities (Shchur and Günnemann, 2019).
Link Prediction
When we want to foreshadow the most likely future relations in Figure 2.2, we use
link prediction algorithms to predict future possible connections in the network. In
addition, they can be used to propose missing links for obstructed or missing data.
The Adamic Adar algorithm (Adamic and Adar, 2003) was adopted early to predict
links in social networks, such as in Figure 2.2, using the formula:
�pG,~q “ÿ
D P # pGq X # p~q1
;>6|# pDq|
In Adamic and Adar (2003), N(u) is the set of nodes adjacent to u. A value of 0 asserts
that two nodes are not close, while higher values indicate closeness.
Similarity
Di�erent vector-based metrics are applied when we want to compute the similarity
of pairs of nodes. A node similarity is calculated by looking at how many neighbors
two nodes share, as in an approximate nearest neighbor algorithm. This algorithm
constructs a k-nearest neighbors (k-NN) graph for a set of objects based on a given
similarity algorithm.
If I use the Euclidean distance and k-NN to calculate the distance between two
nodes:
�p?, @q “ �p@, ?q “
b
p@1 ´ ?1q2 ` p@2 ´ ?2q
2 ` ... ` p@= ´ ?=q2 “
g
f
f
e
=ÿ
8“1
p@8 ´ ?8q2
The above formula takes the n number of dimensions (or features). The similarity
algorithms that can be leveraged with k-nearest neighbor are Cosine similarity, Jaccard
similarity, Euclidean distance, and Pearson similarity, to name a few.
2.2. Knowledge Graphs
2.2.1. Knowledge Representation
Knowledge graphs model information in the form of entities and relationships between
them. This sort of knowledge representation is a �eld, long explored in logic and
reasoning, focused on representing abstract information about the world in a way that
a computer can interpret. A few examples of knowledge representation formalisms
include semantic nets, systems architecture, frames, rules, and ontologies.
2.2.2. Knowledge Bases
A knowledge base is a centralized database for storing, organizing, and disseminating
represented knowledge. The general representation for a knowledge base is an object
model with classes, subclasses, and instances. Some of what I touched on, such as
semantic nets and ontologies, are these object models. The two main forms of knowl-
edge bases are machine-readable and human-readable. Machine-readable knowledge
bases store data that can only be analyzed by arti�cial intelligence systems. Human-
readable knowledge bases store documents and physical texts that can be accessed by
10
humans. The key factors to consider to determine the usefulness of knowledge bases
are the completeness, accuracy, and quality of the information we are using. When
we structure large, unstructured data, we often use a graphical representation of this
knowledge known as a knowledge graph.
2.3. Named-Entity Linking Components
2.3.1. Named-Entity Recognition
Many natural language processing applications require identifying named-entities in
text data and classifying them. Named-entities can be, for example, person or company
names, dates and time expressions, organizations, locations, etc. The task of identifying
these in a text is called named-entity recognition and is often performed for a speci�c
domain of unstructured data.
Figure 2.3.: Named-Entity Recognizer for Music
Named-entity recognition is usually a supervised task because of its reliance on
annotated data, such as CoNLL-2003 (Tjong Kim Sang and De Meulder, 2003). With
this strong need for annotated data for certain domains, such as medicine or law, this
task becomes knowledge intensive by nature. The most common NLP applications that
bene�t from named-entity recognition are QA, relation extraction, and coreference
resolution. In fact, coreference resolution solves the ambiguities of named-entity
recognition by �nding the references to the same entity in a text.
2.3.2. Candidate Generation via Locality Sensitivity Hashing
In the candidate generation process, I needed to �nd the k-NN among the candidates
(Dong et al., 2011). Using brute force to process all possible candidate combinations
would have given us the exact nearest neighbor but it is neither scalable or fast. I also
calculated the frequency that an anchor link (Wikipedia hyperlink) corresponds to a
target page, but this method alone was not enough. Thankfully, there are heuristics
that lead to a promising approximation to this k-NN search task, locality sensitivity
hashing (LSH) being one such algorithm (Gionis et al., 1999).
Figure 2.4.: Locality Sensitivity Hashing
LSH hashes data points into buckets so that data points near each other are located
in the same buckets with high probability, while data points far from each other
are likely to be in di�erent buckets. This makes it possible to observe data within
11
various degrees of similarity. The reason why I don’t simply rely on the anchor link
for candidate generation is that even a slight character di�erence could result in
no matches. Similarity metrics, such as Jaccard similarity, can successfully retrieve
entities with names similar to the identi�ed entity text (E. Zhu et al., 2016). Jaccard
similarity checks to see how similar two texts are and can be e�ciently approximated
by MinHash LSH:
� p�, �q “|� X �|
|� Y �|
If I have a greater intersection of characters between two words than I can expect
a higher Jaccard index. The frequency and rarity of candidates in a set should be
given proper consideration when choosing the similarity measure. Entities that are
constantly reoccurring in the text tend to have embeddings with large normalizations.
This can dominate the candidate generation process and using variants of similarity
measures that put less emphasis on the normalization of an entity should be applied.
Regularization can aid in the inverse issue of rare entities being selected over more
relevant ones.
Figure 2.5.: Jaccard Index of 1/4
As in Figure 2.5, I have one word "Floyd" and another word "Florida"; I can expect
there to be an overall Jaccard index of 1/4 because of how much they have in common
out of all the overall character variables. Min hashing is the most critical aspect of the
algorithm and was chosen because of its e�ectiveness with Jaccard similarity, that is
mapping similarity between sets.
2.3.3. Named-Entity Disambiguation
Named-entity disambiguation and named-entity linking are used interchangeably but
for the purposes of our research, I distinguish the overall system as named-entity
linking and the named-entity linking component as named-entity disambiguation.
This component performs the task of mapping words of interest, such as names of
persons, locations and companies, from an input text document to corresponding
unique entities in a target knowledge base such as Wikipedia. When performing
named-entity disambiguation, I do not directly employ Wikipedia; there are databases
better suited for accessing and retrieving information from their knowledge base, such
as DBpedia or Wikidata.
Kulkarni et al. (2009) were the �rst to annotate and bridge unstructured text with
entity IDs from a knowledge base to disambiguate entities. Two years later Ho�art
et al. (2011) made one of the preeminent contributions to the disambiguation task by
adding rich context through a combined framework of popularity priors, similarity
measures, and coherence algorithms. Their robust framework became the standard that
12
most state-of-the-art models augment and attempt to surpass. Finally, Parravicini et al.
(2019) achieved state-of-the-art accuracy by leveraging knowledge graph embeddings
for the disambiguation task.
Figure 2.6.: Named-Entity Disambiguation with Wikidata (Parravicini et al., 2019)
Fields of author disambiguation (Franzoni et al., 2019), natural language generation
(NLG) (Koncel-Kedziorski et al., 2019), (Logan et al., 2019), and QA systems (Reddy et al.,
2017) bene�t from higher-level representations of text that we cannot reach with simple
recognition or recommendation algorithms. In order for us to �nd concepts relevant
to the task or application that are separate from the text, named-entity disambiguation
can discover underlying meanings. For one example as to the bene�ts of using named-
entity disambiguation, consider a simple query, “Floyd revolutionized rock with the
Wall”. As I mentioned with PageRank and LSH, search and recommendation engines try
to �nd the most relevant documents to recommend a user and to �nd supplementary
information that may be to their liking.
Without a named-entity disambiguation component, the search engine only looks
for information that mention “Floyd”, "rock" and "wall." This engine may provide us
"Pink Floyd", "rock music", and "The Wall", but it could also give us false negatives,
meaning that it misses out on retrieving additional information pertaining to Pink
Floyd, such as "progrock", "David Gilmour", and "Comfortably Numb". Even worse, the
engine could provide us a series of false positives, such as information on "rock wall
climbing", "Floyd Mayweather", "border walls", and "The Rock".
2.4. Feature Embeddings
One of the more assured ways to encode the kind of properties that I have in mind is
through the use of feature vectors. Feature vectors consist of numeric or nominal values
that we embed speci�c information as an input to many machine learning algorithms.
This embedded information is a hidden low-dimensional vector representation used
to preserve linguistic, spatial, or extraneous features into a new space for e�ective
learning.
2.4.1. Word Embeddings
Word embeddings focus on the word features of a certain lexicon. They are capable
of capturing the context of a word, such as its semantic and syntactic similarity.
13
Words or phrases are mapped to vectors of real numbers. It involves a mathematical
embedding from a space with many dimensions per word to a continuous vector space
with a much lower dimension. Dimensionality reduction is the process of reducing
the number of random variables under consideration by obtaining a set of principal
variables. By reducing the size of word embeddings, we can improve their utility in
memory-constrained pipelines.
Figure 2.7.: Word2Vec
Word2Vec is one of the most popular techniques, developed by Google, to learn
word embeddings using a group of models (Goldberg and Levy, 2014). One of these
models in Word2Vec is known as a skip-gram model. It takes every word in a large
corpora and also takes one-by-one the words that surround it within a de�ned windowto then feed a neural network that after training will predict the probability for each
word to actually appear in the window around the highlighted word.
Figure 2.8.: Context Window
Context2Vec is a neural model that develops from a generic embedding function for
these context windows of target words. Melamud et al. (2016) demonstrated the e�-
ciency of training billions of words (with reasonable time constraints) could maintain
high-quality context representations which signi�cantly outperform traditional word
embeddings.
2.4.2. Graph Embeddings
Graph embeddings are the transformation of property graphs to a vector or set of
vectors. Embeddings should capture the graph topology, node-to-node relationship,
and other relevant information about graphs, subgraphs, and nodes. More properties
embedded have the potential to encode better results. Graph embeddings are often
divided between three main groups:
• Node embeddings: We encode each node with its own vector representation. We
would use this embedding when we want to perform visualization or prediction
on the node level, e.g. visualization of vertices in the 2D plane, or prediction
of new connections based on vertex similarities. In many ways, this method is
very similar in mapping as Word2Vec. A few examples are DeepWalk (Perozzi
et al., 2014) and Node2Vec (Grover and Leskovec, 2016).
14
• Bilinear-based embeddings: We encode the relationships between the two entity
vectors using multiple matrices. Assuming we have a total number of entities as
E and a total number of relations as R, the total number of parameters will be Ex E x R. Bilinear-based models like RESCAL (Nickel et al., 2011) generates the
score s of a triple (h,r,t) via tensor-factorization:
Bpℎ, A, Cq “ \)ℎ"A\C
The head nodes \ℎ are represented as a matrix transpose T and relations are
represented as a matrix "A . There is a need for weight decay with RESCAL
because each relation carries with it many parameters which generally leads to
over�tting and downgrading the overall performance (Nickel et al., 2011).
• Translation-based embeddings: We encode the whole relation with a single
vector. The fundamental notion is that the model is making the sum of the head
vector and relation vector as close as possible to the tail vector. Translation-based
models like TransE (Bordes et al., 2013) and TransD (Ji et al., 2015) solve link
prediction in multi-relational data by interpreting relationships as translations
operating on a learned low-dimensional embedding of the entities in a knowledge
graph, rather than on the graph structure itself.
Figure 2.9.: TransE
TransE is one of the most notable translation-based models for knowledge graph
embeddings due to the sheer simplicity of its method:
Bpℎ, A, Cq “ 3p\ℎ ` hA ´ \C q
As two embeddings are compared to generate the score s of their triple (h,r,t),the head embedding \ℎ is �rst translated by the relationship vector EA . TransE
returns the lower scores to entities that are close, therefore the semantic triple
score is computed as such, where d is a dissimilarity function like !1 or !2.
DistMult (Yang et al., 2014) is similar to both RESCAL and TransE. Instead of complex
matrices, Yang et al. (2014) reduce the number of relations by only using diagonal
matrices as vectors v to generate the score s of a triple (h,r,t) :
15
Bpℎ, A, Cq “ x\ℎ ` hA ´ \Cy “
�ÿ
3“1
\ℎ,3 hA ,3 \C ,3
Above, d is the diagonal operator, which is limited to representing only symmet-
ric relations; the same embedding space is on the left and right sides. DistMult and
TransE both use a low number of parameters to achieve state-of-the-art results. How-
ever, having a model that focuses solely on the diagonal matrices is not without its
limitations.
2.5. Neural Networks and Deep Learning
2.5.1. Long Short Term Memory (LSTM)
Long Short Term Memory (LSTM) networks are a form of recurrent neural networks
(RNN) that are capable of learning long-term dependencies (Hochreiter and Schmidhu-
ber, 1997). In standard RNNs, this repeating module will have a very simple structure,
such as a single tanh layer.
Figure 2.10.: LSTM Unit
LSTMs also have this linked, chain structure, but the repeating module has a di�erent
structure. Instead of having a single neural network, there are four layers, interacting
in a way that cleverly manages time intervals. In Figure 2.10, at center is an LSTM
unit composed of a cell, input gate, output gate and forget gate. The cell remembers
values over these time intervals and the gates regulate and control the information that
comes in and out of the cell. LSTMs were created to deal with the vanishing gradient
problem that is encountered with traditional RNN.
2.5.2. Convolutional Neural Networks (CNN)
A convolutional neural network will apply 1D convolutions to map features of text,
and concurrently apply max pooling operations over the time-step dimension to obtain
a �xed-length output. We are often talking about 1D convolutions when working with
text data and 2D convolutions when we are working with image data. With graphs in
mind, studies have gone into exploring the encapsulation of graphs through these 2D
and 3D convolutions, respectively. However, learning a graph through convolutions is
one di�culty and learning an entire knowledge graph through convolutions is another
(Battaglia et al., 2018).
The general purpose of using various convolutions within one network of graph
data is to capture larger representations in di�erent dimensions. For example, one
16
Figure 2.11.: CNN with Semantic Triples
knowledge graph will have nodes and edges, and each node and edge will more
than likely have additional labels and types. Unsurprisingly, where LSTM fails to
capture these features beyond linear representations, CNN shows more promise in
capturing these complexities. In link prediction, densely connected convolutional
neural networks have been e�ective when in conjunction with classic graph heuristics
and similarity metrics (W. Wang et al., 2019).
2.5.3. Contextual Embeddings from Language Models (ELMo)
Looking at a way to advance my context embeddings, where I can look at the en-
tire sentence before assigning each word to a corresponding embedding, I decided
to implement ELMo embeddings. ELMo performs many tasks with state-of-the-art
precision and recall on predicting following words in sentences (Peters et al., 2018).
ELMo uses two layers of bidirectional LSTMs (BiLSTM) in its training, with both layers
bridged with a residual connection. A residual connection is used to allow gradients
to �ow through a network directly, without passing through the non-linear activation
functions. The high-level intuition is that residual connections help neural networks
to train more successfully (Peters et al., 2018).
Figure 2.12.: BiLSTM Layers in ELMo
ELMo embeddings are character-based, which allows a neural network to use
morphological notions to form representations for out-of-vocabulary tokens unseen
in training. It is for this reason that static word embeddings like Word2Vec and GloVe
usually fall short. Even when we create a word embedding with a wide context window,
the word will ultimately have the same vector representation regardless of the context.
ELMo embeddings change with context. It is this text prediction that is being achieved
17
by the forward and backward language models in ELMo that make it one of the best
at tracking language patterns and transfer learning.
18
3. Methodology
I demonstrate the entire named-entity linking system from input to output. I then
discuss the embeddings, as well as the general model pipelines of the disambiguation
component and the feed-forward neural network (FFNN).
3.1. Named-Entity Linking
The input text is processed by a named-entity recognition component set up with the
SpaCy implementation. The start and end positions of the named-entity are catalogued
along with the raw text input and the identi�ed named-entity. A few named-entities
that I exempted were number related such as money, time, percentages, etc. Afterwards,
the input text is cleaned and realigned with new start and end positions. The top
candidates are then retrieved from the anchor link frequency and the LSH algorithm,
and �nally, the named-entity disambiguation model is performed to deliver the output.
Figure 3.1.: Named-Entity Linking Components
19
For evaluation purposes, I focus on the named-entity disambiguation component. To
prepare the training data, I performed text feature engineering for some key elements
that were used in our work. I �rst set up clean candidate/anchor link lists. Soon after,
I processed the 1.5 million sections of Wikipedia that I was using. This took several
days to process and the text had to be processed a few times over. The reason to
reprocess the text is the case of alignment. I had to ensure the alignment of the textual
mention, in our case the anchor link, in the section text was the same before and after
processing. This required recalculating positions each time I cleaned the section text.
In the case of number substitution, I opted to replace numbers with hashes as this was
a useful way to avoid problems that would arise from mixed data types.
3.2. Disambiguation Models
For my deep learning models, I treated the task of named-entity disambiguation as a
classi�cation task. For a candidate in the candidate list for an identi�ed entity in a text,
the model predicts whether this candidate is the true named-entity for the identi�ed
named-entity. Speci�cally, given the knowledge embedding of the candidate and the
local context embedding of the text as inputs, the model predicts true if the candidate
is the correct entity and false otherwise. For example, our baseline model uses word
embeddings with a local context window, where it is trained as part of the embedding
layer of a BiLSTM. Once this occurs, the context embeddings are concatenated with
knowledge graph embeddings to be fed into a feed-forward neural network. With this
framework, I successfully mapped between the knowledge graph embeddings and
local context embeddings to disambiguate the named-entities.
The notion of incorporating graph embeddings with local context embeddings to
map ambiguous named-entities to those in a knowledge base stems from Parravicini et
al. (2019). In their framework, Parravicini et al. (2019) performed successful leveraging
of graph embeddings to achieve named-entity disambiguation. Using DBpedia as the
knowledge base and existing graph algorithms for candidate generation, they were
able to achieve state-of-the-art accuracy on a number of datasets and fast retrieval of
entities in real-world engines. In comparison to my work, they use di�erent similarity
metrics in their candidate generation rather than Jaccard similarity and the LSH
algorithm; and their graph embeddings were node embeddings (DeepWalk), rather
than higher-level knowledge graph embeddings. Our approach is also novel since its
the �rst of its kind to concatenate contextual word embeddings, such as ELMo, with
knowledge graph embeddings to conduct disambiguation as a classi�cation task.
3.2.1. Embeddings
The Context2Vec embeddings were trained with the 500k most frequently occurring
words in the dataset. With this subset, I mapped them to embeddings with a context
window of (+/-) 10 words, which is fed into an embedding layer of the model. The
ELMo embeddings were trained by taking 1.5 million Wikipedia section texts from
the dataset. I truncated the text to a window of (+/-) 10 words to make ELMo simple to
compare with Context2Vec and also ease the computation.
The TransE embeddings were trained from 5 million vectors of entities and relations
in Wikidata and Wikipedia. This includes general domain entities such as concepts,
people, and things. I use the graph embedding engine GraphVite (X. Wang et al.,
2019; Z. Zhu et al., 2019) to utilize the knowledge graph embedding algorithms from
Chapter 2.4.2. and generate embeddings in a short amount of time. DistMult was used
in some preliminary experiments, but ultimately I chose TransE as my knowledge
graph embeddings for the �nal results.
20
3.2.2. BiLSTM Model
My baseline model takes the Context2Vec embeddings as input. This input is fed
into a BiLSTM, whose output is concatenated with the knowledge graph embeddings,
which are then fed into a FFNN. The FFNN maps between the context embeddings and
knowledge graph embeddings for the �nal output. The motivation for the baseline
model was to augment and tune a model to build up the Context2Vec embedding
section.
Figure 3.2.: BiLSTM Model Pipeline
I also know that a BiLSTM is quite e�ective on sequential tagging and word classi-
�cation. Particularly when we are looking at a window size of 10 words before and
after our target word, an LSTM is fundamental for Context2Vec.
3.2.3. CNN-BiLSTM Model
The motivation for the second model is much like the baseline model yet the purpose
was to develop a stack on BiLSTM. A CNN captures the hierarchical relations and there
have been positive stacks with text classi�cation and CNN-LSTM models (Zhou et al.,
2015). I use the CNN to extract a sequence of higher-level phrase representations, and
then further feed this into the BiLSTM for sentence representation. I fed my context
embeddings into one convolutional layer to see how it would perform.
Figure 3.3.: CNN-BiLSTM Model Pipeline
3.2.4. ELMo Model
The ELMo model replaces the Context2Vec embeddings with ELMo embeddings as
input, and since the ELMo embeddings is already made up of BiLSTMs, I only use a
basic LSTM to process the ELMo embeddings; this output is also concatenated like
the CNN-BiLSTM model and ELMo model. Depending on the window size of the
embedding, runtime would vary greatly.
21
Figure 3.4.: ELMo Model Pipeline
3.2.5. Feed-Forward Neural Network (FFNN)
Each model utilizes a FFNN towards the back of its architecture. The FFNN has a
sigmoid function that applies the transformations I commit to vectors in a range of
(0,1) coming out of the previously established network before the loss computation:
5 pB8q “1
1` 4´B8
It is independently connected to each element and is also known as the logistic
function. Unlike softmax loss, each vector component (class) is independent, therefore
the loss computed for each output class is not a�ected by other classes. A sigmoid
activation function applied to the scores before computing the cross-entropy loss:
�� “ ´
�ÿ
8
C8;>6pB8q
I use cross-entropy to calculate the di�erence between two (or more) probability dis-
tributions and it measures the performance of my models whose output is a probability
value of either 0 or 1.
22
4. Experiments
I perform a number of experiments and present these results on openly available
datasets using di�erent feature combinations, models, hyperparameters, and sample
sizes to exemplify this performance. The datasets are detailed for the named-entity dis-
ambiguation, as well as their settings. The Kensho-Derived Wikimedia Dataset (KDWD)
was used for the QA system and AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA) was
used as an entity linking benchmark. Afterwards, I de�ne the evaluation metrics used
for named-entity disambiguation, perform my experiments and compare the results
with other notable works.
4.1. Data
4.1.1. Kensho-Derived Wikimedia Dataset (KDWD)
Wikipedia, the free encyclopedia, and Wikidata, the free knowledge base, are crowd-
sourced projects supported by the Wikimedia Foundation. Recently, Wikipedia added
its six millionth English article after two decades operating. Wikidata, a more machine-
readable sister project, holds more than 75 million items since its creation 8 years ago.
The Wikimedia Foundation disseminates this information under a free license, and
therefore, have been heavily researched by data scientists and computer science groups,
particularly in the �eld of natural language processing (NLP). The Kensho-DerivedWikimedia Dataset (KDWD) 1
is a concentrated subset of the raw Wikimedia data in a
condition more �t for NLP research.
Pages Tokens Entities Relations
5.3M 2.3B 51M 140M
Table 4.1.: KDWD Dataset
The KDWD dataset is structured with three layers of data; there is the text from the
Wikipedia page, the hyperlinks between the pages, and the entities and relations built
from the Wikidata graph. Entities and relations are synonymous with items (Q) and
statement (P) in Wikidata. For example, the Noam Chomsky (Q9049) item in Wikidata
has statements that the item is an instance of (P106) human and the occupation (P106)
linguist (Q14467526) and political writer (Q15958642).
If I only observe the "Introduction" sections of these pages, I am still left with 460M
tokens in our corpus. Sometimes evaluation can prove more di�cult depending on the
referent used in the data, whereby a system annotates an entity with an encrypted
redirect rather than the direct entity in the URI. DBPedia (Auer et al., 2007), which
relies on the Wikimedia project for its knowledge base, will have redirects such as
http://dbpedia.org/resource/PEHDTSCKJBMA that will be completely apt for referencing
http://dbpedia.org/resource/Tom_Waits. In both datasets, I established the direct links
without relying on above redirects for our network.
1https://www.kaggle.com/kenshoresearch/kensho-derived-wikimedia-data
23
4.1.2. AIDA CoNLL-YAGO Dataset (CoNLL03/AIDA)
The AIDA CoNLL-YAGO Dataset created by Ho�art et al. (2011) contains assignments
of entities to the mentions of named-entities that were annotated in the original CoNLL
2003 NER task (Tjong Kim Sang and De Meulder, 2003). The entities are detected by
YAGO2 identi�cation, by Wikipedia URL, or by Freebase mid. For our purposes, I
used the YAGO2 entity identi�er as the target ID, rather than the Wikidata ID used
in KDWD. Each mention of an entity has the accompanying text section that can be
used to train the model.
Documents Entities
TRAIN 946 18k
VALID 215 4.6k
TEST 230 4.3k
Table 4.2.: CoNLL03/AIDA Dataset
The referent used is the Wikipedia link, or anchor link, that I used in KDWD. It
makes the comparison between KDWD and CoNLL03/AIDA easier when their entity-
linking is sourced from similar knowledge bases. CoNLL03/AIDA is the standard for
disambiguation.
4.2. Se�ings
I discuss the environment I created from the data preparation, the NER component,
candidate generation component, and provide a comprehensive look at the three
models in the disambiguation component.
Data Setting: Due to the size of KDWD, I train on only the "Introduction" sections of
1.5 million pages. The amount of text in KDWD can vary dramatically, therefore feature
engineering was quite computationally expensive. The data was split through the
traditional 70% training, 15% validation, and 15% testing. I keep the 500k most frequent
tokens as our Context2Vec lexicon. For the CoNLL03/AIDA dataset, I use all documents,
which were already partitioned. The entire document text accompanies each of the
entities mentioned in the dataset, whereas this structure preexists in KDWD. I use
the 21k most frequent tokens as our Context2Vec lexicon. A max length of 50 tokens
was set for processing the ELMo embeddings in both datasets. Text normalization was
standard, but numbers were replaced with hashtags.
NER Setting: The entities were recognized using Spacy v2.0, which uses subword
features and Bloom embeddings to parse entities (Serrà and Karatzoglou, 2017). The
entities recognized were person, location, organization, etc.
Candidate Generation Setting: The top candidates were calculated for the anchor
link frequency and the LSH algorithm in Chapter 2.3.2. I use jaccard similarity and
min hashing to map the similarity between these given sets of candidates. This setting
accounts for erroneous spelling and helps reduce the dimensionality of the data.
NED Setting: The general pipelines were discussed in Chapter 4 but here I have
the exact sizes of the input (with sequences no longer than 20) and output, along
with the detailed layers of their architecture. The candidate lists that are used for
disambiguation label one true entity among ten false entities.
• BiLSTMModel: The BiLSTM has an output size of 100 and the FFNN has a dense
layer size of 256 with ReLU activation. A dropout layer is added, and the �nal
output layer is a dense layer of 1 with sigmoid activation to classify the entities.
24
Learning rate was 0.007, batch size was 256, with 100 epochs that stopped once
the validation loss peaked after 2 patience.
Figure 4.1.: BiLSTM Model Architecture
• CNN-BiLSTM Model: Most of my hyperparameters and features were chosen
from Yenter and Verma (2017) who used a CNN-LSTM for binary classi�cation
of movie review sentiment. I have a dropout before the convolution. The convo-
lution has kernel size of 5, 64 �lters, ReLU activation, valid padding, and 1 stride.
I add a max pooling layer and batch normalization before feeding into the same
BiLSTM and FFNN as in the BiLSTM model. Learning rate was 0.007, batch size
was 256, with 100 epochs that stopped once the validation loss peaked after 2
patience.
Figure 4.2.: CNN-BiLSTM Model Architecture
• ELMo Model: I have an output layer of 1024 in our LSTM to match the size
of the ELMo embeddings. The LSTM has a recurrent dropout of 0.2. The �nal
dropout is 0.2. I used Adam optimizer with batch size of 64 and a learning rate
of 0.001 and 20 epochs that also had early stopping (Broscheit, 2019). The rest of
the model stays consistent with the other two after concatenation.
25
Figure 4.3.: ELMo Model Architecture
4.3. Evaluation Metrics
4.3.1. Precision and Recall
The general statistical measures I am observing is F1, precision, and recall. Precision
and recall both give us indications of the accuracy of a model but provide deeper
meanings for what the model is actually predicting. Precision means the percentage
of our results which are relevant to the task, whereas recall means the percentage of
total relevant results which are correctly classi�ed:
%'���(�$# “|Relevant ResultsX Retrieved Results|
|Retrieved Results|
'���!! “|Relevant ResultsX Retrieved Results|
|Truly Relevant Results|
The tradeo�s are that lowering our precision will give us irrelevant results not
suitable for a user and that raising precision will provide. This inverse relationship is
why we use the harmonic mean of the F1 score to balance precision and recall:
�1(�$'� “ 2 ¨Precision ¨ Recall
Precision` Recall
4.3.2. Classification Metrics
There are two ways I observe the classi�cation task in my method. I �rst observe
a confusion matrix of predicted results. A confusion matrix displays the number of
predicted values on the y-axis and the number of actual values on the x-axis, and is
broken down by each class. In my task, I should expect two. A confusion matrix allows
us to understand not just the errors being made by the classi�er, but more importantly,
the types of errors that are being produced by our model.
I secondly observe a receiver operating characteristic (ROC) curve and the area
under the curve (AUC). It expresses how well a model is capable of distinguishes
between classes; true named-entities and false named-entities in our classi�cation task.
The higher the AUC, the better the model is predicting true and false entities. The
ROC curve is plotted with the true positive rate (TPR) is measured against the false
positive rate (FPR). The TPR is synonymous with recall, yet in contrast to precision,
the FPR measures the ratio of false positives in the negative samples.
26
4.4. Results
First, I juxtapose the sample sizes and the correlative e�ects this has on the precision
and recall of my models. Then I compare confusion matrices of the two datasets with
the ELMo model and contrast the micro-precision of my model with state-of-the-art
models. I note the accuracy of the candidate lists that I am basing my predictions
from. I conclude with a brief analysis on the ROC curves and discussion on remaining
tangents.
4.4.1. The E�ect of Training Data Size
The sample sizes that are listed in Table 4.3 and Table 4.4 are approximate sizes of the
training and validation instances yielded. Since CoNLL03/AIDA dataset is far smaller
than KDWD, this was re�ected in experiment sizes.
Precision Recall F1 ROC-AUC Sample Size
BiLSTM Model 0.71 0.66 0.68 0.76 500k
CNN-BiLSTM Model 0.72 0.67 0.69 0.78 500k
ELMo Model 0.86 0.72 0.78 0.90 500k
BiLSTM Model 0.74 0.75 0.74 0.82 1M
CNN-BiLSTM Model 0.72 0.82 0.77 0.85 1M
ELMo Model 0.85 0.80 0.82 0.91 1M
BiLSTM Model 0.86 0.86 0.86 0.92 2.8M
CNN-BiLSTM Model 0.84 0.88 0.86 0.92 2.8M
ELMo Model 0.88 0.81 0.84 0.93 2.8M
Table 4.3.: Average Disambiguation Scores of 5 Runs - KDWD
Unsurprisingly, the increase in samples I train and validate from will correlate to an
improvement on the classi�cation task across all models. The BiLSTM model and CNN-
BiLSTM model slightly edge out the ELMo model in performance with 2.8M samples.
With less data, state-of-the-art models like ELMo can be e�ective in performing near
F1 of 80% with the smallest sample size in Table 4.3.
Precision Recall F1 ROC-AUC Sample Size
BiLSTM Model 0.66 0.85 0.74 0.80 20k
CNN-BiLSTM Model 0.63 0.86 0.73 0.80 20k
ELMo Model 0.73 0.87 0.79 0.83 20k
Table 4.4.: Average Disambiguation Scores of 5 Runs - CoNLL03/AIDA
4.4.2. Disambiguation Models
While I am training my models, I often include at maximum 10 false entities among the
true entity as the potential candidates. When predicting the candidates, the BiLSTM
Model and CNN-BiLSTM model repeatably have higher recall and lower precision,
underscoring the classi�er deciding too much of these candidates may be the true
candidate without classifying the exactly true candidates. The ELMo model has higher
precision and lower recall, which substantiates the notion that state-of-the-art language
models like ELMo and BERT (Sun et al., 2019) often can have a harder time generalizing,
thus generating higher false negatives, despite it generating su�cient true positives.
However, I �nd a direct contrast between the two datasets with the ELMo model. In
Figure 4.5, I generate a much higher recall (87%) and lower precision (73%). It should
27
be further noted that both confusion matrices in Figure 4.4 and Figure 4.5 denote the
datasets with the largest training instances. The BiLSTM model and CNN-BiLSTM
model have consistent confusion matrices that hold between both datasets, but it could
stand to reason that CoNLL03/AIDA generates higher false positives from the lack of
su�cient diversity in the data. For example, when a speci�c subject or theme manifests
in a text, the model has a harder time di�erentiating between the macro-context of
the text and the micro-context of the entity within the text.
Figure 4.4.: ELMo CM - KDWD Figure 4.5.: ELMo CM - CoNLL03/AIDA
The broader precision that I have thus far noted has been the micro-precision.
The micro-precision is the fraction of correctly disambiguated named-entities in an
entire corpus, whereas the macro-precision is the fraction of correctly disambiguated
named-entities averaged by their respective documents. In Table 4.5 there is a 9%
drop from our highest micro-precision with CoNLL03/AIDA compared to the lowest
state-of-the-art disambiguation model by Ho�art et al. (2011).
Micro-Precision
J. Raiman and O. Raiman (2018) 0.95
Sil et al. (2018) 0.94
Le and Titov (2018) 0.93
Ho�art et al. (2011) 0.82
Our Model 0.73
Table 4.5.: Disambiguation Models - CoNLL03/AIDA
These disambiguation models in Table 4.5 are comparatively more complex networks,
some of which implement rich integration of the other components of the entity linking
system that is absent in our work. J. Raiman and O. Raiman (2018) integrated symbolic
knowledge into the reasoning process of a neural network. Sil et al. (2018) trained
�ne-grained similarities and dissimilarities between the query and candidate document.
Le and Titov (2018) used multi-relational learning with candidates. Ho�art et al. (2011)
approximate e�ective joint mention-entity mapping.
4.4.3. Candidate List Accuracy
I evaluated the performance of the candidate lists that we were selecting the candidates
from, where the candidate with the highest probability is picked. It is not guaranteed
that the true candidate appears in the list if it happens to be missing. This is noted in
Table 4.6 and highlights the strengths of our models in predicting the correct candidates.
The accuracy is calculated from the recall at k approach.
In the classi�cation task, there is either the possibility for more than one candidate
to be predicted or the potential for no candidate to be predicted as the true entity. I
28
KDWD CoNLL03/AIDA
BiLSTM Model 0.92 0.83
CNN-BiLSTM Model 0.93 0.83
ELMo Model 0.92 0.85
Table 4.6.: Average Candidate List Accuracy of 5 Runs
have chosen to predict the candidate with the highest probability as the true entity. In
candidate generation, bottleneck problems materialize with gaps in the breadth of a
knowledge base. I extrapolate solutions and penalties for this issue in Chapter 4.5.5.
4.4.4. Analysis and Discussion
As I found, the greatest advantages in our models were careful consideration of the
context embeddings and the tradeo� between precision and recall. With more advanced
models like the ELMo model, I can expect higher precision at the cost of recall, with
the BiLSTM model and CNN-BiLSTM model I can expect higher recall at the cost
of precision. On testing with the CNN-BiLSTM model, I found a slight recall and
ROC-AUC boost compared to BiLSTM model. The ELMo embeddings often perform
stronger than the Context2Vec embeddings with less data.
Figure 4.6.: CNN-BiLSTM - KDWD Figure 4.7.: CNN-BiLSTM - CoNLL03/AIDA
With a sample size of 2.8M training and validation instances, the BiLSTM model
and CNN-BiLSTM model show stronger performance. The BiLSTM model increases
approximately 8-10% from the smaller sample size tier. CNN-BiLSTM model and ELMo
model begin to converge in their performance levels with more data. However, when
working with a smaller dataset and an incomplete knowledge base, it stands to reason
that ELMo embeddings would be a more robust model in this setting. Comparing
Figure 4.6 and Figure 4.7, I observe how the ROC curve is far more inconsistent with
CoNLL03/AIDA. I get an average of 26-30% false positives with CoNLL03/AIDA and
Context2Vec.
Figure 4.8.: ELMo - KDWD Figure 4.9.: ELMo - CoNLL03/AIDA
Nonetheless, Figure 4.8 and Figure 4.9 express two smooth curves in both datasets
when working with ELMo embeddings and the smallest sample sizes. In general, when
29
relying too heavily on Wikipedia as the fundamental knowledge base, I will have a
harder time generalizing my application to new data that does not re�ect this structure
well (Hachey et al., 2013). In my work, the models perform exceedingly well with
KDWD and fair with CoNLL03/AIDA; more datasets should be analyzed for a deeper
understanding to how e�ective the performance is across di�erent text and formats.
I must consider how much the model over�ts or under�ts the data, and monitoring
loss can be one clear indication. There is a 0.2-0.3 di�erence in training and validation
loss in CoNLL03/AIDA, whereas there is a magnitude smaller di�erence of 0.02-0.03
in KDWD. This illustrates how our models are over�tting to the CoNLL03/AIDA
dataset, but I believe this to be a healthy amount. Additional dropout layers and batch
normalization, along with larger data was used to address these issues where I could
apply them.
P(Target|Anchor) P(Anchor|Target)
Talking Heads 0.972 0.973
Talking Heads (series) 0.031 0.869
Talking Heads (Australian TV series) 0.021 1.000
Talking Heads (play) 0.015 1.000
Pundit 0.004 0.007
Table 4.7.: Anchor Link of "talking heads"
In Table 4.7, the top 5 likely target candidates are from the anchor link of "talking
heads". However, when I retrieve candidates for this same entry, I will get results such
as "the walking seeds", "walking through �re", "headshaking", and "talking horse" from
the LSH algorithm. Penalties, such as those that observe text dissimilarity wherein the
di�erence is given weight, could �lter out super�uous results during the disambigua-
tion stage.
Furthermore, time is considered. I ran our models on Tesla K80 or P100 GPU. ELMo
embeddings took approximately twice the amount of time to run as the other deep
learning models. When working with the 2.8M samples from KDWD our runtimes
would �nish in 100-110 min with Context2Vec and 160-170 min with ELMo. On the
�nal deployment of the entire named-entity linking system, I used the CNN-LSTM
model with 2.8M samples for its modest improvement on the response time of query
retrieval.
30
5. Conclusion and Future Work
In my work and research, I constructed a complete named-entity linking system that
solves many of the research questions that I originally posited. I manage name varia-
tions in the case of misspelling, so if we have "George Bush" rather than "George Bosh",
the system will conclude that these two entities are the same. Name variations are also
managed in the case of aliases, where it can understand that "Bush Jr", and "George
Bush" can be the same person given the proper context. However, abbreviations remain
a challenge and usually rely on the primary representation of identity.
The ambiguity challenge is largely covered by my work, where often the right
context can give way to the correct entity. In the case of coreference resolution, there
remains room for improvement as a downstream task. Nonetheless, ambiguity is the
most evident challenge solved. In the case of incomplete information, my system
can determine context for those candidates within this snapshot of the dataset. The
bottleneck problem of the candidate lists remains a problem that limits the success of
all of the challenges I addressed.
In order to begin construction, I �rst had to recognize our named-entities. I then
performed candidate generation by combining anchor link frequency with the LSH
algorithm, where a candidate group of potential entities was created for each identi�ed
entity. Afterwards, our named-entity disambiguation component was performed to
select the most probable candidate. The selection process for this �nal process was
treated as a classi�cation task.
I managed to successfully incorporate two types of context embeddings (Con-
text2Vec, ELMo) to concatenate with knowledge graph embeddings (TransE), a fresh ap-
proach to named-entity linking that performs well with both KDWD and CoNLL03/AIDA.
The focus of evaluation was mainly carried out with the disambiguation component of
the named-entity linking system. Running various models, hyperparameters, sample
sizes, and embeddings, I was able to achieve an e�cient �nal 79% F1 when applied to
the CoNLL04/AIDA benchmark.
For future work, there are a few di�erent aspects of my research that should be
expanded on. The �rst one would be improving the other components of my system.
Joint modeling for named-entity recognition, candidate generation, and named-entity
disambiguation has been done heuristically and through neural networks (Broscheit,
2019).
Our named-entity recognition system is a vanilla model, but using LSTM and
conditional random �elds (CRF) could likely improve it. CRFs are used for predicting
sequences that use contextual information to supplement information which will be
used by the model to make a correct prediction. Given its preeminence with contextual
information, it would make compelling work to develop a CRF for the disambiguation
component as well, yet this usage lacks proper research and implementation.
The candidate generation component could be improved by experimenting with
more graph algorithms for greater semantic understanding of text. I could have evalu-
ated Jaccard similarity with Cosine similarity, yet previous work suggested that Jaccard
similarity would perform best with the LSH algorithm (E. Zhu et al., 2016). In addition,
the candidate generation has bottleneck limitations that could be improved with more
considerate curation at the feature engineering stage instead of changing the similarity
metric.
31
Finally, Luan et al. (2018) assert that multi-task identi�cation of entities, relations,
and coreference resolution outperforms other models, such as mine, that only focus
on purely entities. There is a lot of unused relational data in my knowledge graph
embeddings that could be incorporated into deep learning, as was shown in Figure
2.11. I primarily use link prediction and similarity metrics, but graph algorithms like
PageRank and Louvain modularity have been leveraged with deep learning models
too (Cao et al., 2018).
32
A. Named-Entity Linking Examples
Figure A.1.: Named-Entity Linking Example 1
Figure A.2.: Named-Entity Linking Example 2
33
Bibliography
Adamic, Lada A. and Eytan Adar (2003). “Friends and neighbors on the Web”. Soc.Networks 25, pp. 211–230.
Auer, Sören, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and
Zachary Ives (2007). “DBpedia: A Nucleus for a Web of Open Data”. In: vol. 6. Jan.
2007, pp. 722–735. doi: 10.1007/978-3-540-76298-0_52.
Battaglia, Peter, Jessica Blake Chandler Hamrick, Victor Bapst, Alvaro Sanchez, Vinicius
Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro,
Ryan Faulkner, Caglar Gulcehre, Francis Song, Andy Ballard, Justin Gilmer, George
E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Jayne Langston, Chris
Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matt Botvinick, Oriol Vinyals,
Yujia Li, and Razvan Pascanu (2018). “Relational inductive biases, deep learning, and
graph networks”. arXiv. url: https://arxiv.org/pdf/1806.01261.pdf.
Bordes, Antoine, Nicolas Usunier, Alberto Garcia-Durán, Jason Weston, and Oksana
Yakhnenko (2013). “Translating Embeddings for Modeling Multi-Relational Data”.
In: Proceedings of the 26th International Conference on Neural Information ProcessingSystems - Volume 2. NIPS’13. Lake Tahoe, Nevada: Curran Associates Inc., pp. 2787–
2795.
Broscheit, Samuel (2019). “Investigating Entity Knowledge in BERT with Simple Neural
End-To-End Entity Linking”. In: Proceedings of the 23rd Conference on Computa-tional Natural Language Learning (CoNLL). Hong Kong, China: Association for
Computational Linguistics, Nov. 2019, pp. 677–685. doi: 10.18653/v1/K19-1063. url:
https://www.aclweb.org/anthology/K19-1063.
Cao, Jinxin, Di Jin, Liang Yang, and Jianwu Dang (2018). “Incorporating network
structure with node contents for community detection on large networks using
deep learning”. Neurocomputing 297 (Feb. 2018). doi: 10.1016/j.neucom.2018.01.065.
Dong, Wei, Moses Charikar, and Kai Li (2011). “E�cient K-nearest neighbor graph
construction for generic similarity measures”. In: Jan. 2011, pp. 577–586. doi: 10.
1145/1963405.1963487.
Franzoni, Valentina, Michele Lepri, and Alfredo Milani (2019). “Topological and Seman-
tic Graph-based Author Disambiguation on DBLP Data in Neo4j”.CoRR abs/1901.08977.
arXiv: 1901.08977. url: http://arxiv.org/abs/1901.08977.
Gionis, Aristides, Piotr Indyk, and Rajeev Motwani (1999). “Similarity Search in High
Dimensions via Hashing”. In: Proceedings of the 25th International Conference on VeryLarge Data Bases. VLDB ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers
Inc., pp. 518–529. isbn: 1558606157.
Goldberg, Yoav and Omer Levy (2014). “word2vec Explained: deriving Mikolov et al.’s
negative-sampling word-embedding method”. CoRR abs/1402.3722. arXiv: 1402.3722.
url: http://arxiv.org/abs/1402.3722.
Grover, Aditya and Jure Leskovec (2016). “node2vec: Scalable Feature Learning for
Networks”. CoRR abs/1607.00653. arXiv: 1607.00653. url: http://arxiv.org/abs/1607.
00653.
Hachey, Ben, Will Radford, Joel Nothman, Matthew Honnibal, and James Curran (2013).
“Evaluating Entity Linking with Wikipedia”. Arti�cial Intelligence 194 (Jan. 2013),
pp. 130–150. doi: 10.1016/j.artint.2012.04.005.
34
Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long Short-term Memory”. Neuralcomputation 9 (Dec. 1997), pp. 1735–80. doi: 10.1162/neco.1997.9.8.1735.
Ho�art, Johannes, Mohamed Amir Yosef, Ilaria Bordino, Hagen Fürstenau, Manfred
Pinkal, Marc Spaniol, Bilyana Taneva, Stefan Thater, and Gerhard Weikum (2011).
“Robust Disambiguation of Named Entities in Text”. In: Proceedings of the 2011Conference on Empirical Methods in Natural Language Processing. Edinburgh, Scot-
land, UK.: Association for Computational Linguistics, July 2011, pp. 782–792. url:
https://www.aclweb.org/anthology/D11-1072.
Ji, Guoliang, Shizhu He, Liheng Xu, Kang Liu, and Jun Zhao (2015). “Knowledge
Graph Embedding via Dynamic Mapping Matrix”. In: Proceedings of the 53rd AnnualMeeting of the Association for Computational Linguistics and the 7th InternationalJoint Conference on Natural Language Processing (Volume 1: Long Papers). Beijing,
China: Association for Computational Linguistics, July 2015, pp. 687–696. doi:
10.3115/v1/P15-1067. url: https://www.aclweb.org/anthology/P15-1067.
Koncel-Kedziorski, Rik, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Ha-
jishirzi (2019). “Text Generation from Knowledge Graphs with Graph Transformers”.
CoRR abs/1904.02342. arXiv: 1904.02342. url: http://arxiv.org/abs/1904.02342.
Kulkarni, Sayali, Ganesh Ramakrishnan, and Soumen Chakrabarti (2009). “Collective
annotation of Wikipedia entities in web text”. In: Jan. 2009, pp. 457–466. doi: 10.
1145/1557019.1557073.
Le, Phong and Ivan Titov (2018). “Improving Entity Linking by Modeling Latent
Relations between Mentions”. In: Proceedings of the 56th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers). Melbourne, Aus-
tralia: Association for Computational Linguistics, July 2018, pp. 1595–1604. doi:
10.18653/v1/P18-1148. url: https://www.aclweb.org/anthology/P18-1148.
Logan, Robert, Nelson F. Liu, Matthew E. Peters, Matt Gardner, and Sameer Singh
(2019). “Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language
Modeling”. In: Proceedings of the 57th Annual Meeting of the Association for Com-putational Linguistics. Florence, Italy: Association for Computational Linguistics,
July 2019, pp. 5962–5971. doi: 10.18653/v1/P19-1598. url: https://www.aclweb.org/
anthology/P19-1598.
Lu, Hao, Mahantesh Halappanavar, and Ananth Kalyanaraman (2014). “Parallel Heuris-
tics for Scalable Community Detection”. Parallel Computing 486 (Oct. 2014). doi:
10.1016/j.parco.2015.03.003.
Luan, Yi, Luheng He, Mari Ostendorf, and Hannaneh Hajishirzi (2018). “Multi-Task
Identi�cation of Entities, Relations, and Coreference for Scienti�c Knowledge Graph
Construction”. In: Proceedings of the 2018 Conference on Empirical Methods in NaturalLanguage Processing. Brussels, Belgium: Association for Computational Linguistics,
Oct. 2018, pp. 3219–3232. doi: 10.18653/v1/D18-1360. url: https://www.aclweb.org/
anthology/D18-1360.
Melamud, Oren, Jacob Goldberger, and Ido Dagan (2016). “context2vec: Learning
Generic Context Embedding with Bidirectional LSTM”. In: Proceedings of The 20thSIGNLL Conference on Computational Natural Language Learning. Berlin, Germany:
Association for Computational Linguistics, Aug. 2016, pp. 51–61. doi: 10.18653/v1/
K16-1006. url: https://www.aclweb.org/anthology/K16-1006.
Nickel, Maximilian, Volker Tresp, and Hans-Peter Kriegel (2011). “A Three-Way Model
for Collective Learning on Multi-Relational Data.” In: Jan. 2011, pp. 809–816.
Page, Larry, Sergey Brin, R. Motwani, and T. Winograd (1998). The PageRank CitationRanking: Bringing Order to the Web.
Parravicini, Alberto, Rhicheek Patra, Davide B. Bartolini, and Marco D. Santambrogio
(2019). “Fast and Accurate Entity Linking via Graph Embedding”. In: Proceedingsof the 2nd Joint International Workshop on Graph Data Management Experiences
35
Systems (GRADES) and Network Data Analytics (NDA). GRADES-NDA’19. Amster-
dam, Netherlands: Association for Computing Machinery. isbn: 9781450367899. doi:
10.1145/3327964.3328499. url: https://doi.org/10.1145/3327964.3328499.
Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena (2014). “DeepWalk: Online Learning
of Social Representations”. CoRR abs/1403.6652. arXiv: 1403.6652. url: http://arxiv.
org/abs/1403.6652.
Peters, Matthew E., Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,
Kenton Lee, and Luke Zettlemoyer (2018). “Deep contextualized word representa-
tions”. In: Proc. of NAACL.
Raiman, Jonathan and Olivier Raiman (2018). “DeepType: Multilingual Entity Linking
by Neural Type System Evolution”. In: AAAI.Reddy, Sathish, Dinesh Raghu, Mitesh M. Khapra, and Sachindra Joshi (2017). “Gen-
erating Natural Language Question-Answer Pairs from a Knowledge Graph Using
a RNN Based Question Generation Model”. In: Proceedings of the 15th Conferenceof the European Chapter of the Association for Computational Linguistics: Volume 1,Long Papers. Valencia, Spain: Association for Computational Linguistics, Apr. 2017,
pp. 376–385. url: https://www.aclweb.org/anthology/E17-1036.
Serrà, Joan and Alexandros Karatzoglou (2017). “Getting deep recommenders �t: Bloom
embeddings for sparse binary input/output networks”. CoRR abs/1706.03993. arXiv:
1706.03993. url: http://arxiv.org/abs/1706.03993.
Shchur, Oleksandr and Stephan Günnemann (2019). “Overlapping Community Detec-
tion with Graph Neural Networks”. arXiv: 1909.12201 [cs.LG].
Sil, Avirup, Gourab Kundu, Radu Florian, and Wael Hamza (2018). “Neural Cross-
Lingual Entity Linking”. In: AAAI.Sun, Chi, Xipeng Qiu, Yige Xu, and Xuanjing Huang (2019). “How to Fine-Tune BERT
for Text Classi�cation?” CoRR abs/1905.05583. arXiv: 1905.05583. url: http://arxiv.
org/abs/1905.05583.
Tjong Kim Sang, Erik F. and Fien De Meulder (2003). “Introduction to the CoNLL-2003
Shared Task: Language-Independent Named Entity Recognition”. In: Proceedings ofthe Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–
147. url: https://www.aclweb.org/anthology/W03-0419.
Wang, Wentao, Lintao Wu, Ye Huang, Hao Wang, and Rongbo Zhu (2019). “Link
Prediction Based on Deep Convolutional Neural Network”. Information 10 (May
2019), p. 172. doi: 10.3390/info10050172.
Wang, Xiaozhi, Tianyu Gao, Zhaocheng Zhu, Zhiyuan Liu, Juanzi Li, and Jian Tang
(2019). KEPLER: A Uni�ed Model for Knowledge Embedding and Pre-trained LanguageRepresentation. arXiv: 1911.06136 [cs.CL].
Yang, Bishan, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and li Deng (2014). “Embedding
Entities and Relations for Learning and Inference in Knowledge Bases” (Dec. 2014).
Yenter, Alec and Abhishek Verma (2017). “Deep CNN-LSTM with combined kernels
from multiple branches for IMDb review sentiment analysis”. In: Oct. 2017, pp. 540–
546. doi: 10.1109/UEMCON.2017.8249013.
Zachary, Wayne W (1977). “An information �ow model for con�ict and �ssion in small
groups”. Journal of anthropological research, pp. 452–473.
Zhou, Chunting, Chonglin Sun, Zhiyuan Liu, and Francis C. M. Lau (2015). “A C-LSTM
Neural Network for Text Classi�cation”. CoRR abs/1511.08630. arXiv: 1511.08630.
url: http://arxiv.org/abs/1511.08630.
Zhu, Erkang, Fatemeh Nargesian, Ken Q. Pu, and Renée J. Miller (2016). “LSH Ensemble:
Internet-Scale Domain Search”. Proc. VLDB Endow. 9.12 (Aug. 2016), pp. 1185–1196.
issn: 2150-8097. doi: 10.14778/2994509.2994534. url: https://doi.org/10.14778/
2994509.2994534.
36
Zhu, Zhaocheng, Shizhen Xu, Meng Qu, and Jian Tang (2019). “GraphVite: A High-
Performance CPU-GPU Hybrid System for Node Embedding”. CoRR abs/1903.00757.
arXiv: 1903.00757. url: http://arxiv.org/abs/1903.00757.
37