Faculteit Letteren & Wijsbegeerte
Valerie Dumortier
How can a thorough linguistic analysis
contribute to automatic sentiment
analysis in Twitter?
Masterproef voorgedragen tot het behalen van de graad van
Master in de Meertalige Communicatie
2014
Promotor Mevr. Orphee De Clercq
Vakgroep Vertalen Tolken Communicatie
PREFACE
I would like to use this opportunity to thank everyone who made it possible to write this
dissertation.
First, I would like to thank my promoter, Orphee De Clercq, for providing adequate and
constructive feedback and for supporting me constantly.
Secondly, I would like to thank my parents for giving me the opportunity to start and finish
these studies, for their constant support and for keeping believing in me. I would also like to
thank my brother, Jasper Dumortier, for the relaxing and entertaining brother-sister-moments
during my studies. My boyfriend, Sander De Loof, has supported me through thick and thin
during these four years and I would like to thank him for his support and his patience with me
while writing this dissertation. Also the rest of my family deserves a big thank you.
Finally, I would like to thank all my friends who helped made my life as a student to what it is
now. Thanks to them I have all these amazing memories which I will always carry in my
heart. There is one person that I would like to thank in special, Joy Gillespie. She was an
important support while writing this dissertation and a lot of other moments during these
studies.
5
Table of contents
LIST OF TABLES AND IMAGES ........................................................................................... 7
INTRODUCTION ...................................................................................................................... 9
PART I: LITERATURE STUDY ............................................................................................ 11
1. SENTIMENT ANALYSIS ............................................................................................... 11
1.1. Basic concepts of sentiment analysis ......................................................................... 11
1.2. The process of sentiment analysis ............................................................................. 12
1.2.1. Subjectivity classification .................................................................................. 12
1.2.2. Sentiment classification ...................................................................................... 13
2. SENTIMENT ANALYSIS IN TWITTER ....................................................................... 16
2.1. Twitter ....................................................................................................................... 16
2.2. Shallow machine learning approaches to perform SA in Twitter data ...................... 17
2.3. Approaches incorporating more linguistic knowledge .............................................. 22
2.3.1. Sarcasm .............................................................................................................. 22
2.3.2. Negation ............................................................................................................. 23
PART II: CORPUS ANALYSIS .............................................................................................. 25
1. METHODOLOGY ........................................................................................................... 25
1.1. Data collection ........................................................................................................... 25
1.2. Data annotation .......................................................................................................... 29
1.3. LT3-tool ..................................................................................................................... 30
2. RESULTS ......................................................................................................................... 31
2.1. Evaluation .................................................................................................................. 31
2.2. Quantitative analysis .................................................................................................. 33
2.3. Qualitative analysis .................................................................................................... 36
2.3.1. Type 1: negative tweets labelled as positive ...................................................... 38
2.3.2. Type 2: negative tweets labelled as neutral ........................................................ 41
2.3.3. Type 3: positive tweets labelled as negative ...................................................... 44
2.3.4. Type 4: positive tweets labelled as neutral ......................................................... 46
2.3.5. Type 5: neutral tweets labelled as negative ........................................................ 47
2.3.6. Type 6: neutral tweets labelled as positive ......................................................... 48
6
2.4. Interpretation of the error analysis ............................................................................. 50
2.5. Recommendations ..................................................................................................... 51
CONCLUSION ........................................................................................................................ 53
BIBLIOGRAPHY .................................................................................................................... 55
APPENDIX I: MATRICES, ACCURACY, RECALL AND PRECISION ............................ 59
APPENDIX II: EXACT DATA OF ERROR ANALYSIS ...................................................... 62
7
LIST OF TABLES AND IMAGES
Table 1 – Short summary of the shallow machine learning approaches .................................. 20
Table 2 - Number of tweets and average tweet lengths of both corpora .................................. 28
Table 3 - Overview of the number of negative, positive and neutral tweets (labelled by the
human annotator) in both corpora ............................................................................................ 30
Table 4 - Overview of the different types of errors .................................................................. 32
Table 5 – Confusion matrix of the British Airways and Ryanair corpus combined ................ 33
Table 6 - Overview of the accuracy, precision and recall numbers of the British Airways and
Ryanair corpus combined ......................................................................................................... 33
Table 7 – Confusion matrix of the British Airways corpus ..................................................... 34
Table 8 – Confusion matrix of the Ryanair corpus .................................................................. 35
Table 9 - Overview of the accuracy, precision and recall numbers of the British Airways
corpus ....................................................................................................................................... 35
Table 10 - Overview of the accuracy, precision and recall numbers of the Ryanair corpus .... 35
Table 11 - Overview of the different error categories .............................................................. 37
Table 12 - Overview of the number of major errors in each type ............................................ 37
Table 13 - Overview of the number of minor errors in each type ............................................ 37
Table 14 - Pie chart of the shares of the different types of errors ............................................ 38
Table 15 – Error analysis of the British Airways and Ryanair corpus combined .................... 62
Table 16 – Error analysis of the British Airways corpus ......................................................... 62
Table 17 – Error analysis of the Ryanair corpus ...................................................................... 62
Image 1 - Example of the Twitter API with the search term "British Airways" ...................... 27
Image 2 – Possibilities to specify a search on Topsy according to time/date, kind of message
and language respectively ........................................................................................................ 27
Image 3 - Example of result page on Topsy with the search term “British Airways” and with
the specifications as in Image 2 ................................................................................................ 28
9
INTRODUCTION
Twitter is an important social media platform for companies. Since nowadays people are
constantly online and share their opinions on almost everything, Twitter offers companies the
opportunity to understand their consumers better. Consequently, companies use Twitter as an
alternative medium for their customer service1. As a lot of consumers post reviews on Twitter,
sentiment analysis can be an interesting technique for companies to check the overall review
of their products, services, etc. Sentiment analysis is a technique that is used to label a
document, sentence, word, etc. according to its sentiment, which is either neutral, negative or
positive. Still, because of its typical linguistic characteristics, tweets seem to be difficult for
ordinary sentiment analysis tools to label correctly.
This dissertation investigates sentiment analysis in Twitter. As tweets are typically very short
messages with a maximum of 140 characters, the task of performing sentiment analysis in
tweets is fairly different than performing it in longer texts such as reviews on Amazon. Since
tweets have a number of typical linguistic characteristics these have to be taken into account.
Examples are the use of slang words or abbreviations which are not universally known and
the lack of linking words. Moreover, sometimes more background knowledge is required in
order to completely grasp the message.
The Language and Translation Technology Team of Ghent University has developed a
standard sentiment analysis tool (further referred to as LT3-tool) for tweets. This dissertation
examines the performance of this tool and envisages to find out how it could be improved
based on a thorough linguistic analysis. To this purpose, we composed a corpus with tweets
treating two airline companies. We chose to explicitly search for tweets either mentioning
British Airways or Ryanair as these two airline companies have respectively a good and a bad
reputation, which influences the sort of tweets that are sent. The corpus consists of tweets that
would most likely be labelled incorrectly by a sentiment analysis tool. The tweets of the
corpus are labelled by both a human annotator and the LT3-tool. This creates the opportunity
to investigate which aspects of tweets are difficult for automatic processing.
In this dissertation we will try to answer the following questions:
- How good does a machine sentiment analysis tool (the LT3-tool) perform on labelling
linguistically challenging tweets compared to humans?
1 KLM is a good example of a company that communicates mainly through Twitter with their consumers as
customer service (https://twitter.com/KLM, last accessed: May 25th
2014).
10
- What is or what are the most common mistake(s) a machine sentiment analysis tool
(the LT3-tool) makes when labelling linguistically challenging tweets?
- How could a machine sentiment analysis tool (the LT3-tool) be improved in order to
label linguistically challenging tweets better?
The work presented in this dissertation consists of two major parts: an extensive literature
overview and a corpus analysis.
After introducing the topic, we first present a literature study (Part 1) with a general overview
of the task of sentiment analysis on standard text (Section 1). We then continue by describing
the current state of the art of sentiment analysis in Twitter and zoom in on the challenges that
arise when trying to perform sentiment analysis in Twitter (Section 2).
In the second part of this dissertation, the corpus analysis (Part 2), we first introduce the
methodology that was used: we describe how the data was collected (1.1), annotated (1.2) and
automatically labelled (1.3). Then we describe the results by first giving some more
information about how the actual evaluation was performed (2.1), after which we extensively
discuss the results in both a qualitative (2.2) and quantitative (2.3) manner. We finish this part
with an analysis on the meta-level and by presenting some recommendations for improving
the LT3-tool (2.4 and 2.5).
To conclude, we repeat the main findings of this dissertation and present some prospects for
future work.
11
PART I: LITERATURE STUDY
1. SENTIMENT ANALYSIS
In this first part of the literature study, sentiment analysis in general is discussed. We start by
defining some basic concepts (1.1) and explaining the standard process for performing
sentiment analysis (1.2).
1.1. Basic concepts of sentiment analysis
In general, textual information can be divided into two types: facts and opinions. Liu (2010,
p1) defines the former as “objective expressions about entities, events and their properties”
and the latter as “subjective expressions that describe people’s sentiments, appraisals or
feelings toward entities, events and their properties”.
Sentiment analysis, or opinion mining, is a technique which originated from the need to
organize and structure large amounts of opinionated data. Liu (2010, p3) defines sentiment
analysis as “the computational study of opinions, sentiments and emotions expressed in text
[…] The opinion can be expressed on anything, e.g., a product, a service, an individual, an
organization, an event, or a topic.” Nasukawa & Yi (2003, p71) define sentiment analysis as a
technique “to identify how sentiments are expressed in texts and whether the expressions
indicate positive (favorable) or negative (unfavorable) opinions toward the subject”. Esuli &
Sebastiani (2006, p193) define opinion mining or sentiment analysis as “a recent subdiscipline
of computational linguistics which is concerned not with the topic a document is about, but
with the opinion it expresses”.
We can perceive subjectivity as one of the key elements of sentiment analysis or opinion
mining because a subjective sentence expresses an opinion most of the time. All sentences
have a certain level of subjectivity. A sentence with a low level of subjectivity and in which
an opinion is expressed implicitly is called an objective sentence, which means that this
sentence only shares factual information. However, when the level of subjectivity is high and
when an opinion is expressed explicitly, it is a subjective sentence because now also personal
feelings or beliefs are shared. To clarify this, let us compare the objective sentence “The sun
is shining.” to the subjective sentence “I love it when the sun is shining”. The former sentence
12
simply states how the weather is, i.e. it is a fact that the sun is shining, whereas the latter
sentence clearly utters a certain feeling of the speaker towards the fact that the sun is shining.
Within the study of sentiment analysis the level of subjectivity can be determined for both
sentences and documents. Mostly, a document is defined as opinionated or subjective when it
contains various opinionated sentences. However, caution is necessary since one specific
report can contain various opinions. Also, if it contains many quoted sentences the opinion
may not necessarily express the author’s beliefs.
When discussing sentiment analysis it is important to define what sentiment is exactly or at
least how we understand it throughout this dissertation. Sentiment can be defined as “an
attitude or opinion”, “feelings of love sympathy, kindness, etc.” or “an attitude, thought, or
judgement prompted by feeling”2. Sentiment has a polarity or orientation, which is also
sometimes called polarity of opinion or semantic orientation (Liu, 2010). Some only
distinguish two types of polarity, negative and positive (Pang & Lee, 2004; Hu & Liu, 2004),
while others also consider neutral as a sentiment orientation (Liu, 2010; Kim & Hovy, 2004).
For our study we consider sentiment as a positive, negative or neutral emotion towards a
certain entity3.
1.2. The process of sentiment analysis
Sentiment analysis is a process that consists of two stages: subjectivity classification and
sentiment classification. The second stage of sentiment analysis, sentiment classification, can
also occur on different levels: document, sentence, word and feature level. (Liu, 2010; Kumar
& Sebastian, 2012)
1.2.1. Subjectivity classification
As mentioned above, the first step in sentiment analysis is subjectivity classification. In this
task, the system categorizes a text or sentence as either objective or subjective (also called
opinionated versus non-opinionated). Objective or non-opinionated texts or sentences render a
2 http://www.merriam-webster.com/dictionary/sentiment (last accessed: May 20
th 2014)
3 In this dissertation, the entity is airline companies (see Section 1.1 of the corpus analysis).
13
fact while subjective or opinionated texts or sentences express certain feelings of the author
towards something. (Liu, 2010; Kumar & Sebastian, 2012)
1.2.2. Sentiment classification
Once a sequence of text has been labelled as subjective or opinionated, the polarity of the text
or sentence needs to be determined. Sentiment classification is the process of determining
whether a document expresses a negative or positive opinion (sometimes neutral is also
considered as a sentiment). In order to do this, Liu (2010) states that it is necessary to have
information on five different aspects of the object. First, it is necessary to know the object on
which the opinion is expressed. Secondly, it is important to know the feature(s) about which
opinions are being expressed. One single sentence can comment on different features while
several sentences can express an opinion on one sole feature. Thirdly, the opinion itself has to
be known. Liu (2010, p4) defines an opinion on a feature as “a positive or negative view,
attitude or emotion or appraisal […] from an opinion holder”. A fourth aspect of the object is
the opinion holder, the person who expresses the opinion. Finally the system needs
information on when the opinion was expressed.
Sentiment classification can be done on different levels: document level, sentence level, word
level and feature level. Below, the four levels are briefly described and explained.
Sentiment classification on document level
As the name explains, this task determines the polarity of a whole document. It is then
assumed that the whole text holds one sole opinion and is therefore written by only one
author. A difficulty on this level of sentiment analysis is the expression of multiple opinions
in one text. (Kumar & Sebastian, 2012)
According to Liu (2010, p10) sentiment classification on the document level can, on the one
hand, be seen as “a supervised learning problem with two class labels (positive and
negative)”. Liu (2010) argues that domain adaptation is important in document-level
sentiment classification, as opinion words may express different sentiments depending on the
domain in which they are used. On the other hand, he also considers unsupervised learning as
14
a possible approach in which opinion words and phrases are of major importance (also when
this is done on document level).
Sentiment classification on sentence level
Sentiment classification on the sentence level is the task of determining the polarity of a
single sentence. One sentence may contain different opinions. These sentences are mostly
compound sentences, which are not suitable for sentiment classification on the sentence level.
Therefore, determining the strength of opinions is also important as it can be necessary to
determine which opinion has more weight. (Liu, 2010; Kumar & Sebastian, 2012)
Sentiment classification on word level
In word-level sentiment classification, the sentiment of sole words, often called opinion
words, is identified. According to Liu (2010), opinion words are often used in sentiment
classification tasks. Opinion words can be subdivided according to two dimensions: positive
versus negative and base type versus comparative type. Positive opinion words express
desired states whereas negative opinion words express undesired states.
Base type opinion words are the words that express an opinion on their own. The opinion in
these words is straightforward: they are either negative (e.g.: bad, horrible, etc.) or positive
(e.g.: good, marvellous, etc.). Comparative type opinion words are words that are used to
express an opinion using comparative and superlative words. They thus do not express an
opinion on one sole object, but on multiple objects. Although the comparative word can be a
negative word, this does not necessarily mean that the opinion on both compared words is
negative.
An opinion lexicon is a list of opinion words, opinion phrases and idioms. Liu (2010)
distinguishes three approaches to creating opinion word lists, also known as lexicons: manual,
dictionary-based or corpus-based. In the manual approach, opinion word lists are collected by
manually labelling each word with its according sentiment. This technique is often combined
with one of the two automated techniques. In the dictionary-based approach, first opinion
words are collected of which the opinion orientation is commonly known. They are used as
seed words to expand the lexicon by adding synonyms and antonyms. This iterative process
stops when no more new words are found. Usually the resulting lexicon is manually validated.
15
One of the main disadvantages of this approach, however, is that it is not able to find domain-
specific opinion words. A certain word can for example be positive in one domain, but
negative in another. The word “budget”, for example, does not always have a positive
connotation in the airline domain as it is often linked with poor service and old airplanes (e.g.:
In fact, British Airlines. This queue and service is reminiscent of a budget airline in holiday
season. Yet you’re not quite budget, are you...4). In other domains, “budget” has a positive
connotation. According to Liu (2010, p15) the corpus-based approach relies on “syntactic or
co-occurrence patterns and also [on] a seed list of opinion words to find other opinion words
in a large corpus”. Other words which are positioned near the opinion word, such as conjoined
adjectives, are therefore also considered for inclusion in the lexicon. Adjectives conjoined by
‘and’ generally have the same opinion orientation (e.g.: […] Yes, I hate myself too, but its
cheap and direct, and the times are good5) while adjectives conjoined by ‘but’ generally have
different orientations (e.g.: good but not much). The advantage of the corpus-based approach -
when compared to the dictionary-based approach - is that it is capable of finding domain-
specific opinion words, together with their orientation. (Liu, 2010; Kumar & Sebastian, 2012)
Sentiment analysis on feature level
It is useful to classify a text on document or on sentence level, but this approach may be seen
as rather simplistic. A document which is classified as negative does not necessarily render an
exclusively negative opinion: some features may be considered as positive by the author. A
further, deeper and more detailed analysis of the text is needed to define the opinion of the
author of the text. Feature-based sentiment analysis is a technique to determine whether the
author’s opinion on a specific feature of a product is positive or negative (Liu, 2010).
There are several techniques to extract features from a text. Liu (2010) distinguishes two
types of online reviews: pros, cons and the detailed review on the one hand and a free format
on the other hand. The pros, cons and detailed review consists of two parts: a separate list of
the pros and cons of the object, and a full review. In this type of review, it is easier to extract
different features as they are clearly listed (a list with all the cons and a list with all the pros).
In a free format review, pros and cons are not listed separately, and the author can decide
himself how to structure his full text review. This makes it more difficult to extract separate
features. Hu & Liu (2004) established a method to find explicit features that are either nouns
4 Extracted from the British Airways corpus
5 Extracted from the Ryanair corpus
16
or noun phrases. To do this, a corpus of customer reviews of the product was collected. The
method consists of two steps. First, frequently used nouns and noun phrases are detected.
Only the nouns and noun phrases that occur frequently enough in the reviews are retained. In
this way, the features which are most frequently commented on (i.e. the most important
features) are assembled. In the second step, opinion words are used to find more infrequent
features.
Based on the above-mentioned levels, we conclude that for our research, i.e. sentiment
analysis in Twitter, the analysis could occur on three levels: sentence level, word level and
feature level. As tweets are very short messages, sentence and document level can be seen as
equal. Sentiment analysis on Twitter can therefore occur on sentence level.
In our dissertation, the data collection occurred on feature level as only tweets with the entity
British Airways or Ryanair were considered. Once we collected our data, we performed
sentiment analysis on document level as the entire tweets were considered.
2. SENTIMENT ANALYSIS IN TWITTER
2.1. Twitter
Twitter6 is a micro-blogging site that was founded in March 2006 by Biz Stone, Jack Dorsey
and Evan Williams. Its logo is a blue bird and one tweet can consist of maximum 140
characters. Currently, Twitter is available in 35 different languages. The micro-blogging site
is not only popular among ordinary people as also celebrities and companies are active on the
site.
Although messages on micro-blogging sites and also tweets have a character limit of 140,
their popularity grows at an enormous rate. Kaplan and Haenlein (2011) mention three factors
explaining the success of micro-blogs. First, through Twitter one can receive updates on even
the most trivial matters that happen in other people’s lives. Secondly, Twitter allows a push-
push-pull combination. Finally, it offers a “platform for virtual exhibitionism and voyeurism
for both active contributors and passive observers” (p106).
6 http://www.twitter.com (last accessed: May 20
th 2014)
17
The former explains why micro-blogging sites are so popular. The push-push-pull
combination illustrates how tweets can be of benefit to companies. There are three different
stages of the marketing process in which Twitter can be used. First, it can be used in pre-
purchase. In other words, Twitter can be used to read what clients have to say about a
particular company or products which can actually turn them into co-producers. Secondly,
micro-blogging sites can be useful in purchase. Companies can advertise and spread brand-
reinforcing messages through micro-blogging. They cannot only communicate with their
customers but also with and through their employees. The final marketing area is post-
purchase. Companies can improve their customer service and complaint management
processes through micro-blogging sites. (Kaplan and Haenlein, 2011)
Kaplan and Haenlein (2011) mention three rules for successful micro-blogging, namely
relevance, respect and return. For the first factor, “relevance”, there are two important things a
company has to keep in mind: you have to listen before you tweet and find the right balance
between sending out too few or too many messages. A second rule is “respect”. You have to
respect your followers by identifying yourself, using appropriate language and not deceiving
other users. The third rule “return” implies that firms using micro-blogging sites have to keep
in mind the benefits and return-on-investment of their activities. (Kaplan and Haenlein, 2011)
2.2. Shallow machine learning approaches to perform SA in Twitter data
Sentiment analysis is a difficult task on short chunks of text such as tweets, but different
classifiers have already been developed which show some promising results for labeling
tweets. Go et al. (2009), Barbosa & Feng (2010), Davidov et al. (August 2010), Mohammed
et al. (2013) and Kouloumpis et al. (2011) all developed a different classifier to detect
sentiment in Twitter. In this section, we explain these approaches and highlight the
similarities and differences.
As far as the Twitter datasets are concerned, Go et al. (2009) used one that consisted of tweets
which all contained emoticons. Barbosa & Feng (2010) used a combination of objectivity and
subjectivity sentences from three sources: Twendz7, Twitter Sentiment
8 and TweetFeel
9 while
7 An app to search for tweets which can be downloaded on http://www.twtbase.com/twendz/ (last accessed: May
20th
2014) 8 http://www.sentiment140.com/ (last accessed: May 20
th 2014)
9 http://www.tweetfeel.com/ (last accessed: May 20
th 2014)
18
Davidov et al. (August 2010) used an existing dataset (by Brendan O’Connor). Further, the
Mohammed et al. (2013) dataset was provided by the organizers of the ‘SemEval2013 Task 2:
Sentiment Analysis in Twitter’ as the Mohammed et al. (2013) paper was one of the
submissions of that SemEval task. For this dataset, named entities were extracted from the
Twitter API (Nakov et al., 2013). Finally, Kouloumpis et al. (2011) used three different kinds
of datasets: a hashtag dataset in which all tweets contained a hashtag, an emoticon dataset
which appeared to be the same dataset used in Go et al. (2009) and a manually labelled
dataset which contained tweets on certain topics and in which each tweet was labelled with its
sentiment. Although there are no similarities in the data collection itself, apart from
Kouloumpis et al. (2011) and Go et al. (2009) using the same emoticon dataset, all datasets
were further processed and only English tweets were retained. In both Go et al. (2009) and
Barbosa & Feng (2010) tweets that were labelled contradictory were removed. Some other
important processing steps in the Go et al. (2009) dataset were that only the most common
emoticons were retained (these were: “ :), :-), : ), :D, =), :(, :-( and : (”). Besides this, retweets,
tweets with the :P emoticon and duplicates were removed. Further, Barbosa & Feng (2010)
considered one tweet per user only and removed tweets containing top opinion words such as
“cool” and “awesome”. The SemEval dataset, used by Mohammed et al. (2013), was
processed using a lexicon tool. Only tweets containing at least one sentiment word were
considered (Nakov et al., 2013). Finally, Kouloumpis et al. (2011) selected the hashtags that
would be most useful to identify a tweet as positive, negative or neutral. All tweets in the
hashtag dataset that lacked these hashtags were removed.
This brings us to a description of the machine learning approaches. In all systems, different
features were used testing different classifiers. Go et al. (2009) considered four different
feature sets: unigrams, bigrams, unigrams and bigrams and part-of-speech (PoS). The latter,
PoS or meta-features, were also considered by Barbosa & Feng (2010) together with syntax
features. Besides unigrams, which is similar to Go et al. (2009), Davidov et al. (August 2010)
also considered n-gram features, pattern features and punctuation features. Mohammed et al.
(2013) used some features that were also used in the other classifiers such as n-grams, PoS,
emoticons and punctuation. Above these, they added character n-grams, all-caps, hashtags,
lexicons, elongated words, clusters and negation. Kouloumpis et al. (2011) also used n-gram
features, lexicon features and PoS features, but moreover they considered micro-blogging
features.
19
As mentioned earlier, Go et al. (2009), Barbosa & Feng (2010), Davidov et al. (August 2010),
Kouloumpis et al. (2011) and Mohammed et al. (2013) all developed their own classifier. Go
et al. (2009) are the only ones who compared their classifier to others while Barbosa & Feng
(2010), Davidov et al. (August 2010), Mohammed et al. (2013) and Kouloumpis et al. (2011)
all simply tested their own classifier.
The keyword-based classifier developed by Go et al. (2009), counts the number of positive
and negative words in a chunk of text. The highest number ‘wins’ and when there is a tie, in
other words when there is an equal amount of positive and negative words, the tweet is
labelled positive. They used the Twittratr10
list of words in order to know which words to
label positive and which to label negative. They compared their own classifier to three
machine learning algorithms. Barbosa & Feng (2010) considered the two steps of the
sentiment analysis process (cf. 1.2.1 and 1.2.2) when drawing up their classifiers: a
subjectivity classifier and a polarity classifier. While the former focusses on the tweets’
syntax features (emoticons and upper case), the latter focusses on the meta-features (PoS-
tagging). Both Davidov et al. (August 2010) and Mohammed et al. (2013) combined all
features in one single classifier. In order to examine which feature influenced the results the
most however, Mohammed et al. (2013) ran their classifier various times. The first time, all
features were included and later, one feature was left out every time. Finally, Kouloumpis et
al. (2011) also tried to combine their different datasets (hashtag, emoticon and manually
annotated datasets) with different features in order to find out which combinations provided
the best results.
In general, all classifiers performed quite well. Go et al. (2009) concluded that unigrams are
good predictors, whereas bigrams seem more useful in combination with unigrams. Although
Davidov et al. (August 2010) did not use the same features as Go et al. (2009), they also
stated that a combination of various features resulted in a better performance of their
classifier. Further, Davidov et al. (August 2010) argued that some hashtags and smileys tend
to occur together rather frequently and some sentiment types also depend on each other.
Mohammed et al. (2013) concluded that lexicons influenced the performance of the classifier
remarkably. Moreover, automatic sentiment lexicons, which were the NRC Hashtag
Sentiment Lexicon (Mohammed, 2012) and the Sentiment140 Lexicon (Go et al., 2009), had a
better influence on the classifier than manual lexicons, which were the NRC Emotion Lexicon
10
http://www.twittratr.com/ - now automatically redirected to the website of its developer (last accessed: May
20th
2014)
20
(Mohammad & Turney, 2010; Mohammad & Yang, 2011), the MPQA Lexicon (Wilson et al.,
2005) and the Bing Liu Lexicon (Hu & Liu, 2004). Finally, the experiments of Kouloumpis et
al. (2011) showed that generally it was useful to add the emoticon dataset to the hashtag
dataset. Another conclusion is that the PoS feature did cause a poorer performance. The best
results were obtained when the n-gram, lexical and micro-blogging features were trained on
the hashtag dataset only.
There is one contradiction in the results: where, Go et al. (2009) and Kouloumpis et al. (2011)
explicitly mentioned that PoS-tagging seems an unuseful feature, Barbosa & Feng (2010) did
not remark any problems arising with that feature. In Table 1, an overview is presented of the
shallow machine learning approaches which were discussed in this section.
Go et al. (2009) Barbosa & Feng (2010)
Features Unigrams, bigrams, unigrams and
bigrams, and PoS
Meta-features (PoS): negative
polarity, positive polarity and verbs
Syntax features: emoticons and upper
case
Classifier Comparison of their own keyword-
classifier to three machine learning
algorithms
Two types of classifiers: subjectivity
classifier and polarity classifier
Results Unigrams: good
Bigrams: not good
Unigrams and bigrams: good
PoS: not useful
Good results, but one important
limitation: sentences that contain
contradicting sentiments
Mohammed et al. (2013)
Davidov et al. (August
2010)
Kouloumpis et al. (2011)
Features N-grams, character n-
grams, all-caps, PoS,
hashtags, lexicons,
punctuation, emoticons,
elongated words,
clusters and negation
Single word features
(unigrams), n-gram
features, pattern features
and punctuation features
N-gram features, lexicon
features, PoS features
and micro-blogging
features
Classifier A classifier in which all
features are combined;
later adapted by leaving
features out
A classifier in which all
features are combined
A classifier in which all
features are combined;
later adapted by leaving
features out
Results Lexicon feature has
influences classifier the
most
A combination of
different features seems
to be very important
Best results when
combining emoticon and
hashtag dataset or when
training n-gram, lexical
and micro-blogging
features on hashtag
dataset
PoS: not useful
Table 1 – Short summary of the shallow machine learning approaches
21
As to the datasets, Liu et al. (2012) propose a new model which tries to handle the challenge
to combine both manually labelled data (as in Davidov et al. (August 2010)) and noisy
labelled data (as in Go et al. (2009)) for training: the Emoticon Smoothed Language Model
(ESLAM). Basically, a manually labelled dataset is smoothed with a noisy labelled dataset
(with emoticons).
The dataset consisted of 5513 manually labelled tweets of which 3727 remained after
removing non-English and spam tweets11
. The preprocessing of the dataset consisted of
replacing Twitter usernames starting with @ by ‘twitterusername’, replacing all digits with
‘twitterdigit’, replacing all URLs with ‘twitterurl’, removing all stopwords, lower casing and
stemming all words and removing retweets and duplicates. After this pre-processing, 956
random tweets (478 positive and 478 negative) were chosen for polarity classification and
1948 (974 positive and 974 negative) for subjectivity classification.
In order to test whether using emoticons is useful, Liu et al. (2012) compared ESLAM to a
fully supervised language model (LM). Normally however, LM is only used for documents.
In order to solve this problem, two ‘documents’ were made: one ‘document’ which was
actually a succession of all positive tweets and one ‘document’ which was actually a
succession of all negative tweets in the training data. A first conclusion which could be drawn
from the test was that both methods performed better when more manually labelled data was
used. Secondly, the performance of ESLAM was always better than the performance of LM,
especially when a small number of manually labelled data was used. According to Liu et al.
(2012) this means that noisy (emoticon) data does contain useful information. Another
experiment showed that only using noisy labelled data was not enough since the results
revealed that using more manually labelled data resulted in a better performance.
We can conclude that shallow machine learning approaches already perform well on doing
sentiment analysis in Twitter data. As mentioned above, classifiers perform well when some
features are considered such as lexicons, unigrams, unigrams in combinations with bigrams,
etc. More profound linguistic characteristics however, such as sarcasm and negation, seem to
be the main problem. First of all, for classifiers it is sometimes not clear which word a
negation particle is meant for. Further, only humans are able to label tweets as sarcastic and
11
As in e-mail programs or other social media websites, spam is also present on Twitter. It can occur in various
forms such as abusing the @ function in order to reach users when spreading unwanted messages, creating
various accounts, posting links to dubious websites etc. https://support.twitter.com/entries/64986# (last accessed:
May 20th
2014)
22
even for a human annotator this is sometimes a difficult task. In the following section,
approaches for these characteristics are discussed.
2.3. Approaches incorporating more linguistic knowledge
2.3.1. Sarcasm
González-Ibáñez et al. (2011) and Davidov et al. (July 2010) agree that a sarcastic tweet
means the opposite of what is actually written and that this is very hard to detect. Davidov et
al. (July 2010) investigated how to distinguish sarcastic from non-sarcastic utterances on
Twitter and Amazon while González-Ibáñez et al. (2011) compared the performance of
human annotators and machine learning techniques when classifying tweets as positive or
negative. Both Davidov et al. (July 2010) and González-Ibáñez et al. (2010) used a dataset
consisting of tweets that were explicitly marked with the hashtag #sarcasm.
Results showed that Davidov et al.’s (July 2010) model scored well on the Twitter dataset
(with tweets containing #sarcasm) as Twitter messages are short and the sarcasm is mostly not
dependent on other sentences (which can be the case on Amazon as larger chunks of text are
used in that case). In order to pinpoint some typical characteristics of sarcasm in tweets, they
used a “hash-gold standard set”, which contains the tweets labelled with #sarcasm. They
found three typical uses of the hashtag. First, twitter users want others to be able to find their
sarcastic tweets and therefore label them as such. Secondly, the hashtag can be used in order
to indicate that their previous tweet was sarcastic. Finally, the hashtag can be used when there
is a lack of context from which the sarcasm could be deduced and therefore the users
explicitly mention that something is sarcastic. These uses show that tweets with #sarcasm are
both noisy and biased. Noisy because not all tweets containing that hashtag are actually
sarcastic and biased because without the hashtag, it would be difficult, even for humans, to
label them as sarcastic.
González-Ibáñez et al. (2011) countered the concern of Davidov et al. (July 2010) that tweets
with #sarcasm are noisy by removing all tweets where the hashtag was not at the end from
their dataset. They found that lexical and pragmatic features were not enough for a classifier
to distinguish sarcastic tweets from positive or negative tweets. Further, they investigated
whether human annotators would be better at the task of determining whether a tweet was
23
sarcastic, negative or positive. Their results revealed that also the human annotators scored
low on the classification task. Consequently, González-Ibáñez et al. (2011) concluded that
only tweets which were explicitly labelled sarcastic by the author could be used since they are
the only ones able to label their tweets correctly. This is also why they are convinced that
their “approach to create a gold standard of sarcastic tweets is more suitable in the context of
Twitter messages” (p585). Also Davidov et al. (July 2010) used features in their algorithm:
punctuation-based and pattern-based features. The punctuation features can be further divided
into five different features: sentence length in words, the number of exclamation marks,
question marks and quotes in a sentence, and finally the number of capitalized words in a
sentence.
2.3.2. Negation
There is a lot of negation in tweets and in short texts or sentences in general. However, to the
best of our knowledge, we did not find any studies considering this aspect in Twitter. That is
why, for this dissertation, we try to discuss previous studies on negation that could also be
useful for negation in tweets.
Wiegand et al. (2010) provided a recent study on negation in sentiment analysis. In their
paper, they provided an overview of computational approaches that deal with this matter. We
will only discuss those that are relevant for sentiment analysis in Twitter.
Pang et al. (2002) found that a bag-of-words representation is effective. With this technique,
the classifier itself has to determine which words in the dataset or feature set are polar and
which are not. When there is a negation particle, artificial words are added until the
punctuation mark.
A second type of approach is models which include their knowledge of a word’s polar
expression. Polanyi & Zaenen (2004) developed a model that uses contextual valence shifting.
In this model, polar expressions are given scores: positive scores to positive polar expressions
and negative scores to negative polar expressions. When a polar expression is negated, the
polarity score of the expression is flipped. This model’s effectiveness is however unknown as
it has never been implemented. The second approach of this type is the one by Wilson et al.
(2005). They developed a classifier and added a negation feature. They concluded that the
24
more features were used in the classifier, the better it performed. In other words, adding
negation features to a classifier helps.
Another approach, by Jia et al. (2009), examined the impact of scope models for negation, in
other words, models which focus on negation. They developed a complex model which also
considers disambiguation for example and negative rhetorical questions. Wiegand et al.
(2010) argued that the model is better than simpler methods as it keeps linguistic insights in
mind.
Negation does not only occur within phrases or sentences, but sometimes also within words.
These words are mostly words which are clearly either positive or negative, so this should not
be a problem in sentiment analysis. However, new words (containing negation) are arising
and these may not be listed in the polarity lexicon. Therefore, Wiegand et al. (2010) argued
that polarity classifiers should also be able to detect these words.
Liu & Seneff (2009) also made an important remark. They observed that a negated polar
expression (e.g.: not pretty) does not necessarily have the same polar strength as its unnegated
polar expression with the opposite polarity (e.g.: ugly). Although “not pretty” and “ugly” do
have the same polarity, their polar strength is different as in this case “not pretty” is less
negative than “ugly”. Liu & Seneff (2009) therefore proposed a model in which words are
given a score for their intensity and polarity.
Finally, Wiegand et al. (2010) formulated some of the limits that negation modelling still has
in sentiment analysis. First, some words can have different polar expressions in different
contexts. Secondly, some polar opinions can only be detected and understood when the reader
has world knowledge. Finally, rare constructions are often not yet considered as specific
negation models can only be developed when a large enough corpus is available. Only then
can the models be properly tested.
25
PART II: CORPUS ANALYSIS
In the second part we describe the corpus analysis that was performed for this dissertation.
For our data collection we decided to focus on one specific domain: airline companies.
The most important objective of this dissertation’s experiments is to help improve the
performance of a basic sentiment analysis tool (see 1.3). In order to be able to do that, it is
necessary to know why it fails at labelling the tweets correctly. After analysing the results,
some propositions and conclusions will be drawn up in order to help improve the tool. The
focus will be on more profound text attributes.
We can reformulate these objectives in three questions:
- How good does a machine sentiment analysis tool (the LT3-tool) perform on labelling
linguistically challenging tweets compared to humans?
- What is or what are the most common mistake(s) a machine sentiment analysis tool
(the LT3-tool) makes when labelling linguistically challenging tweets?
- How could a machine sentiment analysis tool (the LT3-tool) be improved in order to
label linguistically challenging tweets better?
We will now first describe the methodology followed for our research, after which we will
discuss the results.
1. METHODOLOGY
1.1. Data collection
We chose to compose two different corpora with tweets on two airline companies, British
Airways and Ryanair. These airline companies were selected as they both have very different
reputations. This is an interesting aspect as these reputations can have an influence on the
kinds of tweets that are sent. The British Airways reputation is rather good as it is a non-
budget airline and it has a very good reputation among travellers from not only British but
also other worldwide habitants. The most recent proof of their good reputation is them being
elected best brand in the UK (Molley, 2014; Winch, 2014). Ryanair on the other hand has a
(much) worse reputation which can be caused by them being a low-budget airline. For a lot of
26
travellers a low-budget airline often equals an airline company with a bad customer service.
This is definitely the case for Ryanair as it is known for offering cheap flight tickets but with
no other extras for their customers (Vizard, 2013). However, in the past few months, Ryanair
is doing some extra efforts in order to boost their public image (Travelmole, 2014).
In order to be eligible for our corpus, all tweets had to meet some criteria:
1. It had to be a tweet where one of the two chosen airline companies was mentioned
(Ryanair of British Airways)
2. It had to be sent during November 2013
3. The tweet had to contain at least one sentiment which should be difficult to predict
using an automatic system.
This last criterion was the most important one.
Starting in November 2013, as many tweets as possible were collected using the Twitter API.
One limitation of this service is that it only allows a user to download tweets from the
preceding week12
. During the month of November we entered our two search terms “Ryanair”
and “British Airways” in the tool on a daily basis. This already resulted in a corpus of 45
Ryanair and 60 British Airways tweets. However, since we envisaged to have at least five
tweets per day for each day in November – a criterion that was not met during this period –
we had to find another tool allowing us to search more tweets. To this purpose, we used the
website: www.topsy.com (last accessed: May 20th
2014). This website allows you to search
for old tweets by date. Since this website also searches other social media based on a specific
query, we had to fill in two criteria, i.e. (1) that we only wanted to find tweets and (2) that we
only wanted English tweets. For each day in November, separate searches were performed
and tweets following our personal criteria were selected.
In order to illustrate the data collection, we present print screens of the websites used to
collect tweets. Image 1 shows the Twitter API with the search entry “British Airways” while
Images 2 and 3 are screenshots of the Topsy website using the same search entry.
12
https://dev.twitter.com/docs/using-search (last accessed: May 20th
2014)
27
Image 1 - Example of the Twitter API with the search term "British Airways"
Image 2 – Possibilities to specify a search on Topsy according to time/date, kind of message and language respectively
28
Image 3 - Example of result page on Topsy with the search term “British Airways” and with the specifications as in Image 2
As our objective was to find five tweets for each day in November, we continued searching
for tweets on the Topsy website until that criterion was met. The final corpus consists of 300
tweets in total, 150 for Ryanair13
and 150 for British Airways13
. Table 2 presents some data
characteristics of our corpus.
The corpus contains different types of information on the tweets. First, the date and exact time
are provided. Secondly, the tweet itself is in the corpus. Most emoticons were copied, except
the less common ones that could not be typed (such as an airplane). A third element in the
corpus is the Twitter username of the type @username. The fourth element is the name that
the twitter user provides himself. Normally you would expect this is the real name of the user,
but this is not always the case as some use asterisks (*) for example in their name. The two
last columns of the corpus contain information on the annotation of the tweets, which will be
discussed in closer detail in the following section.
Number of tweets Average tweet length
Ryanair 150 16.54
British Airways 150 17.29
Table 2 - Number of tweets and average tweet lengths of both corpora
13
The Ryanair and British Airways corpora are enclosed with this dissertation via Minerva.
29
1.2. Data annotation
Before testing a sentiment analysis tool on our dataset it was crucial that our data had
manually verified sentiment labels; they had to be a gold standard.
During our data collection we already paid special attention to the third criterion, i.e. why a
sentiment analysis tool would label the sentiment of the tweet incorrectly. This implies that a
first rough annotation of the tweets was already performed. Next to each tweet there was a
description why this tweet was chosen, which often (implicitly) mentioned whether the
particular tweet was positive, negative or neutral.
However, it was still necessary to perform an additional annotation in a uniform manner. To
this purpose all tweets were annotated manually as positive, negative or neutral.
Annotating a corpus is something fairly subjective as one annotator could describe a tweet as
more sarcastic whereas another one considers it a serious tweet. The same is true for this
corpus’ annotation; some annotations might be up for discussion, of which we will now
present some examples:
I'm wondering if @rewardgateway is a bit like the old British Airways in that it's staff
have to pass a test for extremely good lookingness
This tweet could be interpreted as either negative (British Airways focusses on the looks and
not the capacities) or positive (all British Airways stewardesses are good-looking). This is
why we decided to label it as neutral in order to keep those two interpretations in mind.
British Airways pulling out of Manchester was disappointing but attracted other
airlines. Market will change over next 5 yrs.
One could consider this a negative tweet because the twitter user is disappointed in British
Airways for leaving. Others could label it positive as being disappointed can express a
positive opinion of British Airways. In the corpus however, this tweet was also labelled
neutral in order to keep those two interpretations in mind.
We were not able to perform an inter-annotator agreement study due to a lack of time.
However, we consider the pre-selection step and extensive literature study sufficient to form a
rather unambiguous judgment. Table 3 presents the total number of positive, negative and
neutral tweets in each corpus. In the Ryanair corpus, the number of negative tweets is
30
significantly higher (twenty more) than in the British Airways corpus. This could be a
consequence of Ryanair having a worse reputation than British Airways.
Sentiment # of tweets
British Airways Negative 95
Positive 45
Neutral 10
Ryanair Negative 114
Positive 30
Neutral 6
Table 3 - Overview of the number of negative, positive and neutral tweets (labelled by the human annotator) in both corpora
1.3. LT3-tool
The LT3-tool is a sentiment analysis tool developed by and discussed in Van Hee et al.
(2014). The tool is a quite standard sentiment analysis tool including as many features as
possible. The LT3-tool considers ten different features groups: lexicons, n-grams, n-grams
and lexicons, normalization, PoS, negation, word shape, named entity, dependency and PMI.
When testing the tool, first only three features were used (lexicons, n-grams and lexicon and
n-grams) and later the other features were added one by one. Van Hee et al. (2014) did this in
order to examine how much the feature groups influenced the performance of the tool. Using
the SemEval2013 twitter dataset for training (Nakov et al., 2013), in their experiments, Van
Hee et al. (2014) tested the performance of the LT3-tool on different datasets: a dataset with
regular tweets, a dataset with sarcastic tweets, a dataset with SMS messages and a dataset
consisting of blog posts.
Results showed that the LT3-tool performed best on the dataset with regular tweets and worst
on the sarcastic dataset. Further, almost all features had a positive influence on the
performance of the tool except for the dependency feature.
For our analysis, we used the LT3-tool, trained on the SemEval data with all available
features, and tested it on both the British Airways and Ryanair corpus.
31
2. RESULTS
2.1. Evaluation
Before coming to the actual results we will first tell something more about how we evaluated.
The error analysis itself consists of two dimensions: a quantitative and qualitative analysis.
The quantitative analysis consists of calculating the performance of the LT3 tool (accuracy,
precision and recall) using confusion matrices. The qualitative analysis was done by trying to
classify the different errors of the tools.
Quantitative analysis
Once the results of the LT3-tool are outputted, they can be compared to those of the human
annotator. In order to calculate this performance, a confusion matrix is used in which “each
column […] represents the instances in a predicted class, while each row represents the
instances in an actual class”14
. The results of both datasets are represented in two separate
matrices.
Using the data in those matrices, three performance metrics can be calculated: accuracy,
precision and recall. The first value, accuracy, indicates “the degree of closeness of
measurements of a quantity to that quantity's actual (true) value”15
. For the LT3-tool in
particular, it indicates how many tweets were labelled correctly, irrespective of the classes
(positive, negative, neutral). The second value, precision, indicates “the degree to which
repeated measurements under unchanged conditions show the same results”15
. For example,
when one wishes to calculate the precision of the positively labelled tweets for, one calculates
how many per cent of the tweets that were labelled positively by system are actually correct.
In this dissertation, the precision is calculated of all possible labels: negative, positive and
neutral. The third and last value is recall which can be defined as “the fraction of relevant
instances that are retrieved”16
. In other words, with this we calculate how many per cent of the
positive tweets for example were recognized by the sentiment analysis tool. Again, this value
14
http://en.wikipedia.org/wiki/Confusion_matrix (last accessed: May 20th
2014) 15
http://en.wikipedia.org/wiki/Accuracy_and_precision (last accessed: May 20th
2014) 16
http://en.wikipedia.org/wiki/Recall_(information_retrieval) (last accessed: May 20th
2014)
32
is calculated for the negative, positive and neutral labels. We will now present the different
formulas17
used to calculate these metrics.
Accuracy
���.������.�����.��
��� ���
Precision of the negative/positive/neutral class respectively:
���. ���
���. ��� + ���. ��� + ���. ���
���. ���
���. ��� + ���. ��� + ���. ���
���. ���
���. ��� + ���. ��� + ���. ���
Recall of the negative/positive/neutral class respectively:
���. ���
���. ��� + ���. ��� + ���. ���
���. ���
���. ��� + ���. ��� + ���. ���
���. ���
���. ��� + ���. ��� + ���. ���
Qualitative analysis
In the qualitative analysis, six types of errors will be discussed according to how the human
annotator and the LT3-tool labelled the tweets.
Human coder LT3-tool
Type 1 Negative Positive
Type 2 Negative Neutral
Type 3 Positive Negative
Type 4 Positive Neutral
Type 5 Neutral Negative
Type 6 Neutral Positive
Table 4 - Overview of the different types of errors
In Table 4, an overview is provided of the six different types of errors that are distinguished in
the qualitative analysis in Section 2.3. Type 1 and Type 2 are tweets which were labelled
negative by the human annotator, but respectively positive and neutral by the LT3-tool. Type
17
The sequence “X.Y” represents the human annotator labelling a tweet X and the LT3-tool labelling the tweet
Y. Other abbreviations are “pos” for positive, “neg” for negative and “neu” for neutral.
33
2 and Type 3 errors concern tweets labelled positive by the human annotator and respectively
negative and neutral by the sentiment analysis tool. Finally, Type 5 and Type 6 errors were
made when the human annotator labelled tweets as neutral and the LT3-tool did so as negative
and positive respectively.
We will now discuss the results in closer detail.
2.2. Quantitative analysis18
Predicted sentiment (machine)
Ov
eral
l se
nti
men
t (h
um
an)
Negative
Positive
Neutral
Negative
37
90
82
Positive
35
19
21
Neutral
4
6
6
Table 5 – Confusion matrix of the British Airways and Ryanair corpus combined
The confusion matrix of both corpora combined is presented in Table 5. It shows how many
times the LT3-tool labelled tweets correctly or incorrectly. Overall, 37 negative, 19 positive
and 4 neutral tweets have been labelled correctly. In other words, the accuracy of the LT3-tool
on our total corpus is 21%. This indicates that the collected tweets are indeed difficult to label
for an automatic sentiment analysis tool.
Negative Positive Neutral
Accuracy 21%
Precision 49% 17% 6%
Recall 18% 25% 31%
Table 6 - Overview of the accuracy, precision and recall numbers of the British Airways and Ryanair corpus combined
When comparing the performance metrics presented in Table 6, there are some remarkable
differences. First of all, the precision of the negative tweets (49%) is significantly higher than
18
The calculations of all values mentioned below, can be found in Appendix I.
34
that of the positive (17%) and neutral (6%) tweets. In other words, the LT3-tool labelled a lot
more negative tweets correctly than positive and neutral ones. The recall of the negative
tweets however is the lowest, 18% compared to 25% and 31% for the positive and neutral
tweets respectively. Although half of the tweets labelled negative by the LT3-tool were
labelled correctly, the tool did miss a significant number of those tweets which were negative
in reality. Further, the precision of neutral tweets (6%) is very low, meaning that a significant
amount of opinions in the tweets were expressed with neutral words or the positive and
negative opinion words were levelled out. Finally, the precision of neutral tweets is the lowest
whereas its recall is the highest. This high recall could be explained by the corpus containing
few tweets labelled neutral by the human annotator (sixteen tweets in total, which is 5% of the
entire corpus). In the qualitative error analysis, we hope to shed some light on the types of
errors that were frequently made by the LT3-tool.
For completeness, we present below the confusion matrices of both airline companies, British
Airways in Table 7 and Ryanair in Table 8:
Predicted sentiment (machine)
Ov
eral
l se
nti
men
t (h
um
an)
Negative
Positive
Neutral
Negative
24
40
31
Positive
19
14
12
Neutral
1
5
4
Table 7 – Confusion matrix of the British Airways corpus
35
Predicted sentiment (machine) O
ver
all
sen
tim
ent
(hu
man
)
Negative
Positive
Neutral
Negative
13
50
51
Positive
16
5
9
Neutral
3
1
2
Table 8 – Confusion matrix of the Ryanair corpus
Negative Positive Neutral
Accuracy 28%
Precision 55% 24% 9%
Recall 25% 31% 40%
Table 9 - Overview of the accuracy, precision and recall numbers of the British Airways corpus
Negative Positive Neutral
Accuracy 13%
Precision 41% 9% 3%
Recall 11% 17% 33%
Table 10 - Overview of the accuracy, precision and recall numbers of the Ryanair corpus
A comparison of Tables 9 and 10 shows that the performance metrics of the Ryanair corpus
are generally lower than those of the British Airways corpus. First of all, the accuracy of the
British Airways corpus is a 15% higher than in the Ryanair corpus, which is a significant
difference. Generally, all precision values are lower in the Ryanair corpus, but the precision of
the positive tweets is remarkably lower (also a 15%). The recall values are also lower on
negative, positive as well as neutral tweets, but again the difference between the values of the
positive tweets is the largest. The values of neutral tweets (labelled by the human annotator)
are different, which cannot be generalized however as too few tweets were labelled neutral in
both corpora.
36
2.3. Qualitative analysis19-20
As the prime objective of this dissertation is to improve the LT3-tool, it is necessary to know
which mistakes the system makes. After a first thorough inspection we saw that we could
distinguish eight different error categories: sarcasm, negation, experience, comparison,
synonym, ambivalent, side-taking and Twitter-specific signs. When a tweet could be placed in
different categories, we opted for the most dominant error.
We will now first explain these error categories in closer detail. In the literature overview, the
common errors “sarcasm” (Section 2.3.1) and “negation” (Section 2.3.2) have already been
discussed in closer detail. In a sarcastic tweet, the twitter user says something but actually
means the exact opposite of what he says which seems to be difficult for a sentiment analysis
tool to detect. In a negated tweet, a negation particle changes the sentiment orientation of the
word that follows. This also seems to be a problem for a sentiment analysis tool as it is not
always clear which word the negation particle is related to. Please note that negations do not
always occurs with an ordinary negation particle (such as “not”) as sometimes other parts of a
sentence can indicate negation (such as “rather than”) or at least strengthen or weaken the
overall sentiment within a tweet. This is why we decided to also include modifiers in the
“negation” category. The third error category is “experience”; this is when a twitter user tells
a story related to one of the airline companies. The story is usually told using neutral words
and lets the reader draw his/her own conclusions. “Comparison” is a fourth category and
occurs when the twitter user compares the airline company to another company. This can be
confusing for a sentiment analysis tool as it is not always clear which company the negative
or positive opinion is meant for. Fifthly, the error “synonym” indicates that a company is
perceived as an equivalent of a certain sentiment or opinion. When the sixth error,
“ambivalent”, is made, a certain word in the tweet can have different sentiments in various
contexts and in the context of that particular tweet, the word is labelled incorrectly. “Side-
taking” is the seventh error. This occurs when a twitter user takes the airline companies’ side
in discussions. Finally, “Twitter-specific signs” is an error category occurring when for
example a hashtag expresses an important sentiment about the tweet, but cannot be read as
such by the sentiment analysis tool.
In Table 11, an overview is provided of the different error categories with a short explanation.
19
The exact count of errors per corpus can be found in Appendix II. 20
All examples used in this section are extracted from either the British Airways or Ryanair corpus.
37
Sarcasm A sarcastic utterance is used in the tweet
Negation A negation particle or modifier is used in the tweet
Experience The tweet reports an experience of the twitter user
Comparison The entity is compared to another entity (entity: airline company)
Synonym The entity is seen as a synonym to a certain sentiment or opinion
Ambivalent A word is used which can have different meanings in various
contexts
Side-taking The twitter user supports the entity
Twitter-specific signs A sign is used which is typically used in Twitter
Table 11 - Overview of the different error categories
Before proceeding to the actual in-depth error analysis, an overview is provided in Tables 12
and 13. Each includes a summary of how many kinds of major (Table 12) and minor (Table
13) errors were found in each of the six types.
Major errors
Sarcasm Negation Comparison Experience
Type 1 45 14 15 13
Type 2 37 7 7 20
Type 3 5 11 12 1
Type 4 1 1 7 8
Type 5 0 0 3 1
Type 6 0 0 2 4
Total 88 33 46 47
Table 12 - Overview of the number of major errors in each type
Minor errors
Synonym Ambivalent Side-taking Twitter-specific
signs
Type 1 1 0 0 2
Type 2 10 0 0 1
Type 3 0 1 5 0
Type 4 0 2 1 1
Type 5 0 0 0 0
Type 6 0 0 0 0
Total 11 3 6 4
Table 13 - Overview of the number of minor errors in each type
Tables 12 and 13 show again that most errors were made in Types 1, 2 and 3. The large share
of Type 1 and Type 2 errors can be explained by a 70% of the tweets being labelled negative
38
by the human annotator. This results in a larger possibility of those tweets being labelled
erroneously by the LT3-tool. Also Type 3 has quite a large share, which can be explained by
the LT3-tool mostly labelling the exact opposite sentiment when it labelled erroneously
(another reason why Type 1 is notably present). Further, the rather smaller share of Type 5
and Type 6 errors can be explained by few tweets being labelled neutral by the human
annotator (only 5%).
In Table 14, the large shares of Type 1, Type 2 and Type 3 errors are shown more visually in
a pie chart. The correctly labelled tweets are not considered in this chart.
Table 14 - Pie chart of the shares of the different types of errors
2.3.1. Type 1: negative tweets labelled as positive
Type 1 includes all tweets that the human annotator labelled negative, but which were
erroneously labelled positive by the LT3-tool. In general, 90 errors of this type were
committed, 40 in the British Airways corpus and 50 in the Ryanair corpus. These errors were
of different categories, but four errors that were committed stand out: “sarcasm”, “negation”,
“comparison” and “experience”.
Type 1
38%
Type 2
34%
Type 3
15%
Type 4
9%
Type 5
2%
Type 6
2%
Both airline companies
39
Sarcastic tweets seem to have caused substantial problems in Type 1 as 50% of the
erroneously labelled tweets constitute this error. In this case, positive words are used, but the
twitter user points out something negative.
[1] Another triumphant @Ryanair exp. 1 desk open out of 15 with a massive Q. 2 people on desk, 1 booking, 1 just chatting – brilliant
[2] Love Ryanair, best airline ever.. #nextjoke #satonmiarse
[3] Great so @British_Airways have left my back in Norway and I’m in Denmark. Non lols.
[4] Back in rainy Belgium L thanks for my delays #britishairways
In [1], [3] and [4], the twitter users use positive words to explain their negative experience in
the same sentence. As for a human annotator, the negative experience that accompanies the
positive words indicates that it is a sarcastic tweet. [2] is a slightly different case as the
hashtags indicate that the tweet is sarcastic.
Negation, though in a more restricted way, is also an important cause of negative tweets being
labelled positive (16%).
[5] From my door to the gate in 50 minutes, thank god I’m not flying Ryanair… - Oliver
[6] A special little joy grows in my heart every time I book a flight that’s not with Ryanair.
[7] . @British_Airways trust me, it was anything but enjoyable, especially the return trip – back to @VirginAtlantic for me…
[8] @Bitish_Airways rather than thank me for my patience I’d rather you were on time.
The four examples above show that negation seems to be an issue when it occurs in a special
form. In [5] and [6], the LT3-tool has registered “thank” and “joy” as positive words, but
since there is a negation particle before “Ryanair”, the sentiment shifts. However, for the
LT3-tool this is probably not clear as “Ryanair” and not “thank” and “joy” are negated.
Tweets [7] and [8] show that negation does not always occur with the usual negation particles.
In [7] “anything but” is the sequence negating “enjoyable”. In [8] on the other hand, “rather
than” negates or weakens “thank”.
Thirdly, some tweets have been labelled incorrectly when British Airways or Ryanair were
compared to another company. In this case, positive things are said about other companies,
40
which make these tweets negative for British Airways or Ryanair. 17% of the Type 1 errors
are of the “comparison” category.
[9] @jula18 @gems2311 @desiderata2013 I used to use Ryanair much prefer aer lingus now plus you can book seats its not a fee for all.x
[10] The @wizzair airline is a great example of how you can offer cheap flights without having an ass retarded website like ryanair
[11] @greggsulkin @british_airways Virgin Atlantic – solution to flying with British ever again. Soooo much better
[12] why can’t British airways be as cool as American ones???
In all the above examples something good is said about another airline company, which
makes either British Airways or Ryanair look bad. In [9] the twitter user implies that Aer
Lingus is better as passengers do not have to pay when choosing their seats (which they have
to do when booking with Ryanair). In [10] WizzAir is praised for its good website and cheap
tickets. Although the tweet implies that Ryanair also has cheap tickets, which is seen as
something positive, the negative aspect of “an ass retarded website” dominates. In both [11]
and [12] British Airways is seen as a bad airline company when compared to Virgin Atlantic
and American airline companies in general.
The fourth common error committed by the LT3-tool of Type 1, is “experience”. The tweets
report on something the twitter user has experienced with one of the airline companies and he
leaves it up to the readers to draw their conclusions. In this case, negative conclusions should
be drawn which is trivial for human annotators, but seems quite an issue for a sentiment
analysis tool. 14% of the erroneously labelled tweets are of this category.
[13] Dear Ryanair… independent.co.uk/voices/coment…. This story is shocking. Better to save up for longer and travel with a decent airline.
[14] lnkd.in/d48KMsV Well done to Carolyn and the team. Amazing results and well and truly puts Ryanair in its place.
[15] @British_Airways ..furthermore, your Gibraltar contact is a SPANISH NUMBER. I’m Gibraltarian! English is our national language! So …
[16] British Airways bureaucracy & processes: flying with my wife and can’t select her seat next to me because we are in different bookings.
In the Ryanair examples [13] and [14] the twitter users both refer to something unknown for
the readers. Their language use, however, illustrates a negative opinion. In [13] for example it
is implied that Ryanair is not a decent airline and in [14] the user explicitly says that “Ryanair
41
is put in its place”, which also has a negative connotation. In the British Airways tweets,
examples [15] and [16], the twitter users share their negative experience. However, none of
them explicitly say something negative about British Airways. They simply narrate their
(negative) experience without using explicitly negative words and leave it up to the readers to
draw their conclusions. It is normal that this is not obvious for the LT3-tool, which makes this
a difficult problem to tackle automatically.
Further, in the Ryanair corpus two other error categories can be distinguished: “synonym” (1)
and “Twitter-specific signs” (2). These categories were not found in the British Airways
corpus.
[17] @DominicFarrell herpes is better than. Ryanair !!!
[18] @ihateryanair you guys rock! keep up the good work! these thieves @ryanair must be stopped!
[19] It appears that Jet2 are no better than Ryanair #ripoffmerchants
In [17] Ryanair is clearly seen as a synonym for something negative as it is compared to
herpes and the twitter user expresses his preference for herpes over Ryanair. In [18] and [19]
the Twitter-specific signs @ and # express an important opinion on the tweet. In [18] the
twitter user praises another user, namely @ihateryanair. This is important information as this
clearly expresses a negative attitude towards Ryanair. In [19] on the other hand, the
information in the hashtag #ripoffmerchants is essential as it expresses a negative opinion
towards both Jet2 and Ryanair. It seems that the LT3-tool is unable to split these hashtags and
at-mentions into separate linguistic units.
2.3.2. Type 2: negative tweets labelled as neutral
Type 2 includes all tweets that the human annotator labelled negative, but which were
erroneously labelled neutral by the LT3-tool. In general, 82 errors of this type were
committed, 31 in the British Airways corpus and 51 in the Ryanair corpus. These errors are of
different categories of which three stood out: “sarcasm”, “experience” and “synonym”.
Again sarcasm seems to pose the biggest problem for the LT3-tool. In total 45% of the tweets
were falsely labelled neutral because of sarcasm. Although the twitter users use either neutral
words or an equal amount of negative and positive words in this case, the overall message in
the tweet is sarcastic.
42
[20] I think a walk and a swim will get us home quicker than @Ryanair tonight #clontarf #dublinairport pic.twitter.com/od21YJwddk
[21] @The_Old_Penfold @howthingslook @jana_obscura @GregorServais Ryanair, right, where you have to peddle to make the plane go.
[22] When the lady at check in told me I had an aisle seat , what she meant was “here’s a middle seat” – thanks @BritishAirways #terminal5 #a380
[23] The @British_Airways Avios frequent flyer points are about as useful as Monopoly money.
In both [20] and [21] the twitter users make jokes implying that going home with Ryanair
takes a long time and no explicitly negative words are used. A human annotator however can
detect the joke while the LT3-tool takes into account the linguistic features only. In [23] the
twitter user also makes a joke as he alludes to the frequent flyer points being worth nothing
(as Monopoly money is also worthless). Finally, tweet [22] is a classic example of sarcasm as
the “thanks @BritishAirways” is simply not sincere. This can be detected from the context.
Also in Type 2, “experience” is an error occurring regularly (24%). In this case, twitter users
tell a story which has a negative outcome, but they do not use explicitly negative words and
therefore the LT3-tool labelled these tweets as neutral.
[24] Ryanair – was your website built in 1999?
[25] At Rygge ‘Oslo’ Airport, which, to be honest Ryanair, is only near Oslo if you zoom really really far out on your maps.
[26] Just looked at the twitter id of @British_Airways and they only look at tweets until 5pm – so much for #globalservice
[27] I really do wish @British_Airways would call me to tell me where my luggage that they lost is. Am on holiday with just clothes I’m wearing!
In tweets [24], [25] and [26] the twitter users make a remark on a bad aspect of both Ryanair
and British Airways. Although it seems like an ordinary comment, it is actually a form of
criticism. Therefore these tweets cannot be seen as neutral tweets as they clearly out critique.
The same occurs in [27], but opposed to the other three tweets, this message has a more
personal connotation as this twitter user really wants to know where his clothes are.
The third important error is “synonym” which occurs when British Airways or Ryanair are
seen as a synonym of something bad (12%). The “synonym” errors are mainly made in the
Ryanair corpus rather than in the British Airways corpus. This difference can be explained by
Ryanair having a very bad reputation compared to British Airways.
43
[28] So @northernrailorg is the Ryanair of trains.
[29] We are staying at a Ryanair hotel.
[30] @davidfrum This is the RyanAir approach: if you’re in the same time zone, it’s close enough.
[31] What is this?! British Airways? pic.twitter.com/oYAwhaYT8J
In the first three examples [28], [29] and [30] from the Ryanair corpus, the company is clearly
perceived as something negative. Tweets [28] and [29] are fairly short tweets, but for a human
annotator it is still obvious that these are negative tweets about Ryanair as he or she has the
background knowledge of Ryanair’s bad reputation. In [30], the twitter user confirms that “the
RyanAir approach” has a negative connotation by explaining what is wrong with their
approach. In [31] it is clear that this is a negative tweet, even without consulting the image as
“what is this?!” would not be used in a positive context.
Finally, three minor error categories are left: “negation” (7), “comparison” (7) and “Twitter-
specific signs” (1). The “negation” errors can be found in both corpora, three in the British
Airways and four in the Ryanair corpus. The “comparison” errors however are only present in
the Ryanair corpus, while the “Twitter-specific signs” error can be found in the British
Airways corpus.
[32] Flying British Airways 747, should have plenty of room.
[33] used to Ryanair flying business class in Emirates this time is like being in another planet
[34] I’m a patient man but now been on hold 41 minutes at british airways executive club #ba #britishairways #costumerserviceneedsimprovement
Tweet [32] demonstrates again (as [7] and [8]) that also other sequences, as modification
particles in this case, can cause negation. In this example, “should have” negates that there is
“plenty of room”. Therefore it is a negative tweet, as the twitter user believes that this is a
drawback. In [33] Ryanair is compared to another airline company, Emirates. The comparison
is not good for Ryanair as Emirates appears to be better; however no explicitly negative words
were used to express that sentiment. Finally in [34] the hashtags contain the most important
information of the tweet. Since these consist of tokens which are all concatenated, the
negative words used in the hashtag cannot be detected by a sentiment analysis tool.
44
2.3.3. Type 3: positive tweets labelled as negative
Type 3 includes all tweets that the human annotator labelled positive, but which were
erroneously labelled negative by the LT3-tool. In general, 35 errors of this type were
committed, 19 in the British Airways corpus and 16 in the Ryanair corpus. These errors are of
different categories of which two stand out: “negation” and “comparison”.
In general, “negation” occurs in 31% of the errors.
[35] Amazed - new Ryanair website didn't drive me mad when checking in. Ryanair in improved website shocker!!
[36] Ryanair booked us onto new flights, no charge, despite the lost lorry cargo on the M11 being absolutely nothing to do with them. Impressed.
[37] Just got the British Airways magazine with the staff holidays in. Caribbean for £30 a night doesn’t sound too bad.
[38] Not at all keen on T4, not only is it miles from anywhere but more importantly there’s a distinct lack of @British_Airways
Tweet [35] expresses a genuinely positive comment on the new Ryanair website. The twitter
user negates that the website drove him crazy, which was probably the case with the old
website. Tweet [36] also comments on something positive of Ryanair as the negative word
“charge” is being negated. In [37] the problem arises that the sentiment analysis tool does not
know which word the negation particle applies to. In this example, “bad” is negated although
the negation particle is positioned before “sound”. Finally in [38] a lot of negation particles
apply for both “T4” and “British Airways”. The most important negation however, “there’s a
distinct lack of @British_Airways”, aims at a positive comment on British Airways.
In Type 3, “comparison” errors are also a cause of concern. The error occurs 34% of the time.
The airline companies are compared to others and in this case, the positive things apply to
either British Airways or Ryanair. Although there is no significant difference, the bad
reputation of Ryanair could apply to the lower number of “comparison” errors.
[39] Have booked flights for @buildstuffit.It I have gone with ryanair. Yes, I hate myself too, but its cheap and direct, and the times are good
[40] €124/£103 on Aer Lingus, €49/£41 on Ryanair to Amsterdam. I’m kind of tempted… [Granted I’d have to book a hotel and shit too]
[41] I hate easyjet, never me ruder staff. Nothing in comparison to British Airways. #easyjet #britishairways
45
[42] Fuck me, British Airways just make every other airline look like a incompetent arseholes when it comes to service. #1stClass
At first sight, [39] might seem a negative tweet as the twitter user declares that he hates
himself because he chose for Ryanair. However, the positive aspects of Ryanair when
compared with other airline companies (cheap, direct, good timetable) are dominating in this
tweet. A difference with other comparison errors is that Ryanair is not compared with another
specific airline company, but with all companies in general. This can also be found in [42] as
the twitter user declares that British Airways is the best out of all airline companies. In [40]
the twitter user stresses the cheap tickets Ryanair offers in comparison with Aer Lingus.
Finally in [41] all negative aspects apply to EasyJet as British Airways is depicted as a very
good company.
Finally, other Type 3 errors can be divided into four categories: “sarcasm” (5), “experience”
(2), “ambivalent” (1) and “side-taking” (5). Three “sarcasm” errors were found in the British
Airways corpus and two in the Ryanair corpus. All “experience” and “ambivalent” errors are
present in the British Airways corpus. Finally four “side-taking” errors can be found in the
British Airways corpus and one in the Ryanair corpus.
[43] My plane ticket was so cheap I called British Airways to make sure it wasn’t a scam
[44] better download and upload speed begging it off british airways’ internet than in my hous I feel like crying
[45] the british airways waterside is sick
[46] Everyone is bad mouthing the British Airways and I’m like wait what, what
happened?
In [43] the twitter user makes a joke on the price of his ticket and therefore it can be
categorized as sarcasm. Tweet [44] is the story of an experience a British Airways client had
with their internet. Although a negative word (crying) is used, it is clear for a human
annotator that this is a positive tweet for British Airways. The error of tweet [44] is labelled
“ambivalent” because “sick” is usually a negative word, but in this context it is positively
used as a slang word to express something is very good. Finally, the twitter user in [46] sides
with British Airways when he hears that some people have been bad mouthing about them.
46
2.3.4. Type 4: positive tweets labelled as neutral
Type 4 includes all tweets that the human annotator labelled positive, but which were
erroneously labelled neutral by the LT3-tool. In general, 21 errors of this type were
committed, 12 in the British Airways corpus and 9 in the Ryanair corpus. These errors are of
different categories of which two stand out: “comparison” and “experience”.
In general, 38% of the errors are of the category “experience”. In this case, a positive
experience is told by the twitter user.
[47] @askDUBairport is ryanair really letting us have two carry on bags?
[48] Italian newspaper announcing that Ryanair will allow a 2nd item of cabin baggage (eg handbag) as of Dec 1st. Almost dropped my phone
[49] @vraiment_moi I used to work for british airways many years ago and every year me and the misses get free flights anywhere in the world
[50] @str8edgeracer where are you headed? I used to fly into heathrow for work all the time. Go to the british airways loung.
Both [47] and [48] report on the news that Ryanair allows a second carry-on bag on their
airplanes. For a human annotator it is obvious that both twitter users react well on the news,
but as there are no explicitly positive words in the tweets, this may not be as obvious for a
sentiment analysis tool. As for [49], the twitter user has a very good experience with British
Airways as he and his wife receive free tickets from the company. The experience as a whole
is therefore positive. The twitter user in [50] clearly had a positive experience in the British
Airways lounge, although he does not mention that explicitly. A human annotator however
can deduct that the British Airways lounge must be good because otherwise, the twitter user
would not send others there.
In Type 4, 33% of the committed errors belong to the category “comparison”. In this case,
either British Airways or Ryanair are seen as the good company when compared to others.
[51] @Niamhyy you should get a free flight~all the hassle you are going through. If they don’t answer book with Ryanair
[52] First ever Ryanair trip went without a hitch. How very unlike @British_Airways #ABBA
[53] @NinaWarhurst @British_Airways if it was Ryanair or easy jet they would of charged for being ill and call it a administration charge !!!!!
[54] Now THAT’S a billboard! British Airways - #lookup in Piccadilly Circus:
youtu.be/GtJx_pZjvzc via @youtube
47
In [51] no other airline company is mentioned, but Ryanair is clearly the better company as
the twitter user thinks that @Niamhyy would not have that kind of trouble with Ryanair. In
[52] British Airways and Ryanair are compared and the twitter user has an obvious preference
for Ryanair. However for a sentiment analysis tool this seems not to be clear. British Airways
is praised for not asking an extra charge in [53] which is something the twitter user is
positively surprised by as he points out that other airline companies (Ryanair and EasyJet)
would do so. Finally in [54] the new British Airways billboard is complemented. In
comparison to other billboards, the twitter user finds it brilliant.
Finally, other Type 4 errors can be divided into five categories: “sarcasm” (1), “negation” (1),
“ambivalent” (2), “side-taking” (1) and “Twitter-specific signs” (1). The errors “sarcasm”,
“ambivalent” and “Twitter-specific signs” can be found in the British Airways corpus while
the errors “negation” and “side-taking” can be found in the Ryanair corpus.
[55] I mind when I got to sit in first class on British airways cos there were no seats left in economy
[56] 20 mins late…… this wouldnt happen on a ryanair flight…
[57] In British Airways businessclass last night. They spoilt us rotten. fb.me/1raZdGW4e
[58] Dear @Ryanair if it’s true that Muslims are going to boycott you after o’learys anti burka comments,I am willing to pay double to travel.
[59] British Airways > American Airlines
In the sarcastic tweet [55], “I mind” is used sarcastically as no-one would mind being seated
in first class instead of economy. In [56] it is negated that a Ryanair flight would be 20
minutes late. However, for a sentiment analysis tool this may not be clear as the negation
particle and “late” are in different sentences. Further, “rotten” can be seen as an ambivalent
word in [57] as in combination with “to spoil”, it is a very positive word. Ryanair received
support from a twitter user in [58] for anti-burka comments of their CEO. Finally, a sentiment
analysis tool is not able to read “>” in [59] which is positive for British Airways.
2.3.5. Type 5: neutral tweets labelled as negative
Type 5 includes all tweets that the human annotator labelled neutral, but which were
erroneously labelled negative by the LT3-tool. In general, four errors of this type were
48
committed, one in the British Airways and three in the Ryanair corpus. Three of these errors
can be placed in the category “comparison”; the fourth error is of the “experience” category.
Type 5 errors are a rather small group which can be explained by few tweets being labelled
neutral by the human annotator.
[60] @British_Airways sad just had worst travel exp of my life with B A #Luggage took longer than actual flite to shop up. Worse than ryanair!
[61] what I wouldn’t give to hear that Ryanair trumpet right now…. delays again @British_Airways
[62] RT @MarketingWeekEd: Ryanair is trying to shake off its reputation for poor costumer service bit.ly/1cusnPN
[63] British Airways pulling out of Manchester was disappointing but attracted other airlines. Market will change over next 5 yrs.
A lot of negative words are used in [60] as the twitter user reports on a negative experience
with British Airways. For Ryanair however, as this is a tweet from the Ryanair corpus, this is
neither a negative nor a positive tweet as negative and positive things are balanced out. In this
context, the twitter user believes that Ryanair is better than British Airways, but at the same
time Ryanair is seen as a measuring point for bad airlines. Also in [61], negative and positive
aspects of Ryanair are balanced out. The twitter user points out that there would be no delays
with Ryanair, contrary to British Airways. However, he also expresses his annoyance with the
Ryanair trumpet. In [62], the twitter user retweets a newspaper article. It is however not clear
which sentiment he has towards it, which is why this tweet was labelled neutral. Finally, in
[63], positive and negative things are balanced out again as the twitter user claims he is
disappointed that British Airways pulled out of Manchester, but on the other hand he also
seems to accept it.
2.3.6. Type 6: neutral tweets labelled as positive
Type 6 includes all tweets that the human annotator labelled neutral, but which were
erroneously labelled positive by the LT3-tool. In general, six errors of this type were
committed, five in the British Airways and one in the Ryanair corpus. The errors can be
divided into two categories: “comparison” and “experience”. Type 6 errors are also a rather
small group as few tweets were labelled neutral by the human annotator.
66% of the Type 6 errors can be placed in the “experience” category.
49
[64] I’m wondering if @rewardgateway is a bit like the old British Airways in that it’s staff have to pass a test for extremely good lookingness.
[65] Wow! The British Airways app works with the iPhone Passbook. This is actually the first time I use it in some useful way
[66] British Airways I love you but never again will I fly with you when I’m
pregnant
[67] Love airlines that claim you can choose your seat in advance. And then you
can’t. #britishairways
Generally, in all the above examples, the twitter user expresses both negative and positive
things which are balanced out. However, the LT3-tool registered the positive comment only.
In [64] the positive comment is clearly present as the twitter user explicitly mentions that the
British Airways staff is good-looking. At the same time however, he denounces that the staff
is hired based on their looks. The same occurs in [65] when the twitter user is satisfied with
the British Airways app, but he also implies that it took a long time to use it in a useful way
and therefore it is not that useful. The twitter user in [66] utters that she likes to fly with
British Airways but also that they do not treat their pregnant passengers correctly. Finally, in
[67] the twitter user says that he likes the possibility to choose your seat in advance with
British Airways (positive aspect); however it seems that he cannot do it at the time (negative
aspect).
The other 33% of the Type 6 errors can be categorised as “comparison” errors.
[68] Good luck to the GOALies ,Philippines bound Sunday morn with 40 tonnes of aid onboard free Aer Lingus plane. Ryanair next ??
[69] @BritishAirways always thought your food to be the best but for flying
experience @VirginAtlantic just can’t be beat!
In [68] the twitter user points out that Aer Lingus sends help to the Philippines and he simply
asks if Ryanair would do the same. The positive words in this tweet therefore apply for Aer
Lingus. For Ryanair, it is a neutral tweet. In [69], British Airways is compared to Virgin
Atlantic. The twitter user points out positive points of both companies which is probably the
reason why the tweet is labelled positive by the sentiment analysis tool. However, for British
Airways this is nor a negative nor a positive tweet.
50
2.4. Interpretation of the error analysis
In this section, we will discuss how the results of the qualitative error analysis can be
interpreted. Some general conclusions will be drawn which are then used to formulate some
recommendations in order to help improve the performance of the LT3-tool.
First of all, we can divide the eight error categories in two groups: main errors and minor
errors. “Sarcasm”, “negation”, “comparison” and “experience” belong to the group of main
errors while “synonym”, “ambivalent”, “side-taking” and “Twitter-specific signs” belong to
the group of minor errors. In the literature study, “sarcasm” and “negation” were already put
forward as two important sentiment detection problems.
A second conclusion applies to the sarcasm errors. The majority of the sarcasm errors were
made in the Type 1 and Type 2 tweets which were labelled negative by the human annotator
and erroneously, i.e. positive or neutral, by the LT3-tool. We can therefore conclude that
sarcasm is mostly used to express a negative opinion and that the LT3-tool cannot detect this
sarcasm. Although Davidov et al. (July 2010) and González-Ibáñez et al. (2010) reported that
their classifier performed well on a sarcastic dataset, it is important to point out that their
training was done on a gold-standard dataset containing only tweets with the hashtag
“#sarcasm”. Based on our examples, however, we clearly noted that not all sarcastic tweets
are labelled explicitly with “#sarcasm”.
A third finding is that the negation error mainly occurred in Type 1 and Type 3 labelling
errors. This means that when negation is used, this mostly shifts the entire sentiment. The
LT3-tool often had problems when the negation particle did not directly negate or modify the
sentiment words but other words within the tweet (and further from the negation particle). The
bag-of-word model by Pang et al. (2002) might be useful to solve this problem. We agree
with Wilson et al. (2005) that negation features improve the performance of a sentiment
analysis tool. Further, our analysis confirms some of the concerns that Wiegand et al. (2010)
observed, namely that sometimes world knowledge is necessary to label a tweet correctly and
that rare constructions can be negating (cf. modification particles).
A fourth conclusion which could be drawn is that the synonym errors occurred almost
exclusively (read: except for one) in the Ryanair corpus. This difference between the two
corpora could be explained by the bad Ryanair reputation. The presence of background
information is an important factor in this case as a human annotator is aware of the reputation
51
of a company, while a sentiment analysis tool is not. Also in sarcastic tweets, background
information is sometimes necessary in order to grasp the correct meaning.
A sixth conclusion is that the “comparison” error is the sole error that is present in all error
types. However, similar to the “sarcasm” error, it is mostly present in Type 1 and Type 2
errors: tweets which were labelled negative by the human annotator and respectively positive
and neutral by the LT3-tool. The main problem in those tweets with the error “comparison” is
that the LT3-tool did not know which words applied to which company. This could be
explained by a lack of dependency features.
A seventh conclusion which could be drawn is that important information in hashtags and
@mentions can often not be detected by the LT3-tool since these contain many concatenated
words. As a consequence, important sentiment information is not considered by the LT3-tool.
A final conclusion is that our presumption of reputation influencing the number of
negative/positive tweets is confirmed. The number of negative tweets in the Ryanair corpus
was significantly higher (twenty negative tweets more) than in the British Airways corpus.
This difference was most likely caused by the Ryanair reputation, which is much worse than
the British Airways one.
2.5. Recommendations
Based on this analysis, we can propose some recommendations in order to improve the
performance of the LT3-tool. As mentioned earlier, the error analysis confirms the concerns
discussed in Section 2.3 of the literature study which argue that sarcasm and negation cause
substantial problems for sentiment classification tools. Despite adding negation features to the
LT3-tool, it seems that negation still causes problems especially when it is not obvious which
word the negation particle applies to. Moreover, sometimes negation is not expressed in its
ordinary way (i.e. not with an ordinary negation particle, but with a phrase that is equal to
negation or by using modifiers). This could be solved by adding more negation features which
pay attention to this deficit. Also, negation and modality cues should be added.
A second recommendation consists in adding sarcasm features. These are currently not
incorporated in the LT3-tool. The lack of sarcasm features is visible in the results of the error
analysis. Also, the hashtag “#sarcasm” should not be seen as the sole hashtag that indicates
sarcasm (cf. “#nextjoke” in example [2]). Therefore, features considering such hashtags could
52
help improve the LT3-tool. Further, a sort of sarcasm lexicon could be drawn up which
contains words that are typically used in sarcastic tweets. It is difficult to put forward such
examples as we did not study this matter, but the “non lols” sequence in example [3] could be
a sequence indicating sarcasm. This sarcasm lexicon could then be added as a lexicon feature.
Thirdly, more dependency features should be added as now the LT3-tool often makes
erroneous links between different words (i.e. which negative or positive words apply to which
company when a comparison is made). Especially words which are further apart are often
linked erroneously.
Fourth, the tool should somehow be able to consider some background information on the
concerning companies as twitter users expect their readers to know something about the
reputation of a certain company for example (cf. in this dissertation, this was the case for the
Ryanair corpus). Adding such features could help diminishing the number of “synonym”
errors. However, it might be complex to add such features as these would need frequent
revision because reputations can also shift over time (cf. Ryanair is trying to change its
reputation).
Finally, another important pre-processing step should be added which enables the tool to
segment hashtags and @mentions. Now important sentiment indicators are sometimes in
those hashtags or @mentions but they cannot be detected by the LT3-tool.
53
CONCLUSION
This dissertation consisted of two main parts: a literature study on sentiment in general and
sentiment analysis in Twitter specifically on the one hand and a corpus analysis on the other
hand.
In the literature study, we provided an overview of research that has been conducted on
sentiment analysis in general and we have discussed the state of the art on sentiment analysis
in Twitter. In Section 1 of the literature study, some basic concepts were explained, such as
sentiment and subjectivity. Further, the two steps in the sentiment classification process,
subjectivity classification and sentiment classification were defined. Also the subclasses in the
sentiment classification process (document level, sentence level, word level and feature level)
were briefly described. In Section 2, we presented an overview of the current state of the art
on sentiment analysis in Twitter specifically since the specific linguistic features in tweets ask
for a different approach. We first provided some basic information about Twitter itself after
which we made a subdivision between two main approaches on sentiment analysis in Twitter:
shallow machine learning approaches and approaches incorporating more linguistic
knowledge. As far as the shallow machine learning approaches are concerned we can
conclude that these are already a solid baseline for performing sentiment analysis on Twitter
data. Considering features such as lexicons and PoS, these classifiers achieve good results.
However, deeper linguistic features such as sarcasm and negation cause problems for these
shallow classifiers. The approaches incorporating more linguistic knowledge try to solve these
problems. We provided an overview of studies on incorporating sarcasm and negation
features in classifiers. We can conclude that adding such features to a classifier helps.
In the second part of this dissertation, the corpus analysis, we tested the performance of the
LT3-tool, a classifier developed by the Language and Translation Technology Team of Ghent
University. We collected a corpus of tweets reporting on airline companies (British Airways
and Ryanair) which were labelled manually for sentiment (positive, negative and neutral). In a
next phase, the labels of the human annotator were compared to those of the classifier. The
corpus analysis itself consisted of both a quantitative and qualitative analysis. In the
quantitative analysis, three values were calculated: accuracy, precision and recall. In the
qualitative analysis, an extensive error analysis was performed.
We can conclude that the LT3-tool performs badly on our dataset; it obtained an accuracy of
only 21%. This was expected, however, since our corpus deliberately consisted of tweets
54
which should be difficult to label automatically. As to the “precision”, we found that the
classifier performed best on the negative tweets. For “recall” on the other hand, the LT3-tool
performed best on the neutral tweets. In order to gain more insights into the committed errors,
we performed an in-depth error analysis which revealed some recurring difficulties. We
distinguished eight different error categories of which four caused substantial problems for
the LT3-tool: “sarcasm”, “negation”, “comparison” and “experience”. Other minor problems
were “synonym”, “ambivalent”, “side-taking” and “Twitter-specific signs”. In the literature
study, the errors “sarcasm” and “negation” were already put forward as important shortfalls in
other sentiment analysis tools. In our analysis, those concerns were confirmed as “sarcasm”
and “negation” still posed significant problems even though negation features for example
were included in the classifier. This error analysis enabled us to propose some
recommendations for improving the performance of the LT3-tool. Although negation and
dependency features were already considered in the LT3-tool, we proposed to add some more
namely a feature which also considers modifiers as negation and more dependency features so
also words which are further apart could be linked. Further, we recommended adding two new
feature groups. First, features should be added that consider sarcasm, which is currently not
acknowledged by the classifier. Secondly, we recommended adding a feature group which in
some way can also represent background information. This latter one is not self-evident,
however, since those features would need a lot of supervision as reputations for example
(which could be seen as a kind of background information) can shift. Thirdly, adding a pre-
processing step to the tool, which is able to segment @mentions and hashtags would be very
useful as currently a lot of important sentiment information in those Twitter-specific signs
goes lost.
We hope this study has revealed that an extensive linguistic analysis offers some valuable
insights into improving sentiment analysis in Twitter. It would be very interesting to see in
follow-up research whether the recommendations forwarded in this dissertation can be
implemented in the LT3-tool and whether they actually improve the automatic sentiment
analysis in our corpus.
55
BIBLIOGRAPHY
Barbosa, L. & Feng, J. (2010). Robust Sentiment Detection on Twitter from Biased and Noisy
Data. Coling 2010: Poster Volume. p36-44
Davidov, D., Tsur, O. & Rappoport, A. (August 2010). Enhanced Sentiment Learning Using
Twitter Hashtags and Smileys. Coling 2010: Poster Volume. p241-249
Davidov, D., Tsur, O. & Rappoport, A. (July 2010). Semi-Supervised Recognition of
Sarcastic Sentences in Twitter and Amazon. Proceedings of the Fourteenth
Conference on Computational Natural Language Learning. p107-116
Esuli, A. & Sebastiani, F. (2006). Determining Term Subjectivity and Term Orientation for
Opinion Mining. EACL. Vol. 6. p193-200
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant
supervision. CS224N Project Report, Stanford. p1-12.
González-Ibáñez, R., Muresan, S. & Wacholder, N. (2011). Identifying Sarcasm in Twitter: A
Closer Look. Proceedings of the 49th
Annual Meeting of the Association for
Computational Linguistics: shortpapers. p581-586
Hu, M. & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). p168-
177.
Jia, L., Yu, C. & Meng, W. (2009). The Effect of Negation on Sentiment Analysis and
Retrieval Effectiveness. Proceedings of the 18th
ACM Conference on Information and
Knowledge Management. p1827-1830
Kaplan, A. M. & Haenlein, M. (2011). The early bird catches the news: Nine things you
should know about micro-blogging. Business Horizons. 54, p105-113.
Kim, S.M. & Hovy, E. (2004). Determining the Sentiment of Opinions. Proceedings of the
20th
international conference on Computational Linguistics. p1367-1373
Kouloumpis, E., Wilson, T. & Moore, J. (2011). Twitter Sentiment Analysis: The Good the
Bad and the OMG! Association for the Advancement of Artificial Intelligence. p538-
341
Kumar, A. & Sebastian T.M.. (2012). Sentiment Analysis: A Perspective on its Past, Present
en Future. I.J. Intelligent Systems and Applications. 10, p1-14.
Liu, B. (2010). Sentiment Analysis and Subjectivity. In: Indurkhya, N. and Damerau, F.J.
Handbook of Natural Language Processing. 2nd ed. Cambridge, UK: Chapman and
Hall/CRC. p627-667.
Liu, K.L., Li W.J. & Guo, M. (2012). Emoticon Smoothed Language Models for Twitter
Sentiment Analysis. Association for the Advancement of Artificial Intelligence. p456-
462
56
Lui, J. & Seneff, S. (2009). Review Sentiment Scoring via a Parse-and-Paraphrase Paradigm.
Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing. p161-169
Mohammed, S.M. (2012). #Emotional Tweets. Proceedings of the First Joint Conference on
Lexical and Computational Semantics (*SEM). p246-255
Mohammed, S.M., Kiritchenko, S. & Zhu, X. (2013). NRC-Canada: Building the State-of-
the-Art in Sentiment Analysis of Tweets. Proceedings of the seventh international
workshop on Semantic Evaluation Exercises (SemEval-2013). Atlanta, Georgia, US.
Mohammed, S.M., Turney, P.D. (2010). Emotions Evoked by Common Words and Phrases:
Using Mechanical Turk to Create and Emotion Lexicon. Proceedings of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies 2010 Workshop on Computational Approaches to Analysis
and Generation of Emotion in Text. p26-34
Mohammed, S.M., Yang, T. (2011). Tracking Sentiment in Mail: How Genders Differ on
Emotional Axes. Proceedings of the 2nd
Workshop on Computational Approaches to
Subjectivity and Sentiment Analysis (WASSA 2.011). p70-79
Molloy, A. (2014). British Airways climbs to new heights as it tops UK brands list. Available:
http://www.independent.co.uk/news/uk/home-news/british-airways-climbs-to-new-
heights-as-it-tops-uk-brands-list-9148758.html. Last accessed 16th May 2014.
Nakov, P., Kozareva, Z., Ritter, A., Rosenthal, S., Stoyanov, V. & Wilson, T. (2013).
SemEval 2013 Task 2: Sentiment Analysis in Twitter. Second Joint Conference on
Lexican and Computational Semantics (*SEM), Volume 2: Seventh International
Workshop on Semantic Evaluation (SemEval 2013). p312-320
Nasukawa, T. & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural
language processing. Proceedings of the 2nd
international conference on Knowledge
capture. p70-77
Pang, B. & Lee, L. (2004). A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts. Proceedings of the 42nd
Annual Meeting on
Association for Computational Linguistics. p271-278
Pang, B., Lee, L. & Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification Using
Machine Learning Techniques. Proceedings of the ACL-02 Conference on Empirical
Methods in Natural Language Processing-Volume 10. p79-86
Polanyi, L. & Zaenen, A. (2004) Context Valence Shifters. Proceedings of the Advancement
of Artificial Intelligence Conference Spring Symposium on Exploring Attitude and
Affect in Text. p106-111
Travelmole. (2014). Ryanair to boost reputation with new marketing chief. Available:
http://www.travelmole.com/news_feature.php?news_id=2009841. Last accessed 16th
May 2014.
57
Van Hee, C., Van de Kauter, M., De Clercq, O., Lefever, E. & Hoste, V. (2014). LT3:
Sentiment Classification in user-generated content using a rich feature set. Submitted
to the eight international workshop on Semantic Evaluation Exercises (SemEval-2014)
Vizard, S. (2013). Brand Audit: Ryanair. Available:
http://www.marketingweek.co.uk/sectors/travel-and-leisure/brand-audit-
ryanair/4008433.article. Last accessed 16th May 2014.
Wiegand, M., Balahur, A., Roth, B., Klakow, D. & Montoyo, A. (2010). A Survey on the
Role of Negation in Sentiment Analysis. Proceedings of the Workshop on Negation
and Speculation in Natural Language Processing. p60-68
Wilson, T., Wiebe, J. & Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-
Level Sentiment Analysis. Proceedings of Human Language Technology Conference
and Conference on Empirical Methods in Natural Language Processing
Winch, J. (2014). British Airways tops UK brand ratings. Available:
http://www.telegraph.co.uk/finance/newsbysector/retailandconsumer/10656804/Britis
h-Airways-tops-UK-brand-rankings.html. Last accessed 16th May 2014.
59
APPENDIX I: MATRICES, ACCURACY, RECALL AND PRECISION
GENERAL
Predicted sentiment (machine)
Ov
eral
l se
nti
men
t (h
um
an)
Negative
Positive
Neutral
Negative
37
90
82
Positive
35
19
21
Neutral
4
6
6
�������� =37 + 19 + 6
300= 0,21 = 21%
&���'�'��(���')'*�+ =19
19 + 90 + 6= 0,17 = 17%
,���--(���')'*�+ = 19
19 + 35 + 21= 0,25 = 25%
&���'�'��(����)'*�+ =37
37 + 35 + 4= 0,49 = 49%
,���--(����)'*�+ =37
37 + 90 + 82= 0,18 = 18%
&���'�'��(���)��-+ =6
6 + 82 + 21= 0,06 = 6%
,���--(���)��-+ =5
6 + 4 + 6= 0,31 = 31%
60
BRITISH AIRWAYS
Predicted sentiment (machine)
Ov
eral
l se
nti
men
t (h
um
an)
Negative
Positive
Neutral
Negative
24
40
31
Positive
19
14
12
Neutral
1
5
4
�������� =24 + 14 + 4
150= 0,28 = 28%
&���'�'��(���')'*�+ =14
14 + 40 + 5= 0,24 = 24%
,���--(���')'*�+ =14
14 + 19 + 12= 0,31 = 31%
&���'�'��(����)'*�+ =24
24 + 19 + 1= 0,55 = 55%
,���--(����)'*�+ = 24
24 + 40 + 31= 0,25 = 25%
&���'�'��(���)��-+ = 4
4 + 31 + 12= 0,09 = 9%
,���--(���)��-+ =4
4 + 1 + 5= 0,40 = 40%
61
RYANAIR
Predicted sentiment (machine)
Ov
eral
l se
nti
men
t (h
um
an)
Negative
Positive
Neutral
Negative
13
50
51
Positive
16
5
9
Neutral
3
1
2
�������� =13 + 5 + 2
150= 0,13 = 13%
&���'�'��(���')'*�+ =5
5 + 50 + 1= 0,09 = 9%
,���--(���')'*�+ =5
5 + 16 + 9= 0,17 = 17%
&���'�'��(����)'*�+ = 13
13 + 16 + 3= 0,41 = 41%
,���--(����)'*�+ = 13
13 + 50 + 51= 0,11 = 11%
&���'�'��(���)��-+ = 2
2 + 9 + 51= 0,03 = 3%
,���--(���)��-+ = 2
2 + 3 + 1= 0,33 = 33%
62
APPENDIX II: EXACT DATA OF ERROR ANALYSIS
Sarcasm Negation Comparison Experience Synonym Ambivalent Side-taking Twitter-specific signs
Type 1 45 14 15 13 1 0 0 2
Type 2 37 7 7 20 10 0 0 1
Type 3 5 11 12 1 0 1 5 0
Type 4 1 1 7 8 0 2 1 1
Type 5 0 0 3 1 0 0 0 0
Type 6 0 0 2 4 0 0 0 0
Table 15 – Error analysis of the British Airways and Ryanair corpus combined
Sarcasm Negation Comparison Experience Synonym Ambivalent Side-taking Twitter-specific signs
Type 1 20 8 4 8 0 0 0 0
Type 2 14 3 0 12 1 0 0 1
Type 3 3 6 7 1 0 1 1 0
Type 4 1 0 3 5 0 2 0 1
Type 5 0 0 1 0 0 0 0 0
Type 6 0 0 1 4 0 0 0 0
Table 16 – Error analysis of the British Airways corpus
Sarcasm Negation Comparison Experience Synonym Ambivalent Side-taking Twitter-specific signs
Type 1 25 6 11 5 1 0 0 2
Type 2 23 4 7 8 9 0 0 0
Type 3 2 5 5 0 0 0 4 0
Type 4 0 1 4 3 0 0 1 0
Type 5 0 0 2 1 0 0 0 0
Type 6 0 0 1 0 0 0 0 0
Table 17 – Error analysis of the Ryanair corpus