Download pdf - How can a thorough linguistic analysis contribute to ... · British Airways or Ryanair as these two airline companies have respectively a good and a bad reputation, which influences

Faculteit Letteren & Wijsbegeerte

Valerie Dumortier

How can a thorough linguistic analysis

contribute to automatic sentiment

analysis in Twitter?

Masterproef voorgedragen tot het behalen van de graad van

Master in de Meertalige Communicatie

2014

Promotor Mevr. Orphee De Clercq

Vakgroep Vertalen Tolken Communicatie

PREFACE

I would like to use this opportunity to thank everyone who made it possible to write this

dissertation.

First, I would like to thank my promoter, Orphee De Clercq, for providing adequate and

constructive feedback and for supporting me constantly.

Secondly, I would like to thank my parents for giving me the opportunity to start and finish

these studies, for their constant support and for keeping believing in me. I would also like to

thank my brother, Jasper Dumortier, for the relaxing and entertaining brother-sister-moments

during my studies. My boyfriend, Sander De Loof, has supported me through thick and thin

during these four years and I would like to thank him for his support and his patience with me

while writing this dissertation. Also the rest of my family deserves a big thank you.

Finally, I would like to thank all my friends who helped made my life as a student to what it is

now. Thanks to them I have all these amazing memories which I will always carry in my

heart. There is one person that I would like to thank in special, Joy Gillespie. She was an

important support while writing this dissertation and a lot of other moments during these

studies.

5

Table of contents

LIST OF TABLES AND IMAGES ........................................................................................... 7

INTRODUCTION ...................................................................................................................... 9

PART I: LITERATURE STUDY ............................................................................................ 11

1. SENTIMENT ANALYSIS ............................................................................................... 11

1.1. Basic concepts of sentiment analysis ......................................................................... 11

1.2. The process of sentiment analysis ............................................................................. 12

1.2.1. Subjectivity classification .................................................................................. 12

1.2.2. Sentiment classification ...................................................................................... 13

2. SENTIMENT ANALYSIS IN TWITTER ....................................................................... 16

2.1. Twitter ....................................................................................................................... 16

2.2. Shallow machine learning approaches to perform SA in Twitter data ...................... 17

2.3. Approaches incorporating more linguistic knowledge .............................................. 22

2.3.1. Sarcasm .............................................................................................................. 22

2.3.2. Negation ............................................................................................................. 23

PART II: CORPUS ANALYSIS .............................................................................................. 25

1. METHODOLOGY ........................................................................................................... 25

1.1. Data collection ........................................................................................................... 25

1.2. Data annotation .......................................................................................................... 29

1.3. LT3-tool ..................................................................................................................... 30

2. RESULTS ......................................................................................................................... 31

2.1. Evaluation .................................................................................................................. 31

2.2. Quantitative analysis .................................................................................................. 33

2.3. Qualitative analysis .................................................................................................... 36

2.3.1. Type 1: negative tweets labelled as positive ...................................................... 38

2.3.2. Type 2: negative tweets labelled as neutral ........................................................ 41

2.3.3. Type 3: positive tweets labelled as negative ...................................................... 44

2.3.4. Type 4: positive tweets labelled as neutral ......................................................... 46

2.3.5. Type 5: neutral tweets labelled as negative ........................................................ 47

2.3.6. Type 6: neutral tweets labelled as positive ......................................................... 48

6

2.4. Interpretation of the error analysis ............................................................................. 50

2.5. Recommendations ..................................................................................................... 51

CONCLUSION ........................................................................................................................ 53

BIBLIOGRAPHY .................................................................................................................... 55

APPENDIX I: MATRICES, ACCURACY, RECALL AND PRECISION ............................ 59

APPENDIX II: EXACT DATA OF ERROR ANALYSIS ...................................................... 62

7

LIST OF TABLES AND IMAGES

Table 1 – Short summary of the shallow machine learning approaches .................................. 20

Table 2 - Number of tweets and average tweet lengths of both corpora .................................. 28

Table 3 - Overview of the number of negative, positive and neutral tweets (labelled by the

human annotator) in both corpora ............................................................................................ 30

Table 4 - Overview of the different types of errors .................................................................. 32

Table 5 – Confusion matrix of the British Airways and Ryanair corpus combined ................ 33

Table 6 - Overview of the accuracy, precision and recall numbers of the British Airways and

Ryanair corpus combined ......................................................................................................... 33

Table 7 – Confusion matrix of the British Airways corpus ..................................................... 34

Table 8 – Confusion matrix of the Ryanair corpus .................................................................. 35

Table 9 - Overview of the accuracy, precision and recall numbers of the British Airways

corpus ....................................................................................................................................... 35

Table 10 - Overview of the accuracy, precision and recall numbers of the Ryanair corpus .... 35

Table 11 - Overview of the different error categories .............................................................. 37

Table 12 - Overview of the number of major errors in each type ............................................ 37

Table 13 - Overview of the number of minor errors in each type ............................................ 37

Table 14 - Pie chart of the shares of the different types of errors ............................................ 38

Table 15 – Error analysis of the British Airways and Ryanair corpus combined .................... 62

Table 16 – Error analysis of the British Airways corpus ......................................................... 62

Table 17 – Error analysis of the Ryanair corpus ...................................................................... 62

Image 1 - Example of the Twitter API with the search term "British Airways" ...................... 27

Image 2 – Possibilities to specify a search on Topsy according to time/date, kind of message

and language respectively ........................................................................................................ 27

Image 3 - Example of result page on Topsy with the search term “British Airways” and with

the specifications as in Image 2 ................................................................................................ 28

9

INTRODUCTION

Twitter is an important social media platform for companies. Since nowadays people are

constantly online and share their opinions on almost everything, Twitter offers companies the

opportunity to understand their consumers better. Consequently, companies use Twitter as an

alternative medium for their customer service1. As a lot of consumers post reviews on Twitter,

sentiment analysis can be an interesting technique for companies to check the overall review

of their products, services, etc. Sentiment analysis is a technique that is used to label a

document, sentence, word, etc. according to its sentiment, which is either neutral, negative or

positive. Still, because of its typical linguistic characteristics, tweets seem to be difficult for

ordinary sentiment analysis tools to label correctly.

This dissertation investigates sentiment analysis in Twitter. As tweets are typically very short

messages with a maximum of 140 characters, the task of performing sentiment analysis in

tweets is fairly different than performing it in longer texts such as reviews on Amazon. Since

tweets have a number of typical linguistic characteristics these have to be taken into account.

Examples are the use of slang words or abbreviations which are not universally known and

the lack of linking words. Moreover, sometimes more background knowledge is required in

order to completely grasp the message.

The Language and Translation Technology Team of Ghent University has developed a

standard sentiment analysis tool (further referred to as LT3-tool) for tweets. This dissertation

examines the performance of this tool and envisages to find out how it could be improved

based on a thorough linguistic analysis. To this purpose, we composed a corpus with tweets

treating two airline companies. We chose to explicitly search for tweets either mentioning

British Airways or Ryanair as these two airline companies have respectively a good and a bad

reputation, which influences the sort of tweets that are sent. The corpus consists of tweets that

would most likely be labelled incorrectly by a sentiment analysis tool. The tweets of the

corpus are labelled by both a human annotator and the LT3-tool. This creates the opportunity

to investigate which aspects of tweets are difficult for automatic processing.

In this dissertation we will try to answer the following questions:

- How good does a machine sentiment analysis tool (the LT3-tool) perform on labelling

linguistically challenging tweets compared to humans?

1 KLM is a good example of a company that communicates mainly through Twitter with their consumers as

customer service (https://twitter.com/KLM, last accessed: May 25th

2014).

10

- What is or what are the most common mistake(s) a machine sentiment analysis tool

(the LT3-tool) makes when labelling linguistically challenging tweets?

- How could a machine sentiment analysis tool (the LT3-tool) be improved in order to

label linguistically challenging tweets better?

The work presented in this dissertation consists of two major parts: an extensive literature

overview and a corpus analysis.

After introducing the topic, we first present a literature study (Part 1) with a general overview

of the task of sentiment analysis on standard text (Section 1). We then continue by describing

the current state of the art of sentiment analysis in Twitter and zoom in on the challenges that

arise when trying to perform sentiment analysis in Twitter (Section 2).

In the second part of this dissertation, the corpus analysis (Part 2), we first introduce the

methodology that was used: we describe how the data was collected (1.1), annotated (1.2) and

automatically labelled (1.3). Then we describe the results by first giving some more

information about how the actual evaluation was performed (2.1), after which we extensively

discuss the results in both a qualitative (2.2) and quantitative (2.3) manner. We finish this part

with an analysis on the meta-level and by presenting some recommendations for improving

the LT3-tool (2.4 and 2.5).

To conclude, we repeat the main findings of this dissertation and present some prospects for

future work.

11

PART I: LITERATURE STUDY

1. SENTIMENT ANALYSIS

In this first part of the literature study, sentiment analysis in general is discussed. We start by

defining some basic concepts (1.1) and explaining the standard process for performing

sentiment analysis (1.2).

1.1. Basic concepts of sentiment analysis

In general, textual information can be divided into two types: facts and opinions. Liu (2010,

p1) defines the former as “objective expressions about entities, events and their properties”

and the latter as “subjective expressions that describe people’s sentiments, appraisals or

feelings toward entities, events and their properties”.

Sentiment analysis, or opinion mining, is a technique which originated from the need to

organize and structure large amounts of opinionated data. Liu (2010, p3) defines sentiment

analysis as “the computational study of opinions, sentiments and emotions expressed in text

[…] The opinion can be expressed on anything, e.g., a product, a service, an individual, an

organization, an event, or a topic.” Nasukawa & Yi (2003, p71) define sentiment analysis as a

technique “to identify how sentiments are expressed in texts and whether the expressions

indicate positive (favorable) or negative (unfavorable) opinions toward the subject”. Esuli &

Sebastiani (2006, p193) define opinion mining or sentiment analysis as “a recent subdiscipline

of computational linguistics which is concerned not with the topic a document is about, but

with the opinion it expresses”.

We can perceive subjectivity as one of the key elements of sentiment analysis or opinion

mining because a subjective sentence expresses an opinion most of the time. All sentences

have a certain level of subjectivity. A sentence with a low level of subjectivity and in which

an opinion is expressed implicitly is called an objective sentence, which means that this

sentence only shares factual information. However, when the level of subjectivity is high and

when an opinion is expressed explicitly, it is a subjective sentence because now also personal

feelings or beliefs are shared. To clarify this, let us compare the objective sentence “The sun

is shining.” to the subjective sentence “I love it when the sun is shining”. The former sentence

12

simply states how the weather is, i.e. it is a fact that the sun is shining, whereas the latter

sentence clearly utters a certain feeling of the speaker towards the fact that the sun is shining.

Within the study of sentiment analysis the level of subjectivity can be determined for both

sentences and documents. Mostly, a document is defined as opinionated or subjective when it

contains various opinionated sentences. However, caution is necessary since one specific

report can contain various opinions. Also, if it contains many quoted sentences the opinion

may not necessarily express the author’s beliefs.

When discussing sentiment analysis it is important to define what sentiment is exactly or at

least how we understand it throughout this dissertation. Sentiment can be defined as “an

attitude or opinion”, “feelings of love sympathy, kindness, etc.” or “an attitude, thought, or

judgement prompted by feeling”2. Sentiment has a polarity or orientation, which is also

sometimes called polarity of opinion or semantic orientation (Liu, 2010). Some only

distinguish two types of polarity, negative and positive (Pang & Lee, 2004; Hu & Liu, 2004),

while others also consider neutral as a sentiment orientation (Liu, 2010; Kim & Hovy, 2004).

For our study we consider sentiment as a positive, negative or neutral emotion towards a

certain entity3.

1.2. The process of sentiment analysis

Sentiment analysis is a process that consists of two stages: subjectivity classification and

sentiment classification. The second stage of sentiment analysis, sentiment classification, can

also occur on different levels: document, sentence, word and feature level. (Liu, 2010; Kumar

& Sebastian, 2012)

1.2.1. Subjectivity classification

As mentioned above, the first step in sentiment analysis is subjectivity classification. In this

task, the system categorizes a text or sentence as either objective or subjective (also called

opinionated versus non-opinionated). Objective or non-opinionated texts or sentences render a

2 http://www.merriam-webster.com/dictionary/sentiment (last accessed: May 20

th 2014)

3 In this dissertation, the entity is airline companies (see Section 1.1 of the corpus analysis).

13

fact while subjective or opinionated texts or sentences express certain feelings of the author

towards something. (Liu, 2010; Kumar & Sebastian, 2012)

1.2.2. Sentiment classification

Once a sequence of text has been labelled as subjective or opinionated, the polarity of the text

or sentence needs to be determined. Sentiment classification is the process of determining

whether a document expresses a negative or positive opinion (sometimes neutral is also

considered as a sentiment). In order to do this, Liu (2010) states that it is necessary to have

information on five different aspects of the object. First, it is necessary to know the object on

which the opinion is expressed. Secondly, it is important to know the feature(s) about which

opinions are being expressed. One single sentence can comment on different features while

several sentences can express an opinion on one sole feature. Thirdly, the opinion itself has to

be known. Liu (2010, p4) defines an opinion on a feature as “a positive or negative view,

attitude or emotion or appraisal […] from an opinion holder”. A fourth aspect of the object is

the opinion holder, the person who expresses the opinion. Finally the system needs

information on when the opinion was expressed.

Sentiment classification can be done on different levels: document level, sentence level, word

level and feature level. Below, the four levels are briefly described and explained.

Sentiment classification on document level

As the name explains, this task determines the polarity of a whole document. It is then

assumed that the whole text holds one sole opinion and is therefore written by only one

author. A difficulty on this level of sentiment analysis is the expression of multiple opinions

in one text. (Kumar & Sebastian, 2012)

According to Liu (2010, p10) sentiment classification on the document level can, on the one

hand, be seen as “a supervised learning problem with two class labels (positive and

negative)”. Liu (2010) argues that domain adaptation is important in document-level

sentiment classification, as opinion words may express different sentiments depending on the

domain in which they are used. On the other hand, he also considers unsupervised learning as

14

a possible approach in which opinion words and phrases are of major importance (also when

this is done on document level).

Sentiment classification on sentence level

Sentiment classification on the sentence level is the task of determining the polarity of a

single sentence. One sentence may contain different opinions. These sentences are mostly

compound sentences, which are not suitable for sentiment classification on the sentence level.

Therefore, determining the strength of opinions is also important as it can be necessary to

determine which opinion has more weight. (Liu, 2010; Kumar & Sebastian, 2012)

Sentiment classification on word level

In word-level sentiment classification, the sentiment of sole words, often called opinion

words, is identified. According to Liu (2010), opinion words are often used in sentiment

classification tasks. Opinion words can be subdivided according to two dimensions: positive

versus negative and base type versus comparative type. Positive opinion words express

desired states whereas negative opinion words express undesired states.

Base type opinion words are the words that express an opinion on their own. The opinion in

these words is straightforward: they are either negative (e.g.: bad, horrible, etc.) or positive

(e.g.: good, marvellous, etc.). Comparative type opinion words are words that are used to

express an opinion using comparative and superlative words. They thus do not express an

opinion on one sole object, but on multiple objects. Although the comparative word can be a

negative word, this does not necessarily mean that the opinion on both compared words is

negative.

An opinion lexicon is a list of opinion words, opinion phrases and idioms. Liu (2010)

distinguishes three approaches to creating opinion word lists, also known as lexicons: manual,

dictionary-based or corpus-based. In the manual approach, opinion word lists are collected by

manually labelling each word with its according sentiment. This technique is often combined

with one of the two automated techniques. In the dictionary-based approach, first opinion

words are collected of which the opinion orientation is commonly known. They are used as

seed words to expand the lexicon by adding synonyms and antonyms. This iterative process

stops when no more new words are found. Usually the resulting lexicon is manually validated.

15

One of the main disadvantages of this approach, however, is that it is not able to find domain-

specific opinion words. A certain word can for example be positive in one domain, but

negative in another. The word “budget”, for example, does not always have a positive

connotation in the airline domain as it is often linked with poor service and old airplanes (e.g.:

In fact, British Airlines. This queue and service is reminiscent of a budget airline in holiday

season. Yet you’re not quite budget, are you...4). In other domains, “budget” has a positive

connotation. According to Liu (2010, p15) the corpus-based approach relies on “syntactic or

co-occurrence patterns and also [on] a seed list of opinion words to find other opinion words

in a large corpus”. Other words which are positioned near the opinion word, such as conjoined

adjectives, are therefore also considered for inclusion in the lexicon. Adjectives conjoined by

‘and’ generally have the same opinion orientation (e.g.: […] Yes, I hate myself too, but its

cheap and direct, and the times are good5) while adjectives conjoined by ‘but’ generally have

different orientations (e.g.: good but not much). The advantage of the corpus-based approach -

when compared to the dictionary-based approach - is that it is capable of finding domain-

specific opinion words, together with their orientation. (Liu, 2010; Kumar & Sebastian, 2012)

Sentiment analysis on feature level

It is useful to classify a text on document or on sentence level, but this approach may be seen

as rather simplistic. A document which is classified as negative does not necessarily render an

exclusively negative opinion: some features may be considered as positive by the author. A

further, deeper and more detailed analysis of the text is needed to define the opinion of the

author of the text. Feature-based sentiment analysis is a technique to determine whether the

author’s opinion on a specific feature of a product is positive or negative (Liu, 2010).

There are several techniques to extract features from a text. Liu (2010) distinguishes two

types of online reviews: pros, cons and the detailed review on the one hand and a free format

on the other hand. The pros, cons and detailed review consists of two parts: a separate list of

the pros and cons of the object, and a full review. In this type of review, it is easier to extract

different features as they are clearly listed (a list with all the cons and a list with all the pros).

In a free format review, pros and cons are not listed separately, and the author can decide

himself how to structure his full text review. This makes it more difficult to extract separate

features. Hu & Liu (2004) established a method to find explicit features that are either nouns

4 Extracted from the British Airways corpus

5 Extracted from the Ryanair corpus

16

or noun phrases. To do this, a corpus of customer reviews of the product was collected. The

method consists of two steps. First, frequently used nouns and noun phrases are detected.

Only the nouns and noun phrases that occur frequently enough in the reviews are retained. In

this way, the features which are most frequently commented on (i.e. the most important

features) are assembled. In the second step, opinion words are used to find more infrequent

features.

Based on the above-mentioned levels, we conclude that for our research, i.e. sentiment

analysis in Twitter, the analysis could occur on three levels: sentence level, word level and

feature level. As tweets are very short messages, sentence and document level can be seen as

equal. Sentiment analysis on Twitter can therefore occur on sentence level.

In our dissertation, the data collection occurred on feature level as only tweets with the entity

British Airways or Ryanair were considered. Once we collected our data, we performed

sentiment analysis on document level as the entire tweets were considered.

2. SENTIMENT ANALYSIS IN TWITTER

2.1. Twitter

Twitter6 is a micro-blogging site that was founded in March 2006 by Biz Stone, Jack Dorsey

and Evan Williams. Its logo is a blue bird and one tweet can consist of maximum 140

characters. Currently, Twitter is available in 35 different languages. The micro-blogging site

is not only popular among ordinary people as also celebrities and companies are active on the

site.

Although messages on micro-blogging sites and also tweets have a character limit of 140,

their popularity grows at an enormous rate. Kaplan and Haenlein (2011) mention three factors

explaining the success of micro-blogs. First, through Twitter one can receive updates on even

the most trivial matters that happen in other people’s lives. Secondly, Twitter allows a push-

push-pull combination. Finally, it offers a “platform for virtual exhibitionism and voyeurism

for both active contributors and passive observers” (p106).

6 http://www.twitter.com (last accessed: May 20

th 2014)

17

The former explains why micro-blogging sites are so popular. The push-push-pull

combination illustrates how tweets can be of benefit to companies. There are three different

stages of the marketing process in which Twitter can be used. First, it can be used in pre-

purchase. In other words, Twitter can be used to read what clients have to say about a

particular company or products which can actually turn them into co-producers. Secondly,

micro-blogging sites can be useful in purchase. Companies can advertise and spread brand-

reinforcing messages through micro-blogging. They cannot only communicate with their

customers but also with and through their employees. The final marketing area is post-

purchase. Companies can improve their customer service and complaint management

processes through micro-blogging sites. (Kaplan and Haenlein, 2011)

Kaplan and Haenlein (2011) mention three rules for successful micro-blogging, namely

relevance, respect and return. For the first factor, “relevance”, there are two important things a

company has to keep in mind: you have to listen before you tweet and find the right balance

between sending out too few or too many messages. A second rule is “respect”. You have to

respect your followers by identifying yourself, using appropriate language and not deceiving

other users. The third rule “return” implies that firms using micro-blogging sites have to keep

in mind the benefits and return-on-investment of their activities. (Kaplan and Haenlein, 2011)

2.2. Shallow machine learning approaches to perform SA in Twitter data

Sentiment analysis is a difficult task on short chunks of text such as tweets, but different

classifiers have already been developed which show some promising results for labeling

tweets. Go et al. (2009), Barbosa & Feng (2010), Davidov et al. (August 2010), Mohammed

et al. (2013) and Kouloumpis et al. (2011) all developed a different classifier to detect

sentiment in Twitter. In this section, we explain these approaches and highlight the

similarities and differences.

As far as the Twitter datasets are concerned, Go et al. (2009) used one that consisted of tweets

which all contained emoticons. Barbosa & Feng (2010) used a combination of objectivity and

subjectivity sentences from three sources: Twendz7, Twitter Sentiment

8 and TweetFeel

9 while

7 An app to search for tweets which can be downloaded on http://www.twtbase.com/twendz/ (last accessed: May

20th

2014) 8 http://www.sentiment140.com/ (last accessed: May 20

th 2014)

9 http://www.tweetfeel.com/ (last accessed: May 20

th 2014)

18

Davidov et al. (August 2010) used an existing dataset (by Brendan O’Connor). Further, the

Mohammed et al. (2013) dataset was provided by the organizers of the ‘SemEval2013 Task 2:

Sentiment Analysis in Twitter’ as the Mohammed et al. (2013) paper was one of the

submissions of that SemEval task. For this dataset, named entities were extracted from the

Twitter API (Nakov et al., 2013). Finally, Kouloumpis et al. (2011) used three different kinds

of datasets: a hashtag dataset in which all tweets contained a hashtag, an emoticon dataset

which appeared to be the same dataset used in Go et al. (2009) and a manually labelled

dataset which contained tweets on certain topics and in which each tweet was labelled with its

sentiment. Although there are no similarities in the data collection itself, apart from

Kouloumpis et al. (2011) and Go et al. (2009) using the same emoticon dataset, all datasets

were further processed and only English tweets were retained. In both Go et al. (2009) and

Barbosa & Feng (2010) tweets that were labelled contradictory were removed. Some other

important processing steps in the Go et al. (2009) dataset were that only the most common

emoticons were retained (these were: “ :), :-), : ), :D, =), :(, :-( and : (”). Besides this, retweets,

tweets with the :P emoticon and duplicates were removed. Further, Barbosa & Feng (2010)

considered one tweet per user only and removed tweets containing top opinion words such as

“cool” and “awesome”. The SemEval dataset, used by Mohammed et al. (2013), was

processed using a lexicon tool. Only tweets containing at least one sentiment word were

considered (Nakov et al., 2013). Finally, Kouloumpis et al. (2011) selected the hashtags that

would be most useful to identify a tweet as positive, negative or neutral. All tweets in the

hashtag dataset that lacked these hashtags were removed.

This brings us to a description of the machine learning approaches. In all systems, different

features were used testing different classifiers. Go et al. (2009) considered four different

feature sets: unigrams, bigrams, unigrams and bigrams and part-of-speech (PoS). The latter,

PoS or meta-features, were also considered by Barbosa & Feng (2010) together with syntax

features. Besides unigrams, which is similar to Go et al. (2009), Davidov et al. (August 2010)

also considered n-gram features, pattern features and punctuation features. Mohammed et al.

(2013) used some features that were also used in the other classifiers such as n-grams, PoS,

emoticons and punctuation. Above these, they added character n-grams, all-caps, hashtags,

lexicons, elongated words, clusters and negation. Kouloumpis et al. (2011) also used n-gram

features, lexicon features and PoS features, but moreover they considered micro-blogging

features.

19

As mentioned earlier, Go et al. (2009), Barbosa & Feng (2010), Davidov et al. (August 2010),

Kouloumpis et al. (2011) and Mohammed et al. (2013) all developed their own classifier. Go

et al. (2009) are the only ones who compared their classifier to others while Barbosa & Feng

(2010), Davidov et al. (August 2010), Mohammed et al. (2013) and Kouloumpis et al. (2011)

all simply tested their own classifier.

The keyword-based classifier developed by Go et al. (2009), counts the number of positive

and negative words in a chunk of text. The highest number ‘wins’ and when there is a tie, in

other words when there is an equal amount of positive and negative words, the tweet is

labelled positive. They used the Twittratr10

list of words in order to know which words to

label positive and which to label negative. They compared their own classifier to three

machine learning algorithms. Barbosa & Feng (2010) considered the two steps of the

sentiment analysis process (cf. 1.2.1 and 1.2.2) when drawing up their classifiers: a

subjectivity classifier and a polarity classifier. While the former focusses on the tweets’

syntax features (emoticons and upper case), the latter focusses on the meta-features (PoS-

tagging). Both Davidov et al. (August 2010) and Mohammed et al. (2013) combined all

features in one single classifier. In order to examine which feature influenced the results the

most however, Mohammed et al. (2013) ran their classifier various times. The first time, all

features were included and later, one feature was left out every time. Finally, Kouloumpis et

al. (2011) also tried to combine their different datasets (hashtag, emoticon and manually

annotated datasets) with different features in order to find out which combinations provided

the best results.

In general, all classifiers performed quite well. Go et al. (2009) concluded that unigrams are

good predictors, whereas bigrams seem more useful in combination with unigrams. Although

Davidov et al. (August 2010) did not use the same features as Go et al. (2009), they also

stated that a combination of various features resulted in a better performance of their

classifier. Further, Davidov et al. (August 2010) argued that some hashtags and smileys tend

to occur together rather frequently and some sentiment types also depend on each other.

Mohammed et al. (2013) concluded that lexicons influenced the performance of the classifier

remarkably. Moreover, automatic sentiment lexicons, which were the NRC Hashtag

Sentiment Lexicon (Mohammed, 2012) and the Sentiment140 Lexicon (Go et al., 2009), had a

better influence on the classifier than manual lexicons, which were the NRC Emotion Lexicon

10

http://www.twittratr.com/ - now automatically redirected to the website of its developer (last accessed: May

20th

2014)

20

(Mohammad & Turney, 2010; Mohammad & Yang, 2011), the MPQA Lexicon (Wilson et al.,

2005) and the Bing Liu Lexicon (Hu & Liu, 2004). Finally, the experiments of Kouloumpis et

al. (2011) showed that generally it was useful to add the emoticon dataset to the hashtag

dataset. Another conclusion is that the PoS feature did cause a poorer performance. The best

results were obtained when the n-gram, lexical and micro-blogging features were trained on

the hashtag dataset only.

There is one contradiction in the results: where, Go et al. (2009) and Kouloumpis et al. (2011)

explicitly mentioned that PoS-tagging seems an unuseful feature, Barbosa & Feng (2010) did

not remark any problems arising with that feature. In Table 1, an overview is presented of the

shallow machine learning approaches which were discussed in this section.

Go et al. (2009) Barbosa & Feng (2010)

Features Unigrams, bigrams, unigrams and

bigrams, and PoS

Meta-features (PoS): negative

polarity, positive polarity and verbs

Syntax features: emoticons and upper

case

Classifier Comparison of their own keyword-

classifier to three machine learning

algorithms

Two types of classifiers: subjectivity

classifier and polarity classifier

Results Unigrams: good

Bigrams: not good

Unigrams and bigrams: good

PoS: not useful

Good results, but one important

limitation: sentences that contain

contradicting sentiments

Mohammed et al. (2013)

Davidov et al. (August

2010)

Kouloumpis et al. (2011)

Features N-grams, character n-

grams, all-caps, PoS,

hashtags, lexicons,

punctuation, emoticons,

elongated words,

clusters and negation

Single word features

(unigrams), n-gram

features, pattern features

and punctuation features

N-gram features, lexicon

features, PoS features

and micro-blogging

features

Classifier A classifier in which all

features are combined;

later adapted by leaving

features out

A classifier in which all

features are combined

A classifier in which all

features are combined;

later adapted by leaving

features out

Results Lexicon feature has

influences classifier the

most

A combination of

different features seems

to be very important

Best results when

combining emoticon and

hashtag dataset or when

training n-gram, lexical

and micro-blogging

features on hashtag

dataset

PoS: not useful

Table 1 – Short summary of the shallow machine learning approaches

21

As to the datasets, Liu et al. (2012) propose a new model which tries to handle the challenge

to combine both manually labelled data (as in Davidov et al. (August 2010)) and noisy

labelled data (as in Go et al. (2009)) for training: the Emoticon Smoothed Language Model

(ESLAM). Basically, a manually labelled dataset is smoothed with a noisy labelled dataset

(with emoticons).

The dataset consisted of 5513 manually labelled tweets of which 3727 remained after

removing non-English and spam tweets11

. The preprocessing of the dataset consisted of

replacing Twitter usernames starting with @ by ‘twitterusername’, replacing all digits with

‘twitterdigit’, replacing all URLs with ‘twitterurl’, removing all stopwords, lower casing and

stemming all words and removing retweets and duplicates. After this pre-processing, 956

random tweets (478 positive and 478 negative) were chosen for polarity classification and

1948 (974 positive and 974 negative) for subjectivity classification.

In order to test whether using emoticons is useful, Liu et al. (2012) compared ESLAM to a

fully supervised language model (LM). Normally however, LM is only used for documents.

In order to solve this problem, two ‘documents’ were made: one ‘document’ which was

actually a succession of all positive tweets and one ‘document’ which was actually a

succession of all negative tweets in the training data. A first conclusion which could be drawn

from the test was that both methods performed better when more manually labelled data was

used. Secondly, the performance of ESLAM was always better than the performance of LM,

especially when a small number of manually labelled data was used. According to Liu et al.

(2012) this means that noisy (emoticon) data does contain useful information. Another

experiment showed that only using noisy labelled data was not enough since the results

revealed that using more manually labelled data resulted in a better performance.

We can conclude that shallow machine learning approaches already perform well on doing

sentiment analysis in Twitter data. As mentioned above, classifiers perform well when some

features are considered such as lexicons, unigrams, unigrams in combinations with bigrams,

etc. More profound linguistic characteristics however, such as sarcasm and negation, seem to

be the main problem. First of all, for classifiers it is sometimes not clear which word a

negation particle is meant for. Further, only humans are able to label tweets as sarcastic and

11

As in e-mail programs or other social media websites, spam is also present on Twitter. It can occur in various

forms such as abusing the @ function in order to reach users when spreading unwanted messages, creating

various accounts, posting links to dubious websites etc. https://support.twitter.com/entries/64986# (last accessed:

May 20th

2014)

22

even for a human annotator this is sometimes a difficult task. In the following section,

approaches for these characteristics are discussed.

2.3. Approaches incorporating more linguistic knowledge

2.3.1. Sarcasm

González-Ibáñez et al. (2011) and Davidov et al. (July 2010) agree that a sarcastic tweet

means the opposite of what is actually written and that this is very hard to detect. Davidov et

al. (July 2010) investigated how to distinguish sarcastic from non-sarcastic utterances on

Twitter and Amazon while González-Ibáñez et al. (2011) compared the performance of

human annotators and machine learning techniques when classifying tweets as positive or

negative. Both Davidov et al. (July 2010) and González-Ibáñez et al. (2010) used a dataset

consisting of tweets that were explicitly marked with the hashtag #sarcasm.

Results showed that Davidov et al.’s (July 2010) model scored well on the Twitter dataset

(with tweets containing #sarcasm) as Twitter messages are short and the sarcasm is mostly not

dependent on other sentences (which can be the case on Amazon as larger chunks of text are

used in that case). In order to pinpoint some typical characteristics of sarcasm in tweets, they

used a “hash-gold standard set”, which contains the tweets labelled with #sarcasm. They

found three typical uses of the hashtag. First, twitter users want others to be able to find their

sarcastic tweets and therefore label them as such. Secondly, the hashtag can be used in order

to indicate that their previous tweet was sarcastic. Finally, the hashtag can be used when there

is a lack of context from which the sarcasm could be deduced and therefore the users

explicitly mention that something is sarcastic. These uses show that tweets with #sarcasm are

both noisy and biased. Noisy because not all tweets containing that hashtag are actually

sarcastic and biased because without the hashtag, it would be difficult, even for humans, to

label them as sarcastic.

González-Ibáñez et al. (2011) countered the concern of Davidov et al. (July 2010) that tweets

with #sarcasm are noisy by removing all tweets where the hashtag was not at the end from

their dataset. They found that lexical and pragmatic features were not enough for a classifier

to distinguish sarcastic tweets from positive or negative tweets. Further, they investigated

whether human annotators would be better at the task of determining whether a tweet was

23

sarcastic, negative or positive. Their results revealed that also the human annotators scored

low on the classification task. Consequently, González-Ibáñez et al. (2011) concluded that

only tweets which were explicitly labelled sarcastic by the author could be used since they are

the only ones able to label their tweets correctly. This is also why they are convinced that

their “approach to create a gold standard of sarcastic tweets is more suitable in the context of

Twitter messages” (p585). Also Davidov et al. (July 2010) used features in their algorithm:

punctuation-based and pattern-based features. The punctuation features can be further divided

into five different features: sentence length in words, the number of exclamation marks,

question marks and quotes in a sentence, and finally the number of capitalized words in a

sentence.

2.3.2. Negation

There is a lot of negation in tweets and in short texts or sentences in general. However, to the

best of our knowledge, we did not find any studies considering this aspect in Twitter. That is

why, for this dissertation, we try to discuss previous studies on negation that could also be

useful for negation in tweets.

Wiegand et al. (2010) provided a recent study on negation in sentiment analysis. In their

paper, they provided an overview of computational approaches that deal with this matter. We

will only discuss those that are relevant for sentiment analysis in Twitter.

Pang et al. (2002) found that a bag-of-words representation is effective. With this technique,

the classifier itself has to determine which words in the dataset or feature set are polar and

which are not. When there is a negation particle, artificial words are added until the

punctuation mark.

A second type of approach is models which include their knowledge of a word’s polar

expression. Polanyi & Zaenen (2004) developed a model that uses contextual valence shifting.

In this model, polar expressions are given scores: positive scores to positive polar expressions

and negative scores to negative polar expressions. When a polar expression is negated, the

polarity score of the expression is flipped. This model’s effectiveness is however unknown as

it has never been implemented. The second approach of this type is the one by Wilson et al.

(2005). They developed a classifier and added a negation feature. They concluded that the

24

more features were used in the classifier, the better it performed. In other words, adding

negation features to a classifier helps.

Another approach, by Jia et al. (2009), examined the impact of scope models for negation, in

other words, models which focus on negation. They developed a complex model which also

considers disambiguation for example and negative rhetorical questions. Wiegand et al.

(2010) argued that the model is better than simpler methods as it keeps linguistic insights in

mind.

Negation does not only occur within phrases or sentences, but sometimes also within words.

These words are mostly words which are clearly either positive or negative, so this should not

be a problem in sentiment analysis. However, new words (containing negation) are arising

and these may not be listed in the polarity lexicon. Therefore, Wiegand et al. (2010) argued

that polarity classifiers should also be able to detect these words.

Liu & Seneff (2009) also made an important remark. They observed that a negated polar

expression (e.g.: not pretty) does not necessarily have the same polar strength as its unnegated

polar expression with the opposite polarity (e.g.: ugly). Although “not pretty” and “ugly” do

have the same polarity, their polar strength is different as in this case “not pretty” is less

negative than “ugly”. Liu & Seneff (2009) therefore proposed a model in which words are

given a score for their intensity and polarity.

Finally, Wiegand et al. (2010) formulated some of the limits that negation modelling still has

in sentiment analysis. First, some words can have different polar expressions in different

contexts. Secondly, some polar opinions can only be detected and understood when the reader

has world knowledge. Finally, rare constructions are often not yet considered as specific

negation models can only be developed when a large enough corpus is available. Only then

can the models be properly tested.

25

PART II: CORPUS ANALYSIS

In the second part we describe the corpus analysis that was performed for this dissertation.

For our data collection we decided to focus on one specific domain: airline companies.

The most important objective of this dissertation’s experiments is to help improve the

performance of a basic sentiment analysis tool (see 1.3). In order to be able to do that, it is

necessary to know why it fails at labelling the tweets correctly. After analysing the results,

some propositions and conclusions will be drawn up in order to help improve the tool. The

focus will be on more profound text attributes.

We can reformulate these objectives in three questions:

- How good does a machine sentiment analysis tool (the LT3-tool) perform on labelling

linguistically challenging tweets compared to humans?

- What is or what are the most common mistake(s) a machine sentiment analysis tool

(the LT3-tool) makes when labelling linguistically challenging tweets?

- How could a machine sentiment analysis tool (the LT3-tool) be improved in order to

label linguistically challenging tweets better?

We will now first describe the methodology followed for our research, after which we will

discuss the results.

1. METHODOLOGY

1.1. Data collection

We chose to compose two different corpora with tweets on two airline companies, British

Airways and Ryanair. These airline companies were selected as they both have very different

reputations. This is an interesting aspect as these reputations can have an influence on the

kinds of tweets that are sent. The British Airways reputation is rather good as it is a non-

budget airline and it has a very good reputation among travellers from not only British but

also other worldwide habitants. The most recent proof of their good reputation is them being

elected best brand in the UK (Molley, 2014; Winch, 2014). Ryanair on the other hand has a

(much) worse reputation which can be caused by them being a low-budget airline. For a lot of

26

travellers a low-budget airline often equals an airline company with a bad customer service.

This is definitely the case for Ryanair as it is known for offering cheap flight tickets but with

no other extras for their customers (Vizard, 2013). However, in the past few months, Ryanair

is doing some extra efforts in order to boost their public image (Travelmole, 2014).

In order to be eligible for our corpus, all tweets had to meet some criteria:

1. It had to be a tweet where one of the two chosen airline companies was mentioned

(Ryanair of British Airways)

2. It had to be sent during November 2013

3. The tweet had to contain at least one sentiment which should be difficult to predict

using an automatic system.

This last criterion was the most important one.

Starting in November 2013, as many tweets as possible were collected using the Twitter API.

One limitation of this service is that it only allows a user to download tweets from the

preceding week12

. During the month of November we entered our two search terms “Ryanair”

and “British Airways” in the tool on a daily basis. This already resulted in a corpus of 45

Ryanair and 60 British Airways tweets. However, since we envisaged to have at least five

tweets per day for each day in November – a criterion that was not met during this period –

we had to find another tool allowing us to search more tweets. To this purpose, we used the

website: www.topsy.com (last accessed: May 20th

2014). This website allows you to search

for old tweets by date. Since this website also searches other social media based on a specific

query, we had to fill in two criteria, i.e. (1) that we only wanted to find tweets and (2) that we

only wanted English tweets. For each day in November, separate searches were performed

and tweets following our personal criteria were selected.

In order to illustrate the data collection, we present print screens of the websites used to

collect tweets. Image 1 shows the Twitter API with the search entry “British Airways” while

Images 2 and 3 are screenshots of the Topsy website using the same search entry.

12

https://dev.twitter.com/docs/using-search (last accessed: May 20th

2014)

27

Image 1 - Example of the Twitter API with the search term "British Airways"

Image 2 – Possibilities to specify a search on Topsy according to time/date, kind of message and language respectively

28

Image 3 - Example of result page on Topsy with the search term “British Airways” and with the specifications as in Image 2

As our objective was to find five tweets for each day in November, we continued searching

for tweets on the Topsy website until that criterion was met. The final corpus consists of 300

tweets in total, 150 for Ryanair13

and 150 for British Airways13

. Table 2 presents some data

characteristics of our corpus.

The corpus contains different types of information on the tweets. First, the date and exact time

are provided. Secondly, the tweet itself is in the corpus. Most emoticons were copied, except

the less common ones that could not be typed (such as an airplane). A third element in the

corpus is the Twitter username of the type @username. The fourth element is the name that

the twitter user provides himself. Normally you would expect this is the real name of the user,

but this is not always the case as some use asterisks (*) for example in their name. The two

last columns of the corpus contain information on the annotation of the tweets, which will be

discussed in closer detail in the following section.

Number of tweets Average tweet length

Ryanair 150 16.54

British Airways 150 17.29

Table 2 - Number of tweets and average tweet lengths of both corpora

13

The Ryanair and British Airways corpora are enclosed with this dissertation via Minerva.

29

1.2. Data annotation

Before testing a sentiment analysis tool on our dataset it was crucial that our data had

manually verified sentiment labels; they had to be a gold standard.

During our data collection we already paid special attention to the third criterion, i.e. why a

sentiment analysis tool would label the sentiment of the tweet incorrectly. This implies that a

first rough annotation of the tweets was already performed. Next to each tweet there was a

description why this tweet was chosen, which often (implicitly) mentioned whether the

particular tweet was positive, negative or neutral.

However, it was still necessary to perform an additional annotation in a uniform manner. To

this purpose all tweets were annotated manually as positive, negative or neutral.

Annotating a corpus is something fairly subjective as one annotator could describe a tweet as

more sarcastic whereas another one considers it a serious tweet. The same is true for this

corpus’ annotation; some annotations might be up for discussion, of which we will now

present some examples:

I'm wondering if @rewardgateway is a bit like the old British Airways in that it's staff

have to pass a test for extremely good lookingness

This tweet could be interpreted as either negative (British Airways focusses on the looks and

not the capacities) or positive (all British Airways stewardesses are good-looking). This is

why we decided to label it as neutral in order to keep those two interpretations in mind.

British Airways pulling out of Manchester was disappointing but attracted other

airlines. Market will change over next 5 yrs.

One could consider this a negative tweet because the twitter user is disappointed in British

Airways for leaving. Others could label it positive as being disappointed can express a

positive opinion of British Airways. In the corpus however, this tweet was also labelled

neutral in order to keep those two interpretations in mind.

We were not able to perform an inter-annotator agreement study due to a lack of time.

However, we consider the pre-selection step and extensive literature study sufficient to form a

rather unambiguous judgment. Table 3 presents the total number of positive, negative and

neutral tweets in each corpus. In the Ryanair corpus, the number of negative tweets is

30

significantly higher (twenty more) than in the British Airways corpus. This could be a

consequence of Ryanair having a worse reputation than British Airways.

Sentiment # of tweets

British Airways Negative 95

Positive 45

Neutral 10

Ryanair Negative 114

Positive 30

Neutral 6

Table 3 - Overview of the number of negative, positive and neutral tweets (labelled by the human annotator) in both corpora

1.3. LT3-tool

The LT3-tool is a sentiment analysis tool developed by and discussed in Van Hee et al.

(2014). The tool is a quite standard sentiment analysis tool including as many features as

possible. The LT3-tool considers ten different features groups: lexicons, n-grams, n-grams

and lexicons, normalization, PoS, negation, word shape, named entity, dependency and PMI.

When testing the tool, first only three features were used (lexicons, n-grams and lexicon and

n-grams) and later the other features were added one by one. Van Hee et al. (2014) did this in

order to examine how much the feature groups influenced the performance of the tool. Using

the SemEval2013 twitter dataset for training (Nakov et al., 2013), in their experiments, Van

Hee et al. (2014) tested the performance of the LT3-tool on different datasets: a dataset with

regular tweets, a dataset with sarcastic tweets, a dataset with SMS messages and a dataset

consisting of blog posts.

Results showed that the LT3-tool performed best on the dataset with regular tweets and worst

on the sarcastic dataset. Further, almost all features had a positive influence on the

performance of the tool except for the dependency feature.

For our analysis, we used the LT3-tool, trained on the SemEval data with all available

features, and tested it on both the British Airways and Ryanair corpus.

31

2. RESULTS

2.1. Evaluation

Before coming to the actual results we will first tell something more about how we evaluated.

The error analysis itself consists of two dimensions: a quantitative and qualitative analysis.

The quantitative analysis consists of calculating the performance of the LT3 tool (accuracy,

precision and recall) using confusion matrices. The qualitative analysis was done by trying to

classify the different errors of the tools.

Quantitative analysis

Once the results of the LT3-tool are outputted, they can be compared to those of the human

annotator. In order to calculate this performance, a confusion matrix is used in which “each

column […] represents the instances in a predicted class, while each row represents the

instances in an actual class”14

. The results of both datasets are represented in two separate

matrices.

Using the data in those matrices, three performance metrics can be calculated: accuracy,

precision and recall. The first value, accuracy, indicates “the degree of closeness of

measurements of a quantity to that quantity's actual (true) value”15

. For the LT3-tool in

particular, it indicates how many tweets were labelled correctly, irrespective of the classes

(positive, negative, neutral). The second value, precision, indicates “the degree to which

repeated measurements under unchanged conditions show the same results”15

. For example,

when one wishes to calculate the precision of the positively labelled tweets for, one calculates

how many per cent of the tweets that were labelled positively by system are actually correct.

In this dissertation, the precision is calculated of all possible labels: negative, positive and

neutral. The third and last value is recall which can be defined as “the fraction of relevant

instances that are retrieved”16

. In other words, with this we calculate how many per cent of the

positive tweets for example were recognized by the sentiment analysis tool. Again, this value

14

http://en.wikipedia.org/wiki/Confusion_matrix (last accessed: May 20th

2014) 15

http://en.wikipedia.org/wiki/Accuracy_and_precision (last accessed: May 20th

2014) 16

http://en.wikipedia.org/wiki/Recall_(information_retrieval) (last accessed: May 20th

2014)

32

is calculated for the negative, positive and neutral labels. We will now present the different

formulas17

used to calculate these metrics.

Accuracy

��.��.��.��

��

Precision of the negative/positive/neutral class respectively:

��. ��

��. �� + ��. �� + ��. ��

��. ��

��. �� + ��. �� + ��. ��

��. ��

��. �� + ��. �� + ��. ��

Recall of the negative/positive/neutral class respectively:

��. ��

��. �� + ��. �� + ��. ��

��. ��

��. �� + ��. �� + ��. ��

��. ��

��. �� + ��. �� + ��. ��

Qualitative analysis

In the qualitative analysis, six types of errors will be discussed according to how the human

annotator and the LT3-tool labelled the tweets.

Human coder LT3-tool

Type 1 Negative Positive

Type 2 Negative Neutral

Type 3 Positive Negative

Type 4 Positive Neutral

Type 5 Neutral Negative

Type 6 Neutral Positive

Table 4 - Overview of the different types of errors

In Table 4, an overview is provided of the six different types of errors that are distinguished in

the qualitative analysis in Section 2.3. Type 1 and Type 2 are tweets which were labelled

negative by the human annotator, but respectively positive and neutral by the LT3-tool. Type

17

The sequence “X.Y” represents the human annotator labelling a tweet X and the LT3-tool labelling the tweet

Y. Other abbreviations are “pos” for positive, “neg” for negative and “neu” for neutral.

33

2 and Type 3 errors concern tweets labelled positive by the human annotator and respectively

negative and neutral by the sentiment analysis tool. Finally, Type 5 and Type 6 errors were

made when the human annotator labelled tweets as neutral and the LT3-tool did so as negative

and positive respectively.

We will now discuss the results in closer detail.

2.2. Quantitative analysis18

Predicted sentiment (machine)

Ov

eral

l se

nti

men

t (h

um

an)

Negative

Positive

Neutral

Negative

37

90

82

Positive

35

19

21

Neutral

4

6

6

Table 5 – Confusion matrix of the British Airways and Ryanair corpus combined

The confusion matrix of both corpora combined is presented in Table 5. It shows how many

times the LT3-tool labelled tweets correctly or incorrectly. Overall, 37 negative, 19 positive

and 4 neutral tweets have been labelled correctly. In other words, the accuracy of the LT3-tool

on our total corpus is 21%. This indicates that the collected tweets are indeed difficult to label

for an automatic sentiment analysis tool.

Negative Positive Neutral

Accuracy 21%

Precision 49% 17% 6%

Recall 18% 25% 31%

Table 6 - Overview of the accuracy, precision and recall numbers of the British Airways and Ryanair corpus combined

When comparing the performance metrics presented in Table 6, there are some remarkable

differences. First of all, the precision of the negative tweets (49%) is significantly higher than

18

The calculations of all values mentioned below, can be found in Appendix I.

34

that of the positive (17%) and neutral (6%) tweets. In other words, the LT3-tool labelled a lot

more negative tweets correctly than positive and neutral ones. The recall of the negative

tweets however is the lowest, 18% compared to 25% and 31% for the positive and neutral

tweets respectively. Although half of the tweets labelled negative by the LT3-tool were

labelled correctly, the tool did miss a significant number of those tweets which were negative

in reality. Further, the precision of neutral tweets (6%) is very low, meaning that a significant

amount of opinions in the tweets were expressed with neutral words or the positive and

negative opinion words were levelled out. Finally, the precision of neutral tweets is the lowest

whereas its recall is the highest. This high recall could be explained by the corpus containing

few tweets labelled neutral by the human annotator (sixteen tweets in total, which is 5% of the

entire corpus). In the qualitative error analysis, we hope to shed some light on the types of

errors that were frequently made by the LT3-tool.

For completeness, we present below the confusion matrices of both airline companies, British

Airways in Table 7 and Ryanair in Table 8:


Ov

eral

l se

nti

men

t (h

um

an)

Negative

Positive

Neutral

Negative

24

40

31

Positive

19

14

12

Neutral

1

5

4

Table 7 – Confusion matrix of the British Airways corpus

35

Predicted sentiment (machine) O

ver

all

sen

tim

ent

(hu

man

)

Negative

Positive

Neutral

Negative

13

50

51

Positive

16

5

9

Neutral

3

1

2

Table 8 – Confusion matrix of the Ryanair corpus


Accuracy 28%

Precision 55% 24% 9%

Recall 25% 31% 40%

Table 9 - Overview of the accuracy, precision and recall numbers of the British Airways corpus


Accuracy 13%

Precision 41% 9% 3%

Recall 11% 17% 33%

Table 10 - Overview of the accuracy, precision and recall numbers of the Ryanair corpus

A comparison of Tables 9 and 10 shows that the performance metrics of the Ryanair corpus

are generally lower than those of the British Airways corpus. First of all, the accuracy of the

British Airways corpus is a 15% higher than in the Ryanair corpus, which is a significant

difference. Generally, all precision values are lower in the Ryanair corpus, but the precision of

the positive tweets is remarkably lower (also a 15%). The recall values are also lower on

negative, positive as well as neutral tweets, but again the difference between the values of the

positive tweets is the largest. The values of neutral tweets (labelled by the human annotator)

are different, which cannot be generalized however as too few tweets were labelled neutral in

both corpora.

36

2.3. Qualitative analysis19-20

As the prime objective of this dissertation is to improve the LT3-tool, it is necessary to know

which mistakes the system makes. After a first thorough inspection we saw that we could

distinguish eight different error categories: sarcasm, negation, experience, comparison,

synonym, ambivalent, side-taking and Twitter-specific signs. When a tweet could be placed in

different categories, we opted for the most dominant error.

We will now first explain these error categories in closer detail. In the literature overview, the

common errors “sarcasm” (Section 2.3.1) and “negation” (Section 2.3.2) have already been

discussed in closer detail. In a sarcastic tweet, the twitter user says something but actually

means the exact opposite of what he says which seems to be difficult for a sentiment analysis

tool to detect. In a negated tweet, a negation particle changes the sentiment orientation of the

word that follows. This also seems to be a problem for a sentiment analysis tool as it is not

always clear which word the negation particle is related to. Please note that negations do not

always occurs with an ordinary negation particle (such as “not”) as sometimes other parts of a

sentence can indicate negation (such as “rather than”) or at least strengthen or weaken the

overall sentiment within a tweet. This is why we decided to also include modifiers in the

“negation” category. The third error category is “experience”; this is when a twitter user tells

a story related to one of the airline companies. The story is usually told using neutral words

and lets the reader draw his/her own conclusions. “Comparison” is a fourth category and

occurs when the twitter user compares the airline company to another company. This can be

confusing for a sentiment analysis tool as it is not always clear which company the negative

or positive opinion is meant for. Fifthly, the error “synonym” indicates that a company is

perceived as an equivalent of a certain sentiment or opinion. When the sixth error,

“ambivalent”, is made, a certain word in the tweet can have different sentiments in various

contexts and in the context of that particular tweet, the word is labelled incorrectly. “Side-

taking” is the seventh error. This occurs when a twitter user takes the airline companies’ side

in discussions. Finally, “Twitter-specific signs” is an error category occurring when for

example a hashtag expresses an important sentiment about the tweet, but cannot be read as

such by the sentiment analysis tool.

In Table 11, an overview is provided of the different error categories with a short explanation.

19

The exact count of errors per corpus can be found in Appendix II. 20

All examples used in this section are extracted from either the British Airways or Ryanair corpus.

37

Sarcasm A sarcastic utterance is used in the tweet

Negation A negation particle or modifier is used in the tweet

Experience The tweet reports an experience of the twitter user

Comparison The entity is compared to another entity (entity: airline company)

Synonym The entity is seen as a synonym to a certain sentiment or opinion

Ambivalent A word is used which can have different meanings in various

contexts

Side-taking The twitter user supports the entity

Twitter-specific signs A sign is used which is typically used in Twitter

Table 11 - Overview of the different error categories

Before proceeding to the actual in-depth error analysis, an overview is provided in Tables 12

and 13. Each includes a summary of how many kinds of major (Table 12) and minor (Table

13) errors were found in each of the six types.

Major errors

Sarcasm Negation Comparison Experience

Type 1 45 14 15 13

Type 2 37 7 7 20

Type 3 5 11 12 1

Type 4 1 1 7 8

Type 5 0 0 3 1

Type 6 0 0 2 4

Total 88 33 46 47

Table 12 - Overview of the number of major errors in each type

Minor errors

Synonym Ambivalent Side-taking Twitter-specific

signs

Type 1 1 0 0 2

Type 2 10 0 0 1

Type 3 0 1 5 0

Type 4 0 2 1 1

Type 5 0 0 0 0

Type 6 0 0 0 0

Total 11 3 6 4

Table 13 - Overview of the number of minor errors in each type

Tables 12 and 13 show again that most errors were made in Types 1, 2 and 3. The large share

of Type 1 and Type 2 errors can be explained by a 70% of the tweets being labelled negative

38

by the human annotator. This results in a larger possibility of those tweets being labelled

erroneously by the LT3-tool. Also Type 3 has quite a large share, which can be explained by

the LT3-tool mostly labelling the exact opposite sentiment when it labelled erroneously

(another reason why Type 1 is notably present). Further, the rather smaller share of Type 5

and Type 6 errors can be explained by few tweets being labelled neutral by the human

annotator (only 5%).

In Table 14, the large shares of Type 1, Type 2 and Type 3 errors are shown more visually in

a pie chart. The correctly labelled tweets are not considered in this chart.

Table 14 - Pie chart of the shares of the different types of errors

2.3.1. Type 1: negative tweets labelled as positive

Type 1 includes all tweets that the human annotator labelled negative, but which were

erroneously labelled positive by the LT3-tool. In general, 90 errors of this type were

committed, 40 in the British Airways corpus and 50 in the Ryanair corpus. These errors were

of different categories, but four errors that were committed stand out: “sarcasm”, “negation”,

“comparison” and “experience”.

Type 1

38%

Type 2

34%

Type 3

15%

Type 4

9%

Type 5

2%

Type 6

2%

Both airline companies

39

Sarcastic tweets seem to have caused substantial problems in Type 1 as 50% of the

erroneously labelled tweets constitute this error. In this case, positive words are used, but the

twitter user points out something negative.

[1] Another triumphant @Ryanair exp. 1 desk open out of 15 with a massive Q. 2 people on desk, 1 booking, 1 just chatting – brilliant

[2] Love Ryanair, best airline ever.. #nextjoke #satonmiarse

[3] Great so @British_Airways have left my back in Norway and I’m in Denmark. Non lols.

[4] Back in rainy Belgium L thanks for my delays #britishairways

In [1], [3] and [4], the twitter users use positive words to explain their negative experience in

the same sentence. As for a human annotator, the negative experience that accompanies the

positive words indicates that it is a sarcastic tweet. [2] is a slightly different case as the

hashtags indicate that the tweet is sarcastic.

Negation, though in a more restricted way, is also an important cause of negative tweets being

labelled positive (16%).

[5] From my door to the gate in 50 minutes, thank god I’m not flying Ryanair… - Oliver

[6] A special little joy grows in my heart every time I book a flight that’s not with Ryanair.

[7] . @British_Airways trust me, it was anything but enjoyable, especially the return trip – back to @VirginAtlantic for me…

[8] @Bitish_Airways rather than thank me for my patience I’d rather you were on time.

The four examples above show that negation seems to be an issue when it occurs in a special

form. In [5] and [6], the LT3-tool has registered “thank” and “joy” as positive words, but

since there is a negation particle before “Ryanair”, the sentiment shifts. However, for the

LT3-tool this is probably not clear as “Ryanair” and not “thank” and “joy” are negated.

Tweets [7] and [8] show that negation does not always occur with the usual negation particles.

In [7] “anything but” is the sequence negating “enjoyable”. In [8] on the other hand, “rather

than” negates or weakens “thank”.

Thirdly, some tweets have been labelled incorrectly when British Airways or Ryanair were

compared to another company. In this case, positive things are said about other companies,

40

which make these tweets negative for British Airways or Ryanair. 17% of the Type 1 errors

are of the “comparison” category.

[9] @jula18 @gems2311 @desiderata2013 I used to use Ryanair much prefer aer lingus now plus you can book seats its not a fee for all.x

[10] The @wizzair airline is a great example of how you can offer cheap flights without having an ass retarded website like ryanair

[11] @greggsulkin @british_airways Virgin Atlantic – solution to flying with British ever again. Soooo much better

[12] why can’t British airways be as cool as American ones???

In all the above examples something good is said about another airline company, which

makes either British Airways or Ryanair look bad. In [9] the twitter user implies that Aer

Lingus is better as passengers do not have to pay when choosing their seats (which they have

to do when booking with Ryanair). In [10] WizzAir is praised for its good website and cheap

tickets. Although the tweet implies that Ryanair also has cheap tickets, which is seen as

something positive, the negative aspect of “an ass retarded website” dominates. In both [11]

and [12] British Airways is seen as a bad airline company when compared to Virgin Atlantic

and American airline companies in general.

The fourth common error committed by the LT3-tool of Type 1, is “experience”. The tweets

report on something the twitter user has experienced with one of the airline companies and he

leaves it up to the readers to draw their conclusions. In this case, negative conclusions should

be drawn which is trivial for human annotators, but seems quite an issue for a sentiment

analysis tool. 14% of the erroneously labelled tweets are of this category.

[13] Dear Ryanair… independent.co.uk/voices/coment…. This story is shocking. Better to save up for longer and travel with a decent airline.

[14] lnkd.in/d48KMsV Well done to Carolyn and the team. Amazing results and well and truly puts Ryanair in its place.

[15] @British_Airways ..furthermore, your Gibraltar contact is a SPANISH NUMBER. I’m Gibraltarian! English is our national language! So …

[16] British Airways bureaucracy & processes: flying with my wife and can’t select her seat next to me because we are in different bookings.

In the Ryanair examples [13] and [14] the twitter users both refer to something unknown for

the readers. Their language use, however, illustrates a negative opinion. In [13] for example it

is implied that Ryanair is not a decent airline and in [14] the user explicitly says that “Ryanair

41

is put in its place”, which also has a negative connotation. In the British Airways tweets,

examples [15] and [16], the twitter users share their negative experience. However, none of

them explicitly say something negative about British Airways. They simply narrate their

(negative) experience without using explicitly negative words and leave it up to the readers to

draw their conclusions. It is normal that this is not obvious for the LT3-tool, which makes this

a difficult problem to tackle automatically.

Further, in the Ryanair corpus two other error categories can be distinguished: “synonym” (1)

and “Twitter-specific signs” (2). These categories were not found in the British Airways

corpus.

[17] @DominicFarrell herpes is better than. Ryanair !!!

[18] @ihateryanair you guys rock! keep up the good work! these thieves @ryanair must be stopped!

[19] It appears that Jet2 are no better than Ryanair #ripoffmerchants

In [17] Ryanair is clearly seen as a synonym for something negative as it is compared to

herpes and the twitter user expresses his preference for herpes over Ryanair. In [18] and [19]

the Twitter-specific signs @ and # express an important opinion on the tweet. In [18] the

twitter user praises another user, namely @ihateryanair. This is important information as this

clearly expresses a negative attitude towards Ryanair. In [19] on the other hand, the

information in the hashtag #ripoffmerchants is essential as it expresses a negative opinion

towards both Jet2 and Ryanair. It seems that the LT3-tool is unable to split these hashtags and

at-mentions into separate linguistic units.

2.3.2. Type 2: negative tweets labelled as neutral

Type 2 includes all tweets that the human annotator labelled negative, but which were

erroneously labelled neutral by the LT3-tool. In general, 82 errors of this type were

committed, 31 in the British Airways corpus and 51 in the Ryanair corpus. These errors are of

different categories of which three stood out: “sarcasm”, “experience” and “synonym”.

Again sarcasm seems to pose the biggest problem for the LT3-tool. In total 45% of the tweets

were falsely labelled neutral because of sarcasm. Although the twitter users use either neutral

words or an equal amount of negative and positive words in this case, the overall message in

the tweet is sarcastic.

42

[20] I think a walk and a swim will get us home quicker than @Ryanair tonight #clontarf #dublinairport pic.twitter.com/od21YJwddk

[21] @The_Old_Penfold @howthingslook @jana_obscura @GregorServais Ryanair, right, where you have to peddle to make the plane go.

[22] When the lady at check in told me I had an aisle seat , what she meant was “here’s a middle seat” – thanks @BritishAirways #terminal5 #a380

[23] The @British_Airways Avios frequent flyer points are about as useful as Monopoly money.

In both [20] and [21] the twitter users make jokes implying that going home with Ryanair

takes a long time and no explicitly negative words are used. A human annotator however can

detect the joke while the LT3-tool takes into account the linguistic features only. In [23] the

twitter user also makes a joke as he alludes to the frequent flyer points being worth nothing

(as Monopoly money is also worthless). Finally, tweet [22] is a classic example of sarcasm as

the “thanks @BritishAirways” is simply not sincere. This can be detected from the context.

Also in Type 2, “experience” is an error occurring regularly (24%). In this case, twitter users

tell a story which has a negative outcome, but they do not use explicitly negative words and

therefore the LT3-tool labelled these tweets as neutral.

[24] Ryanair – was your website built in 1999?

[25] At Rygge ‘Oslo’ Airport, which, to be honest Ryanair, is only near Oslo if you zoom really really far out on your maps.

[26] Just looked at the twitter id of @British_Airways and they only look at tweets until 5pm – so much for #globalservice

[27] I really do wish @British_Airways would call me to tell me where my luggage that they lost is. Am on holiday with just clothes I’m wearing!

In tweets [24], [25] and [26] the twitter users make a remark on a bad aspect of both Ryanair

and British Airways. Although it seems like an ordinary comment, it is actually a form of

criticism. Therefore these tweets cannot be seen as neutral tweets as they clearly out critique.

The same occurs in [27], but opposed to the other three tweets, this message has a more

personal connotation as this twitter user really wants to know where his clothes are.

The third important error is “synonym” which occurs when British Airways or Ryanair are

seen as a synonym of something bad (12%). The “synonym” errors are mainly made in the

Ryanair corpus rather than in the British Airways corpus. This difference can be explained by

Ryanair having a very bad reputation compared to British Airways.

43

[28] So @northernrailorg is the Ryanair of trains.

[29] We are staying at a Ryanair hotel.

[30] @davidfrum This is the RyanAir approach: if you’re in the same time zone, it’s close enough.

[31] What is this?! British Airways? pic.twitter.com/oYAwhaYT8J

In the first three examples [28], [29] and [30] from the Ryanair corpus, the company is clearly

perceived as something negative. Tweets [28] and [29] are fairly short tweets, but for a human

annotator it is still obvious that these are negative tweets about Ryanair as he or she has the

background knowledge of Ryanair’s bad reputation. In [30], the twitter user confirms that “the

RyanAir approach” has a negative connotation by explaining what is wrong with their

approach. In [31] it is clear that this is a negative tweet, even without consulting the image as

“what is this?!” would not be used in a positive context.

Finally, three minor error categories are left: “negation” (7), “comparison” (7) and “Twitter-

specific signs” (1). The “negation” errors can be found in both corpora, three in the British

Airways and four in the Ryanair corpus. The “comparison” errors however are only present in

the Ryanair corpus, while the “Twitter-specific signs” error can be found in the British

Airways corpus.

[32] Flying British Airways 747, should have plenty of room.

[33] used to Ryanair flying business class in Emirates this time is like being in another planet

[34] I’m a patient man but now been on hold 41 minutes at british airways executive club #ba #britishairways #costumerserviceneedsimprovement

Tweet [32] demonstrates again (as [7] and [8]) that also other sequences, as modification

particles in this case, can cause negation. In this example, “should have” negates that there is

“plenty of room”. Therefore it is a negative tweet, as the twitter user believes that this is a

drawback. In [33] Ryanair is compared to another airline company, Emirates. The comparison

is not good for Ryanair as Emirates appears to be better; however no explicitly negative words

were used to express that sentiment. Finally in [34] the hashtags contain the most important

information of the tweet. Since these consist of tokens which are all concatenated, the

negative words used in the hashtag cannot be detected by a sentiment analysis tool.

44

2.3.3. Type 3: positive tweets labelled as negative

Type 3 includes all tweets that the human annotator labelled positive, but which were

erroneously labelled negative by the LT3-tool. In general, 35 errors of this type were


different categories of which two stand out: “negation” and “comparison”.

In general, “negation” occurs in 31% of the errors.

[35] Amazed - new Ryanair website didn't drive me mad when checking in. Ryanair in improved website shocker!!

[36] Ryanair booked us onto new flights, no charge, despite the lost lorry cargo on the M11 being absolutely nothing to do with them. Impressed.

[37] Just got the British Airways magazine with the staff holidays in. Caribbean for £30 a night doesn’t sound too bad.

[38] Not at all keen on T4, not only is it miles from anywhere but more importantly there’s a distinct lack of @British_Airways

Tweet [35] expresses a genuinely positive comment on the new Ryanair website. The twitter

user negates that the website drove him crazy, which was probably the case with the old

website. Tweet [36] also comments on something positive of Ryanair as the negative word

“charge” is being negated. In [37] the problem arises that the sentiment analysis tool does not

know which word the negation particle applies to. In this example, “bad” is negated although

the negation particle is positioned before “sound”. Finally in [38] a lot of negation particles

apply for both “T4” and “British Airways”. The most important negation however, “there’s a

distinct lack of @British_Airways”, aims at a positive comment on British Airways.

In Type 3, “comparison” errors are also a cause of concern. The error occurs 34% of the time.

The airline companies are compared to others and in this case, the positive things apply to

either British Airways or Ryanair. Although there is no significant difference, the bad

reputation of Ryanair could apply to the lower number of “comparison” errors.

[39] Have booked flights for @buildstuffit.It I have gone with ryanair. Yes, I hate myself too, but its cheap and direct, and the times are good

[40] €124/£103 on Aer Lingus, €49/£41 on Ryanair to Amsterdam. I’m kind of tempted… [Granted I’d have to book a hotel and shit too]

[41] I hate easyjet, never me ruder staff. Nothing in comparison to British Airways. #easyjet #britishairways

45

[42] Fuck me, British Airways just make every other airline look like a incompetent arseholes when it comes to service. #1stClass

At first sight, [39] might seem a negative tweet as the twitter user declares that he hates

himself because he chose for Ryanair. However, the positive aspects of Ryanair when

compared with other airline companies (cheap, direct, good timetable) are dominating in this

tweet. A difference with other comparison errors is that Ryanair is not compared with another

specific airline company, but with all companies in general. This can also be found in [42] as

the twitter user declares that British Airways is the best out of all airline companies. In [40]

the twitter user stresses the cheap tickets Ryanair offers in comparison with Aer Lingus.

Finally in [41] all negative aspects apply to EasyJet as British Airways is depicted as a very

good company.

Finally, other Type 3 errors can be divided into four categories: “sarcasm” (5), “experience”

(2), “ambivalent” (1) and “side-taking” (5). Three “sarcasm” errors were found in the British

Airways corpus and two in the Ryanair corpus. All “experience” and “ambivalent” errors are

present in the British Airways corpus. Finally four “side-taking” errors can be found in the

British Airways corpus and one in the Ryanair corpus.

[43] My plane ticket was so cheap I called British Airways to make sure it wasn’t a scam

[44] better download and upload speed begging it off british airways’ internet than in my hous I feel like crying

[45] the british airways waterside is sick

[46] Everyone is bad mouthing the British Airways and I’m like wait what, what

happened?

In [43] the twitter user makes a joke on the price of his ticket and therefore it can be

categorized as sarcasm. Tweet [44] is the story of an experience a British Airways client had

with their internet. Although a negative word (crying) is used, it is clear for a human

annotator that this is a positive tweet for British Airways. The error of tweet [44] is labelled

“ambivalent” because “sick” is usually a negative word, but in this context it is positively

used as a slang word to express something is very good. Finally, the twitter user in [46] sides

with British Airways when he hears that some people have been bad mouthing about them.

46

2.3.4. Type 4: positive tweets labelled as neutral

Type 4 includes all tweets that the human annotator labelled positive, but which were

erroneously labelled neutral by the LT3-tool. In general, 21 errors of this type were


different categories of which two stand out: “comparison” and “experience”.

In general, 38% of the errors are of the category “experience”. In this case, a positive

experience is told by the twitter user.

[47] @askDUBairport is ryanair really letting us have two carry on bags?

[48] Italian newspaper announcing that Ryanair will allow a 2nd item of cabin baggage (eg handbag) as of Dec 1st. Almost dropped my phone

[49] @vraiment_moi I used to work for british airways many years ago and every year me and the misses get free flights anywhere in the world

[50] @str8edgeracer where are you headed? I used to fly into heathrow for work all the time. Go to the british airways loung.

Both [47] and [48] report on the news that Ryanair allows a second carry-on bag on their

airplanes. For a human annotator it is obvious that both twitter users react well on the news,

but as there are no explicitly positive words in the tweets, this may not be as obvious for a

sentiment analysis tool. As for [49], the twitter user has a very good experience with British

Airways as he and his wife receive free tickets from the company. The experience as a whole

is therefore positive. The twitter user in [50] clearly had a positive experience in the British

Airways lounge, although he does not mention that explicitly. A human annotator however

can deduct that the British Airways lounge must be good because otherwise, the twitter user

would not send others there.

In Type 4, 33% of the committed errors belong to the category “comparison”. In this case,

either British Airways or Ryanair are seen as the good company when compared to others.

[51] @Niamhyy you should get a free flight~all the hassle you are going through. If they don’t answer book with Ryanair

[52] First ever Ryanair trip went without a hitch. How very unlike @British_Airways #ABBA

[53] @NinaWarhurst @British_Airways if it was Ryanair or easy jet they would of charged for being ill and call it a administration charge !!!!!

[54] Now THAT’S a billboard! British Airways - #lookup in Piccadilly Circus:

youtu.be/GtJx_pZjvzc via @youtube

47

In [51] no other airline company is mentioned, but Ryanair is clearly the better company as

the twitter user thinks that @Niamhyy would not have that kind of trouble with Ryanair. In

[52] British Airways and Ryanair are compared and the twitter user has an obvious preference

for Ryanair. However for a sentiment analysis tool this seems not to be clear. British Airways

is praised for not asking an extra charge in [53] which is something the twitter user is

positively surprised by as he points out that other airline companies (Ryanair and EasyJet)

would do so. Finally in [54] the new British Airways billboard is complemented. In

comparison to other billboards, the twitter user finds it brilliant.

Finally, other Type 4 errors can be divided into five categories: “sarcasm” (1), “negation” (1),

“ambivalent” (2), “side-taking” (1) and “Twitter-specific signs” (1). The errors “sarcasm”,

“ambivalent” and “Twitter-specific signs” can be found in the British Airways corpus while

the errors “negation” and “side-taking” can be found in the Ryanair corpus.

[55] I mind when I got to sit in first class on British airways cos there were no seats left in economy

[56] 20 mins late…… this wouldnt happen on a ryanair flight…

[57] In British Airways businessclass last night. They spoilt us rotten. fb.me/1raZdGW4e

[58] Dear @Ryanair if it’s true that Muslims are going to boycott you after o’learys anti burka comments,I am willing to pay double to travel.

[59] British Airways > American Airlines

In the sarcastic tweet [55], “I mind” is used sarcastically as no-one would mind being seated

in first class instead of economy. In [56] it is negated that a Ryanair flight would be 20

minutes late. However, for a sentiment analysis tool this may not be clear as the negation

particle and “late” are in different sentences. Further, “rotten” can be seen as an ambivalent

word in [57] as in combination with “to spoil”, it is a very positive word. Ryanair received

support from a twitter user in [58] for anti-burka comments of their CEO. Finally, a sentiment

analysis tool is not able to read “>” in [59] which is positive for British Airways.

2.3.5. Type 5: neutral tweets labelled as negative

Type 5 includes all tweets that the human annotator labelled neutral, but which were

erroneously labelled negative by the LT3-tool. In general, four errors of this type were

48

committed, one in the British Airways and three in the Ryanair corpus. Three of these errors

can be placed in the category “comparison”; the fourth error is of the “experience” category.

Type 5 errors are a rather small group which can be explained by few tweets being labelled

neutral by the human annotator.

[60] @British_Airways sad just had worst travel exp of my life with B A #Luggage took longer than actual flite to shop up. Worse than ryanair!

[61] what I wouldn’t give to hear that Ryanair trumpet right now…. delays again @British_Airways

[62] RT @MarketingWeekEd: Ryanair is trying to shake off its reputation for poor costumer service bit.ly/1cusnPN

[63] British Airways pulling out of Manchester was disappointing but attracted other airlines. Market will change over next 5 yrs.

A lot of negative words are used in [60] as the twitter user reports on a negative experience

with British Airways. For Ryanair however, as this is a tweet from the Ryanair corpus, this is

neither a negative nor a positive tweet as negative and positive things are balanced out. In this

context, the twitter user believes that Ryanair is better than British Airways, but at the same

time Ryanair is seen as a measuring point for bad airlines. Also in [61], negative and positive

aspects of Ryanair are balanced out. The twitter user points out that there would be no delays

with Ryanair, contrary to British Airways. However, he also expresses his annoyance with the

Ryanair trumpet. In [62], the twitter user retweets a newspaper article. It is however not clear

which sentiment he has towards it, which is why this tweet was labelled neutral. Finally, in

[63], positive and negative things are balanced out again as the twitter user claims he is

disappointed that British Airways pulled out of Manchester, but on the other hand he also

seems to accept it.

2.3.6. Type 6: neutral tweets labelled as positive

Type 6 includes all tweets that the human annotator labelled neutral, but which were

erroneously labelled positive by the LT3-tool. In general, six errors of this type were

committed, five in the British Airways and one in the Ryanair corpus. The errors can be

divided into two categories: “comparison” and “experience”. Type 6 errors are also a rather

small group as few tweets were labelled neutral by the human annotator.

66% of the Type 6 errors can be placed in the “experience” category.

49

[64] I’m wondering if @rewardgateway is a bit like the old British Airways in that it’s staff have to pass a test for extremely good lookingness.

[65] Wow! The British Airways app works with the iPhone Passbook. This is actually the first time I use it in some useful way

[66] British Airways I love you but never again will I fly with you when I’m

pregnant

[67] Love airlines that claim you can choose your seat in advance. And then you

can’t. #britishairways

Generally, in all the above examples, the twitter user expresses both negative and positive

things which are balanced out. However, the LT3-tool registered the positive comment only.

In [64] the positive comment is clearly present as the twitter user explicitly mentions that the

British Airways staff is good-looking. At the same time however, he denounces that the staff

is hired based on their looks. The same occurs in [65] when the twitter user is satisfied with

the British Airways app, but he also implies that it took a long time to use it in a useful way

and therefore it is not that useful. The twitter user in [66] utters that she likes to fly with

British Airways but also that they do not treat their pregnant passengers correctly. Finally, in

[67] the twitter user says that he likes the possibility to choose your seat in advance with

British Airways (positive aspect); however it seems that he cannot do it at the time (negative

aspect).

The other 33% of the Type 6 errors can be categorised as “comparison” errors.

[68] Good luck to the GOALies ,Philippines bound Sunday morn with 40 tonnes of aid onboard free Aer Lingus plane. Ryanair next ??

[69] @BritishAirways always thought your food to be the best but for flying

experience @VirginAtlantic just can’t be beat!

In [68] the twitter user points out that Aer Lingus sends help to the Philippines and he simply

asks if Ryanair would do the same. The positive words in this tweet therefore apply for Aer

Lingus. For Ryanair, it is a neutral tweet. In [69], British Airways is compared to Virgin

Atlantic. The twitter user points out positive points of both companies which is probably the

reason why the tweet is labelled positive by the sentiment analysis tool. However, for British

Airways this is nor a negative nor a positive tweet.

50

2.4. Interpretation of the error analysis

In this section, we will discuss how the results of the qualitative error analysis can be

interpreted. Some general conclusions will be drawn which are then used to formulate some

recommendations in order to help improve the performance of the LT3-tool.

First of all, we can divide the eight error categories in two groups: main errors and minor

errors. “Sarcasm”, “negation”, “comparison” and “experience” belong to the group of main

errors while “synonym”, “ambivalent”, “side-taking” and “Twitter-specific signs” belong to

the group of minor errors. In the literature study, “sarcasm” and “negation” were already put

forward as two important sentiment detection problems.

A second conclusion applies to the sarcasm errors. The majority of the sarcasm errors were

made in the Type 1 and Type 2 tweets which were labelled negative by the human annotator

and erroneously, i.e. positive or neutral, by the LT3-tool. We can therefore conclude that

sarcasm is mostly used to express a negative opinion and that the LT3-tool cannot detect this

sarcasm. Although Davidov et al. (July 2010) and González-Ibáñez et al. (2010) reported that

their classifier performed well on a sarcastic dataset, it is important to point out that their

training was done on a gold-standard dataset containing only tweets with the hashtag

“#sarcasm”. Based on our examples, however, we clearly noted that not all sarcastic tweets

are labelled explicitly with “#sarcasm”.

A third finding is that the negation error mainly occurred in Type 1 and Type 3 labelling

errors. This means that when negation is used, this mostly shifts the entire sentiment. The

LT3-tool often had problems when the negation particle did not directly negate or modify the

sentiment words but other words within the tweet (and further from the negation particle). The

bag-of-word model by Pang et al. (2002) might be useful to solve this problem. We agree

with Wilson et al. (2005) that negation features improve the performance of a sentiment

analysis tool. Further, our analysis confirms some of the concerns that Wiegand et al. (2010)

observed, namely that sometimes world knowledge is necessary to label a tweet correctly and

that rare constructions can be negating (cf. modification particles).

A fourth conclusion which could be drawn is that the synonym errors occurred almost

exclusively (read: except for one) in the Ryanair corpus. This difference between the two

corpora could be explained by the bad Ryanair reputation. The presence of background

information is an important factor in this case as a human annotator is aware of the reputation

51

of a company, while a sentiment analysis tool is not. Also in sarcastic tweets, background

information is sometimes necessary in order to grasp the correct meaning.

A sixth conclusion is that the “comparison” error is the sole error that is present in all error

types. However, similar to the “sarcasm” error, it is mostly present in Type 1 and Type 2

errors: tweets which were labelled negative by the human annotator and respectively positive

and neutral by the LT3-tool. The main problem in those tweets with the error “comparison” is

that the LT3-tool did not know which words applied to which company. This could be

explained by a lack of dependency features.

A seventh conclusion which could be drawn is that important information in hashtags and

@mentions can often not be detected by the LT3-tool since these contain many concatenated

words. As a consequence, important sentiment information is not considered by the LT3-tool.

A final conclusion is that our presumption of reputation influencing the number of

negative/positive tweets is confirmed. The number of negative tweets in the Ryanair corpus

was significantly higher (twenty negative tweets more) than in the British Airways corpus.

This difference was most likely caused by the Ryanair reputation, which is much worse than

the British Airways one.

2.5. Recommendations

Based on this analysis, we can propose some recommendations in order to improve the

performance of the LT3-tool. As mentioned earlier, the error analysis confirms the concerns

discussed in Section 2.3 of the literature study which argue that sarcasm and negation cause

substantial problems for sentiment classification tools. Despite adding negation features to the

LT3-tool, it seems that negation still causes problems especially when it is not obvious which

word the negation particle applies to. Moreover, sometimes negation is not expressed in its

ordinary way (i.e. not with an ordinary negation particle, but with a phrase that is equal to

negation or by using modifiers). This could be solved by adding more negation features which

pay attention to this deficit. Also, negation and modality cues should be added.

A second recommendation consists in adding sarcasm features. These are currently not

incorporated in the LT3-tool. The lack of sarcasm features is visible in the results of the error

analysis. Also, the hashtag “#sarcasm” should not be seen as the sole hashtag that indicates

sarcasm (cf. “#nextjoke” in example [2]). Therefore, features considering such hashtags could

52

help improve the LT3-tool. Further, a sort of sarcasm lexicon could be drawn up which

contains words that are typically used in sarcastic tweets. It is difficult to put forward such

examples as we did not study this matter, but the “non lols” sequence in example [3] could be

a sequence indicating sarcasm. This sarcasm lexicon could then be added as a lexicon feature.

Thirdly, more dependency features should be added as now the LT3-tool often makes

erroneous links between different words (i.e. which negative or positive words apply to which

company when a comparison is made). Especially words which are further apart are often

linked erroneously.

Fourth, the tool should somehow be able to consider some background information on the

concerning companies as twitter users expect their readers to know something about the

reputation of a certain company for example (cf. in this dissertation, this was the case for the

Ryanair corpus). Adding such features could help diminishing the number of “synonym”

errors. However, it might be complex to add such features as these would need frequent

revision because reputations can also shift over time (cf. Ryanair is trying to change its

reputation).

Finally, another important pre-processing step should be added which enables the tool to

segment hashtags and @mentions. Now important sentiment indicators are sometimes in

those hashtags or @mentions but they cannot be detected by the LT3-tool.

53

CONCLUSION

This dissertation consisted of two main parts: a literature study on sentiment in general and

sentiment analysis in Twitter specifically on the one hand and a corpus analysis on the other

hand.

In the literature study, we provided an overview of research that has been conducted on

sentiment analysis in general and we have discussed the state of the art on sentiment analysis

in Twitter. In Section 1 of the literature study, some basic concepts were explained, such as

sentiment and subjectivity. Further, the two steps in the sentiment classification process,

subjectivity classification and sentiment classification were defined. Also the subclasses in the

sentiment classification process (document level, sentence level, word level and feature level)

were briefly described. In Section 2, we presented an overview of the current state of the art

on sentiment analysis in Twitter specifically since the specific linguistic features in tweets ask

for a different approach. We first provided some basic information about Twitter itself after

which we made a subdivision between two main approaches on sentiment analysis in Twitter:

shallow machine learning approaches and approaches incorporating more linguistic

knowledge. As far as the shallow machine learning approaches are concerned we can

conclude that these are already a solid baseline for performing sentiment analysis on Twitter

data. Considering features such as lexicons and PoS, these classifiers achieve good results.

However, deeper linguistic features such as sarcasm and negation cause problems for these

shallow classifiers. The approaches incorporating more linguistic knowledge try to solve these

problems. We provided an overview of studies on incorporating sarcasm and negation

features in classifiers. We can conclude that adding such features to a classifier helps.

In the second part of this dissertation, the corpus analysis, we tested the performance of the

LT3-tool, a classifier developed by the Language and Translation Technology Team of Ghent

University. We collected a corpus of tweets reporting on airline companies (British Airways

and Ryanair) which were labelled manually for sentiment (positive, negative and neutral). In a

next phase, the labels of the human annotator were compared to those of the classifier. The

corpus analysis itself consisted of both a quantitative and qualitative analysis. In the

quantitative analysis, three values were calculated: accuracy, precision and recall. In the

qualitative analysis, an extensive error analysis was performed.

We can conclude that the LT3-tool performs badly on our dataset; it obtained an accuracy of

only 21%. This was expected, however, since our corpus deliberately consisted of tweets

54

which should be difficult to label automatically. As to the “precision”, we found that the

classifier performed best on the negative tweets. For “recall” on the other hand, the LT3-tool

performed best on the neutral tweets. In order to gain more insights into the committed errors,

we performed an in-depth error analysis which revealed some recurring difficulties. We

distinguished eight different error categories of which four caused substantial problems for

the LT3-tool: “sarcasm”, “negation”, “comparison” and “experience”. Other minor problems

were “synonym”, “ambivalent”, “side-taking” and “Twitter-specific signs”. In the literature

study, the errors “sarcasm” and “negation” were already put forward as important shortfalls in

other sentiment analysis tools. In our analysis, those concerns were confirmed as “sarcasm”

and “negation” still posed significant problems even though negation features for example

were included in the classifier. This error analysis enabled us to propose some

recommendations for improving the performance of the LT3-tool. Although negation and

dependency features were already considered in the LT3-tool, we proposed to add some more

namely a feature which also considers modifiers as negation and more dependency features so

also words which are further apart could be linked. Further, we recommended adding two new

feature groups. First, features should be added that consider sarcasm, which is currently not

acknowledged by the classifier. Secondly, we recommended adding a feature group which in

some way can also represent background information. This latter one is not self-evident,

however, since those features would need a lot of supervision as reputations for example

(which could be seen as a kind of background information) can shift. Thirdly, adding a pre-

processing step to the tool, which is able to segment @mentions and hashtags would be very

useful as currently a lot of important sentiment information in those Twitter-specific signs

goes lost.

We hope this study has revealed that an extensive linguistic analysis offers some valuable

insights into improving sentiment analysis in Twitter. It would be very interesting to see in

follow-up research whether the recommendations forwarded in this dissertation can be

implemented in the LT3-tool and whether they actually improve the automatic sentiment

analysis in our corpus.

55

BIBLIOGRAPHY

Barbosa, L. & Feng, J. (2010). Robust Sentiment Detection on Twitter from Biased and Noisy

Data. Coling 2010: Poster Volume. p36-44

Davidov, D., Tsur, O. & Rappoport, A. (August 2010). Enhanced Sentiment Learning Using

Twitter Hashtags and Smileys. Coling 2010: Poster Volume. p241-249

Davidov, D., Tsur, O. & Rappoport, A. (July 2010). Semi-Supervised Recognition of

Sarcastic Sentences in Twitter and Amazon. Proceedings of the Fourteenth

Conference on Computational Natural Language Learning. p107-116

Esuli, A. & Sebastiani, F. (2006). Determining Term Subjectivity and Term Orientation for

Opinion Mining. EACL. Vol. 6. p193-200

Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant

supervision. CS224N Project Report, Stanford. p1-12.

González-Ibáñez, R., Muresan, S. & Wacholder, N. (2011). Identifying Sarcasm in Twitter: A

Closer Look. Proceedings of the 49th

Annual Meeting of the Association for

Computational Linguistics: shortpapers. p581-586

Hu, M. & Liu, B. (2004). Mining and summarizing customer reviews. Proceedings of the

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). p168-

177.

Jia, L., Yu, C. & Meng, W. (2009). The Effect of Negation on Sentiment Analysis and

Retrieval Effectiveness. Proceedings of the 18th

ACM Conference on Information and

Knowledge Management. p1827-1830

Kaplan, A. M. & Haenlein, M. (2011). The early bird catches the news: Nine things you

should know about micro-blogging. Business Horizons. 54, p105-113.

Kim, S.M. & Hovy, E. (2004). Determining the Sentiment of Opinions. Proceedings of the

20th

international conference on Computational Linguistics. p1367-1373

Kouloumpis, E., Wilson, T. & Moore, J. (2011). Twitter Sentiment Analysis: The Good the

Bad and the OMG! Association for the Advancement of Artificial Intelligence. p538-

341

Kumar, A. & Sebastian T.M.. (2012). Sentiment Analysis: A Perspective on its Past, Present

en Future. I.J. Intelligent Systems and Applications. 10, p1-14.

Liu, B. (2010). Sentiment Analysis and Subjectivity. In: Indurkhya, N. and Damerau, F.J.

Handbook of Natural Language Processing. 2nd ed. Cambridge, UK: Chapman and

Hall/CRC. p627-667.

Liu, K.L., Li W.J. & Guo, M. (2012). Emoticon Smoothed Language Models for Twitter

Sentiment Analysis. Association for the Advancement of Artificial Intelligence. p456-

462

56

Lui, J. & Seneff, S. (2009). Review Sentiment Scoring via a Parse-and-Paraphrase Paradigm.

Proceedings of the 2009 Conference on Empirical Methods in Natural Language

Processing. p161-169

Mohammed, S.M. (2012). #Emotional Tweets. Proceedings of the First Joint Conference on

Lexical and Computational Semantics (*SEM). p246-255

Mohammed, S.M., Kiritchenko, S. & Zhu, X. (2013). NRC-Canada: Building the State-of-

the-Art in Sentiment Analysis of Tweets. Proceedings of the seventh international

workshop on Semantic Evaluation Exercises (SemEval-2013). Atlanta, Georgia, US.

Mohammed, S.M., Turney, P.D. (2010). Emotions Evoked by Common Words and Phrases:

Using Mechanical Turk to Create and Emotion Lexicon. Proceedings of the North

American Chapter of the Association for Computational Linguistics: Human

Language Technologies 2010 Workshop on Computational Approaches to Analysis

and Generation of Emotion in Text. p26-34

Mohammed, S.M., Yang, T. (2011). Tracking Sentiment in Mail: How Genders Differ on

Emotional Axes. Proceedings of the 2nd

Workshop on Computational Approaches to

Subjectivity and Sentiment Analysis (WASSA 2.011). p70-79

Molloy, A. (2014). British Airways climbs to new heights as it tops UK brands list. Available:

http://www.independent.co.uk/news/uk/home-news/british-airways-climbs-to-new-

heights-as-it-tops-uk-brands-list-9148758.html. Last accessed 16th May 2014.

Nakov, P., Kozareva, Z., Ritter, A., Rosenthal, S., Stoyanov, V. & Wilson, T. (2013).

SemEval 2013 Task 2: Sentiment Analysis in Twitter. Second Joint Conference on

Lexican and Computational Semantics (*SEM), Volume 2: Seventh International

Workshop on Semantic Evaluation (SemEval 2013). p312-320

Nasukawa, T. & Yi, J. (2003). Sentiment analysis: Capturing favorability using natural

language processing. Proceedings of the 2nd

international conference on Knowledge

capture. p70-77

Pang, B. & Lee, L. (2004). A Sentimental Education: Sentiment Analysis Using Subjectivity

Summarization Based on Minimum Cuts. Proceedings of the 42nd

Annual Meeting on

Association for Computational Linguistics. p271-278

Pang, B., Lee, L. & Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification Using

Machine Learning Techniques. Proceedings of the ACL-02 Conference on Empirical

Methods in Natural Language Processing-Volume 10. p79-86

Polanyi, L. & Zaenen, A. (2004) Context Valence Shifters. Proceedings of the Advancement

of Artificial Intelligence Conference Spring Symposium on Exploring Attitude and

Affect in Text. p106-111

Travelmole. (2014). Ryanair to boost reputation with new marketing chief. Available:

http://www.travelmole.com/news_feature.php?news_id=2009841. Last accessed 16th

May 2014.

57

Van Hee, C., Van de Kauter, M., De Clercq, O., Lefever, E. & Hoste, V. (2014). LT3:

Sentiment Classification in user-generated content using a rich feature set. Submitted

to the eight international workshop on Semantic Evaluation Exercises (SemEval-2014)

Vizard, S. (2013). Brand Audit: Ryanair. Available:

http://www.marketingweek.co.uk/sectors/travel-and-leisure/brand-audit-

ryanair/4008433.article. Last accessed 16th May 2014.

Wiegand, M., Balahur, A., Roth, B., Klakow, D. & Montoyo, A. (2010). A Survey on the

Role of Negation in Sentiment Analysis. Proceedings of the Workshop on Negation

and Speculation in Natural Language Processing. p60-68

Wilson, T., Wiebe, J. & Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-

Level Sentiment Analysis. Proceedings of Human Language Technology Conference

and Conference on Empirical Methods in Natural Language Processing

Winch, J. (2014). British Airways tops UK brand ratings. Available:

http://www.telegraph.co.uk/finance/newsbysector/retailandconsumer/10656804/Britis

h-Airways-tops-UK-brand-rankings.html. Last accessed 16th May 2014.

59

APPENDIX I: MATRICES, ACCURACY, RECALL AND PRECISION

GENERAL


Ov

eral

l se

nti

men

t (h

um

an)

Negative

Positive

Neutral

Negative

37

90

82

Positive

35

19

21

Neutral

4

6

6

�� =37 + 19 + 6

300= 0,21 = 21%

&��'�'��(��')'*�+ =19

19 + 90 + 6= 0,17 = 17%

,��--(��')'*�+ = 19

19 + 35 + 21= 0,25 = 25%

&��'�'��(��)'*�+ =37

37 + 35 + 4= 0,49 = 49%

,��--(��)'*�+ =37

37 + 90 + 82= 0,18 = 18%

&��'�'��(��)��-+ =6

6 + 82 + 21= 0,06 = 6%

,��--(��)��-+ =5

6 + 4 + 6= 0,31 = 31%

60

BRITISH AIRWAYS


Ov

eral

l se

nti

men

t (h

um

an)

Negative

Positive

Neutral

Negative

24

40

31

Positive

19

14

12

Neutral

1

5

4

�� =24 + 14 + 4

150= 0,28 = 28%

&��'�'��(��')'*�+ =14

14 + 40 + 5= 0,24 = 24%

,��--(��')'*�+ =14

14 + 19 + 12= 0,31 = 31%

&��'�'��(��)'*�+ =24

24 + 19 + 1= 0,55 = 55%

,��--(��)'*�+ = 24

24 + 40 + 31= 0,25 = 25%

&��'�'��(��)��-+ = 4

4 + 31 + 12= 0,09 = 9%

,��--(��)��-+ =4

4 + 1 + 5= 0,40 = 40%

61

RYANAIR


Ov

eral

l se

nti

men

t (h

um

an)

Negative

Positive

Neutral

Negative

13

50

51

Positive

16

5

9

Neutral

3

1

2

�� =13 + 5 + 2

150= 0,13 = 13%

&��'�'��(��')'*�+ =5

5 + 50 + 1= 0,09 = 9%

,��--(��')'*�+ =5

5 + 16 + 9= 0,17 = 17%

&��'�'��(��)'*�+ = 13

13 + 16 + 3= 0,41 = 41%

,��--(��)'*�+ = 13

13 + 50 + 51= 0,11 = 11%

&��'�'��(��)��-+ = 2

2 + 9 + 51= 0,03 = 3%

,��--(��)��-+ = 2

2 + 3 + 1= 0,33 = 33%

62

APPENDIX II: EXACT DATA OF ERROR ANALYSIS

Sarcasm Negation Comparison Experience Synonym Ambivalent Side-taking Twitter-specific signs

Type 1 45 14 15 13 1 0 0 2

Type 2 37 7 7 20 10 0 0 1

Type 3 5 11 12 1 0 1 5 0

Type 4 1 1 7 8 0 2 1 1

Type 5 0 0 3 1 0 0 0 0

Type 6 0 0 2 4 0 0 0 0

Table 15 – Error analysis of the British Airways and Ryanair corpus combined


Type 1 20 8 4 8 0 0 0 0

Type 2 14 3 0 12 1 0 0 1

Type 3 3 6 7 1 0 1 1 0

Type 4 1 0 3 5 0 2 0 1

Type 5 0 0 1 0 0 0 0 0

Type 6 0 0 1 4 0 0 0 0

Table 16 – Error analysis of the British Airways corpus


Type 1 25 6 11 5 1 0 0 2

Type 2 23 4 7 8 9 0 0 0

Type 3 2 5 5 0 0 0 4 0

Type 4 0 1 4 3 0 0 1 0

Type 5 0 0 2 1 0 0 0 0

Type 6 0 0 1 0 0 0 0 0

Table 17 – Error analysis of the Ryanair corpus