Automatic Detection of Nastiness and Early Signs of

Automatic Detection of Nastiness and Early Signs of Cyberbullying

Incidents on Social Media

by

Niloofar Safi Samghabadi

A dissertation submitted to the Department of Computer Science,

College of Natural Sciences and Mathematics

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in Computer Science

Chair of Committee: Dr. Thamar Solorio

Committee Member: Dr. Edgar Gabriel

Committee Member: Dr. Rakesh Verma

Committee Member: Dr. Ruihong Huang

University of HoustonMay 2020

Copyright 2020, Niloofar Safi Samghabadi

ACKNOWLEDGMENTS

I would like to express my most profound appreciation to my advisor, Dr. Thamar Solorio,

for her continuous encouragement, patience, kindness, and supervision that enabled me to

pursue the degree. This research would not have been possible without her support and

guidance. I want to extend my sincere gratitude to Dr. Edgar Gabriel, Dr. Rakesh Verma,

and Dr. Ruihong Huang for serving my dissertation committee and providing constructive

feedback to improve my research. I would like to thank the Department of Computer Science

at the University of Houston for financially supporting me during my five years of Ph.D.

study. I also wish to acknowledge the current and former members of RiTUAL Lab, whom

I have always learned from them over the last several years. Without their collaborations

and supports, I could not reach this point.

This dissertation could not have been possible without the support, help, and love of my

husband, and the unconditional love of my family. I am sincerely grateful to them all for

being the support and driving force that I needed.

iii

ABSTRACT

Although social media has made it easy for people to connect on an unlimited virtual space,

it has also opened doors to people who misuse it to bully others. Nowadays, abusive behavior

and cyberbullying are considered as major issues in cyberspace that can seriously affect the

mental and physical health of victims. However, due to the growing number of social media

users, manual moderation of online content is impractical. Available automatic systems for

hate speech and cyberbullying detection fail to make opportune predictions, which makes

them ineffective for warning the possible victims of these attacks.

In this dissertation, we aim at advancing new technology that will help to protect vul-

nerable online users against cyber attacks. As a first approximation to this goal, we develop

computational methods to identify extremely aggressive texts automatically. We start by

exploiting a wide range of linguistic features to create a machine learning model to detect on-

line abusive content. Then, we build a deep neural architecture to identify offensive content

in online short and noisy texts more precisely, by incorporating emotion information into

textual representations. We further expand these methods and propose a Natural Language

Processing system that constantly monitors online conversations, and triggers an alert when

a possible case of cyberbullying is happening. We design a new evaluation framework, and

show that our system is able to provide timely and accurate cyberbullying predictions, based

on limited evidence.

In this research, we are mainly concerned about kids and young adults, as the most vul-

nerable group of users under online attacks. To this end, we propose new language resources

for both tasks of abusive language and cyberbullying detection from social media platforms

that are specifically popular among youth. Furthermore, within our experimentations, we

discuss the differences among these corpora and the other available resources that include

data on adult topics.

iv

TABLE OF CONTENTS

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . iiiABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivLIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixLIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Structure of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 62.1 Automatic Detection of Online Aggression and Hate Speech . . . . . . . . . 62.2 Automatic Detection of Cyberbullying . . . . . . . . . . . . . . . . . . . . . 72.3 Early Text Categorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.4 Challenges and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Problem Formulation 103.1 Abusive Language Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Cyberbullying Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 Chunk-by-Chunk Evaluation . . . . . . . . . . . . . . . . . . . . . . 123.2.2 Post-by-Post Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Data Resources 184.1 ask.fm Abusive Language Dataset . . . . . . . . . . . . . . . . . . . . . . . . 18

4.1.1 Data Collection and Sampling . . . . . . . . . . . . . . . . . . . . . . 204.1.2 Annotation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.1.3 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Curious Cat Abusive Language Dataset . . . . . . . . . . . . . . . . . . . . . 244.2.1 Data Collection and Annotation . . . . . . . . . . . . . . . . . . . . . 254.2.2 Data Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 ask.fm Cyberbullying Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 284.4 Other Available Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4.1 Other Abusive Language Corpora . . . . . . . . . . . . . . . . . . . . 31

v

4.4.2 Instagram Cyberbullying Corpus . . . . . . . . . . . . . . . . . . . . 32

5 Feature Engineering for Abusive Language Detection 345.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375.3 Baseline Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4 Classification Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . 385.5 Analysis of Bad Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.6 Feature-Based Model on Other Resources . . . . . . . . . . . . . . . . . . . . 43

5.6.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6.2 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.6.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.6.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.6.6 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.7 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6 Paying Attention to the Emotions for Abusive Language Detection 526.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.3 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.4 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.5 Why Does DeepMoji Work? . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.6 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

7 Detecting the Early Signs of Cyberbullying 647.1 Traditional Machine Learning Approach . . . . . . . . . . . . . . . . . . . . 65

7.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657.1.2 Experimental Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 677.1.3 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.1.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

7.2 Deep Learning Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.2.1 Time-Wise Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 717.2.2 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 737.2.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 777.2.4 Decision-Making Process . . . . . . . . . . . . . . . . . . . . . . . . . 787.2.5 Classification Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 807.2.6 Comparison with State-of-the-art . . . . . . . . . . . . . . . . . . . . 83

7.3 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

8 Conclusions and Future Work 858.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 858.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

vi

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

vii

LIST OF TABLES

4.1 Statistics for ask.fm abusive language corpus. . . . . . . . . . . . . . . . . . 234.2 Examples of the different topics in ask.fm abusive language dataset. . . . . . 234.3 Statistics for Curious Cat abusive language corpus . . . . . . . . . . . . . . 284.4 Parts of a cyberbullying instance in our corpus. . . . . . . . . . . . . . . . . 304.5 Statistics for ask.fm cyberbullying corpus. . . . . . . . . . . . . . . . . . . . 304.6 Average length of posts, and words in ask.fm, Curious Cat, Kaggle, and

Wikipedia data sets in terms of the average number of words and the av-erage number of characters, respectively. . . . . . . . . . . . . . . . . . . . . 32

4.7 Data statistics for Instagram cyberbullying corpus. . . . . . . . . . . . . . . 32

5.1 Negative patterns for detecting nastiness. The capital letters show the abbre-viations for the following POS tags: L = nominal + verbal (e.g. I’m)/verbal+ nominal (e.g. let’s), R = adverb, D = determiner, A = adjective, N =noun, O = pronoun (not possessive) . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 The results of baseline experiments for invective class . . . . . . . . . . . . . 385.3 Classification results for invective class. N/A stands for the feature sets that

are not applicable to Kaggle and Wikipedia datasets . . . . . . . . . . . . . 395.4 Top negative features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5 Examples of mislabeled instances by the classifier . . . . . . . . . . . . . . . 425.6 Degree of negativity for bad words . . . . . . . . . . . . . . . . . . . . . . . 425.7 Data distribution for TRAC English and Hindi corpus . . . . . . . . . . . . 445.8 Validation results employing various feature sets for the English and Hindi

datasets using the Logistic Regression model. In this table, BU stands forBinary Unigram, and N/A stands for the feature sets that are not applicableto Hindi data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.9 Results for the English test set. FB and SM stand for Facebook and SocialMedia, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.10 Results for the Hindi test set. FB and SM stand for Facebook and SocialMedia, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.11 Misclassified examples in case of the aggression level. In these examples, thepredicted labels seem more reasonable to us than the actual labels, whichshow that the perceived level of aggression is subjective. . . . . . . . . . . . 48

viii

6.1 Classification results in terms of F1-score for the negative/offensive class and weighted

F1. The values that are in bold show the best results obtained for each dataset.

+DM refers to the experiments in which we directly concatenate DM vectors with

the last hidden representation generated by the model. . . . . . . . . . . . . . . 59

7.1 F1-score for the chunk-by-chunk evaluation for the positive class. The boldvalues show the best performance gained for each feature set. . . . . . . . . . 68

7.2 Padding length in segment-level and conversation-level for various segmenta-tion algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

7.3 Classification results for cyberbullying detection using different segmentationmethods in terms of precision, recall, F1, F-latency, and average number ofcomments that the model needs to monitor for making a cyberbullying predic-tion. The values in bold show the best scores obtained for each segmentationmethod. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7.4 Comparison to the state-of-the-art results for detecting early signs of cyber-bullying. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

ix

LIST OF FIGURES

3.1 Chunk-by-chunk evaluation framework. The larger white document shows an in-

stance/conversation. The smaller colorful documents indicates 10 different equally

sized chunks of the conversation. In this setup, the model is evaluated across 10

different iterations. In each iteration, the model gets access to one more chunk of

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2 Post-by-post evaluation framework. In this framework, in every iteration t, the

model gets access to t-th posts in each conversation k. Then, the output is generated

based on the current information and the system makes a decision whether to

trigger a cyberbullying alert through a policy-based decision-making process. (a)

system did not trigger an alert for conversation k, and finally labeled it as a non-

cyberbullying instance after monitoring all the posts inside this conversation. (b)

system triggered a cyberbullying alert for conversation k at the end of the third

iteration, and was not allowed to monitor more posts for that conversation. . . . . 15

4.1 An example of a random user’s timeline in ask.fm. . . . . . . . . . . . . . . . . 194.2 CrowdFlower interface for contributors to annotate the data. . . . . . . . . . 214.3 An example of a random user’s timeline in Curious Cat website. . . . . . . . . . 254.4 Complete agreement for questions and answers across negative and positive labeled

data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.5 Overall process of building ask.fm cyberbullying corpus. . . . . . . . . . . . . 28

5.1 Label distribution comparison between training and evaluation sets for bothEnglish and Hindi languages. . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Confusion matrix plots of our best performing systems for English Facebookand Social Media data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.3 Confusion matrix plots of our best performing systems for Hindi Facebookand Social Media data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.1 Overall architecture of our unified deep neural model for abusive language detection.

This model combines the textual and emotion representations for a given input

comment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Top 5 emojis that DeepMoji model assigned to one neutral and one offensive in-

stance from our Curious Cat data. The words are colored based on the attention

weights given by the DeepMoji model. Darker colors show higher attention weights. 55

x

6.3 Emoji distribution over Curious Cat data . . . . . . . . . . . . . . . . . . . . . 61

7.1 Flow of Emojis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.2 Overall architecture of our model for cyberbullying detection. The model consists of

three main modules: (1) Word Encoder that encodes the sequence of words in each

segment to create segment representation, (2) Segment Encoder that encodes the

sequence of segments to create media session representation, and (3) Classification

Layer that provides the final classification. UMR (User Mention Rate), ATG (Av-

erage Time Gap), and ADM (Average DeepMoji Vector) show the hand-engineered

features that are used to provide more context for each segment. . . . . . . . . . 747.3 Updating the input sequence for a long media session at iteration t. Each

segment includes at most 5 posts, but we might have a lower number of postsin a segment due to the big time gap between two posts. In this example, thetime gap between p7 and p8 is greater than one month, and they are placedin different segments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.4 Delay distribution for HAN + UMR model. . . . . . . . . . . . . . . . . . . 82

xi

Chapter 1

Introduction

Nowadays, the internet has become the primary communication tool worldwide. There

are several social media platforms through which people get equal opportunities to share

information and interact with each other in a virtually unlimited space. Such platforms are

beneficial for online users to develop better social skills, and learn about new ideas and issues.

However, they might put them under the risk of harassment, bullying, and cyber-attacks, as

well.

1.1 Motivation

Cyberbullying is one of the most unfortunate effects of online social media. It is defined as the

use of information/communication technologies (ICT’s) to harm other people by sending or

posting negative, harmful, false, or mean content to them intentionally and repeatedly. The

most vulnerable group of users under these attacks are teens and preteens [34]. According

to a High School Youth Risk Behavior Survey, 14.8% of students surveyed nationwide in the

United States (US) reported being bullied electronically [46]. Another study, done by the

1

Cyberbullying Research Center from 2007 to 2015, shows that on average, 26.3% of middle

and high school students from across the US were victims of cyberbullying [49]. Also, on

average, about 16% of the students admitted that they cyberbullied others at some point

in their lives. Previous research shows that there is a statistically significant relationship

between low self-esteem and experiences with cyberbullying [50]. Additionally, cyberbullying

victims face social, emotional, physiological, and psychological disorders that may lead them

to harm themselves or to commit suicide [76, 23]. Therefore, this is extremely important to

detect cyberbullying, before it affects our young generation.

The daily growth of online communities has raised the need for developing automatic

methods to moderate online content. To meet this need, automatic detection of cyberbul-

lying and online abusive content has become a hot topic in the area of Natural Language

Processing and Machine Learning in recent years. However, most of the available studies

are focused on offline settings (i.e., detecting cyberbullying after it took place). The biggest

drawback of such models is that they cannot be used for prevention. The ideal system has

to take an opportune action (e.g., triggering an alert) based on limited evidence considering:

2) the confidence in the decision to take action, and 1) the risk of waiting for more evidence.

Traditional text classification algorithms do not take the prediction time into account. It

motivated us to explore early text categorization strategies. In this scenario, the system per-

formance is evaluated based on both the accuracy and the earliness of the predictions. More

specifically, we deployed the early text categorization strategies to design a dynamic mecha-

nism that monitors the streams of messages for online users and provides timely predictions

of whether cyberbullying is happening in any of those conversations.

2

1.2 Research Objectives

In this dissertation, we aim at developing computational methods to detect abusive language

and early signs of cyberbullying in social media. The main objectives of our work are as

follows:

1. Design discriminative methods to identify online posts that include aggression and

abusive language.

2. Expand those methods to monitor the thread of online messages, and build a sequential

decision-making model on top to detect the early signs of cyberbullying.

As a part of this research, we build appropriate corpora that allow us to design automated

approaches for detecting abusive language and early signs of cyberbullying. We collect the

data from social media platforms that are specifically popular among teens and pre-teens.

We focused on this demographic group because they are particularly vulnerable to online

attacks, and possibly less experienced to handle these attacks. Abusive language can have

many forms, and swear words are not always a good indicator of it. Not including any

profane words, a comment could still be offensive towards a user. Also, sometimes people

use bad words with no intention to attack others (e.g., for joking around with their friends).

Concerning this fact, we introduced a new approach for sampling the data covering more

forms of abuse.

We further use these language resources to design sophisticated methods that extract con-

textual information from online posts using several types of features (e.g., lexical, emotional,

semantic, etc.). These features will be further used for identifying the abusive language in

different online domains.

Finally, we expand the proposed abuse detection methods to build a system that can

detect the early signs of cyberbullying accurately, and using as little evidence as possible.

3

More precisely, we advance automatic techniques to model the dynamic interactions among

online users, and design a decision-making process to analyze this information sequentially.

This system can be further applied to other tasks, where early prediction is relevant (e.g.,

detecting cyber-pedophiles).

1.3 Structure of this Dissertation

In this section, we provide a brief explanation of the chapters included in this dissertation

as follows:

Chapter 2 provides a comprehensive overview of the previous research that has been

done with respect to the related topics.

Chapter 3 discusses the task formulation as well as the evaluation metrics that we used

for both tasks of abusive language and cyberbullying detection.

Chapter 4 describes the motivation behind the social media platforms that we chose for

collecting the data, and discusses two different approaches that we use for creating the new

language resources for both tasks of abusive language and cyberbullying detection. In this

chapter, we also introduce the other available corpora that we use to examine the robustness

of our proposed models.

Chapter 5 covers our initial efforts to detect online abusive language using traditional

machine learning approaches. This chapter aims at investigating the effects of various hand-

engineered features on the task of aggression identification.

Chapter 6 is dedicated to introducing an end-to-end deep neural network architecture

using a new attention mechanism called emoji-aware attention. The motivation behind this

chapter is to create a unified deep learning model that enriches the textual representation of

online posts using their hidden emoji information.

Chapter 7 describes our approaches to detect the early signs of cyberbullying in social

4

media. In this chapter, we develop a predictive model that consecutively monitors the thread

of online messages, and identify the conversations that include cyberbullying incidents as

early as possible, using partial information.

Chapter 8 summarizes the main contributions and the significant findings of this disser-

tation. We also list possible future research in the area of abusive language and cyberbullying

detection.

5

Chapter 2

Literature Review

This chapter discusses the relevant works that have been done on similar topics as this

dissertation. The first section reviews the automatic methods for hate speech and abusive

language detection. Afterward, we briefly review the existing approaches for cyberbullying

detection. We also discuss the research work in the area of early text categorization. Finally,

we mention the main challenges and limitations of the previous studies that we aim to address

in this dissertation.

2.1 Automatic Detection of Online Aggression and Hate

Speech

In recent years, several studies have been done toward detecting abusive and hateful language

in online texts. Some of these works targeted different online platforms like Twitter [73, 10,

80], Wikipedia [75], and Facebook [29] to create new language resources for encouraging

other research groups to contribute into the task of aggression identification.

6

The most common approach for abuse detection is to utilize the combination of vari-

ous types of hand-engineered features. Previous studies explored the use of lexicon-based

features [54, 21, 74], bag-of-words (BoW) [65, 71], and combination of word and character n-

grams with other features such as typed-dependency relations and sentiment score [13, 69].

A few related works also investigated user-level information such as age [9], gender [73],

geo-location, and language [20], and showed promising results.

With the advent of deep neural networks, multiple studies have explored the proficiency

of these approaches for abusive language detection. Among various neural network architec-

tures, Convolutional Neural Network (CNN) [48, 61, 43], and its combination with Gated

Recurrent Unit (GRU) [70, 81] are indicated to work quite efficient for the task. However, it

has been shown that the ensemble of deep learning and traditional machine learning models

outperforms models that use one or the other [68] since any of these techniques make differ-

ent errors. There are also a few works that incorporated user information into the system,

and showed some improvements in the performance. For example, they used Graph Convo-

lutional Networks to generate user embeddings, considering the network of the users [55, 42].

2.2 Automatic Detection of Cyberbullying

Although there are several works on detecting abusive language and hate speech, only a

few studies have addressed cyberbullying detection. Dinakar et al. constructed a common

sense knowledge base - BullySpace - with knowledge about bullying situations and a wide

range of common daily topics using YouTube data [12]. Xu et al. studied bullying traces

and formulated cyberbullying detection as different Natural Language Processing (NLP)

tasks [76]. For instance, they used latent topic modeling to analyze the topics commonly

discussed in bullying comments.

Other available research works on this topic have investigated cyberbullying on Instagram

7

and Vine [24], using text-based [52, 78], and image-based [27] features. There are also a

few studies that have used temporal information to detect cyberbullying by using several

different time-related features [64] and modeling the structure of a social media session with

a hierarchical attention model [7].

2.3 Early Text Categorization

Early text categorization problem is an emerging research topic which is being more pop-

ular, by reason of the specialized forums such as eRisk-CLEF.1 eRisk started from 2017

and aims at exploring the evaluation methodology, effectiveness metrics and practical ap-

plications (particularly those related to health and safety) of early risk detection on the

Internet. eRisk have emphasized topics such as detecting the early signs of depression [36],

anorexia [37, 38], and self-harm [38] with monitoring the threads of online messages col-

lected from Reddit.2 Most of the current approaches for this task rely on basic methods

such as naıve Bayes with fixed thresholds for the maximum posterior probability [15, 16, 14]

to perform early classification. The rest of the available works use either manual feature

engineering methods like bag-of-word (BOW) [67, 2], or an ensemble of several neural and

non-neural methods [44, 17] to approach the task. Most of these models are very complex

and require high computational resources.

2.4 Challenges and Limitations

Although several research works have been done on abusive language detection, existing

methods mostly focused on one or two different datasets. There is no proof that these

techniques perform the same on other online domains. Our first incentive in this dissertation

1https://erisk.irlab.org2https://www.reddit.com

8

is to explore the possibility of having one single model that solves the problem of aggression

identification across different online platforms. Another relevant limitation, in this case,

is the lack of data. Several corpora are available for detecting hate speech and offensive

language. However, due to the diverse nature of data across different social media, we still

need to investigate further resources. For example, most of the available datasets are on adult

topics, and to the best of our knowledge, none of them mainly targets youth. However, as

we discussed in Chapter 1, they are the most vulnerable group of users under online abusive

behavior.

Regarding cyberbullying detection, all previous approaches identify the event after it

takes place. Therefore, none of them are practical for assisting the process of online mod-

eration. To address this limitation, we propose a system that learns the dynamics of online

conversations and can accurately trigger a cyberbullying alert in a timely manner, with as

little evidence as possible. Again, one crucial step in advancing such a system is to build the

appropriate resources suited for the early detection of cyberbullying.

9

Chapter 3

Problem Formulation

This chapter provides high-level definitions for the two following related problems that we

address in this dissertation: (1) abusive language detection, and (2) detecting the early signs

of cyberbullying. We also introduce the evaluation metrics that we used for each of the tasks.

3.1 Abusive Language Detection

Similar to the previous related studies, we approach the online abusive language detection as

a text classification task. In this scenario, we have a list of online comments, c1, c2, ..., cn, as

the input to the model, and the classifier is supposed to predict the correct class of a given

input comment. All our data resources include only two classes of offensive or neutral.

We examine two different approaches for abusive language detection: (1) machine learning

models with hand-crafted features, and (2) deep neural networks. In the first category of the

models, we use a binary classifier. However, in our deep neural models, instead of having one

single output neuron with a sigmoid activation on top, we have a two-neuron output (one per

class) and use softmax activation to generate the probability distribution over the classes.

The reason is that in the former scenario, we have to set a particular threshold through

10

which we could map the single probability value to one of the available classes. However,

we found it very difficult to define a threshold that works well across all the resources and

models. As for the evaluation, we used the following metrics:

1. F1-score for offensive class: This is our primary metric, which is calculated as follows:

Precision =True Positive

True Positive+ False Positive(3.1)

Recall =True Positive

True Positive+ False Negative(3.2)

F1 = 2× Precision ∗RecallPrecision+Recall

(3.3)

We use F1-score, because all the datasets that we used are highly imbalanced towards the

positive/neutral class.

2. Weighted F1: This metric calculates the average F1-score over both classes. We used

this metric to ensure that the model does not sacrifice the positive/neutral class to increase

the performance of the negative class.

3. AUC score: This metric calculates the area under the ROC (Receiver Operating

Characteristic) curve. The ROC curve is plotted with True Positive Rate (TPR) against the

False Positive Rate (FPR) at all classification thresholds. TPR and FPR are computed as

follows:

TPR = Recall =True Positive

True Positive+ False Negative(3.4)

FPR =False Positive

True Negative+ False Positive(3.5)

AUC score shows how much the model is capable of distinguishing between classes and is

one of the most popular metrics for evaluating binary classification tasks. We use this metric

to compare the performance of our proposed model with state-of-the-art results.

11

3.2 Cyberbullying Detection

Most of the available research on cyberbullying detection formulated this problem as a typical

text classification task. However, in this dissertation, we approach this problem with a

different perspective, which is detecting the early signs of cyberbullying, based on limited

evidence. We use a particular text classification scenario called “early text categorization”.

In this scenario, we have a dynamic text-stream of online posts, p1, p2, p3, ..., that gradually

form an interactive conversation. Our ultimate goal is to create a system that can provide

accurate and timely cyberbullying alerts without the need to wait for a substantial amount of

evidence. The main difference of this scenario as compared with regular text classification is

that in this case, both the timeliness and the accuracy of the predictions are crucial. Another

characteristic of the early text categorization scenario is that once the classifier decides to

trigger an alert after seeing the t-th comment in an online text-stream, it cannot see the

next comments, or change the decision.

In this dissertation, we examine both traditional machine learning and deep learning

approaches to create the early cyberbullying classifier. In both cases, the training is done

as a usual text classification task. However, the evaluation is done using one of the follow-

ing evaluation frameworks: (1) chunk-by-chunk evaluation framework, or (2) post-by-post

evaluation framework.

3.2.1 Chunk-by-Chunk Evaluation

In this evaluation framework, every instance (conversation) in the test set is divided into

10 equally sized chunks. Each chunk includes 10% of the posts inside the conversation, and

there is no overlap between two different chunks. The first chunk contains the oldest 10% of

the posts, the second chunk consists of the second oldest 10%, and so forth. In this setup,

the evaluation score is reported within 10 iterations.

12

Figure 3.1: Chunk-by-chunk evaluation framework. The larger white document shows an in-stance/conversation. The smaller colorful documents indicates 10 different equally sized chunksof the conversation. In this setup, the model is evaluated across 10 different iterations. In eachiteration, the model gets access to one more chunk of data.

Figure 3.1 illustrates chunk-by-chunk evaluation framework. In the first iteration, we gen-

erate a document representation starting with the first chunk and evaluate the performance

of the classifier based on this representation. In every next iteration, we incrementally add

one more chunk of test data and evaluate the performance of the model once again. There-

fore, in the last iteration (iteration 10), we get access to 100% of the data in test instances.

At last, we have 10 different evaluation scores using a different portion of test data in each

iteration. With this evaluation framework, the goal is to get better performance in earlier

iterations. This approach is a standard framework that has commonly been used by several

13

related works [36, 37, 35, 15].

As for the evaluation metric, we report the F1-score for cyberbullying class (the class of

interest) in each iteration. We use this metric since our data is highly imbalanced towards

the non-cyberbullying class.

3.2.2 Post-by-Post Evaluation

In this evaluation framework, instead of a new chunk of data, the system gets access to only

one more new post per conversation in each iteration. Besides, we need to have a sequential

decision-making process that, in every iteration, decides whether to label each conversation

as cyberbullying based on the current information or wait for monitoring more posts within

the next iterations.

Figure 3.2 illustrates the post-by-post evaluation framework. In the first iteration, the

model only has access to the first post of every conversation k in test data. In the second

iteration, it gets access to the second posts, and so forth. Once the model decides to label a

conversation as cyberbullying, it is not allowed to see the remaining posts in that conversation

or changes the decision. Whenever the model reaches the last post in a media session, it

automatically labels that media session as non-cyberbullying. In this evaluation framework,

we evaluate the performance once, after the last iteration (i.e., when all the conversations

are labeled as cyberbullying or non-cyberbullying). The goal is to identify cyberbullying

conversations accurately and as early as possible.

Within this framework, we use the following evaluation metrics:

1. Precision: We report this score for cyberbullying class. It shows how many of the

cyberbullying predictions made by the model are correct. We calculate this measure through

Equation 3.1.

2. Recall: We compute this score for cyberbullying class to measure how many instances

14

Figure 3.2: Post-by-post evaluation framework. In this framework, in every iteration t, the modelgets access to t-th posts in each conversation k. Then, the output is generated based on the currentinformation and the system makes a decision whether to trigger a cyberbullying alert througha policy-based decision-making process. (a) system did not trigger an alert for conversation k,and finally labeled it as a non-cyberbullying instance after monitoring all the posts inside thisconversation. (b) system triggered a cyberbullying alert for conversation k at the end of the thirditeration, and was not allowed to monitor more posts for that conversation.

of this class are identified by our model. Equation 3.2 shows how we calculate this metric.

3. F1-score: This is the weighted average of precision and recall (Equation 3.3). F1-score

is a good indicator of the model performance when the data is very imbalanced.

4. Average number of comments: This shows the average number of comments that

the system needs to monitor before making the cyberbullying predictions. This metric is a

good indicator of timeliness.

5. F-latency [56]: We also use F-latency that takes into account both the timeliness and

the correctness of the predictions. To calculate this metric, we used the same settings as

15

eRisk2019 [38]. Assume that m ∈ M is a media session, and our early detection system

iteratively analyzes the comments posted to m. After monitoring km comments (km ≥ 1),

the system makes a decision predm ∈ {0, 1}, and actualm ∈ {0, 1} shows the actual label.

Delay is a key component of evaluating the early prediction of cyberbullying; because we do

not want the system to detect cyberbullying incidents too late. We measure the delay as

follows:

latencyTP = median{km : m ∈M, predm = actualm = 1} (3.6)

where 1 stands for the cyberbullying class. Similar to [38], we compute the latency only

for the true positives (i.e., the cyberbullying media sessions that labeled correctly). The

intuition is that the false negatives (i.e., actualm = 1, predm = 0) are not detected by

the system and, therefore, they would not generate an alert. However, in [56], the authors

calculated the latency for all instances where actualm = 1.

Furthermore, we assign the following penalty to each individual true positive decision,

taken after monitoring km comments:

penalty(km) = −1 +2

1 + exp−p.(km−1)(3.7)

where p is a parameters that determines how quickly the penalty should increase. Similar

to [38], we set p such that the penalty equals 0.5 at the median number of comments in a

media session.1 Then, we calculate the overall speed factor of the system as follows:

speed = (1−median{penalty(km) : m ∈M, predm = actualm = 1}) (3.8)

Speed would be equal to 1 for a system that detects the true positives right after seeing

1In our case p = 0.02.

16

the first comment. Finally, we calculate the F-latency as follows:

F − latency = F1.speed (3.9)

17

Chapter 4

Data Resources

We decided to create new data resources for the tasks of abusive language and cyberbullying

detection. Our goal was to mainly target youth conversations since they are the most

vulnerable group of online users under abusive behavior, and cyberbullying. We started

by using a dictionary of bad words to find the online potential offensive posts. We further

presented an approach to help find more forms of aggression (i.e., implicit as well as explicit

forms). Following that approach, we introduced two new corpora: one for abusive language

detection, and the other for cyberbullying detection.

4.1 ask.fm Abusive Language Dataset

The first dataset that we created includes highly negative posts from ask.fm.1 ask.fm is a

semi-anonymous social network, where anyone can post a comment/question to any other

user and may choose to do so anonymously. An example of a random user’s timeline on

ask.fm is illustrated in Figure 4.1. Given that people tend to engage in abusive behavior

under cover of anonymity [66], this anonymity option in ask.fm allows attackers to freely

1https://ask.fm

18

harass the other users by flooding their pages with profanity-laden questions and comments.

Seeing a lot of offensive messages in a user’s profile page often disturbs him/her. Several

teen suicides have been attributed to cyberbullying in ask.fm [22, 60].

Figure 4.1: An example of a random user’s timeline in ask.fm.

This phenomenon motivated us to crawl a number of ask.fm accounts. We further ana-

lyzed those accounts manually to ascertain how cyberbullying has been carried out in this

particular site. We learned that victims have their profile page flooded with abusive posts.

We concluded that to detect cyberbullying incidents, we first need to identify abusive mes-

sages. Thus, from identifying victims of cyberbullying, we switched to looking for word

patterns that make a post offensive.

A big challenge there was that abusive posts are rare compared to the rest of online

posts. To ensure that we would have obtained enough invective posts, we decided to focus

exclusively on posts that contain profanity. It is analogous to the method used in data

19

collection by [76]; they limited their Twitter data to tweets containing any of the words

bully, bullied, bullying.

4.1.1 Data Collection and Sampling

Since most of the abusive posts we observed in our small scale study contained profanities,

we decided to analyze the occurrence of bad words in a random collection of social media

data. We scraped about 586K question-answer pairs from 1,954 random users in ask.fm from

28th January - 14th February, 2015. We limited crawling to posts in English by determining

the percentage of English words (≥ 70%) in the user’s first page using a python library called

PyEnchant.2

To create the appropriate bad word list, we compiled a list from Google’s bad words

list3 and terms listed in [25]. We shortlisted some of the bad words in the list, based on

the frequency of them in the collected data. For those selected words, we also considered

their morphological variations and slang. Then, we looked at a small sample of data and

filtered out all posts containing any of those bad words. The resulting dataset consists of

about 350 question-answer pairs. That small portion of data was divided among five different

annotators for two-way annotation. A third annotator resolved all disagreements. From these

annotations, we computed the negative use rate (NUR) of each bad word (wi). Equation 4.1

defines NUR. Count(PI, wi) and Count(PN,wi) are the counts of posts including word wi

tagged as invective and neutral respectively.

NUR(wi) =Count(PI, wi)

Count(PI, wi) + Count(PN,wi)(4.1)

According to NUR, we ranked the list of profane words, and removed words that were

2http://pythonhosted.org/pyenchant/3https://code.google.com/p/badwordslist/downloads/detail?name=badwords.txt

20

Figure 4.2: CrowdFlower interface for contributors to annotate the data.

below the threshold (0.05). The final list included the words f*ck, a*s, sh*t, die, kill, h*e,

as**ole, s*ck, n**ger, stfu, b*tch, and cut plus their morphological variations and slang. We

called this small set of annotated data as “gold data” and used it for annotating a larger

sample of data via CrowdFlower.4

4.1.2 Annotation Process

With the small gold annotated data, we started a crowdsourcing task of annotating around

600 question-answer pairs in CrowdFlower.5 We provided a simple annotation guideline for

contributors with some positive and negative examples to ease their task. Three different

contributors annotated each question-answer pair. Figure 4.2 shows the interface we designed

for the task.

For ensuring the high quality of the data, the same data was reviewed and annotated

by four in-lab annotators using a 3-way annotation scheme. Initially, we found that the

4http://www.crowdflower.com/5CrowdFlower has been recently rebranded to FigureEight.

21

inter-annotator agreement was low. Hence, we changed the annotation guideline until the

contributors and our internal annotators had a reasonable agreement. We learned that

although the task may seem simple, it is possibly not so for the external contributors.

Thus, it is necessary to iterate the process several times to ensure high-quality data. Then,

from the original set containing our gold data and extra 600 labeled pairs, we labeled more

data with a combination of in-lab and CrowdFlower annotations into two classes: invective

and neutral. Eventually, around 5,600 question and answer pairs were annotated with this

iterative process. The average inter-annotator agreement kappa score for this data was 0.453

that shows a moderate agreement among annotators. This score is reasonable for this task

since offensive language is very subjective [58].

4.1.3 Data Statistics

Table 4.1 shows the data distribution in ask.fm corpus.6 Based on this table, the number of

invective questions is greater than invective answers. It shows that in most cases, the attack

was initiated by a question/comment that was sent to a user’s timeline. There are also a

noticeable amount of invective answers in the dataset, most of which are probably the replies

to an invective question/comment. For the experiments, we randomly divided the data into

training and test sets using a 70:30 training-to-test ratio. We preserved the distribution of

invective and neutral classes the same in both sets. We used 20% of the training data as the

validation set.

While annotating, we found various types of abuse in the data. Example 1 in Table 4.2

shows instances of sexual harassment directed towards female users. In most of these cases,

the attacking user is anonymous, and he/she is constantly posting similar questions on

the victim’s profile. We also found several instances where the purpose of the post is to

6The data is accessible via the following link: http://ritual.uh.edu/resources/

22

Table 4.1: Statistics for ask.fm abusive language corpus.

Class Question Answer Totalinvective 1,114 909 2,023neutral 4,483 4,688 9,171Total 5,597 5,597 11,194

defend/protect self or another person by standing up for a friend or posting hateful or

threatening messages to the anonymous users (Example 2 in Table 4.2). Also, the use of

profane words does not necessarily convey hostility. Looking at the question and answer

pair in Example 3 in Table 4.2, it is obvious that the users are joking with each other. On

ask.fm, some users discourage cyberbullying by motivating victims to stay strong and not

to hurt or kill themselves. Example 4 in Table 4.2 illustrates this case.

Table 4.2: Examples of the different topics in ask.fm abusive language dataset.

Ex. Posts1 Question Send nudes to me babe? :) I’ll send you some :)

Answer: stfuQuestion: C’mon post something sexy. Like a yoga pantspic or your bra or thong

2 Question: She’s not ugly you blind ass bat3 Question: you + me + my bed = fuckkk (;

Answer: Haha ooooooh shit (;4 Question: well I just want you to know I’m suicidal and 13.

and I’m probably gonna kill myself tonight . . .Answer: No please don’t seriously god put you on this earthfor a reason and that reason was not for you to take yourselfoff of it . . .

All these examples show that the introduced ask.fm dataset covers a wide range of topics

related to cyberbullying. We believe that the dataset will be a resource for other researchers

carrying out abusive language detection research.

23

4.2 Curious Cat Abusive Language Dataset

Most of the available resources for the task of abusive language detection have been created

based on either a list of bad words or seed words related to abusive topics. However, the

following examples indicate that profane words are not a good criterion for filtering abusive

content anymore:

Neutral: Damn you are such a BEAUTIFUL F*CKING MOMMY!

Offensive: u should use ur hands to choke urself.

In addition, the language keeps changing during the time. Based on an article published

by Linguistic Society of America (LSA),7 “many of the changes that occur in language begin

with young adults. As young people interact with others their own age, their language

grows to include words, phrases, and constructions that are different from those of the older

generation.” With respect to the fact that our target group is youth, if we stick to a list of

bad words for creating the dataset, we possibly miss the newer forms of abusive language.

Therefore, we decided to create a new corpus without sticking to a specific list of bad words.

Instead, using an abusive language classifier that is able to learn new offensive words, phrases,

and slang.

For creating this dataset, we collected the data from Curious Cat,8 a semi-anonymous

question-answering social media. This website is very popular among the youth and has

more than 15 million registered users. The anonymity option available on Curious Cat

opens the door for digital abuse like ask.fm. On this website, users can choose not to reveal

any personal information on their account, as well as post comments/questions on other

users’ timelines anonymously. Due to these properties, there are two significant limitations

with respect to Curious Cat data: (1) the post content is usually too short, and (2) there is

7https://www.linguisticsociety.org/content/english-changing8https://curiouscat.me

24

either no or limited information about the sender of a post. Figure 4.3 illustrates an example

of a random user’s timeline in Curious Cat website.

Figure 4.3: An example of a random user’s timeline in Curious Cat website.

4.2.1 Data Collection and Annotation

We crawled about 500K English question-answer pairs from 2K randomly chosen users in

Curious Cat. To avoid having bias through some specific swear words in the data, we did not

use a particular list of bad words to find potentially offensive messages. Instead, we exploited

a pre-trained classifier with reasonable performance on the other resources. Since the format

of the Curious Cat data is similar to ask.fm, we utilized the classification method, which is

described in Chapter 5 of this dissertation. We chose this classifier because: (1) it reports the

state-of-the-art results on ask.fm, and (2) it utilizes lexical features that make it capable of

25

learning new words and phrases related to the offensive class. This model combines lexical,

domain-specific, and emotion-related features and uses an SVM classifier to detect nastiness.

We trained that classifier on the full ask.fm dataset (presented in Section 4.1), and applied it

to Curious Cat to automatically label all rows of data. While ask.fm and Curious Cat have

the same format, we noticed key differences between them, which may substantially affect

the quality of automatic labeling. For example, with Curious Cat, we observed numerous

sexual posts that are full of profanities, yet not offensive to the user. For example, a user

may encourage others to post sexual comments to him/her, like the following example:

Question: I wanna s*ck your d*ck so hard and taste your c*m.

Answer: Enter my DMs beautiful.

Therefore, we created the primary version of our data by randomly selecting 2,482

question-answer pairs, where 60% were chosen from the offensive labeled data, and 40%

selected from the positive/neutral labeled data (we only considered the label of the ques-

tions). Using a 3-way annotation scheme, we asked four in-lab annotators, including three

undergraduate and one graduate students, to annotate each row of the data. Based on the

annotations, Fleiss’s kappa score [19] was 0.5 that shows a moderate agreement among the

annotators.

Figure 4.4 shows the rate of “complete agreement” among all annotators for positive and

negative questions and answers. By complete agreement, we mean the case where all the

annotators assigned the same class to an instance (in Curious Cat data, an instance could

be a question or answer). Based on the figure, the complete agreement on the negative class

is much less than the positive/neutral one. This observation demonstrates the fact that the

perceived level of aggression is very subjective since negativity is a function of context and

social culture. It is also interesting to see that for negative instances, the annotation results

show more complete agreements on top of the questions compared to answers. It indicates

26

that it was harder for the annotators to decide whether a reply to a comment is offensive.

Figure 4.4: Complete agreement for questions and answers across negative and positive labeleddata.

4.2.2 Data Statistics

Statistics show that 95% of negative comments were posted on users’ timelines anonymously.

We also found that about 100 instances of negative comments in our final corpus do not

include any profanity, which means that our proposed sampling method may capture the

implicit forms of abusive language as well.

Table 4.3 shows the final distribution of the proposed corpus. Statistics show that 95%

of negative comments were posted on users’ timelines anonymously. Looking at the labeled

data, we also found that about 100 instances of abusive posts do not include any profanities,

and 1327 positive/neutral posts have at least one profane word. It shows that the proposed

sampling method could capture the implicit forms of abusive language as well as explicit

ones. It also filtered out examples that included bad words but are not attacking the other

users.

For the experiments, we randomly split Curious Cat data into train and test sets with

a 70:30 training to test ratio and utilized 20% of the train data as the validation set. We

preserved the distribution of offensive and neutral classes the same in all data partitions.

27

Table 4.3: Statistics for Curious Cat abusive language corpus

Class Question Answer TotalOffensive 609 171 780Neutral 1873 2311 4184Total 2482 2482 4964

4.3 ask.fm Cyberbullying Dataset

Abusive language detection could be considered as the initial step towards finding cyberbully-

ing incidents. Cyberbullying happens when the attacker deliberately sends several offensive

messages to the victim, repeatedly. Therefore, we need to monitor at least parts of the

whole users’ conversations to detect such episodes. For creating the cyberbullying corpus,

we collected our data from ask.fm, because of two main properties of this website: 1) the

anonymity option, and 2) its popularity among teens. Typically in ask.fm, the data consists

of question-answer pairs in users’ timeline.

Figure 4.5: Overall process of building ask.fm cyberbullying corpus.

Figure 4.5 shows the corpus creation scheme. We collected a large amount of ask.fm data,

including the full history of question-answer pairs for 3K users. The question field includes

a question/comment posted by the other users, and the answer field consists of the reply to

that question/comment provided by the owner of the account.9 As we mentioned earlier, in

9The answer can be an empty string, while the questions/comments that the user has not replied to them,

28

order to find the cyberbullying incidents, we first had to look for the threads of messages

that include a high ratio of abusive comments. Therefore, we used the same approach as

what we will present in Section 4.2.1 to automatically label each row of data. We created

the cyberbullying and non-cyberbullying classes as follows:

Cyberbullying class: To make the cyberbullying instances (CB), we created a fixed-length

sliding window and moved it through the whole history of question-answer pairs per user.

For each window sample, we calculated the ratio of offensive questions/comments inside the

window based on the automatic labels. We did not consider the labels for the answers since

we were looking for the users that received a large amount of negativity within a specific

period. If the negativity ratio inside the window was greater than a pre-defined threshold, we

assumed that window as a potential cyberbullying event. Additionally, we checked whether

we could expand such a window by considering more question-answer pairs, yet keeping the

inside negativity rate greater than the defined threshold. This step is crucial to capture

the whole cyberbullying episode. Finally, since automatic labeling was likely to be noisy, we

asked two in-lab annotators to check the resulting windows manually. We labeled a window as

a cyberbullying instance if both annotators agreed that it included a cyberbullying episode.

Table 4.4 shows some parts of a cyberbullying instance in our corpus.

Non-cyberbullying class: For making the non-cyberbullying instances (Non-CB), we em-

ployed the same method, but inversely. More specifically, we looked for the windows that had

the negativity ratio less than the defined threshold. We created bins with various negativity

ratios (e.g., 0%-5%, 5%-10%, and so on) and made sure to add a fair number of samples from

each category to our data. As for the false-positive examples, we also added the window

samples that were labeled as highly negative but were not annotated as cyberbullying by

our annotators (e.g., when two users fight with each other in the third user’s timeline).

are not shown in his/her timeline

29

Table 4.4: Parts of a cyberbullying instance in our corpus.

Q: didn’t you used to make yourself throw up orsomething? It obviously didn’t work because you’restill over weightA: you’re ignorant.Q: I’m not trying to be!!!! you’re just better offdead so go right ahead. Nobody’s holding you backhoney. We won’t miss you.A: thanks for the clarificationQ: glad I could help! Let me know when you’re deadso I can spit on your grave!!! :-)A: okQ: Fucking bulimic bitchA: yeah totally!!Q: tell your mom I said hi when you see her in hell!!!She’s so proud of how you’ve turned out. Just kid-dingA: she’s definitely in heaven. and she’s my godmother. and I know she loves meQ: oh look here your best friend coming to the res-cue how cute. She secretly thinks you’re worthlesstoo. Nobody actually cares! They just say they do.Oh silly Meaghan so naive. You need serious help.Maybe you should ask your pointer and middle fin-gers? They’ve seemed to help you this farA: please just stop.

We empirically fixed the minimum window size and negativity threshold to 20 and 40%,

respectively (i.e., the potential cyberbullying windows included at least 20 question-answer

pairs and at least 40% of questions were labeled as offensive). Table 4.5 shows the distribution

of the data in our ask.fm cyberbullying corpus in terms of the number of users in each class.

Since cyberbullying is a rare event, we kept the ratio of positive to negative examples 1:10

to be closer to the real case scenarios.

Table 4.5: Statistics for ask.fm cyberbullying corpus.

Class training test Totalcyberbullying 19 8 27non-cyberbullying 190 80 270Total 209 88 297

30

4.4 Other Available Datasets

In this section, we introduce the other available resources for both tasks of aggression iden-

tification and cyberbullying detection that we used in the experiments that we conducted in

this dissertation.

4.4.1 Other Abusive Language Corpora

We also applied our abusive language detection models to other available corpora to better

qualify the performance of the proposed models. One of the available datasets was Kaggle

data that released in 2012 for a task hosted by Kaggle called Detecting Insults in Social Com-

mentary.10 This data contains posts on adult topics like politics, employment, and military.

Another source of data that we used was Wikipedia abusive language data set [75] includes

approximately 115k labeled discussion comments from English Wikipedia. The dataset was

labeled via Crowdflower annotators on whether each comment contains a personal attack.

Kaggle and Wikipedia corpora contain training, evaluation, and test sets separately. We

used the same partitions in our experiments.

Table 4.6 compares the four different resources that we used in this dissertation for

the task of abusive language detection. Among all corpora, Kaggle is more balanced than

the others. Inversely, Wikipedia has the least ratio of negativity. Compared to these two

datasets, ours have a reasonable amount of negative examples. Also, as we can see in this

table, posts in ask.fm and Curious Cat are much shorter than Kaggle and Wikipedia data.

It also seems that users in these two platforms tend to use shorter words or even more

abbreviations.

10https://www.kaggle.com/c/detecting-insults-in-social-commentary

31

Table 4.6: Average length of posts, and words in ask.fm, Curious Cat, Kaggle, and Wikipediadata sets in terms of the average number of words and the average number of characters,respectively.

Corpus Size Negativity ratio Avg no. of words Avg length of wordsask.fm 11194 18.08% 13.92 4.73Curious Cat 4964 15.71% 15.3 3.43Kaggle 6597 26.42% 38.35 5.54Wikipedia ∼115K 11.70% 81.29 5.94

4.4.2 Instagram Cyberbullying Corpus

We used our ask.fm cyberbullying dataset for doing some initial experiments to show that

the early text categorization scenario is applicable to the task of cyberbullying detection.

However, because of the small size of our data, we could not further use it to create our

advanced deep neural architecture. Deep learning models have been proved to be effective

for text classification, but are very data-hungry. Therefore, we decided to look for the other

available resources. There are two other available corpora for the task of cyberbullying

detection, one from Instagram [26] and another from Vine [53]. We decided not to use the

second one since the Vine social media is not active anymore.

Table 4.7: Data statistics for Instagram cyberbullying corpus.

Labelcategory bullying non-bullying totalunder 10% 13 (4.68%) 265 (95.32) 27810%-40% 187 (18.37%) 831 (81.63%) 1,018above 40% 478 (51.84%) 444 (48.16%) 922total 678 (30.57%) 1,540 (69.43%) 2,218

Instagram cyberbullying dataset includes 2,218 media sessions labeled as cyberbullying or

normal. Each media session consists of a media object/image and its associated comments.

There are three different categories of media sessions in this corpus: (1) under 10% of the

comments include at least one negative word, (2) between 10-40% of the comments contain

at least one bad word, and (3) above 40% of the comments include at least one bad word.

Table 4.7 shows the distribution of classes over the different categories of this dataset. For

32

the experiments, we randomly split Instagram data into train and test sets with an 80:20

training to test ratio, and utilized 20% of the train data as the validation set.

33

Chapter 5

Feature Engineering for Abusive

Language Detection

In this chapter, we propose a traditional classification model that makes use of various hand-

engineered linguistic features to predict whether an online comment is abusive. We use the

ask.fm data (Section 4.1) as the main corpus for conducting the experiments. We investigate

the robustness of the model by applying it on Kaggle and Wikipedia datasets (Section 4.4)

as well.

5.1 Methodology

For this task, We consider each question/comment or answer as a single post. We use the

following hand-crafted features to extract information from each post:

1. Lexical: Words are powerful mediums to convey a feeling, describe, or express ideas.

With this notion, we use word n-grams (n=1, 2, 3), char n-grams (n=3, 4, 5), k-skip n-grams1

(k=2, n=2, 3) as features. We weigh each term with its term frequency-inverse document

1We use this feature to capture long-distance context

34

frequency (TF-IDF).

2. POS colored n-grams: We use the n-gram of tokens with their POS tags to understand

the importance of the role played by the syntactic class of the token in making a post

invective. For instance, “Oh, f**k!” is a neutral sentence, however “F**k you!” is an

offensive one. POS tags for the word f**k in these two examples are NN, and VB respectively.

By using this feature, we hope to capture such kind of patterns. We use CMU’s Part of Speech

tagger2 to get the POS tags for each document.

3. Emoticons (E): Emoticons are practical tools that people use to convey their nega-

tive/positive feelings better. We manually create a dictionary of happy and sad emoticons,

considering the most frequent emojis in the training data. To build the feature vector, we

use a normalized count of happy, sad, and total emoticons in each post.

4. SentiWordNet (SWN): We exploit sentiment polarity of online comments with re-

spect to neutrality, positivity, and negativity scores given by SentiWordNet [3]. We further

concatenate this information with the average count of nouns, verbs, adverbs, and adjectives

in the post to create the feature set. We use Ark Tweet NLP [47] for tokenization.

5. LIWC (Linguistic Inquiry and Word Count): LIWC2007 includes around 70 word

categories to analyze different language dimensions like emotions (e.g., sadness, anger, etc.),

self-references, and casual words in each text [51]. We use a normalized count of words

separated by any of the LIWC categories to form a feature set.

6. Style and Writing density (WR): This category focuses on the properties of the

text. For example, abusive users sometimes try to be more robust by using capital letters

in their comments. Our stylistic feature set consists of the number of words, characters,

all uppercase words, exclamations, question marks, as well as average word length, average

sentence length, and the average number of words per sentence as the features.

2http://www.cs.cmu.edu/~ark/TweetNLP/\#pos

35

7. Question-Answer (QA): Our ask.fm abusive language data is from a semi-anonymous

social network that contains question-answer pairs. Therefore, certain features like the type

of post (question or answer), whether the post is a reply to an anonymous question/comment,

user mentions in the post, bad word ratio, and the list of appeared bad words might be useful

for detecting online invective posts. For example, based on the training data, questions show

a higher rate of abuse as the starting points of a conversation.

8. Patterns (P): Based on work by Yin et al. [79], and a careful review of our ask.fm

training data, we extracted the patterns presented in Table 5.1. These patterns include

combinations of lexical forms and POS tags. As another feature set, we build a binary

vector to check the existence of any of these patterns in the post.

Table 5.1: Negative patterns for detecting nastiness. The capital letters show the abbrevia-tions for the following POS tags: L = nominal + verbal (e.g. I’m)/verbal + nominal (e.g.let’s), R = adverb, D = determiner, A = adjective, N = noun, O = pronoun (not possessive)

Pattern ExampleL (You’re) + R + D + A* + N (bad word) You’re just a pussy.

L (You’re) + D + A* + N (bad word) You’re a one retarded b*tch.V (bad word) + O I want to kill (V) you (O).O + N (bad word) You shitheads.

N + N* (at least 2 bad words) You stupid ass (N) dip (N) shit (N)O (You) + A + N (bad word) You stupid ass.

V (bad word) + D + N (bad word) S**k my ass.

9. Embeddings: The idea behind this approach was to use a vector space model to improve

lexical-semantic modeling [32]. We utilize two different types of features in this case. The

first one is defined by averaging the word embedding of all the words in each post, and the

second one is based on a document embedding approach [32].

10. LDA: In order to find and analyze the topics involved in invective posts, we employ

one of the best-known topic modeling algorithms, Latent Dirichlet Allocation (LDA) [4]. For

creating the feature set, we compute the probability of the appearance of each topic in the

36

post.

5.2 Experimental Setup

We feed the different combinations of extracted features into a linear SVM classifier to

predict the final labels.3 We evaluate the classification model on three different datasets,

including ask.fm, Kaggle, and Wikipedia.4 Our goal is to show the proposed model works

well not only on our ask.fm dataset but also on the data from the other domains. For each

of these language resources, we use the validation set to search for the best C parameters for

the classifier through grid search over the following values: {1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1,

10, 100, 1000, 10000}. Since the dataset is highly skewed, we perform oversampling of the

invective instances during training to mitigate the imbalanced data problem.

Moreover, for the embeddings features, we build the vector space by training 290,634

unique words coming from all 586K question-answer pairs we crawled from ask.fm. We also

train separate word and document embeddings models on Kaggle and Wikipedia corpora.

For the LDA feature, we again use all crawled data from ask.fm. For training the LDA

topic model, we consider all question-answer pairs related to each user as a single document

and ignore the users with less than ten pairs. For the other two datasets, we look at each

comment as a separate document, and train the LDA model using the training set. With

regards to the pre-processing step, we remove stopwords and words that occurred less than

seven times across the whole corpus, and set the number of topics to 20.

3We conducted all the experiments with the Logistic Regression classifier as well. However, the finalresults were lower in comparison to SVM.

4All these datasets are presented in Chapter 4

37

5.3 Baseline Method

People use emoticons to help convey their emotions when they are posting online. In the

baseline experiment, at first, we check whether a post contains any emoticons from the

following list: {<3, :-), :), (-:, (:, :o), :c)}. We chose this list since we found that these

emoticons are used to show positive feelings based on the ask.fm training data. If the post

contains at least one of these emoticons, we label it “neutral”. Otherwise, we calculate the

ratio of bad words to total words. If it is greater than a given threshold, our baseline system

predicts the post as “invective”.

invective(x) =

0, if badWordRatio(x) < T

1, if badWordRatio(x) ≥ T

(5.1)

Table 5.2 shows the results for the baseline experiment. We select the best threshold

value among all threshold values {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8} by performing grid

search using the training set for training, and the validation set for testing.

Table 5.2: The results of baseline experiments for invective class

Our dataset Kaggle dataset Wikipedia datasetExperiment AUC F-score AUC F-score AUC F-score

Majority Baseline 0.5 0.0 0.5 0.0 0.5 0.0Minority Baseline 0.5 0.3 0.5 0.41 0.5 0.18

Our Baseline 0.567 0.27 0.597 0.36 0.610 0.28

5.4 Classification Results and Analysis

Table 5.3 shows the classification results using various combinations of features for all three

corpora. Based on the last row of the table, we can quickly conclude that combining all

features does not necessarily give the best F1-score. However, we obtain an F1-score of 0.59

38

for our ask.fm data when we selectively combine different types of features.

Table 5.3: Classification results for invective class. N/A stands for the feature sets that arenot applicable to Kaggle and Wikipedia datasets

Our dataset Kaggle dataset Wikipedia datasetFeature AUC F-score AUC F-score AUC F-scoreUnigram (U) 0.768 0.57 0.813 0.71 0.882 0.72Bigram (B) 0.680 0.48 0.742 0.62 0.810 0.66Trigram (T) 0.587 0.31 0.647 0.46 0.702 0.53Word 1, 2, 3gram (UBT) 0.726 0.55 0.777 0.68 0.830 0.74Char 3gram (CT) 0.753 0.55 0.805 0.70 0.883 0.69Char 4gram (C4) 0.748 0.56 0.812 0.72 0.879 0.73Char 5gram (C5) 0.717 0.52 0.793 0.71 0.869 0.74Char 3, 4, 5gram (C345) 0.734 0.55 0.811 0.73 0.866 0.752 skip 2gram (2S2G) 0.654 0.44 0.756 0.65 0.764 0.652 skip 3gram (2S3G) 0.593 0.32 0.649 0.46 0.712 0.52POS colored unigram (POSU) 0.762 0.56 0.803 0.70 0.874 0.71POS colored bigram (POSB) 0.674 0.47 0.732 0.61 0.806 0.65POS colored trigram (POST) 0.577 0.28 0.643 0.45 0.697 0.52POSU+POSB+POST (POS123) 0.724 0.55 0.788 0.68 0.824 0.73Question-Answer (QA) 0.744 0.52 N/A N/A N/A N/AEmoticon (E) 0.511 0.30 0.505 0.41 0.524 0.19QA + E 0.743 0.52 N/A N/A N/A N/ASentiWordNet (SWN) 0.602 0.35 0.575 0.39 0.632 0.30C345 + SWN 0.736 0.55 0.797 0.72 0.866 0.75LIWC 0.662 0.42 0.715 0.57 0.787 0.53QA + LIWC 0.764 0.55 N/A N/A N/A N/AWriting Density (WR) 0.564 0.30 0.566 0.42 0.682 0.31U + WR 0.769 0.57 0.804 0.70 0.878 0.71Patterns (P) 0.539 0.17 0.518 0.09 0.544 0.16QA+LIWC+P 0.756 0.54 N/A N/A N/A N/AWord2vec (W2V) 0.745 0.51 0.759 0.63 0.854 0.61Doc2vec (D2V) 0.750 0.52 0.792 0.66 0.886 0.60LDA 0.626 0.37 0.559 0.40 0.577 0.26LIWC+E+SWN+W2V+D2V 0.780 0.56 0.799 0.68 0.889 0.65U+C4+QA+LIWC+E+SWN+W2V+D2V 0.785 0.57 N/A N/A N/A N/AU+C4+POSU+QA+D2V+LDA 0.781 0.58 N/A N/A N/A N/AC4+U+QA+E 0.766 0.59 N/A N/A N/A N/AAll Features 0.756 0.56 0.798 0.71 0.882 0.75Best Previous Reported score 0.842

Although some features like SWN and P alone perform worse or not much better than the

baseline (comparing either AUC or F1-score), it seems that selectively combining them with

other features improves the performance of the system. The results show that combining a

feature with others, in most cases but not all produces a higher AUC score compared to only

using a single feature for training the classifier. It means that each feature carries valuable

information about different aspects of the posts. Interestingly, combining emotion-based

39

features with the embeddings ones (LIWC+E+SWN+W2V+D2V) gives us one of the best

AUC scores. It shows that the emotions reflected in the text provide useful information about

whether it is hostile or not. However, the results that we obtained from LDA features are

not remarkable. Even combining LDA with the other features does not seem to improve the

performance. One reason may be the sparsity of feature vectors in this case. LDA features

rank all trained topics over each document. It makes a vector for each post containing the

probability of each topic to belong to the post. Generally, the length of online comments is

very short, so that these vectors would be very sparse.

Table 5.3 shows the results for the Kaggle and Wikipedia datasets as well. The results

do not outperform the best AUC score reported by Kaggle’s winner (0.8425). However, we

consider our method promising, since not all the features are customized for Kaggle dataset.

Also, we compare the results with those reported on Wikipedia corpus [75]. In that research,

the authors only presented the AUC of their model trained on the train split and evaluated

on the development split. With the same configuration, our results are similar to those that

they reported (e.g., using the same experimental setup, they got an AUC of 0.952 for word

n-gram, and we got an AUC of 0.956 for word unigram). Overall, the results of applying

our model to Kaggle and Wikipedia data show that our approach is applicable to other

domains. It is also interesting to investigate why we obtain higher scores for these two

corpora compared to our ask.fm dataset. Referring to the comparison of all three corpora

in Section 4.4, we believe that the primary reasons are:

1. In ask.fm, posts are either questions or answers that are shorter than the ones in other

datasets. By looking at our ask.fm data, we found that in many cases, both question

and answer include only one word – that makes the decision hard.

2. Online posts do not basically follow formal language conventions. Because ask.fm is

5https://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard

40

mostly used by teenagers and youth; there are more misspellings and abbreviations

inside the texts, which makes their processing much more difficult.

Among all the features, the only one that works poorly is P, specifically on Kaggle data. As

mentioned in Section 5.1, for extracting those patterns, we only looked at ask.fm training

data. So, it makes sense that they may not provide good results for the other datasets.

However, It would still be interesting to investigate the possibility of extracting the negative

patterns from the text automatically.

Table 5.4: Top negative features

Feature Our dataset Kaggle dataset Wikipedia dataset

Ubitch, fuck, asshole, shut, stfu,off, you,stupid, fucking, ugly,

pussy, u, ass, slut, face

you, idiot, stupid, dumb, loser,your, moron, ignorant, you’re, faggot,

bitch, shut, asshole, ass, retard

fuck, fucking, stupid, idiot, shit,asshole, ass, moron,bullshit, suck,

idiots, bitch, sucks, dick, penis

C4itch, bitc, ass, fuc, uck ,

stfu, hoe, bit, tfu , fuck, stf,dumb, off, you, slut,

you, you , re a, diot, idi,idio, dumb, moro, oron, dum,

your, bitc, tard, fuc, oser

fuck, fuc, shit, uck , diot, ass,suck, idio, moro, shi, gay,

bitc, oron, dick

Table 5.4 lists the most important features learned by the classifier. The “ ” sign rep-

resents the whitespace character. Interestingly, the classifier has learned to discriminate

between neutral and invective words. One of the most significant gains from this table is

that the second-person pronoun is ranked as one of the top negative features. This obser-

vation shows that online attackers mostly address their victims directly, and supports our

idea that invective posts might follow specific patterns. Also, in ask.fm dataset, the word

“face” ranked as one of the highly negative features. Therefore, we can conclude that at-

tackers post negative comments about victims’ faces, and in some cases, as a reaction to

an uploaded photo. Moreover, the top bad words captured from the other datasets (like

idiot, stupid, moron) provide us the opportunity to expand our bad word list, and enrich our

ask.fm corpus.

The misclassified examples show that the classifier gets confused with the following cases:

(1) single profane word answers (Example 1 in Table 5.5), (2) question and answer pairs in

41

which users joke around using profanities (Example 2 in Table 5.5), (3) posts that include

a mixture of politeness and profanity (Example 3 in Table 5.5), and (4) posts that include

bad words, but are offered as compliments (Example 4 in Table 5.5).

Table 5.5: Examples of mislabeled instances by the classifier

Posts1 Answer: stfu

Answer: Die2 Question: Fuck you brian lmao

Answer: xD ty3 Question: Can I kill you?

Question: Can we fuck please?4 Question: You are hot as fuck

5.5 Analysis of Bad Words

Table 5.6 shows the degree of negativity for the words in our final bad word list. By this

analysis, we would like to estimate how negative each word is by itself. For computing this

measure, we consider the posts in our ask.fm dataset that contain only one profane word.

Then, for each bad word wi in the list, we apply the same formula as Equation 4.1 to calculate

the ratio of the negative posts containing wi or any of its variations to the total posts in

which wi or any of its variations appears as the single bad word.

Table 5.6: Degree of negativity for bad words

bad word negativityas**ole 51.16%kill 12.47%f*ck 33.05%n**ger 13.30%sh*t 15.23%cut 4.85%

bad word negativityb*tch 41.65%a*s 24.77%die 7.41%s*ck 26.88%h*e 36.58%stfu 51.55%

42

From Table 5.6, it is clear that most of the bad words are used in a neutral/positive

way more often than in a negative way. Although these numbers are also related to the

overall incidence of nastiness, there are some noteworthy findings. For example, when the

word “f*ck” used as a verb, it refers to sexual activity. This form used more often in a

neutral/positive post, rather than a negative one. This finding reflects a sexualized teen

culture, part of a growing problem affecting young social media users. In contrast, the low

degree of negativity of the words “die”, “kill”, and “cut” is also interesting. By looking at the

data, we found that the likelihood that these harm-related words reflect online harassment is

related to the appearance of the other bad words. Moreover, the data shows that these words

are sometimes used to threaten people or encourage them to commit suicide. In contrast,

the acronym “stfu” has the most substantial degree of negativity. We believe that these

observations are related to the versatility of the words. It is less likely to see the acronym

“stfu” being used in a neutral/positive way than the other terms. Also, some words like

“suck” and “hoe” seem to carry highly negative weight.

5.6 Feature-Based Model on Other Resources

Using a very similar approach, we attended the first workshop on Trolling, Aggression, and

Cyberbullying (TRAC). TRAC 2018 Shared Task on Aggression Identification [30] aims at

developing a classifier that could make a 3-way classification of a given data instance between

“Overtly Aggressive”, “Covertly Aggressive”, and “Non-aggressive”. In this section, we

present the different models we submitted to the shared task. We mainly used lexical and

semantic features in our final submissions. This competition contained two main subtasks,

one focusing on English data and another on Hindi. For each language, two test data were

provided: one from Facebook and another from an anonymized social media. The training

43

and development data for both languages come from Facebook. We obtained weighted F1-

measures of 0.5921 for the English Facebook task (ranked 12th out of 30 teams), 0.5663 for

the English Social Media task (ranked 6th out of 30 teams), 0.6451 for the Hindi Facebook

task (ranked 1st out of 15 teams), and 0.4853 for the Hindi Social Media task (ranked 2nd

out of 15 teams).

5.6.1 Data

The datasets were provided by [31]. Table 5.7 shows the distribution of training, validation,

and test (Facebook and social media) data for English and Hindi corpora. Each post in the

data was labeled with one out of three possible tags:

• Non-aggressive (NAG): There is no aggression in the text.

• Overtly aggressive (OAG): The text contains either aggressive lexical items or

certain syntactic structures.

• Covertly aggressive (CAG): The text contains an indirect attack against the target

using polite expressions in most cases.

Table 5.7: Data distribution for TRAC English and Hindi corpus

Data Training (FB) Validation (FB) Test (FB) Test (SM)English 12000 3001 916 1257Hindi 12000 3001 970 1194

5.6.2 Data Preprocessing

For cleaning the English dataset, we lowercased the data and removed URLs, Email ad-

dresses, and numbers. We also did minor stemming by removing “ing”, plural and possessive

“s”, and replaced a few common abstract grammatical forms with the formal versions.

44

On manual inspection of the training data for Hindi, we used a word-level language

identification system on training data [41]. We found that approximately 60% of the data

is Hindi-English code-mixed. Moreover, some instances use Roman script for Hindi, while

others are in Devanagari. Only 26% of the training data is in the Devanagari script. We

normalized the data by transliterating instances in Devanagari to Roman script. We iden-

tified these instances using Unicode pattern matching, and transliterated those to Roman

script using indic-trans transliteration tool6.

5.6.3 Methodology

To create our model, we made use of the following features:

1. Lexical: We used the same lexical features as what we described in Section 5.1). In

addition, we also considered using another weighting scheme by trying the binary word

n-grams (n=1, 2, 3).

2. Word Embeddings: For the embedding model, we used pre-trained vectors trained

on the part of Google News dataset, including about 3 million words7. We computed word

embeddings feature vectors by averaging the word vectors of all the words in each comment.

We skipped the words which are not in the vocabulary of the pre-trained model. This rep-

resentation was only used for English data, and the coverage of the Google word embedding

is 63% for this corpus.

3. Sentiment: We used Stanford Sentiment Analysis tool [62]8 to extract fine-grained

sentiment distribution of each comment. For every message, we calculated the mean and

standard deviation of sentiment distribution over all sentences and used them as the feature

vector.

6https://github.com/libindic/indic-trans7https://code.google.com/archive/p/word2vec/8https://nlp.stanford.edu/sentiment/code.html

45

4. LIWC (Linguistic Inquiry and Word Count): We used the same approach as

Section 5.1 to build this feature vector. However, we did not consider all the LIWC2007 [51]

word categories in this case, but just the ones related to positive or negative emotions and

self-references. This feature is only applicable to English data.

5. Gender Probability: Following the approach presented in [72], we used a lexicon

trained on Twitter data [59] to calculate the probability of gender. We further converted

these probabilities to binary gender by considering the positive cases as female, and the rest

as male. Finally, we built the feature vectors consisting of the probability of the gender and

binary gender for each message. This feature did not apply to Hindi as well.

5.6.4 Experimental Setup

Since the task was multi-class classification, we employed a one-versus-rest classifier. It

trains a separate classifier per each class and labels every message with the class that has

the highest predicted probability across all classifiers. We tried both Logistic Regression and

linear SVM as the estimator for the classifier. We decided to use the former one in our final

systems due to its better performance on the validation data.

5.6.5 Results

For both datasets, we considered various combinations of the features as mentioned earlier,

and trained several classification models. Table 5.8 shows the validation results for both

languages using the validation sets.

Table 5.9 includes the results of our three submitted systems for the English Facebook and

Social Media data. In all three methods, we used the same set of features as follows: binary

unigram, word unigram, character n-grams of length 4 and 5, and word embeddings. In the

first system, we used both train and validation sets for training our ensemble classifier. In

46

Table 5.8: Validation results employing various feature sets for the English and Hindi datasetsusing the Logistic Regression model. In this table, BU stands for Binary Unigram, and N/Astands for the feature sets that are not applicable to Hindi data.

Weighted F1Feature English HindiUnigram (U) 0.5804 0.6159Bigram (B) 0.4637 0.5195Trigram (T) 0.3846 0.4300Char 3gram (C3) 0.5694 0.6065Char 4gram (C4) 0.5794 0.6212Char 5gram (C5) 0.5758 0.6195Word Embeddings (W2V) 0.5463 N/ASentiment (S) 0.3961 N/ALIWC 0.4350 N/AGender Probability (GP) 0.3440 N/ABU + U + C4 + C5 + W2V 0.5875 N/AC3 + C4 + C5 0.5494 0.6207U + C3 + C4 + C5 0.5541 0.6267

the second system, we only used the train data for training the model. The third system was

set up the same as the second one, but we also corrected the misspellings in the data using

PyEnchant9 spell checking tool. Unfortunately, we could not try applying the sentiment

and lexicon-based features after spell correction due to the restrictions on the total number

of submissions. However, we believe that it might have improved the performance of the

system.

Table 5.9: Results for the English test set. FB and SM stand for Facebook and Social Media,respectively.

F1 (weighted)System FB SMRandom Baseline 0.3535 0.3477System 1 0.5673 0.5453System 2 0.5847 0.5391System 3 0.5921 0.5663

Table 5.10 shows the performance of our submitted models for Hindi Facebook and social

media data. For the Hindi dataset, the combination of word unigrams, character n-grams of

length 3, 4, and 5 gave the best performance over the validation set. These features capture

9https://pypi.org/project/pyenchant

47

the word usage distribution across classes. Both System 1 and System 2 used these features,

trained over training set only and training and validation sets, respectively.

Table 5.10: Results for the Hindi test set. FB and SM stand for Facebook and Social Media,respectively.

F1 (weighted)System FB SMRandom Baseline 0.3571 0.3206System 1 0.6451 0.4853System 2 0.6292 0.4689

5.6.6 Analysis

Looking at the mislabeled instances in the validation phase, we found that there are two

main reasons for the classifier mistakes:

1. Perceived level of aggression is subjective. There are some examples in the validation

dataset where the label is CAG, but it is more likely to be OAG and vice versa.

Table 5.11 shows some of these examples.

2. There are several typos and misspellings in the data that certainly affect the perfor-

mance.

Table 5.11: Misclassified examples in case of the aggression level. In these examples, thepredicted labels seem more reasonable to us than the actual labels, which show that theperceived level of aggression is subjective.

LabelLanguage Example Actual Predicted

EnglishWhat has so far Mr.Yechuri done for this Country. Ask him to shut down his bloodypiehole for good or I if given the chance will crap on his mouth hole.

CAG OAG

The time you tweeted is around 3 am morning,,which is not at all a namaz time.,Asyou bollywood carrier is almost finished, you are preparing yourself for politics bythese comments.

OAG CAG

Hindiajeeb chutya hai.... kahi se course kiya hai ya paida hee chutya hua tha CAG OAGSalman aur aamir ki kounsi movie release huyee jo aandhi me dub gaye?? ?Bikauchatukar media

OAG CAG

48

Furthermore, based on Figure 5.1, it is evident that the Hindi corpus is more balanced

than the English one considering the portion of OAG and CAG instances. That could be a

good reason why the performance of the lexical features is better for Hindi data.

(a) Data distribution for training sets (b) Data distribution for evaluation sets

Figure 5.1: Label distribution comparison between training and evaluation sets for bothEnglish and Hindi languages.

OAG

CAG

NAG

Predicted label

OAG

CAG

NAG

True

labe

l

83 42 19

28 79 35

75 210 345

Confusion Matrix

0.0

0.1

0.2

0.3

0.4

0.5

(a) EN-FB task

OAG

CAG

NAG

Predicted label

OAG

CAG

NAG

True

labe

l

244 74 43

150 131 132

26 105 352

Confusion Matrix

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b) EN-SM task

Figure 5.2: Confusion matrix plots of our best performing systems for English Facebook andSocial Media data.

Figure 5.2a shows the confusion matrix of our best model for all three classes in the En-

glish Facebook corpus. Based on this figure, the classifier mislabeled several NAG instances

with CAG label. Since our system is mostly relying on lexical features, we can conclude that

there are much fewer profanities in CAG instances comparing to the OAG ones. Therefore,

49

without considering the sentiment aspects of the messages, it is hard to distinguish OAG

instances from NAG ones. This fact can also be proven by looking at Figure 5.2b, since it

seems that the classifier also gets confused to label CAG instances in both cases with and

without profanities in English Social Media corpus.

Figure 5.3a shows that for Hindi Facebook data, the biggest challenge is to distinguish

OAG instances from CAG ones. In this case, our proposed approach was built upon lexical

features only. Therefore, it can be inferred from the figure that even indirect aggressive

messages in Hindi contain lots of profanities. However, for the Hindi Social Media corpus,

we have the same concern as English data.

OAG

CAG

NAG

Predicted label

OAG

CAG

NAG

True

labe

l

199 155 8

69 313 31

25 54 116

Confusion Matrix

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(a) HI-FB task

OAG

CAG

NAG

Predicted label

OAG

CAG

NAG

True

labe

l177 134 148

50 155 176

23 78 253

Confusion Matrix

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

(b) HI-SM task

Figure 5.3: Confusion matrix plots of our best performing systems for Hindi Facebook andSocial Media data.

5.7 Findings

In this chapter, we used a traditional machine learning classifier, along with various types

of hand-engineered features, to detect the abusive language in online content. Considering

different combinations of features as the input to the model, we evaluated the performance

on our ask.fm dataset, as well as two other corpora, Kaggle and Wikipedia. Based on the

50

results, there is no particular feature set that works best for all three resources. One reason is

the different formats of the data in these corpora. For instance, the question-answer feature

is not applicable to Kaggle and Wikipedia data. However, it boosts the performance of the

model on ask.fm corpus. Also, for both Kaggle and Wikipedia datasets, character n-gram

features work better than word n-grams. However, the opposite is true for ask.fm data. This

finding can be explained by various nature of adults’ and teens’ conversations. Additionally,

analysis of bad words in ask.fm data showed that young people use lots of profanities in

their daily conversations, but more often in a neutral/positive way. All these observations

demonstrate the need for having data resources that mainly include youth’s conversations.

51

Chapter 6

Paying Attention to the Emotions for

Abusive Language Detection

Many research works have reported that offensive posts are contextual, personalized, and

creative, which make them harder to detect than detecting spam [39, 33]. Even without

using bad words, the post can be hostile to the receiver. On the other hand, the use of

profane words does not necessarily convey a negative meaning [1]. In Section 5.5, we also

showed that profanities are mostly used neutrally in today’s teen talks. Therefore, although

traditional approaches using lexical features have been proved to work quite well for the task

of abusive language detection [12, 45, 10], these types of features can introduce some bias

into the system by focusing on profane words.

Researchers have recently started to use user-level information to increase the chance

of finding abusive users [55, 42]. For example, they employed Graph Convolutional Neural

Networks to model users’ interactions. However, this type of approaches has two serious

limitations:

1. Some of the social media platforms offer anonymity options to their users, which makes

52

it impossible to track all the users that are involved in a conversation. For example,

on ask.fm and Curious Cat, it is not possible to identify the sender of an anonymous

message, while a large number of comments are sent anonymously. In such cases, we

cannot form the entire user network. Therefore, this method is not generalizable to all

domains.

2. Abusive language datasets are included limited instances of social media comments.

Therefore, the users’ interaction network might be biased towards some specific users

that have more comments in the data.

To overcome the challenge as mentioned above, we propose a single deep neural archi-

tecture that only considers the content of the online posts, but also tries to extract more

aspects of the text to detect more forms of abusive language. Our model employs both tex-

tual and emotional cues from the input text to decide whether it is offensive or not. As part

of this model, we introduce the Emotion-Aware Attention (EA) mechanism that dynamically

learns to weigh the words based on the emotions behind the text. We use this method to

reduce the model bias towards bad words. To prove the robustness of our model, we apply

it to all abusive language data resources that we introduced in Chapter 4 (ask.fm, Curious

Cat, Kaggle, and Wikipedia), and compare the performance with the state-of-the-art results

reported on each of these corpora.

6.1 Model Architecture

Our proposed model detects whether a given input text is offensive or not. Figure 6.1

shows the overall architecture of the model. The motivation behind this model is to exploit

emotional representation to better distinguish the use of profanities in an offensive way from

a neutral way. Our model consists of the following modules:

53

Figure 6.1: Overall architecture of our unified deep neural model for abusive language detection.This model combines the textual and emotion representations for a given input comment.

1. Bidirectional Long Short-Term Memory (BiLSTM): This module includes an

embedding layer that generates the corresponding embedding matrix for the given input

text. Then, we pass the embedding vectors to a Bidirectional LSTM (BiLSTM) layer to

extract the contextual information from the text.

2. DeepMoji (DM): Emojis help online users to show their actual feelings within the text.

With this notion, we hypothesize that emojis are effective tools to provide context for online

comments, and to recognize more forms of abuse. For capturing emotion from the text, we

use the DM model pre-trained on a large set of Twitter data [18]. This model covers 64

frequently used online emojis. The model gets an online comment as the input, and provide

a 64-dimensional representation as for the output that shows how relevant each emoji is to

the given input. Figure 6.2 illustrates the top 5 emojis that the DM model assigned to the

instances chosen from our Curious Cat dataset. Both of these comments are very short in

length and include one single bad word “die”. However, one of them is neutral, and another

54

one is offensive. The top emojis assigned to each of these examples by the DM model is

completely different. Therefore, we can conclude that the DM is able to recognize the tone

of the language correctly. The colors also show the attention weights assigned by the model.

The darker colors indicate higher attention weights. Interestingly, the word “die” has the

highest attention score in the offensive instance.

Figure 6.2: Top 5 emojis that DeepMoji model assigned to one neutral and one offensive instancefrom our Curious Cat data. The words are colored based on the attention weights given by theDeepMoji model. Darker colors show higher attention weights.

We utilize the last hidden representation of the DM model and feed it to a non-linear

layer to project it into the same space as the output of the BiLSTM module.

3. Emotion-Aware Attention (EA): We hypothesize that it is not enough to only

consider the word representations in the attention model because of two following reasons:

1. Many bad words may also be used in a neutral way to make jokes and provide com-

pliments among friends.

2. Some texts do not contain any profanity but are still offensive to the receiver.

Both reasons may confuse the model for final prediction. Therefore, we designed the EA

mechanism to consider not only the word representations but also the emotions behind the

text to better determine the most important words in a document. Our approach is similar

to [40], where the authors proposed an attention model that uses genre information to find

the most appropriate features for predicting the likability of books.

Let’s assume that hi = [−→hi ;←−hi ] is the concatenation of the forward and backward hidden

states of BiLSTM, and e is the output of the DM module. To measure the importance of

55

words, we calculate the attention weights αi as follows:

αi =exp(score(hi, e))

Σi′exp(score(hi′ , e))(6.1)

where the score(.) function is defined as:

score(hi, e) = vT tanh(Wahi +Wee+ ba) (6.2)

where Wa and We are weight matrices, and b and v are the parameters of the model. We is

shared across the words and adds emotion effects to the attention weights. The output of

the attention layer is the weighted sum r calculated as follows:

r =∑i

αihi (6.3)

4. Feed-Forward Neural Network: We concatenate the outputs of the attention model

and DM module to further take into account the direct effect of the emotions on the model.

The resulting vector is then fed into a hidden dense layer with 100 neurons. To improve the

generalization of the model, we use batch normalization and dropout with a rate of 0.5 after

the hidden layer. Finally, we use a two neuron output layer along with softmax activation

to predict whether the input text is offensive or not.

6.2 Experimental Setup

As for preprocessing, we truncate the posts to 200 tokens, 1 and left-pad the shorter sequence

with zeros. We use Binary Cross Entropy to compute the loss between predicted and actual

1We also tried the sequence lengths of 100, but it did not work well for Kaggle and Wikipedia since thecomments are longer in these two datasets.

56

labels, which is calculated as follows:

l = − 1

NΣN

i=1 yi.log(y′i) + (1− yi).log(1− y′i) (6.4)

where yi is the actual label and yi′ is the predicted probability of the class of interest.

Additionally, to smooth the imbalance problem in the datasets, we add information about

class weights to the loss function. The network weights updated using Adam optimizer [28]

with a learning rate of 1e−5. We train the model over 200 epochs and report the test results

based on the best macro F1 obtained from the validation set.

6.3 Baselines

We compare our proposed model against the state-of-the-art and several strong baselines

listed bellow:

1. DM Baseline: In this model, we directly pass the output of the DM module to the

dense and output layers. The motivation behind this baseline is to estimate the power of

the DM model to detect abusive language on its own.

2. BiLSTM + Regular Attention (RA): This model calculates the score(.) function

without considering DM vectors using the following formula: vT tanh(Wahi + ba). The mo-

tivation behind this model is to compare the effect of regular attention with emotion-aware

attention.

3. BERT Baseline: In this model, we directly pass the last hidden representation of the

BERT for [CLS] token to the dense and output layer. With this model, we plan to test the

power of the BERT as a feature extractor for the task of abusive language detection.

4. Sam’17 [57]: We present this model in Chapter 5. This method has the state-of-the-art

results on the ask.fm corpus and applies an SVM classifier on top of a combination of various

57

features.

5. Kaggle Winner: It shows the result for the winner of Kaggle competition. This model

includes an ensemble of several machine learning classifier with word n-grams and character

n-grams lexical features.2

6. Bodapati’19 [5]: This work reported the state-of-the-art results on the Wikipedia

dataset. The model includes BERT with a single dense layer on top to fine-tune it for the

task of abusive language detection.3

We run our proposed model, BiLSTM + EA + DM, as well as all of the baselines

with two different pre-trained embeddings: (1) 200-dimensional Glove4 embeddings pre-

trained on Twitter, and (2) BERTbase (uncased) contextualized embeddings pre-trained on

the BookCorpus and English Wikipedia corpus [11].5 For all the models, we also conduct

additional experiments by feeding the concatenation of the hidden representation and output

of the DM module to the feed-forward layer (instead of using the hidden representation only).

We do those experiments to study the direct effects of emotion vectors to the final output.

6.4 Classification Results

We present our results in Table 6.1, where we report the F1-score for the abusive class

along with weighted average F1 for both classes. Based on results, Our primary model

performs significantly better than Sam’17 on the ask.fm dataset. Also, the best obtained

AUC score on Kaggle is 0.925, which shows an 8% improvement over the Kaggle Winner’s

result. For Wikipedia, we achieve the same weighted F1 of 95.7, as reported by Bodapati’19

(in the paper) with BiLSTM + RA + DM model using BERT. While, unlike Podapati’19,

2https://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard The code forthis model is available online.

3We implemented this model ourselves since the code is not released.4https://nlp.stanford.edu/projects/glove5We did not fine-tune the BERT, and only used it as a feature extractor.

58

we do not fine-tune BERT (fine-tuning BERT is computationally expensive, especially on

large corpora like Wikipedia). However, when we re-implemented their model, we achieved

a slightly higher weighted F1 of 95.9 as what we report in Table 6.1. Although the gap

between the weighted F1 of our best model and Podapati’19 is only 0.2%, we can see that

the F1 for the offensive class is around 3% worse with our model. This finding indicates that

our model probably works better for the neutral class.

Table 6.1: Classification results in terms of F1-score for the negative/offensive class and weightedF1. The values that are in bold show the best results obtained for each dataset. +DM refers tothe experiments in which we directly concatenate DM vectors with the last hidden representationgenerated by the model.

Curious Cat ask.fm Kaggle WikipediaModel F1 W F1 F1 W F1 F1 W F1 AUC F1 W F1DM Baseline 71.90 91.0 59.21 85.1 73.45 86.0 0.913 72.20 94.5

Glo

ve

BiLSTM + RA58.62 85.1 52.97 83.7 69.67 84.9 0.888 74.13 95.1

+ DM 69.83 89.6 61.61 86.1 73.46 86.4 0.919 75.74 95.3

BiLSTM + EA61.65 87.0 53.86 84.0 67.18 81.4 0.885 75.92 95.3

+ DM 72.16 91.1 62.62 86.1 74.93 86.8 0.919 77.09 95.5

BE

RT

BERT Baseline40.86 81.6 37.29 80.1 64.72 81.4 0.862 50.84 89.6

+ DM 70.17 89.9 60.79 85.6 76.50 87.6 0.920 73.24 94.9

BiLSTM + RA58.75 87.2 50.75 83.7 74.44 87.2 0.921 75.85 95.3

+ DM 71.95 90.9 62.37 85.2 75.00 87.1 0.925 77.89 95.7

BiLSTM + EA62.03 87.6 56.97 84.9 74.01 85.6 0.922 77.15 95.5

+ DM 71.20 90.6 60.95 84.9 73.90 86.6 0.921 77.58 95.6

Sam’17 65.54 88.3 58.47 84.1 72.85 86.0 0.806 74.48 94.7Kaggle Winner 65.86 90.0 51.49 84.4 72.03 86.5 0.842 74.45 95.2Bodapati’19 68.19 89.9 56.38 85.0 76.86 88.5 0.925 80.13 95.9

Overall, we obtained the best results with our main proposed model, BiLSTM + EA +

DM, for Curious Cat and ask.fm corpora using Glove embeddings. Based on the McNemar

test with a confidence level of 0.05, the improvements with our model are statistically signif-

icant (p-value < 0.001) compared to the state-of-the-art results on these two datasets. For

Kaggle and Wikipedia corpora, BERT Baseline + DM, and BiLSTM + RA + DM models

provide the best results, respectively. These approaches do not outperform the results re-

ported by Bodapati’19. However, based on the McNemar test, our results are comparable to

that of the state-of-the-art (p-value > 0.05). Although our primary model, BiLSTM + EA +

59

DM, is not the winner for Kaggle and Wikipedia datasets, still, the best performing models

have DM as part of their architectures. This observation demonstrates the advantages of

using DM representation to extract contextual information from online content. The reason

is that DM trained using fine-grained emoji categories, that can capture different levels of

emotional feelings (e.g., , , and show different levels of anger). Such information

helps the model to determine the tone of language more precisely. In Section 6.5, we provide

a more detailed analysis of the DM model. Moreover, for all the datasets except Curious

Cat, the best performing model is statistically significant compared to the DM Baseline.

Therefore, we can conclude that both textual and emotional representations are needed to

detect abusive content.

It is interesting that Glove embeddings seem to work better than BERT on Curious Cat

and ask.fm corpora, especially over the negative class. Instead, using BERT improves the

performance of the models on Kaggle and Wikipedia datasets. The reason is the various

nature of the language across different domains. For instance, ask.fm and Curious Cat data

mostly include teens’ posts, where the language is more unstructured and noisy, and the

posts are much shorter compared to Kaggle and Wikipedia. Instead, Glove was pre-trained

on Twitter data, where the posts are usually very short, noisy, and unstructured, similar to

these two corpora. Therefore, Glove could probably generalize the semantic representation

better in such cases compared to BERT. This big difference in the performance of the various

architectures on different data resources shows that it is very challenging to build a model

that works well on different domains. It also confirms the need to collect data from more

social media platforms.

60

6.5 Why Does DeepMoji Work?

To show why emoji representation is helpful to detect the abusive language in social media,

we plot the emoji distribution over the neutral and offensive classes for the Curious Cat

training data (Figure 6.3). For creating this plot, we use the average DM vector extracted

for each instance. This vector shows the relevance of each emoji to a specific comment. We

created one overall emoji vector per class by averaging the emoji vectors extracted for all

of the instances of the same class. Finally, we chose 19 out of the 64 emojis used in the

DM project to create the final plot. As Figure 6.3 shows, there are different patterns visible

for the neutral and offensive classes. This observation validates our hypothesis on why it is

useful to incorporate emoji information into the model.

Figure 6.3: Emoji distribution over Curious Cat data

Based on Figure 6.3, angry emojis ( , , ) are highly correlated with the offensive

class, inversely happy, and love faces ( , , ) appeared more frequently in the neutral

class. For the happy and love faces, and , the difference between offensive and neutral

classes is much less. We believe that this represents the scenario where a defender (a user

who defends the victim of online attacks) tries to support an attacked user by complimenting

him/her, while expressing hatred towards the attackers. Sad faces ( , , , , ) are more

frequent in neutral instances than offensive ones. It possibly shows the cases where a user

61

expresses his/her unhappiness in response to an attack. Interestingly, the laughing face, ,

shows a higher probability for the negative class. This observation shows that sometimes the

attackers attempt to humiliate or mock the victims in their offensive messages. Additionally,

the plot shows exactly the same probabilities for the poker face ( ) over the offensive and

neutral classes. Therefore, we can conclude that this emoji does not convey any additional

information related to offensive language. Other emojis ( , , , and ) that indicate

the violent and threatening behavior towards the receiver also seem to be more associated

with the offensive class.

6.6 Findings

In this chapter, we investigated the advantages of using emoji representation to detect online

abusive language. Social media users make use of emojis in their online posts to reflect their

feelings within the text. Therefore, we hypothesized that emojis could help to understand

the emotions underneath online posts and to recognize the offensive language better. We

created a unified deep neural architecture that incorporates emotion information into textual

features through a mechanism called emotion-aware attention to detect abusive language.

We reported the performance of our model on different domains and showed very promising

results in comparison to the state-of-the-art models.

Based on the results, we found that the behavior of our proposed model is entirely

different on the domains that are popular among youth (ask.fm and Curious Cat), and the

ones with more structured texts (Kaggle and Wikipedia). Our primary model (BiLSTM +

EA + DM) showed significantly better performance compared to fine-tuned BERT on ask.fm

and Curious Cat data. The results indicated that for these two corpora, Glove embeddings

work better than BERT. However, our model could not outperform fine-tuned BERT on

Kaggle and Wikipedia. This big difference in the results shows how challenging it is to build

62

a model that works well across different domains. It also confirms the need to explore various

data resources.

63

Chapter 7

Detecting the Early Signs of

Cyberbullying

In this chapter, we move from abusive language detection to a more specific problem, cyber-

bullying detection. As we defined earlier, in a cyberbullying episode, the attackers deliber-

ately use strongly negative language to bully others in cyberspace repeatedly. We need to

advance systems that are able to detect cyberbullying incidents as soon as possible before

they can cause irreparable damages to our young generation. As we discussed in Chapter 2,

there are several studies on cyberbullying detection. However, most of them are focused

on offline settings and can not be used for warning people. Instead, we aim at developing

predictive models that continuously monitor online content, and provide early and accurate

predictions based on limited evidence when a possible case of cyberbullying is happening.

Early text categorization is a particular text classification scenario, where it is essential

to know the category of a document as soon as possible. In this scenario, the accuracy

and the timeliness of the predictions are both critical. This emerging topic has been re-

cently employed in various areas, especially the ones related to health and safety, such as

depression detection and self-harm detection. In this chapter, we plan to create an early

64

text categorization framework to address cyberbullying detection. Herein, unlike abusive

language detection, we work with the threads of messages (conversations) instead of single

posts. The ultimate goal is to create a system that accurately triggers an alert when the risk

of cyberbullying is high, monitoring as few posts inside the conversation as possible (starting

from the first post).

7.1 Traditional Machine Learning Approach

We use our ask.fm cyberbullying dataset (Section 4.3) to investigate the possibility of em-

ploying the early text categorization scenario to the task of cyberbullying detection. Since

our corpus is small, we follow a traditional machine learning approach along with several

hand-engineered features. We further evaluate our model using the chunk-by-chunk evalua-

tion framework that we introduced in Chapter 3.

In this scenario, the instances/conversations in test data are read sequentially in chunks

of texts that are fed into the classifier in an incremental fashion to obtain the prediction

at chunk t. In our case, at every time t, we only have access to question-answer pairs in

the first t chunks of test data to make the predictions. However, the training is done, as

usual, using all chunks of data per instance. The intuition behind this scenario is to learn

the overall pattern of a conversation and to investigate how helpful this pattern is to detect

cyberbullying in the early stages of the conversation. Chunk-by-chunk evaluation is the most

standard framework for evaluating the early text categorization models according to different

forums such as eRisk [36, 38].

7.1.1 Methodology

We make use of the following features to extract the information from the text:

65

1. Lexical: We use word n-grams (n = 1, 2, 3) and char n-grams (n = 3, 4, 5) as they were

proven to be effective lexical representation for abuse and hate speech detection (Chapter 5).

For word n-gram features, we build a vocabulary that only considers the top 10K features

ordered by term frequency across the corpus. We weigh each term with its term frequency-

inverse document frequency (TF-IDF).

2. Word Embeddings: As for semantic features, we use the pre-trained Google News

word2vec model, including embeddings for about 3 million words. For each post inside a

conversation/instance, we create the feature vector by averaging the word embeddings of all

the words in each post.

3. Style and Writing density (WR): We extract the stylistic properties of the text, in-

cluding the number of words, characters, all uppercase words, exclamations, question marks,

as well as average word length, sentence length, and words per sentence.

4. LIWC (Linguistic Inquiry and Word Count): We use LIWC2007 word categories

[51] to extract different language dimensions like various emotions (e.g., sadness, anger, etc.),

self-references, and casual words in each text. To create this feature set, we use a normalized

count of words separated by any of the LIWC categories.

5. DeepMoji: The emojis are used to better understand a textual message by suggesting

pictures that may help to represent it better. DeepMoji is a deep learning model that is

pre-trained on a large set of Twitter data [18]. Given an input text, this model provides an

output representation for 64 frequently used online emojis. This representation shows how

relevant each of those emojis is to the given input. We apply this pre-trained model on our

data and extract the last hidden representation as the feature set for each post.

66

7.1.2 Experimental Settings

To configure an early text classification scenario, we use a chunk-by-chunk evaluation frame-

work. In the experiments, we divide each instance into 10 equally sized chunks. Every chunk

includes 10% of question-answer pairs in that conversation.

In this chunk-by-chunk setting, we consider all questions and all answers within a chunk

as separate documents. Then, we extract the aforementioned features from each document

instead of a single post. The reason for separating questions and answers is that we believe

these two categories of posts reflect two different views (i.e., commenters vs. the account

holder). We concatenate question-based and answer-based feature vectors to get a single

representation for each instance. Then, we feed the final representation to a linear SVM

classifier. For each set of features, we tune the C parameter of the classifier with a grid

search over the following values: {0.1, 1, 2, 5, 10}.

To perform the chunk-by-chunk evaluation, as we discussed in Section 3.2.1, we report

the performance of the different methods using increasing amounts of textual evidence within

10 different iterations. In the first iteration, we generate a document representation starting

with the first chunk and evaluate the performance of the classifier based on this representa-

tion. In every next iteration, we incrementally add one more chunk of test data and evaluate

the performance of the model once again. The ultimate goal is to obtain better performance

in the earlier iterations.

7.1.3 Classification Results

Table 7.1 shows the classification results in terms of F1-score for the cyberbullying class (the

class of interest). The results of WR and LIWC features are not included in the table due to

the very low performance of the model using these features. Even combining them with the

other features do not improve the performance. However, they seem to be helpful for the

67

task of abusive language detection (please refer to Section 5.4). This contradiction indicates

that in practice, there are some differences between the two tasks of abusive language and

cyberbullying detection. One major difference is the format of the data. For example,

with regards to WR features, in cyberbullying detection, several users are involved in a

conversation that use various writing styles and language tones in their comments. Because

of the method that we use for feature extraction, it is hard to extract cyberbullies’ writing

style out of all this information. However, in abusive language detection, every input is a

single comment written by a particular user.

Table 7.1: F1-score for the chunk-by-chunk evaluation for the positive class. The bold valuesshow the best performance gained for each feature set.

Feature Iter1 Iter2 Iter3 Iter4 Iter5 Iter6 Iter7 Iter8 Iter9 Iter10Unigram 0.46 0.54 0.66 0.76 0.61 0.71 0.71 0.61 0.61 0.67Bigram 0.00 0.20 0.00 0.20 0.36 0.36 0.40 0.40 0.22 0.40Trigram 0.00 0.20 0.22 0.22 0.40 0.54 0.54 0.61 0.50 0.33Char 3gram 0.40 0.22 0.40 0.40 0.54 0.36 0.40 0.54 0.54 0.54Char 4gram 0.22 0.22 0.22 0.22 0.40 0.40 0.40 0.40 0.40 0.40Char 5gram 0.00 0.22 0.22 0.22 0.22 0.40 0.40 0.40 0.40 0.40Word2Vec 0.43 0.59 0.47 0.53 0.53 0.57 0.50 0.50 0.50 0.36Unigram + Word2Vec 0.67 0.61 0.67 0.67 0.71 0.71 0.71 0.71 0.61 0.66DeepMoji 0.73 0.78 0.75 0.80 0.88 0.82 0.82 0.93 0.75 0.77

Based on the results, the best F1-score is obtained from DeepMoji features using eight

chunks of data (80% of data). Even in earlier chunks, this method works significantly better

than the other approaches. This observation shows that emoji-based representation for

cyberbullying and non-cyberbullying instances are likely to be very different. We further

analyze this result in Section 7.1.4.

Taking into account that the average number of question-answer pairs in each chunk of

the test data is 4, unigram+Word2Vec and DeepMoji features show very promising results

in the earlier chunks. This finding indicates that with these two feature sets, the classifier

is able to identify cyberbullying instances after monitoring only a few question-answer pairs

inside the conversations. Overall, we can conclude that adding more information to the test

68

data decreases the performance of the system after a while (especially in the last two chunks).

Even for the Word2Vec feature, we get the best performance using only the first two chunks

of the test data. The reason could be the distribution of the offensive messages in a cy-

berbullying episode. Cyberbullying is usually started with a couple of questions/comments

from the attacker(s), and as it goes forward, one or more users get involved in the conver-

sation as the victim’s bystanders. Some of such users try to encourage the victim to stay

strong against the attacks, and some others start defending the victim by posting aggres-

sive comments targeting the attacker(s). With regards to the approach that we used for

feature extraction, this information possibly confuses the classifier when it gets access to the

later chunks. To sum up, Table 7.1 indicates that we can successfully adapt the early text

categorization approach to the cyberbullying detection task, where the system shows better

performance using less evidence.

7.1.4 Analysis

Figure 7.1 illustrates the flow of emojis for a non-cyberbullying and a cyberbullying instance

in our corpus. Based on this figure, we can understand better why DeepMoji representation

helps to detect cyberbullying at its early stages. For making this figure, we chose 6 out of 64

emojis from the output of the DeepMoji model. We selected an emoji set that covers various

emotions (e.g., happiness, sadness, anger). We create the plot based on the average emoji

vector that we calculate for the available textual data in each chunk. The y-axis shows the

probability that each emoji could be assigned to every chunk.

Based on Figure 7.1a, in a non-cyberbullying conversation, we have a mixture of the

emojis (i.e., overall, no emoji is dominant). Nevertheless, in a cyberbullying instance (Figure

7.1b), negative emojis like and are almost dominant, specifically in the first few chunks,

while the probability of positive emojis like and are much lower. Interestingly, the

69

(a) Non-cyberbullying instance (b) Cyberbullying instance

Figure 7.1: Flow of Emojis.

laughing face emoji ( ) is also showing a higher probability in the cyberbullying example.

We may conclude that probably in this instance, the attacker(s) made fun of the victim.

7.2 Deep Learning Approach

Deep neural models have widely been used to address the task of abusive language detection,

and showed some improvements compared to the machine learning approaches. However,

to the best of our knowledge, there is only one research study that applied deep learning

to cyberbullying detection [7]. In that work, the authors created a hierarchical attention

network to model the sequence of posts inside a conversation for detecting the Instagram

media sessions that include cyberbullying. They used the same Instagram dataset that

we described in Section 4.4.2. They compared their method with several machine learning

models, and showed very promising results. Therefore, we decided to explore the deep neural

models to create our early cyberbullying classifier. Since such models are very data-hungry,

we could not use our ask.fm cyberbullying dataset for the experimentation. Instead, we used

the Instagram corpus that includes more than 2,200 instances (media sessions).

In this section, we create a hierarchical deep neural architecture to encode the Instagram

media sessions. We further develop a sequential decision-making process that monitors the

70

thread of online posts inside each conversation. It provides a post-by-post evaluation frame-

work, through which the system decides whether to label a conversation as cyberbullying

based on the monitored posts, or wait for more evidence.

7.2.1 Time-Wise Segmentation

Cyberbullying is a sustained attack over a period of time.1 The previous studies showed

that temporal features are advantageous to detect this phenomenon [64, 7]. Therefore, we

decided to take into account the temporal dynamics of the posts, while modeling online

conversations.

We divide each media session into several segments, where each segment may include one

to several posts. We hypothesize that modeling the sequence of segments instead of posts

can help the system to identify cyberbullying incidents better. Therefore, we formulate the

task as follows: given a sequence of segments as input, the model provides an output whether

this sequence is cyberbullying or not. For segmenting the media sessions, we first sort the

comments in an ascending order based on the time that they were posted. Then, we pursue

three following approaches:

1. Varying Length Segmentation: In this method, we take into account: (1) the ratio

of offensive messages inside a segment, and (2) the time gap between the two consecutive

posts inside a segment. We used our best emotion-aware deep neural model (presented in

Chapter 6) trained on Curious Cat data to automatically label each post in a media session

as offensive or neutral. We chose Curious Cat corpus to train the model because of two

reasons:

1. Instagram social media is more similar to Curious Cat and ask.fm. Among these two

different resources, the first one includes more forms of abusive language, whereas the

1https://www.helpguide.org/articles/abuse/bullying-and-cyberbullying.htm

71

latter one created based on a short list of bad words.

2. The overall performance of the model on Curious Cat data is far better than ask.fm

(Section 6.4, Table 6.1). Thus, we can assume that the model which trained on Curious

Cat may be better transferable to the other domains.

After labeling all posts in each media session, we create a sliding window with a default size

of 5 over its comments. We start from the first comment and move the window throughout

the whole conversation. At every step, we check the ratio of offensive messages inside the

window. If it is greater than 40% (i.e., 2 out of 5 comments were labeled as offensive),

we add more posts to that window, one-by-one, until: (1) the negative ratio become less

than the threshold, or (2) the time gap between the last comment inside the window and

the new comment is greater than one month. Each finalized window is considered as a

segment. It should be noted that there was no overlap between the two adjacent segments.

In this approach, the longer segments probably include strongly negative comments that

were posted in a specific interval of time. Therefore, the larger segments can be a good

indicator of possible cyberbullying incidents.2

2. Constant Length Segmentation: This method divides each media session into several

segments with a constant length of 5. It means that each segment includes at most 5 posts.

In this method, we also take the time gap between two posts into account, such that two

comments with a time gap of more than a month can not be placed in the same segment.

In this situation, we start a new segment with the more recent post. The reason is that

cyberbullying is a repeated attack over a time period. Therefore, the big time gap between

two comments in a media session most likely shows that there is no cyberbullying incident

in that part of the conversation.

3. Single Post Segmentation: In this approach, we consider each post as a single segment.

2A cyberbullying conversation might include more than one cyberbullying incident

72

This method is widely used to represent conversational information [6, 63].

7.2.2 Model Architecture

In our proposed model, we make use of a popular deep neural model for document classifi-

cation called Hierarchical Attention Network (HAN) [77]. HAN encodes the word-level and

sentence-level information in a document. It uses attention in both levels to find more or less

important content while making the document representation. We use similar architecture

to first encode the sequence of words inside a segment for creating segment representation.

Then, we encode the sequence of segment representations to make the document represen-

tation. We use the attention mechanism in both levels with the intuition that not all the

words and segments have the same contribution in the final classification. Figure 7.2 shows

the overall architecture of our proposed model. The model consists of the following modules:

1. Word Encoder with Attention: As we mentioned earlier, the input to the model is

a sequence of segments; each of them includes at least one post. We first concatenate all

posts inside a segment to create a single sequence of words. We use <SP> and <EP> as

the special tokens that show the start and the end of each post, respectively. With these

special tokens, we can have a single sequence of words per segment, yet keeping the posts

separated in this sequence. The next step is to embed the words to the vectors of real values

through an embedding matrix. As for the embeddings, we use 200-dimensional Glove3

embeddings trained on Twitter. Then, we pass the embedding vectors to a Bidirectional

LSTM (BiLSTM) layer. We obtained the annotation hit for each word wit by concatenating

the forward and backward hidden states, i.e., hit = [−→hit;←−hit]. Not all the words contribute

equally to the segment representation. Therefore, we use the attention mechanism to learn

the importance of the words through the following formulations:

3https://nlp.stanford.edu/projects/glove

73

Figure 7.2: Overall architecture of our model for cyberbullying detection. The model consistsof three main modules: (1) Word Encoder that encodes the sequence of words in each segment tocreate segment representation, (2) Segment Encoder that encodes the sequence of segments to createmedia session representation, and (3) Classification Layer that provides the final classification.UMR (User Mention Rate), ATG (Average Time Gap), and ADM (Average DeepMoji Vector)show the hand-engineered features that are used to provide more context for each segment.

74

uit = vT tanh(Wwhit + bw) (7.1)

αit =exp(uit)

Σt′exp(uit′)(7.2)

s′i =∑t

αithit (7.3)

where s′i is the resulting representation for i-th segment.

Sometimes the caption of the photo in a media session might provoke the other users

to post negative comments, e.g., when the account holder expresses his/her opinion about

a specific topic, and another user has an opposite view. With this intuition, we try to

model the relevancy of each segment to the caption of the photo in each media session

using another attention mechanism called co-attention [8]. Through this method, we aim

at learning the attention weights of the words in segments based on their relevancy to the

caption. To calculate the segment-caption co-attention, we first pass the caption through

the embeddings, and then the BiLSTM layers to generate caption representation. Suppose

that S = {h1, h2, ..., hN} is the hidden representation of encoded words in a segment, and

C = {c1, c2, ..., cT} is the feature matrix of encoded words in the caption. We first calculate

the affinity matrix F ∈ RT×N as follows:

F = tanh(CTWlS) (7.4)

where Wl is the learnable weighting matrix. Then, we used this affinity matrix as a feature

and calculated the word attention maps as follows:

Hs = tanh(WsS + (WcC)F ) (7.5)

where Ws and Wc are the weight parameters that can be learned through the model. We

75

compute the attention weights of the words, and then the segment representation as follows:

αs = softmax(wThsH

s) (7.6)

si =∑t

αsih

i (7.7)

Ultimately, we create the final segment representation si by concatenating s′i and si, i.e.,

si = [s′i; si].

2. Segment Encoder with Attention: To provide more context for each segment, we

make use of the following features, as well as the representation provided by the word encoder:

1. User Mention Rate (UMR): Cyberbullies mostly target their victims directly. With

this intuition, for each segment, we calculate the ratio of comments containing a di-

rect user mention. By user mention, we mean the comments that include the token

@username.

2. Average Time Gaps (ATG): Cyberbullying is a continuous attack over a period

of time. The previous studies showed that there is a strong correlation between the

strength of support for cyberbullying incidents and the media sessions that include

frequent postings within 1 hour of each other [27]. Therefore, we decided to take

temporal information into account. We calculate the average time gap ∆t between the

posts inside a segment, in terms of the number of hours.

3. Average DeepMoji Vector (ADM): Based on what we showed in Section 7.1.4,

DeepMoji representation can help identify cyberbullying incidents. We extract the

DeepMoji representation per post, and calculate the average DeepMoji vector over all

the posts inside a segment.

We concatenate all these features and pass them to a feed-forward layer to: (1) regulate their

76

range of values, and (2) compress all this information into a single representation. Then, for

each segment, we concatenate the si representation with the feature representation and feed

the resulting vectors to the segment-level BiLSTM. Not all the segments in a conversation

are of the same importance to identify the class of the media session. Therefore, we employ

another regular attention mechanism as what we used inside the word encoder to weigh the

segments, and generate the final representation for the media session.

3. Document Classifier: We feed the media session representation to a fully-connected

layer to compress the output representation of the segment encoder. Finally, we passed the

resulting vector to an output layer along with softmax activation to predict whether the

media session includes cyberbullying or not.

7.2.3 Experimental Setup

Table 7.2 shows the padding length in segment-level and conversation-level. We chose No.

segments and No. posts values based on the median number of segments in a media session,

and the median number of posts in a segment, respectively. No. segments shows the maxi-

mum number of segments that we consider to create the input to the model. We remove the

extra segments from the end of the longer media sessions, and for the shorter ones, we add

the segments of all zeros to the end of the sequence. No. posts shows the maximum number

of posts that we consider in each segment. In the varying length segmentation method, we

only considered the first ten posts in the larger segments. As we mentioned earlier, we con-

catenate the remaining posts in a segment to create a sequence of words per segment. Before

that, we truncate the posts to 65 tokens based on the median length of posts in training

data. After concatenating the posts inside the segments, we find the maximum segment

length, i.e., the maximum number of tokens in a segment. Then we right-pad the shorter

segments with zeros.

77

Table 7.2: Padding length in segment-level and conversation-level for various segmentationalgorithms.

Segmentation No. segments No. postsVarying Length 10 10Constant Length 10 5Single Post 50 1

We use Binary Cross Entropy (Equation 6.4) to compute the loss between predicted and

actual labels. The network weights are updated using Adam optimizer [28] with a learning

rate of 1e−5. We train the model over 250 epochs, and save the best model based on the

best weighted F1 obtained from the validation set.

7.2.4 Decision-Making Process

As we mentioned earlier, in our early text categorization scenario, we train the model using

the whole training data similar to regular text classification. However, the evaluation process

is different. In these experiments, we evaluate the performance through a post-by-post

evaluation framework that we discussed in Section 3.2.2. Therefore, at every time t, we only

have access to the first t comments in each media session. Then, the system should decide

whether to label the media session as cyberbullying based on the currently seen posts, or

waits for monitoring more posts.

For creating the appropriate input for the model in this evaluation framework, we start

with an empty sequence of segments. Then, we update this sequence at every iteration with

one more post based on the segmentation approach that we used for training the model.

For the models trained with a sequence of varying or constant length segments, we assume

that at most, we can have a sequence of 10 segments as the input to the model, where

each segment included at most 5 posts. For the longer media sessions, if we need to add

a new post to the sequence, but all the segments are already full, we remove the very first

78

segment from the beginning of the sequence that contains the oldest information. Figure 7.3

illustrates this update process.

Figure 7.3: Updating the input sequence for a long media session at iteration t. Each segmentincludes at most 5 posts, but we might have a lower number of posts in a segment due to thebig time gap between two posts. In this example, the time gap between p7 and p8 is greaterthan one month, and they are placed in different segments.

Therefore, in each iteration, we update the input sequence based on the new post and pass

it to the model to get the output distribution over the cyberbullying and non-cyberbullying

classes for that iteration. Furthermore, we design a policy-based decision-making process

that in every iteration, decides whether to label each media session based on the current and

previous outputs of the model, or waits for monitoring more posts. In our policy, we take

into account the performance of the model in two consecutive iterations. More precisely, the

system labels a media session i as cyberbullying in the t-th iteration if the output probability

of cyberbullying class for both current and previous input sequences is greater than 80%. We

fixed this threshold empirically, using the development set. With this policy, we aim to focus

on the limited time frames in which the model is confident that the risk of cyberbullying is

high. Once the system decides to label a media session as cyberbullying based on the seen

posts at iteration t, it is not allowed to monitor the rest of the posts in that media session,

or changes the decision. Also, if the system reaches the last post in a media session, it will

be labeled as non-cyberbullying.

79

7.2.5 Classification Results

To find the contribution of each element in the performance, we compare our primary model

with the following variations:

1. HAN: We exploit the original architecture of the Hierarchical Attention Network

without using any of the hand-engineered features.

2. HAN + UMR: We only consider User Mention Rate to provide more context for the

segment representations.

3. HAN + ATG: We only use the Average Time Gap in the segment encoder as the

hand-crafted feature.

4. HAN + ADM: We only concatenate the Average DeepMoji vectors with the segment

representations in the segment encoder.

We run our proposed model, as well as all of the variations with and without using the

co-attention mechanism.

Table 7.3 shows the classification results in terms of precision, recall, F1, and F-latency

for cyberbullying class using our decision policy for cyberbullying detection. Based on the

results, we can conclude that overall, the performance of the different models using varying

length segmentation is better than the two other segmentation approaches. This observation

demonstrates our hypothesis that this method can help the model to better identify the

cyberbullying patterns. Adding co-attention decreases the performance in almost all the

models. The possible reasons could be:

1. Captions are usually much shorter than the comments posted on a media session. Based

on the training data, we found that the average number of words in captions is around

10; however, the average number of words in a comment is around 73. Therefore, it

80

Table 7.3: Classification results for cyberbullying detection using different segmentationmethods in terms of precision, recall, F1, F-latency, and average number of comments thatthe model needs to monitor for making a cyberbullying prediction. The values in bold showthe best scores obtained for each segmentation method.

Model Precision Recall F1 F-latencyAvg. # ofcomments

Vary

ingLength

HAN+ 76.06 78.83 77.42 72.01 16.08+ co-attn 73.03 81.02 76.82 71.45 18.00

HAN + UMR+ 75.34 80.29 77.74 72.31 16.40+ co attn 67.81 86.13 75.88 70.58 16.04

HAN + ATG+ 64.29 85.40 73.35 68.96 15.25+ co attn 67.92 78.83 72.97 67.87 17.01

HAN + ADM+ 74.50 81.02 77.62 72.20 15.71+ co attn 69.37 81.02 74.75 70.27 15.70

Full Model+ 74.48 78.83 76.60 71.24 15.80+ co attn 70.70 81.02 75.51 70.98 15.63

ConstantLength

HAN+ 78.40 71.53 74.81 69.58 15.16+ co-attn 75.56 74.45 75.00 67.90 17.73

HAN + UMR+ 77.37 77.37 77.37 71.96 17.18+ co attn 68.71 81.75 74.67 69.45 17.59

HAN + ATG+ 68.52 81.02 74.25 69.80 15.56+ co attn 68.96 72.99 70.92 64.56 19.21

HAN + ADM+ 75.00 78.83 76.87 72.26 15.73+ co attn 68.26 83.21 75.00 70.13 16.28


Single

Post

HAN+ 71.15 81.02 75.77 71.23 14.30+ co-attn 75.00 72.26 73.61 67.00 16.63

HAN + UMR+ 73.82 80.29 76.92 72.31 17.13+ co attn 68.67 75.18 71.78 66.76 16.48

HAN + ATG+ 72.66 73.72 73.19 68.80 15.86+ co attn 66.25 77.37 71.38 67.10 15.20

HAN + ADM+ 71.15 81.02 75.77 71.23 13.88+ co attn 67.48 80.29 73.33 69.30 13.62


is much harder to extract useful information from the caption. Also, for lots of the

media sessions, the caption is empty that does not provide any information.

2. Sometimes, the caption is not directly relevant to the posted photo. The content of the

photo could affect the conversation more significantly. Unfortunately, we did not have

access to the images on the Instagram dataset to examine this assumption empirically.

Among all the models, HAN + ATG seems to have the worst performance for all the

segmentation techniques. The reason could be the way that we encode the time information.

81

We found that in most of the media sessions, the time gaps between the latest posts are much

higher than the time gaps between the posts at the start or middle of the conversation. The

gaps could be in the range of several minutes to a couple of years. Therefore, encoding the

temporal information is hard. We can also see in Table 7.3 that we do not get the best

results from our full model when we utilize varying length segmentation. However, using

the two other segmentation methods, we obtained the best results with our primary model.

That is because varying length segmentation method might generate very large segments,

including so many comments. Therefore, for extracting the hand-crafted features, calculating

the average vector is not possibly the best approach.

(a) Varying length segmentation. (b) Single post segmentation.

Figure 7.4: Delay distribution for HAN + UMR model.

Another interesting observation from Table 7.3 is that the F-latency scores that we obtain

for HAN + UMR model using varying length and single post segmentation methods are the

same. However, for the latter one, the F1 score is lower, while the average number of

comments for making the decision is higher than the former one. Figure 7.4 illustrates the

delay distribution for both models. By delay, we mean the minimum number of posts that

the model needs to monitor before predicting the cyberbullying media sessions. As we can

82

see in this figure, using single post segmentation, the model could make earlier decisions for

more media sessions comparing to varying length segmentation. However, there are more

outliers (delay ¿ 80) in this case that made the average value higher.

7.2.6 Comparison with State-of-the-art

To the best of our knowledge, there is only one research study that addressed the early de-

tection of cyberbullying [78]. In this work, the authors proposed a model called “CONcISE”

for timely and accurate cyberbullying detection. This model includes a two-stage approach

that classifies comments (offensive vs. neutral) in a media session as they become available

and raises an alert at the session-level when a threshold is exceeded. In this study, the

authors used the Instagram cyberbullying corpus as well. However, they did not consider all

the media sessions in their experiments. They only made use of the media sessions that 40%

of their comments include at least one bad word. They used 22.1% of those media sessions

for training (203 media sessions), and the rest 77.9% for testing (719 media sessions). To

compare our model with CONcISE, we use the same training and test sets as what they

used, and train the models that performed best for different segmentation strategies (based

on Table 7.3). Table 7.4 compares our models with three different variations of their ap-

proach. In each of these variations, they utilized a different list of bad words for detecting

the abusive comments.

Based on Table 7.4, our HAN + UMR model using varying length segmentation out-

performs the state-of-the-art results with both performance (more than 2% improvement on

F1-score) and earliness of the decisions (around 10 fewer posts needed on average to make

the cyberbullying prediction). The results of our full model using the other two segmentation

algorithms are also comparable to CONcISE’s results, considering that our model can make

the decision much earlier than CONcISE. Also, we believe that the small size of training

83

Table 7.4: Comparison to the state-of-the-art results for detecting early signs of cyberbully-ing.

Model Precision Recall F1Avg. # ofcomments (std.)

CONcISE-10 69.5 79.4 74.1 26.68 (16.92)CONcISE-profane 74.5 76.9 75.7 30.62 (21.82)CONcISE-Noswearing 74.2 77.6 75.9 29.75 (21.09)

HAN + UMR (Single Post) 71.2 68.7 69.9 20.55 (26.07)HAN + UMR (Constant Length) 71.8 62.6 66.9 24.33 (23.80)HAN + UMR (Varying Length) 74.4 82.6 78.3 16.28 (18.13)Full Model (Single Post) 72.2 78.3 75.1 14.49 (17.19)Full Model (Constant Length) 71.1 79.1 74.9 16.36 (18.46)Full Model (Varying Length) 66.1 48.4 55.9 23.39 (25.25)

data in these experiments affects the overall performance of our deep neural model.

7.3 Findings

In this chapter, we first investigated the possibility of applying the early text categorization

scenario to detect cyberbullying incidents in their early stages. We started with traditional

machine learning approaches, and showed the advantages of using the flow of emojis (through

DeepMoji representation) to identify early signs of cyberbullying.

Next, we switched to deep learning approaches and proposed a hierarchical attention

network to address the problem of cyberbullying detection. We examined three different

time-wise segmentation techniques for creating the input to our model, and reported the

state-of-the-art results for early prediction of cyberbullying. We showed that our model

is able to monitor the sequence of comments in a media session, and provides timely and

accurate predictions of cyberbullying conversations based on limited evidence.

84

Chapter 8

Conclusions and Future Work

In this chapter, we provide an overview of the work that we conducted in this dissertation.

We first summarize the goals and significant findings of our research. Then, we discuss the

shortcomings of our study and the possible future work in the area.

8.1 Conclusions

In this dissertation, we address two relevant tasks of abusive language and cyberbullying

detection on online social media. Our goal is to advance new technology that will help to

protect vulnerable online users against cyber attacks. The first significant contribution of this

dissertation is to introduce new corpora for both tasks of abusive language and cyberbullying

detection. We mainly focused on the social media domains that are very popular among kids

and youth and are less explored by other research groups. We collected two datasets from

ask.fm and Curious Cat websites for the task of abusive language detection. The moderate

inter-agreement scores for annotating both corpora validated the fact that the perceived level

of aggression is very subjective, and that the abusive language detection is a difficult task,

even for human annotators. We also introduced a new corpus for cyberbullying detection,

85

including several conversations collected from ask.fm social media. In the process of data

collection, we explored different sampling techniques that help us to capture more forms of

abusive content.

We began our experiments with creating discriminative methods to identify extremely

offensive online content automatically. We formulated this problem as a binary text classifi-

cation task. Our initial models consist of a traditional machine learning classifier, along with

a set of hand-engineered features. We explored several types of features, including lexical,

semantic, sentiment, stylistic, and lexicon-based features, to capture the different aspects of

the text. We learned that among these, lexical features are very powerful tools to detect

explicit forms of abuse. However, when we combine them with other selectively chosen fea-

tures, the performance of the model improves. Our analysis over a limited list of frequently

used bad words on ask.fm showed that online users, especially teens, mostly use profanities

neutrally. Therefore, there is a chance that lexical features add bias to the model towards

some specific bad words. These findings motivated us to build a more robust representation

that can capture contextual information from short and noisy online texts.

As the next step, we decided to use advanced deep neural techniques that recently re-

ported significant improvements in various text classification problems. With the notion

that the emotional aspects of the online text can help the automatic system to distinguish

between the use of profanity in a neutral way from an offensive way, we proposed a unified

deep neural architecture to detect abusive comments. Within our model, we introduced

a novel attention mechanism called “emotion-aware attention” that incorporates emotion

information into textual representations to identify the importance of the words inside a

comment. We used DeepMoji [18] model as part of our system to extract fine-grained emo-

tion information from the texts. The results proved the effectiveness of this approach, where

we obtained the state-of-the-result results on ask.fm and Curious Cat data comparing to

86

several strong baselines such as fine-tuned BERT.

Another conclusion with this dissertation is that none of our approaches works well across

different domains. In our experiments, we used our ask.fm and Curious Cat datasets as the

candidates for social media platforms that are especially popular among kids and youth. We

also used two other available language resources, Kaggle and Wikipedia, as the candidates for

social forums that are mostly popular among adults. Overall, we noticed that our proposed

models show entirely different behavior on ask.fm and Curious Cat as compared with Kaggle

and Wikipedia. For instance, with our deep neural architecture, Glove embeddings produce

much better results on ask.fm and Curious Cat in comparison with BERT. However, the

opposite is true on Kaggle and Wikipedia. Based on the results across different models and

datasets, we can conclude that ask.fm is the most difficult and challenging corpus. We can

provide two reasons for that: (1) ask.fm has the lowest average length of comments across all

corpora, and (2) almost all the instances in ask.fm include at least one profane word from a

short list of bad words. These observations demonstrate the need to explore different social

media websites, and keep this question open if it is possible to have an abusive language

classifier that performs well across the various domains.

Our ultimate goal in this dissertation was to create a system that monitors the dynamic

text-stream of comments in online conversations, and accurately triggers cyberbullying alerts,

using limited evidence. We proposed a hierarchical attention network that analyzes a conver-

sation in two different levels: word-level and segment-level, where each segment is a subset

of the conversation, and may include a various number of posts. We also designed a post-by-

post evaluation framework along with a policy-based decision-making process through which

the system decides whether to make a cyberbullying prediction based on the current and

previous information, or wait for more evidence. With our proposed system, we reported

the state-of-the-art results for the task of early detection of cyberbullying.

87

To sum up, in this dissertation, we developed new technologies that can further be used

as part of a framework to protect vulnerable online users against online aggression and

cyberbullying. In this way, we advanced Natural Language processing techniques by: (1)

extracting different types of information/features to provide contextual representations for

online short and noisy texts, and (2) developing a system that triggers timely and accu-

rate cyberbullying alerts based on limited evidence. Our proposed system can be further

adapted to the relevant tasks, where early risk prediction is crucial (e.g., detecting online

sexual predators, or mental health problems such as depression). We make our data re-

sources as well as all the proposed methods in this dissertation publicly available with the

hope of encouraging other researchers to contribute to the tasks of abusive language and

cyberbullying detection.

8.2 Future Work

During the research done in this dissertation, we identified some remaining challenges in

the field of abusive language and cyberbullying detection. With regard to the data, all the

available resources mostly include explicit forms of abusive language and are biased towards

profane words. This bias further affects the performance of automated abuse detection

models. To solve this problem, advanced sampling techniques are needed that are capable

of capturing implicit forms of abusive language as well.

One limitation of our research is that we did not exploit the inner characteristics of our

ask.fm and Curious Cat corpora. As we discussed in Chapter 4, we have two different types

of posts in these two datasets: (1) question, and (2) answer. In our experiments, we did not

explore the relevancy of these two types of posts but encoded them as separate comments.

However, jointly modeling the question and answer within a pair might be a better encoding

for the online users’ interactions. For example, the answer can provide more context of

88

whether a receiving question/comment is offensive towards the receiver.

Additionally, the conducted research on abusive language detection can be extended to

improve the results. Although some of our proposed models outperformed the state-of-the-

art results for our datasets, we found that those improvements are not generalizable to other

domains. Therefore, still more research has to be done on whether it is possible to have an

abusive language classifier that performs well across the various online domains.

Our early cyberbullying detection system can be improved, as well. One significant limi-

tation of our model is that we used a fixed threshold in the decision-making process. Learning

a dynamic threshold instead can make the model more generalizable to other domains. Fur-

thermore, although it was reported in the previous related studies that temporal information

could help detect cyberbullying, our results show the opposite. We believe that the way we

encoded this information into the model can be improved.

Finally, in this dissertation, we developed automatic methods to detect online abusive

language and cyberbullying that can help to protect vulnerable online users against cyber

attacks. There are still several limitations to reach this ultimate goal, and we hope that the

research community finds this work useful on the path to addressing the remaining challenges

in the field.

89

Bibliography

[1] al-Khateeb, H. M., and Epiphaniou, G. How technology can mitigate and coun-teract cyber-stalking and online grooming. Computer Fraud & Security 2016, Fwul1(2016), 14–18.

[2] Aragon, M. E., Lopez-Monroy, A. P., and Montes-y-Gomez, M. INAOE-CIMAT at erisk 2019: Detecting signs of anorexia using fine-grained emotions. InWorking Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano,Switzerland, September 9-12, 2019 (2019), vol. 2380 of CEUR Workshop Proceedings,CEUR-WS.org.

[3] Baccianella, S., Esuli, A., and Sebastiani, F. Sentiwordnet 3.0: An enhancedlexical resource for sentiment analysis and opinion mining. In LREC (2010), vol. 10,pp. 2200–2204.

[4] Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent Dirichlet allocation. Journalof Machine Learning Research 3, Jan (2003), 993–1022.

[5] Bodapati, S., Gella, S., Bhattacharjee, K., and Al-Onaizan, Y. Neuralword decomposition models for abusive language detection. In Proceedings of the ThirdWorkshop on Abusive Language Online (Florence, Italy, Aug. 2019), Association forComputational Linguistics, pp. 135–145.

[6] Chang, J. P., and Danescu-Niculescu-Mizil, C. Trouble on the horizon: Fore-casting the derailment of online conversations as they develop. In Proceedings of the 2019Conference on Empirical Methods in Natural Language Processing and the 9th Inter-national Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (2019),pp. 4745–4756.

[7] Cheng, L., Guo, R., Silva, Y., Hall, D., and Liu, H. Hierarchical attentionnetworks for cyberbullying detection on the instagram social network. In Proceedings ofthe 2019 SIAM International Conference on Data Mining (2019), SIAM, pp. 235–243.

[8] Cui, L., Shu, K., Wang, S., Lee, D., and Liu, H. defend: A system for explainablefake news detection. In Proceedings of the 28th ACM International Conference onInformation and Knowledge Management (2019), pp. 2961–2964.

90

[9] Dadvar, M., Trieschnigg, D., Ordelman, R., and de Jong, F. Improvingcyberbullying detection with user context. In European Conference on InformationRetrieval (2013), Springer, pp. 693–696.

[10] Davidson, T., Warmsley, D., Macy, M., and Weber, I. Automated hate speechdetection and the problem of offensive language. In Eleventh International AAAI Con-ference on Web and Social Media (2017).

[11] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-trainingof deep bidirectional transformers for language understanding. In Proceedings of the2019 Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019),pp. 4171–4186.

[12] Dinakar, K., Jones, B., Havasi, C., Lieberman, H., and Picard, R. Com-mon sense reasoning for detection, prevention, and mitigation of cyberbullying. ACMTransactions on Interactive Intelligent Systems (TiiS) 2, 3 (2012), 18.

[13] Dinakar, K., Reichart, R., and Lieberman, H. Modeling the detection of textualcyberbullying. In Fifth International AAAI Conference on Weblogs and Social Media(2011).

[14] Errecalde, M. L., Villegas, M. P., Funez, D. G., Ucelay, M. J. G., andCagnina, L. C. Temporal variation of terms as concept space for early risk predic-tion. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum,Dublin, Ireland, September 11-14, 2017 (2017), vol. 1866 of CEUR Workshop Proceed-ings, CEUR-WS.org.

[15] Escalante, H. J., Montes, M., Villasenor, L., and Errecalde, M. L. Earlytext classification: a naive solution. In Proceedings of NAACL-HLT 2016, 7th Workshopon Computational Approaches to Subjectivity, Sentiment and Social Media Analysis(2016), pp. 91–99.

[16] Escalante, H. J., Villatoro-Tello, E., Garza, S. E., Lopez-Monroy, A. P.,y Gomez, M. M., and Villasenor-Pineda, L. Early detection of deception andaggressiveness using profile-based representations. Expert Systems with Applications 89,Supplement C (2017), 99 – 111.

[17] Fano, E., Karlgren, J., and Nivre, J. Uppsala university and gavagai at CLEFerisk: Comparing word embedding models. In Working Notes of CLEF 2019 - Con-ference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019(2019), vol. 2380 of CEUR Workshop Proceedings, CEUR-WS.org.

[18] Felbo, B., Mislove, A., Søgaard, A., Rahwan, I., and Lehmann, S. Using mil-lions of emoji occurrences to learn any-domain representations for detecting sentiment,

91

emotion and sarcasm. In Proceedings of the 2017 Conference on Empirical Methods inNatural Language Processing (2017), pp. 1615–1625.

[19] Fleiss, J. L. Measuring nominal scale agreement among many raters. PsychologicalBulletin 76, 5 (1971), 378.

[20] Galan-Garcıa, P., Puerta, J. G. d. l., Gomez, C. L., Santos, I., andBringas, P. G. Supervised machine learning for the detection of troll profiles intwitter social network: Application to a real case of cyberbullying. Logic Journal of theIGPL 24, 1 (2016), 42–53.

[21] Gitari, N. D., Zuping, Z., Damien, H., and Long, J. A lexicon-based approach forhate speech detection. International Journal of Multimedia and Ubiquitous Engineering10, 4 (2015), 215–230.

[22] Healy. Ask.fm is relocating to ireland and no oneis happy about it. http://mashable.com/2014/11/05/

ask-fm-relocation-ireland-cyberbullying-suicides-cold-shoulder/

#SdafIlqyoGqg, 2014.

[23] Hinduja, S., and Patchin, J. W. Connecting adolescent suicide to the severity ofbullying and cyberbullying. Journal of School Violence 18, 3 (2019), 333–346.

[24] Hosseinmardi, H., Ghasemianlangroodi, A., Han, R., Lv, Q., and Mishra,S. Analyzing negative user behavior in a semi-anonymous social network. CoRRabs/1404.3839 (2014).

[25] Hosseinmardi, H., Han, R., Lv, Q., Mishra, S., and Ghasemianlangroodi,A. Towards understanding cyberbullying behavior in a semi-anonymous social network.In Advances in Social Networks Analysis and Mining (ASONAM), 2014 IEEE/ACMInternational Conference on (2014), IEEE, pp. 244–252.

[26] Hosseinmardi, H., Mattson, S. A., Rafiq, R. I., Han, R., Lv, Q., andMishra, S. Analyzing labeled cyberbullying incidents on the instagram social net-work. In International Conference on Social Informatics (2015), Springer, pp. 49–66.

[27] Hosseinmardi, H., Mattson, S. A., Rafiq, R. I., Han, R., Lv, Q., andMishra, S. Detection of cyberbullying incidents on the instagram social network.arXiv preprint arXiv:1503.03909 (2015).

[28] Kingma, D. P., and Ba, J. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 (2014).

[29] Kumar, R., Ojha, A. K., Malmasi, S., and Zampieri, M. Benchmarking aggres-sion identification in social media. In Proceedings of the First Workshop on Trolling,Aggression and Cyberbullying (TRAC-2018) (2018), pp. 1–11.

92

[30] Kumar, R., Ojha, A. K., Malmasi, S., and Zampieri, M. Benchmarking Aggres-sion Identification in Social Media. In Proceedings of the First Workshop on Trolling,Aggression and Cyberbulling (TRAC) (Santa Fe, USA, 2018).

[31] Kumar, R., Reganti, A. N., Bhatia, A., and Maheshwari, T. Aggression-annotated corpus of hindi-english code-mixed data. In Proceedings of the 11th LanguageResources and Evaluation Conference (LREC) (Miyazaki, Japan, 2018).

[32] Le, Q., and Mikolov, T. Distributed representations of sentences and documents.In International Conference on Machine Learning (2014), pp. 1188–1196.

[33] Lieberman, H., Dinakar, K., and Jones, B. Let’s gang up on cyberbullying.Computer 44, 9 (2011), 93–96.

[34] Livingstone, S., Haddon, L., Gorzig, A., and Olafsson, K. Risks and safetyon the internet. The Perspective of European Children. Final Findings from the EUKids Online Survey of (2010), 9–16.

[35] Lopez Monroy, A. P., Gonzalez, F. A., Montes, M., Escalante, H. J.,and Solorio, T. Early text classification using multi-resolution concept represen-tations. In Proceedings of the 2018 Conference of the North American Chapter of theAssociation for Computational Linguistics: Human Language Technologies, NAACL-HLT-2018, Volume 1 (Long Papers) (2018), Association for Computational Linguistics,pp. 1216–1225.

[36] Losada, D. E., Crestani, F., and Parapar, J. CLEF 2017 erisk overview: Earlyrisk prediction on the internet: Experimental foundations. In Working Notes of CLEF2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14,2017 (2017), vol. 1866 of CEUR Workshop Proceedings, CEUR-WS.org.

[37] Losada, D. E., Crestani, F., and Parapar, J. Overview of erisk: Early riskprediction on the internet (extended lab overview). In Working Notes of CLEF 2018 -Conference and Labs of the Evaluation Forum, Avignon, France, September 10-14, 2018(2018), vol. 2125 of CEUR Workshop Proceedings, CEUR-WS.org.

[38] Losada, D. E., Crestani, F., and Parapar, J. Overview of erisk at CLEF 2019:Early risk prediction on the internet (extended overview). In Working Notes of CLEF2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September9-12, 2019 (2019), vol. 2380 of CEUR Workshop Proceedings, CEUR-WS.org.

[39] Macbeth, J., Adeyema, H., Lieberman, H., and Fry, C. Script-based storymatching for cyberbullying prevention. In CHI ’13 Extended Abstracts on Human Fac-tors in Computing Systems (New York, NY, USA, 2013), CHI EA ’13, ACM, pp. 901–906.

93

[40] Maharjan, S., Montes, M., Gonzalez, F. A., and Solorio, T. A genre-aware attention model to improve the likability prediction of books. In Proceedingsof the 2018 Conference on Empirical Methods in Natural Language Processing (2018),pp. 3381–3391.

[41] Mave, D., Maharjan, S., and Solorio, T. Language Identification and Analysisof Code-Switched Social Media Text. In Proceedings of the Third Workshop on Com-putational Approaches to Linguistic Code-Switching (Melbourne, Australia, July 2018),Association for Computational Linguistics.

[42] Mishra, P., Del Tredici, M., Yannakoudakis, H., and Shutova, E. Abu-sive language detection with graph convolutional networks. In Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019),pp. 2145–2150.

[43] Mishra, P., Yannakoudakis, H., and Shutova, E. Neural character-based com-position models for abuse detection. In Proceedings of the 2nd Workshop on AbusiveLanguage Online (ALW2) (2018), pp. 1–10.

[44] Mohammadi, E., Amini, H., and Kosseim, L. Quick and (maybe not so) easy de-tection of anorexia in social media posts. In Working Notes of CLEF 2019 - Conferenceand Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019 (2019),vol. 2380 of CEUR Workshop Proceedings, CEUR-WS.org.

[45] Nobata, C., Tetreault, J., Thomas, A., Mehdad, Y., and Chang, Y. Abusivelanguage detection in online user content. In Proceedings of the 25th InternationalConference on World Wide Web (Republic and Canton of Geneva, Switzerland, 2016),WWW ’16, International World Wide Web Conferences Steering Committee, pp. 145–153.

[46] nobullying.com. The complicated web of teen lives - 2015 bullying report. http://

nobullying.com/the-complicated-web-of-teen-lives-2015-bullying-report/,2015.

[47] Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., andSmith, N. A. Improved part-of-speech tagging for online conversational text withword clusters. In Proceedings of the 2013 Conference of the North American Chapter ofthe Association for Computational Linguistics: Human Language Technologies (2013),pp. 380–390.

[48] Park, J. H., and Fung, P. One-step and two-step classification for abusive languagedetection on twitter. In Proceedings of the First Workshop on Abusive Language Online(2017), pp. 41–45.

94

[49] Patchin, J. W. Summary of our cyberbullying research (2004-2015). http://

cyberbullying.org/summary-of-our-cyberbullying-research, 2015.

[50] Patchin, J. W., and Hinduja, S. Cyberbullying and self-esteem. Journal of SchoolHealth 80, 12 (2010), 614–621.

[51] Pennebaker, J. W., Booth, R. J., and Francis, M. E. Liwc2007: Linguisticinquiry and word count. Austin, Texas: liwc.net (2007).

[52] Rafiq, R. I., Hosseinmardi, H., Han, R., Lv, Q., and Mishra, S. Scalable andtimely detection of cyberbullying in online social networks. In Proceedings of the 33rdAnnual ACM Symposium on Applied Computing (2018), ACM, pp. 1738–1747.

[53] Rafiq, R. I., Hosseinmardi, H., Han, R., Lv, Q., Mishra, S., and Mattson,S. A. Careful what you share in six seconds: Detecting cyberbullying instances in vine.In Proceedings of the 2015 IEEE/ACM International Conference on Advances in SocialNetworks Analysis and Mining 2015 (2015), ACM, pp. 617–622.

[54] Razavi, A. H., Inkpen, D., Uritsky, S., and Matwin, S. Offensive language de-tection using multi-level classification. In Canadian Conference on Artificial Intelligence(2010), Springer, pp. 16–27.

[55] Ribeiro, M. H., Calais, P. H., Santos, Y. A., Almeida, V. A., and Meira Jr,W. Characterizing and detecting hateful users on twitter. In Twelfth InternationalAAAI Conference on Web and Social Media (2018).

[56] Sadeque, F., Xu, D., and Bethard, S. Measuring the latency of depression detec-tion in social media. In Proceedings of the Eleventh ACM International Conference onWeb Search and Data Mining (2018), ACM, pp. 495–503.

[57] Samghabadi, N. S., Maharjan, S., Sprague, A., Diaz-Sprague, R., andSolorio, T. Detecting nastiness in social media. In Proceedings of the First Workshopon Abusive Language Online (2017), pp. 63–72.

[58] Sap, M., Card, D., Gabriel, S., Choi, Y., and Smith, N. A. The risk ofracial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics (2019), pp. 1668–1678.

[59] Sap, M., Park, G., Eichstaedt, J., Kern, M., Stillwell, D., Kosinski, M.,Ungar, L., and Schwartz, H. A. Developing age and gender predictive lexica oversocial media. In Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) (2014), pp. 1146–1151.

[60] Shute. Cyberbullying suicides: what will it take to have ask.fm shut down?- telegraph. http://www.telegraph.co.uk/news/health/children/10225846/

Cyberbullying-suicides-What-will-it-take-to-have-Ask.fm-shut-down.html,2013.

95

[61] Singh, V., Varshney, A., Akhtar, S. S., Vijay, D., and Shrivastava, M.Aggression detection on social media text using deep neural networks. In Proceedingsof the 2nd Workshop on Abusive Language Online (ALW2) (2018), pp. 43–50.

[62] Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A.,and Potts, C. Recursive deep models for semantic compositionality over a sentimenttreebank. In Proceedings of the 2013 Conference on Empirical Methods in NaturalLanguage Processing (2013), pp. 1631–1642.

[63] Song, K., Bing, L., Gao, W., Lin, J., Zhao, L., Wang, J., Sun, C., Liu, X.,and Zhang, Q. Using customer service dialogues for satisfaction analysis with context-assisted multiple instance learning. In Proceedings of the 2019 Conference on EmpiricalMethods in Natural Language Processing and the 9th International Joint Conference onNatural Language Processing (EMNLP-IJCNLP) (2019), pp. 198–207.

[64] Soni, D., and Singh, V. Time reveals all wounds: Modeling temporal characteristicsof cyberbullying. In Twelfth International AAAI Conference on Web and Social Media(2018).

[65] Sood, S. O., Antin, J., and Churchill, E. Using crowdsourcing to improveprofanity detection. In 2012 AAAI Spring Symposium Series (2012).

[66] Sticca, F., and Perren, S. Is cyberbullying worse than traditional bullying? Ex-amining the differential roles of medium, publicity, and anonymity for the perceivedseverity of bullying. Journal of Youth and Adolescence 42, 5 (2013), 739–750.

[67] Trifan, A., and Oliveira, J. L. Bioinfo@uavr at erisk 2019: delving into socialmedia texts for the early detection of mental and food disorders. In Working Notesof CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland,September 9-12, 2019 (2019), vol. 2380 of CEUR Workshop Proceedings, CEUR-WS.org.

[68] van Aken, B., Risch, J., Krestel, R., and Loser, A. Challenges for toxiccomment classification: An in-depth error analysis. In Proceedings of the 2nd Workshopon Abusive Language Online (ALW2) (2018), pp. 33–42.

[69] Van Hee, C., Lefever, E., Verhoeven, B., Mennes, J., Desmet, B.,De Pauw, G., Daelemans, W., and Hoste, V. Detection and fine-grained classi-fication of cyberbullying events. In Proceedings of the International Conference RecentAdvances in Natural Language Processing (2015), INCOMA Ltd. Shoumen, Bulgaria,pp. 672–680.

[70] Wang, C. Interpreting neural network hate speech classifiers. In Proceedings of the2nd Workshop on Abusive Language Online (ALW2) (2018), pp. 86–92.

[71] Warner, W., and Hirschberg, J. Detecting hate speech on the world wide web.In Proceedings of the second workshop on language in social media (2012), Associationfor Computational Linguistics, pp. 19–26.

96

[72] Waseem, Z. Are you a racist or am i seeing things? annotator influence on hate speechdetection on twitter. In Proceedings of the first Workshop on NLP and ComputationalSocial Science (2016), pp. 138–142.

[73] Waseem, Z., and Hovy, D. Hateful symbols or hateful people? predictive featuresfor hate speech detection on twitter. In Proceedings of the NAACL Student ResearchWorkshop (2016), pp. 88–93.

[74] Wiegand, M., Ruppenhofer, J., Schmidt, A., and Greenberg, C. Inducing alexicon of abusive words–a feature-based approach. In Proceedings of the 2018 Confer-ence of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, Volume 1 (Long Papers) (2018), pp. 1046–1056.

[75] Wulczyn, E., Thain, N., and Dixon, L. Ex machina: Personal attacks seen atscale. In Proceedings of the 26th International Conference on World Wide Web (2017),International World Wide Web Conferences Steering Committee, pp. 1391–1399.

[76] Xu, J.-M., Jun, K.-S., Zhu, X., and Bellmore, A. Learning from bullying tracesin social media. In Proceedings of the 2012 Conference of the North American Chap-ter of the Association for Computational Linguistics: Human Language Technologies(Stroudsburg, PA, USA, 2012), NAACL HLT ’12, Association for Computational Lin-guistics, pp. 656–666.

[77] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., and Hovy, E. Hierarchicalattention networks for document classification. In Proceedings of the 2016 Conference ofthe North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies (2016), pp. 1480–1489.

[78] Yao, M., Chelmis, C., and Zois, D. S. Cyberbullying ends here: Towards robustdetection of cyberbullying in social media. In The World Wide Web Conference (2019),pp. 3427–3433.

[79] Yin, D., Davison, B. D., Xue, Z., Hong, L., Kontostathis, A., and Edwards,L. Detection of harassment on web 2.0. Proceedings of the Content Analysis in the Web2.0 (CAW2.0) 2 (2009), 1–7.

[80] Zampieri, M., Malmasi, S., Nakov, P., Rosenthal, S., Farra, N., and Ku-mar, R. Predicting the type and target of offensive posts in social media. In Proceedingsof the 2019 Conference of the North American Chapter of the Association for Computa-tional Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)(2019), pp. 1415–1420.

[81] Zhang, Z., Robinson, D., and Tepper, J. Detecting hate speech on twitter usinga convolution-gru based deep neural network. In European Semantic Web Conference(2018), Springer, pp. 745–760.

97

Documents

Automatic Detection of Nastiness and Early Signs of