8
15 th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2018) ©2018 IEEE Autonomic Author Identification in Internet Relay Chat (IRC) Sicong Shao NSF Center for Cloud and Autonomic Computing The University of Arizona, Tucson, Arizona [email protected] Cihan Tunc NSF Center for Cloud and Autonomic Computing The University of Arizona, Tucson, Arizona [email protected] Amany Al-Shawi National Center for Cybersecurity Technology King Abdulaziz City for Science and Technology, Riyadh, Saudi Arabia [email protected] Salim Hariri NSF Center for Cloud and Autonomic Computing The University of Arizona, Tucson, Arizona [email protected] Abstract— With the advances in Internet technologies and services, the social media has been gaining excessive popularity, especially because these technologies provide anonymity where they use nicknames to post their messages. Unfortunately, the anonymity feature has been exploited by the cyber-criminals to hide their identities and their operations. Hence, there is a growing interest in cybersecurity research domain to identify the authors of malicious messages and activities. Internet Relay Chat (IRC) channels are widely used to exchange messages and information among malicious users involved in cybercrimes. In this paper, we present an autonomic author identification technique based on personality profile and analysis of IRC messages. We first monitor the IRC channels using our autonomic bots and then create a personality profile for each targeted author. We demonstrate that personality analysis for author detection/identification is an efficient approach and has high detection rates. Keywords— Author identification; cybersecurity; machine learning; Internet Relay Chat (IRC); Watson AI platform; personality insights I. INTRODUCTION Advances in the Internet services and mobile services have led to the rapid growth in the use of the social network where users can post their opinion and share information across the globe immediately, in an anonymous way. However, such an anonymity has also been exploited by cybercrime to create fake and illegal accounts and pretend to be somebody else [36]. People with illegitimate purposes can exploit the powerful dissemination feature of social networking platforms to spread malicious information and influence people opinion, which can cause serious threat in cyberspace. For example, in “Myspace mom” incident, a fake account caused the suicide of a teenage girl with cruel messages and cyber-bullying [29]. Furthermore, the perpetrators can avoid being detected by using anonymous servers, spoofing, through VPN, and use of fake accounts. Hence, homeland security and law enforcement agencies have launched projects to prevent deceptive attacks and track the identities of senders in order to improve our protection capabilities against terrorism, child predators, etc. [30]. The current solutions can only detect and find cyber- criminals if they make a mistake by providing their real- identity information; e.g., Andromeda botnet mastermind (a.k.a. Ar3s) was arrested through his public ICQ number [31]. Hence, it is critically important to develop innovative tools that can efficiently identify authors, their origins, language, locations, etc. regardless of the approach used to hide their identity. Thus, developing effective methods for tracing illegitimate messages is an important cybersecurity research problem. Author identification for analyzing online messages is one of the suggested solutions, which can be formulated as assigning a writing among a set of the given authors [3]. The goal of the research presented in this paper is to develop methods for author identification in cyberspace and the ability to group intercepted anonymous messages that belong to the same authors as well as those generated by known terrorists or cyber-criminals [30]. To effectively achieve these capabilities, we present the design and implementation of Automatic Author Identification and Characterization (AAIC) framework that provides an innovative and effective solution to identify authorship using the personality profile analysis based on the fact that the personality characteristics of a person is relatively stable [37]. For monitoring and identification of the authors, we use Internet Relay Chat (IRC) channels, which have been actively used by the security groups (both malicious and non-malicious) to share their knowledge and get help because IRC provides many professional channels (chatrooms) for real-time communication [1][2] even for cybercrime such as hacking, cracking, and carding [9][25]. The reason why using IRC in cybercrime is not only that IRC is a commonly used method of communication in cybercrime community but also that IRC ensures the anonymity. And, the users can hide themselves in the public channels (chatrooms) and change their user name regularly. Compared to themost of the author identification techniques which focus on newspapers, emails, website forums, blogs, etc. [3][4][5][6][7][8], performing author identification in IRC is a more challenging task due to several reasons. First, most of the previous works studied asynchronous computer-mediated communication (CMC) including emails, web forums, blogs, comments, etc., while few works go into author identification in synchronous mediums such as chatrooms. This fact gives us less clue on how to find effective feature to distinguish a user in chatroom like IRC. Also, the IRC channel administrators generally dislike bots (used to monitor/log the channels) and block them and even their IPs. This necessitates the need for developing intelligent monitoring systems. Third, in most author identification studies, researchers have focused on identifying a very limited number of the most active users (e.g., up to 20) [3][4][5][6][7]. For example, Zheng et al. [3] 978-1-5386-9120-5/18/$31.00 ©2018 IEEE

Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

15th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2018) ©2018 IEEE

Autonomic Author Identification in Internet Relay Chat (IRC)

Sicong Shao NSF Center for Cloud and

Autonomic Computing The University of Arizona,

Tucson, Arizona [email protected]

Cihan Tunc NSF Center for Cloud and

Autonomic Computing The University of Arizona,

Tucson, Arizona [email protected]

Amany Al-Shawi National Center for

Cybersecurity Technology King Abdulaziz City for Science

and Technology, Riyadh, Saudi Arabia

[email protected]

Salim Hariri NSF Center for Cloud and

Autonomic Computing The University of Arizona,

Tucson, Arizona [email protected]

Abstract— With the advances in Internet technologies and services, the social media has been gaining excessive popularity, especially because these technologies provide anonymity where they use nicknames to post their messages. Unfortunately, the anonymity feature has been exploited by the cyber-criminals to hide their identities and their operations. Hence, there is a growing interest in cybersecurity research domain to identify the authors of malicious messages and activities. Internet Relay Chat (IRC) channels are widely used to exchange messages and information among malicious users involved in cybercrimes. In this paper, we present an autonomic author identification technique based on personality profile and analysis of IRC messages. We first monitor the IRC channels using our autonomic bots and then create a personality profile for each targeted author. We demonstrate that personality analysis for author detection/identification is an efficient approach and has high detection rates.

Keywords— Author identification; cybersecurity; machine learning; Internet Relay Chat (IRC); Watson AI platform; personality insights

I. INTRODUCTION

Advances in the Internet services and mobile services have led to the rapid growth in the use of the social network where users can post their opinion and share information across the globe immediately, in an anonymous way. However, such an anonymity has also been exploited by cybercrime to create fake and illegal accounts and pretend to be somebody else [36]. People with illegitimate purposes can exploit the powerful dissemination feature of social networking platforms to spread malicious information and influence people opinion, which can cause serious threat in cyberspace. For example, in “Myspace mom” incident, a fake account caused the suicide of a teenage girl with cruel messages and cyber-bullying [29]. Furthermore, the perpetrators can avoid being detected by using anonymous servers, spoofing, through VPN, and use of fake accounts. Hence, homeland security and law enforcement agencies have launched projects to prevent deceptive attacks and track the identities of senders in order to improve our protection capabilities against terrorism, child predators, etc. [30].

The current solutions can only detect and find cyber-criminals if they make a mistake by providing their real-identity information; e.g., Andromeda botnet mastermind (a.k.a. Ar3s) was arrested through his public ICQ number [31]. Hence, it is critically important to develop innovative tools that can efficiently identify authors, their origins, language, locations, etc. regardless of the approach used to hide their identity. Thus, developing effective methods for

tracing illegitimate messages is an important cybersecurity research problem. Author identification for analyzing online messages is one of the suggested solutions, which can be formulated as assigning a writing among a set of the given authors [3]. The goal of the research presented in this paper is to develop methods for author identification in cyberspace and the ability to group intercepted anonymous messages that belong to the same authors as well as those generated by known terrorists or cyber-criminals [30].

To effectively achieve these capabilities, we present the design and implementation of Automatic Author Identification and Characterization (AAIC) framework that provides an innovative and effective solution to identify authorship using the personality profile analysis based on the fact that the personality characteristics of a person is relatively stable [37]. For monitoring and identification of the authors, we use Internet Relay Chat (IRC) channels, which have been actively used by the security groups (both malicious and non-malicious) to share their knowledge and get help because IRC provides many professional channels (chatrooms) for real-time communication [1][2] even for cybercrime such as hacking, cracking, and carding [9][25]. The reason why using IRC in cybercrime is not only that IRC is a commonly used method of communication in cybercrime community but also that IRC ensures the anonymity. And, the users can hide themselves in the public channels (chatrooms) and change their user name regularly.

Compared to themost of the author identification techniques which focus on newspapers, emails, website forums, blogs, etc. [3][4][5][6][7][8], performing author identification in IRC is a more challenging task due to several reasons. First, most of the previous works studied asynchronous computer-mediated communication (CMC) including emails, web forums, blogs, comments, etc., while few works go into author identification in synchronous mediums such as chatrooms. This fact gives us less clue on how to find effective feature to distinguish a user in chatroom like IRC. Also, the IRC channel administrators generally dislike bots (used to monitor/log the channels) and block them and even their IPs. This necessitates the need for developing intelligent monitoring systems. Third, in most author identification studies, researchers have focused on identifying a very limited number of the most active users (e.g., up to 20) [3][4][5][6][7]. For example, Zheng et al. [3]

978-1-5386-9120-5/18/$31.00 ©2018 IEEE

Page 2: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

identified up to 20 of most active users who frequently posted messages in newsgroups, with the best accuracy being around 83% when given 20 authors. However, more candidates need to be considered because IRC channel usually has more potential suspects. Fourth, the average length of IRC messages is very short compared to emails or blog entries. For example, the average length of a message in #anonops (one of the IRC channels) operated by the anonymous organization is 5.7 words based on more than 800,000 messages collected by our monitoring bot. Last but not least, hacker, anonymity, and terrorist groups use sophisticated techniques to hide and fake their stylometric features which are commonly used for author identification. However, even though their stylometric features change, their personality/characteristics remain the same and that gives the means to identify them using personality profile analysis.

The main idea in our author identification approach in IRC channels is that the suspects in cyberspace unconsciously leave their personality trace derived from their online messages. These messages can be used to model the personality characteristics and hence can be used to distinguish each individual uniquely. Moreover, most author identification techniques rely on stylometric features, which can be manipulated and controlled [3][4][5][6][7][8]; however, hiding and faking personality features is arduous.

The remainder of this paper is structured as follows. In Section II we provide background information about IRC client, author identification, and IBM Watson Artificial Intelligence (AI) platform. Section III explains our Automatic Author Identification and Characterization (AAIC) framework. The experimental environment and evaluation results are presented in Section IV. Finally, in Section V, the paper is concluded.

II. BACKGROUND AND RELATED WORK

A. Internet Relay Chat (IRC)

Internet Relay Chat (IRC) is a popular communication method, especially in the cyber-domain. IRC requires a server that provides networking for the connected users through a protocol that facilitates real-time text communications [1]. IRC has been traditionally utilized for legitimate functions, but it has also been extensively used by hacker, anonymity, and terrorist over the years [22]. IRC provides two methods of communication: (i) private chat, i.e. one-to-one messages, and (ii) broadcasted public messages. In the channels, public messages sent by the users are broadcasted to all other users in the same channel in real time. Hence, this differs from the website behavior because in the websites (e.g., blogs), the users can read previously posted messages anytime by browsing them [9]. On the contrary to the website blogs where offline collection and batch processing would work efficiently, in IRC based communication, real-time collection and threat detection are critical research issues [1][10].

In this research, in order to monitor the IRC channels and the chats, we have developed autonomic IRC bots for the comprehensive real-time recording of the IRC data using several strategies as will be discussed in Section III.

B. Author Identification

Author identification has been widely used for various reasons such as computer forensics, plagiarism check, social media misuse, etc. using e-mail, website forum, and blogs. For example, Zheng et al. [3] and Narayanan et al. [11] conduct a series of research to identify the author’s online messages. Author identification on online messages is a particular important issue in cybercrime because one of the obvious features of cybercrime is anonymity. Anonymous users always fake their personal information and hide their identity for escaping from security investigation.

The authors in [32] apply n-gram analysis using the 3-15 word based n-grams and apply k-NN, outlier classification, and collective classification. In [33], the authors focus on the frequency of the most frequently occurring character n-grams and apply SVM for the classification. In [11] the authors extract features from each post and pass them to the classifier. While doing that they use a fixed set of function words, such as “the”, “in”, as function words bear little relation to the topic of conversation. Furthermore, they also exclude bigrams and trigrams, which may be significantly influenced by specific words. Hill et al. [34] showed that stylometry enables identifying reviewers of research papers with reasonably high accuracy, given that the adversary, assumed to be a member of the community, has access to a large number of unblinded reviews of potential reviewers by serving on conference and grant selection committees. The authors in [35] apply unsupervised machine learning algorithms based on word frequencies. However, the study contains very small amount of authors for the successful demonstration.

IRC provides anonymity by the protocol [12]. IRC server will automatically mask the user’s IP when a user connects to its server, which is the most important layer of anonymity. Moreover, unlike most social media platform requiring sign-up task, users join hacker channels through an easy non-registration process where they can regularly change their user name as they wish. With the help of many hacker groups, e.g., #anonops which is an infamous international anonymous organization providing network in IRC, users can hide in the IRC channels to plan and commit cybercrime [13]. Therefore, it would be very important if security experts could identify who is the author of IRC messages given a list of suspects. There are a few previous studies related to author identification in IRC. Layton et al. [38] used IRC messages of 50 users from 8 years Ubuntu channel logs. An accuracy up to 55% was achieved by using the inverse-author-frequency (iaf) weighting and Recentred Local Profile (RLP) methods. Inches et al. [39] used the dataset “irc logs” that merges heterogeneous chat messages of 40 different IRC channels. By applying chi-squared distance and Kullback-Leibler divergence to determine the similarity between author profiles, the best accuracy achieved up to 92%, 52%, and 61%, with 20, 132, and 148 users. However, this approach does not consider the degree of class imbalance, and also ignores performing author identification in the same channel which is a more difficult real-world problem due to lack of data. Furthermore, both of these approaches only focus on normal channel without regard to hacker channel where the

Page 3: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

logs are harder to be collected and users are more sophisticated to hide their identities.

C. Personality Analysis through IBM Watson AI Platform

IBM Watson is the AI platform service provided by IBM allowing users to integrate AI into their applications, training, management, and analysis of data in a secure cloud environment (guaranteeing privacy of the data against with IBM and third parties) [14][15][16].

We leverage IBM Watson Assistant and Personality Insight capabilities to build the conversation module and personality feature extraction module for our automatic author identification and characterization framework. Watson Assistant is an AI assistant service for social media to answer the given questions through pre-configured content intents, such as banking [17]. Furthermore, the service also can be improved using the history by better understanding of input [17].

Another service we leverage in our approach is the IBM Personality Insights that is based on integrating psychology and data analytics algorithms to analyze the given content and create a personality profile [18]. The IBM Personality Insights service uses three models: Big Five, Needs, and Values [18]. Big Five personality characteristics represent the most widely used model for generally describing how a person engages with the world. The model includes five primary dimensions as follows. (1) Agreeableness: a person's tendency to be compassionate and cooperative toward others; (2) Conscientiousness: a person's tendency to act in an organized or thoughtful way; (3) Extraversion: a person's tendency to seek stimulation in the company of others; (4) Emotional range, also referred to as Neuroticism or Natural reactions: the extent to which a person's emotions are sensitive to the person's environment; and (5) Openness: the extent to which a person is open to experiencing a variety of activities. Each of these top-level dimensions has six facets that further characterize an individual according to the dimension. Needs model describes which aspects of a product will resonate with a person and includes twelve characteristic needs: Excitement, Harmony, Curiosity, Ideal, Closeness, Self-expression, Liberty, Love, Practicality, Stability, Challenge, and Structure. Values model describes motivating factors that influence a person's decision making. The model includes five values: Self-transcendence, Conservation, Hedonism, Self-enhancement, Open to change. Watson infers personality features from textual information using an open-vocabulary approach [18]. By using GloVe which is an open-source word embedding techniques, the service obtains a vector representation for the words in the input text [40]. It then feeds this representation to a machine learning model that infers a personality profile. To train the model, IBM uses scores from surveys that were conducted among thousands of users along with their Twitter data [18].

III. SYSTEM DESIGN

The Automatic Author Identification and Characterization (AAIC) framework architecture for IRC environment is shown in Figure 1. The AAIC components can be outlined as follows: (a) Data collection and pre-

processing; (b) Feature extraction; (c) Learning Unit; (d) Author Identification.

Figure 1. The architecture of automatic author identification and characterization framework

For the IRC channel monitoring and conversation logging, an autonomic IRC bot has been created with the features of robust continuous monitoring, comprehensive information collection, and pre-processing in real-time. Using these capabilities, the IRC bot monitors the channels and extracts structured data to be used for analysis, in the following format: Username + Chat message content +Time. The architecture of autonomic IRC monitoring bot is shown in Figure 2.

We have observed that the IRC channel monitoring and logging has multiple challenges. First of all, in some critical channels, if a user is identified as a non-contributing user or as a bot, the channel operators (i.e., administrators) block the user and even the IP. Hence, to ensure that the bot can avoid being identified by IRC channel administrators, we integrated a conversation module, which can provide basic answers for the questions. For this purpose, we have leveraged IBM Watson Assistant [17] by adding a natural language interface and automated the interactions with users in the monitored channels. The procedure of conversation capability is as follows: 1) we create a workspace which is a container in IBM Cloud for the artifacts that define the conversation flow. 2) Using the IRC messages collected from the monitored IRC channel to transform to the Intent, Entity, and Dialog content in workspace for training the conversation capability. The Intent is purposes expressed in IRC user’s input messages, such as a topic about cybercrime or anonymous activity. The Entity represents a term, an object or a data type which is relevant to IRC user’s intents. By identifying the entity which is mentioned by IRC user’s input, the Watson Assistant can perform a specific context for an intent. For example, an entity may represent a hacking tool that the IRC user intends to launch a cyber-attack, a math calculating question or a time inquiring question for detection bot. To train Watson Assistant to identify IRC user’s entities, we list the possible values for entities that IRC users may mention. The dialog is a branching conversation flow that defines the response of conversation module when it identifies the defined intents and entities. We provide responses by analyzing our collected IRC chat messages based on the intents and entities which we recognize in their input chat messages. As we provide this information, Watson Assistant uses the IRC chat messages to create a machine learning model to understand the IRC messages. Through retraining, we ensure our chat module

Page 4: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

keeps the latest models for handling conversation for monitored IRC channel when new chat data is introduced. 3) After the creation of the model for IRC conversation, we connect Watson Assistant into the conversation module through Watson Assistant API. The conversation module also includes response trigger for triggering Watson Assistant to response group chat in the channel, one-to-one chat in the channel, and private chat via Direct Client-to-Client (DCC) protocol.

It is also possible that the cyber-criminals can create a temporary channel where they can discuss their plans or share and invite the other users to work together. For example, on December 6, 2010, the users in anonops server have suddenly started using a temporary channel called #operationpayback, which had been quiet for months [19]. In this channel, the cyber attackers discussed their motivation and plans, and then started launching DDoS attacks against the websites of Swedish Prosecution Authority, everyDnS, Senator Joseph Lieberman, and others. This event resulted in having these websites to experience downtime [19][26]. Therefore, to track such activities, a self-replication module is developed that allows parent bot to generate a new (child) bot which inherits all the capability for continuous autonomic monitoring. By trusting all the self-signed certificate, the bot is able to join the hacker server and channel that enforced TLS/SSL access to their network.

Figure 2. The architecture of autonomic IRC monitoring bot

A. Feature Extraction

In the monitored IRC channel, users send messages representing social communication. These messages can be measured, and constituted the personality. The characteristics of personality are distinguished uniquely from individual to individual. Based on how IRC users communicate with others, personality characteristics influence most of the user's activities and behaviors in the IRC channel, from those as natural as the way user conversation and interaction. Moreover, personality also influences the way IRC users make decisions including cyberattack type and hacking production selection, attack and crime motivation, hacking activities organization, malicious tool developing, and so on.

Using IBM’s Personality Insights services as explained in Section II, we have been successful in analyzing individual authors’ IRC messages and inferred individuals’ intrinsic personality characteristics to create their personality profile. Our author identification approach can also perform in different languages as IBM Personality Insights service

provides multiple languages (e.g., English, Japanese, Korean, Arabic), which is important for the international cybersecurity investigation [23][24].

It has been shown that a successful personality characteristics can be created using 3000 words [18]. If the given text is less than 600 words, the service still analyzes them but the result is not guaranteed to provide a sufficient personality information and 100 words are the minimum threshold.

Feature extraction starts with the collection of the IRC dataset and then the pre-processing of the data by our autonomic IRC bot (it separates the individual authors’ messages). Next, we obtain individual user characteristics through the feature extraction unit which can filter individual suspect’s IRC chat messages to the personality analysis module. By calling the Personality Insights service from IBM Cloud, the personality analysis module can get an individual user’s personality in JSON format (that has the normalized personality analysis results based on three models: Big Five, Needs, and Values). Big Five model contains five primary dimensions, Agreeableness, Conscientiousness, Extraversion, Emotional range, and Openness. Each of these primary dimensions includes six facet features that further distinguish a user. Needs model contains twelve need features, and Values model includes five value features. We select all the facet features in each primary dimension of Big Five model (except the openness, agreeableness, emotional range, conscientiousness, and extraversion due to they are high dimensional features), all the features of Needs model, and all the features of Values model to represent the personality of the suspect candidate user, which creates 47 features in total. A sunburst chart visualization for a user’s personality profile is shown in Figure 3.

Figure 3. Visualization of an IRC user’s personality features

Page 5: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

B. Learning Unit

After creating features that define each user (i.e., the feature extraction), we apply classification methods for the author identification model creation. To successfully identify each author, we have adopted machine learning algorithms such as k-Nearest Neighbor (k-NN) with different nearest neighbors and Support Vector Machine (SVM) algorithms.

1) k-Nearest Neighbor k-Nearest Neighbor (k-NN) classifier identifies the author

of a given text content from a given set of candidate users. This classifier can be viewed as a multi-class classification task. In this problem, the k-NN method classifies an unknown personality insights sample from Personality Insights service to identify which of the majority of its nearest neighbors the author belongs to. In k-NN analysis, we have used the Euclidean distance for the personality similarity distance measurement. The distance between two data samples ( , , … , ) and ( , , … , ) is calculated as ∑( − ) where is the index of the features that are normalized personaliy characteristics values derived IRC chat content.

2) Support Vector Machine Support Vector Machine (SVM) algorithm is a machine

learning approach that we leverage for the authorship analysis approach. We can define SVM as maximizing the margin

between two classes given a dataset of ( , ), =1, … , , ∈ , ∈ +1,−1 , where is the label of

class and represent the feature vector. The label of unknown data sample can be determined by =( ∅( ) + ) where ∅( ) is a mapping to a higher dimensional space to get a nonlinear SVM and is the vector that SVM needs to optimize. By calculating the following optimization problem, the optimal hyperplane can be obtained as follows:

min 12 ( ∙ ) +

. . ∙ ∅( ) + ≥ 1, = 1,… ,

(1)

where is a slack variable, and is a penalty factor. Its dual form is:

arg − 12 Κ , ,

. . = 0, 0 ≤ ≤ , = 1,… ,

(2)

where , is a kernel function. We use Radial basis function (RBF) kernel in our problem.

The classification function is:

= ( , ) +

(3)

This optimization problem is a quadratic problem which can be solved by a sequential minimal optimization type decomposition method [20]. The binary classification SVM can be extended to multi-class classification by combining some two-category SVM classifier in a certain manner, thus forming a multi-class classifier.

In our approach of author identification, we used LIBSVM [20] to implement multi-class classification SVM. LIBSVM uses the one-against-one method for multi-class classification that needs ( − 1)/2 classifier for N-class classification [20]. Each classifier is trained on samples from two corresponding classes. A voting mechanism is used for test after all the classifiers are trained. The unknown personality sample is classified to the suspect with the largest vote.

IV. EXPERIMENTS AND RESULTS

To evaluate the effectiveness of our approach, we designed several author identification tasks in various IRC channels monitored by our autonomic IRC bot technique (shown in Table I). We selected six active IRC channels for the demonstration.

• The #anonops channel is an international communication platform controlled by anonymous hacking organization.

• The #2600 channel is a highly active community with hacker magazines and monthly hacker meetings [1].

• The #computer is an active computer discussion channel located Underworld server for understanding cybercrime and immoral deeds on the Internet.

• The #politics channel is another Underworld’s important channel whose topics are related to political warfare.

• The #security and the #networking are two popular channels involving the topics of computer and network security in freenode server.

Table I. TOTAL # OF MESSAGES OF THE MONITORED CHANNELS

Server Name Channel Name

Total # of Messages

Collection Data Range

irc.anonops.com #anonops 817,435 8/15/17 – 4/13/18

irc.2600.net #2600 549,400 4/01/17 – 4/13/18

irc.underworld.no #computer 186,458 9/13/17 – 4/13/18

irc.underworld.no #politics 109,416 1/04/18 – 4/13/18

irc.freenode.net ##security 243,273 4/13/17 – 4/13/18

irc.freenode.net #networking 220,060 9/06/17 – 4/13/18

After unstructured IRC messages are extracted in real-time and transformed to structured data files in CSV format, the feature extraction unit can automatically analyze the user’s IRC messages to return the personality profile in JSON format by communicating with the IBM Personality Insights service. For a successful personality analysis, IBM Personality Insights services require 3000 words [18]. Hence, we obtain each personality sample in JSON format using continuous 3,000 words from the same user without any overlap (each sample is derived by non-overlapping text) and

Page 6: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

discard the remaining context that are less than 3,000 words. These sample files of the author are stored in their own personality profile document.

While the number of the authors increase, we face a class-imbalance problem where the data proportion is not equal. To address this issue, we apply undersampling in our personality datasets, which is an efficient method for class-imbalance learning [21]. The undersampling method uses a subset of the samples from the majority class to train the classifier [21]. We perform undersampling to the majority by randomly removing samples from that population until the minority class becomes some specified percentage of the majority class [28]. We have observed that the ratio of the biggest class to the smallest class is 4:1, and the ratio of the prevalent class to the small class is smaller than 4:1. The number of samples of each author in different IRC channels is listed in the Figure 4.

Figure 4. The statistic results of the numbers of samples of the most 30 different active users in six different channels

Most authorship identification studies perform the identification using static websites (through offline data) for a limited number of authors between two to twenty authors. As the number of authors increases, the accuracy of the author identification significantly decreases. For example, Zheng et al. [3] identified up to 20 most active users posting messages frequently in newsgroups with the best accuracy of 83% detection. In contrast, we use streaming messages and up to 30 most active authors. In Figure 5 (a-f), we demonstrate the author identification using k-NN algorithm and SVM and compare the effect of the number of authors included in the comparison. In these experiments, we used top 5, 10, 20, and 30 authors from six different IRC channels. Also, the activity degree is measured by the total number of messages sent by a user, not the total amount of words or sentences. For each test, we trained our classifier model using k-NN (1-NN, 3-NN, and 5-NN) and SVM, respectively.

To evaluate the classifier performance, we used accuracy measure that has normally been adopted in author identification [3]. We calculate the accuracy for all classifiers using Leave-One-Out Cross-Validation (LOOCV) in order to train on as many samples as possible. LOOCV is k-fold cross-validation taken to its logical extreme, with equal to (i.e., the number of data points in the set). LOOCV performs experiments on a dataset with examples. For each

experiment, LOOCV uses − 1 examples for training and the remaining example for testing [27].

In our experiments (as shown in Figure 5), we observe that k-NN algorithm is able to provide acceptable amount of accuracy in the detection of the authors, as efficient as the study in [3]. We have observed that as the value decrease, the accuracy increases due to the better classifying the author groups. Furthermore, we observe that SVM achieves significantly higher accuracy than k-NN in the monitored channels. We also observe that the accuracy presents the downward trend with the increasing number of authors. For example, SVM achieved 95.89% to 100% accuracy when the number of authors is 5. Given 10 authors, SVM achieved 99.34%, 99.18%, 94.17%, 93.73%, 93.33%, and 90.43% accuracy. When extending to 20 authors, SVM can achieve accuracy varied from 92.12% to 96.13% in #anonops and #2600 and #politics. The fact that #politics can still maintain good accuracy with very small samples reflects that the topics related politics such as political warfare are easier to distinguish the personality from individual to individual, compared with traditional cybersecurity and cybercrime topics in the other channels. The accuracy results of SVM of #computer, #security, and #networking are 86.45%, 81.68%, and 77.17% in 20 author level test, respectively. When extending to 30 authors, the accuracy results of #anonops and #politics are still higher than 90%, and the accuracy of #2600 decrease to 87.35%. In the cases of #computer, #security, and #networking channels, the results decrease to 83.60%, 75.11%, and 75.22%, respectively. The decrease in the accuracy with the increasing number of authors is a predictable result due to the reason: when the number of top users increase, the authors with lower amount of messages (i.e., the authors who do not frequently participate in the channel conversation) cannot provide sufficient information to effectively discriminate authorship. Especially for #networking and #security channels, the authorship detection seriously suffered due to the lack of data for infrequent authors. However, by looking at the active channels, we can easily state that by collecting further data, the accuracy will be increased.

It appears that the personality features based on Personality Insights are very effective for author identification. While k-NN provides sufficient accuracy (compared to other studies), SVM is able to present outperforming results with a 9.88%, 10.60%, 10.73% improvement in average, compared with 1-NN, 3-NN, 5-NN, respectively.

(a)

Page 7: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

(b)

(c)

(d)

(e)

(f)

Fig. 5. The author identification accuracy results of six monitored channels. (a) accuracy of #anonops. (b) accuracy of #2600. (c) accuracy of #politics. (d) accuracy of #computer. (e) accuracy of #networking. (f) accuracy of ##security.

V. CONCLUSION

The anonymity of the Internet services, especially in the social media, provides freedom to the users. On the other hand, it can be exploited for the underground cyber-criminal works. It is highly desired to be able to identify the anonymous individuals spreading malicious software tools or cybercriminals. To address this cybersecurity challenge, we presented in this paper an autonomic personality analysis based author identification for the Internet Relay Chat (IRC) environment. Compared to the previously applied techniques that focus on stylometric measures and deep-learning techniques, our approach focuses on the fact that each author leaves some footprints in the text from their personality characteristics. By using IBM Watson Personality Insights, we were able to extract this information and apply classification techniques to identify individual authors. Using the IRC chat logs that are collected through our autonomic IRC bots in various cybersecurity, underground channels, and also general channels (computer and politics), we have demonstrated that the personality based solution can work effectively in identification of the authors. We have observed between 92%-96% identification for 20 authors when the chat messages are sufficient.

ACKNOWLEDGMENT

This work is partly supported by the Air Force Office of Scientific Research (AFOSR) Dynamic Data-Driven Application Systems (DDDAS) award number FA9550-18-1-0427, National Science Foundation (NSF) research projects NSF-1624668 and SES-1314631, and Thomson Reuters in the framework of the Partner University Fund (PUF) project (PUF is a program of the French Embassy in the United States and the FACE Foundation and is supported by American donors and the French government).

REFERENCES [1] S. Shao, C. Tunc, P. Satam, and S. Hariri, “Real-Time IRC Threat

Detection Framework,” In Foundations and Applications of Self* Systems (FAS* W), 2017 IEEE 2nd International Workshops on, pp. 318-323.

[2] J. Yu, C. Tunc, and S. Hariri, “Automated Framework for Scalable Collection and Intelligent Analytics of Hacker IRC Information,” In

Page 8: Autonomic Author Identification in Internet Relay Chat (IRC)nsfcac.arizona.edu/research/papers/author-identification/3.pdf · A. Internet Relay Chat (IRC) Internet Relay Chat (IRC)

Cloud and Autonomic Computing (ICCAC), 2016 International Conference on, pp. 33-39.

[3] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship identification of online messages: Writing ‐ style features and classification techniques,” Journal of the Association for Information Science and Technology 57, no. 3 (2006): 378-393.

[4] J. Savoy, “Authorship attribution based on specific vocabulary,” ACM Transactions on Information Systems (TOIS) 30, no. 2 (2012): 12.

[5] S. Segarra, M. Eisen, and A. Ribeiro, “Authorship attribution through function word adjacency networks,” IEEE Transactions on Signal Processing 63, no. 20 (2015): 5464-5478.

[6] S. Phani, S. Lahiri, and A. Biswas, “A machine learning approach for authorship attribution for Bengali blogs,” In Asian Language Processing (IALP), 2016 International Conference on, pp. 271-274.

[7] J. Ma, B. Xue, and M. Zhang, “A Profile-Based Authorship Attribution Approach to Forensic Identification in Chinese Online Messages,” In Intelligence and Security Informatics, 2016 Springer Pacific-Asia Workshop on, pp. 33-52.

[8] S. R. Pillay, and T. Solorio, “Authorship attribution of web forum posts,” In eCrime Researchers Summit (eCrime), 2010, pp. 1-7.

[9] V. Benjamin, B. Zhang, J. F. Nunamaker Jr, and H. Chen, “Examining hacker participation length in cybercriminal Internet-relay-chat communities,” Journal of Management Information Systems 33, no. 2 (2016): 482-510.

[10] V. Benjamin, and H. Chen, “Time-to-event modeling for predicting hacker IRC community participant trajectory,” In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE Joint, pp. 25-32.

[11] A. Narayanan, H. Paskov, Z. Gong, J. Bethencourt, E. Stefanov, E. R. Shin, and D. Song, “On the feasibility of internet-scale author identification,” In Security and Privacy (SP), 2012 IEEE Symposium on, pp. 300-314.

[12] E. Cooke, F. Jahanian, and D. McPherson, “The Zombie Roundup: Understanding, Detecting, and Disrupting Botnets,” SRUTI 5 (2005): 6-6.

[13] V. Benjamin, W. Li, T. Holt, and H. Chen, “Exploring threats and vulnerabilities in hacker web: Forums, IRC and carding shops,” In Intelligence and Security Informatics (ISI), 2015 IEEE International Conference on, pp. 85-90.

[14] “IBM Watson Platform service,” [Online] URL: https://console.bluemix.net/developer/watson/dashboard, Accessed: December 2017

[15] “IBM Watson SDKs,” [Online] URL: https://console.bluemix.net/developer/watson/sdks-and-tools, Accessed: December 2017

[16] “AI Everywhere with IBM Watson and Apple Core ML,” [Online] URL: https://www.ibm.com/blogs/watson/2018/03/ai-everywhere-ibm-watson-apple-core-ml/, Accessed: March 2018

[17] “IBM Watson Assistant service,” [Online] URL: https://www.ibm.com/watson/services/conversation/, Accessed: December 2017

[18] “IBM Watson Personality Insights service,” [Online] URL: https://console.bluemix.net/docs/services/personality-insights, Accessed: December 2017

[19] P. Olson. We Are Anonymous: Inside the Hacker World of LulzSec, Anonymous, and the Global Cyber Insurgency. Back Bay Books, 2013

[20] C. Chang, and C. Lin, “LIBSVM: a library for support vector machines,” ACM transactions on intelligent systems and technology (TIST) 2, no. 3 (2011): 27.

[21] X. Liu, J. Wu, and Z. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, no. 2 (2009): 539-550.

[22] V. Benjamin, and H. Chen, “Securing cyberspace: Identifying key actors in hacker communities,” In Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on, pp. 24-29.

[23] V. Benjamin, and H. Chen, “Identifying language groups within multilingual cybercriminal forums,” In Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, pp. 205-207.

[24] Z. Fang, X. Zhao, Q. Wei, G. Chen, Y. Zhang, C. Xing, W. Li, and H. Chen, “Exploring key hackers and cybersecurity threats in Chinese hacker communities,” In Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, pp. 13-18.

[25] J. Radianti, “A study of a social behavior inside the online black markets,” In Emerging Security Information Systems and Technologies (SECURWARE), 2010 Fourth International Conference on, pp. 189-194.

[26] M. Sauter. The coming swarm: DDOS actions, hacktivism, and civil disobedience on the Internet. Bloomsbury Publishing USA. 2014

[27] R. Payam., T. Lei, and L. Huan, “Cross validation in Encyclopedia of Database Systems,” Tamer zsu M, Ling L (Eds). EUA: Springer (2009).

[28] N. V. Chawla, “C4. 5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure,” In Machine Learning, 2003 International Conference on, vol. 3, p. 66.

[29] “Myspace Mom” [Online] http://www.foxnews.com/story/2007/12/06/myspace-mom-linked-to-missouri-teen-suicide-being-cyber-bullied-herself.html, Accessed: Feb 2018

[30] N. Cheng, R. Chandramouli, and K. P. Subbalakshmi. “Author gender identification from text.” Digital Investigation 8, no. 1 (2011): 78-88.

[31] “Mastermind behind sophisticated, massive botnet outs himself,” [Online] URL: https://arstechnica.com/tech-policy/2017/12/mastermind-behind-massive-botnet-tracked-down-by-sloppy-opsec/, Accessed: Feb 2018

[32] J. Peng, K. R. Choo, and H. Ashman. “Bit-level n-gram based forensic authorship analysis on social media: Identifying individuals from linguistic profiles,” Journal of Network and Computer Applications 70 (2016): 171-182.

[33] E. Stamatatos. “Author identification: Using text sampling to handle the class imbalance problem.” Information Processing & Management 44, no. 2 (2008): 790-799.

[34] S. Hill, and F. Provost. “The myth of the double-blind review?: author identification using only citations,” Acm Sigkdd Explorations Newsletter 5, no. 2 (2003): 179-184.

[35] S. Nirkhi, R. V. Dharaskar, and V. M. Thakare. “Authorship Verification of Online Messages for Forensic Investigation,” Procedia Computer Science 78 (2016): 640-645.

[36] W. Wu, J. Zhou, Y. Xiang, and L. Xu, “How to achieve non-repudiation of origin with privacy protection in cloud computing,” J. Comput. Syst. Sci., vol. 79, no. 8 (2013): 1200-1213.

[37] D. A. Cobb-Clark, and S. Schurer. “The stability of big-five personality traits.” Economics Letters 115, no. 1 (2012): 11-15.

[38] R. Layton, S. McCombie, and P. Watters. “Authorship attribution of irc messages using inverse author frequency,” In Cybercrime and Trustworthy Computing Workshop (CTC), 2012 Third, pp. 7-13.

[39] G. Inches, M. Harvey, and F. Crestani, “Finding participants in a chat: Authorship attribution for conversational documents,” In Social Computing (SocialCom), 2013 International Conference on, pp. 272-279.

[40] P, Jeffrey, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” In Empirical Methods in Natural Language Processing (EMNLP), 2014 Conference on, pp. 1532-1543.