7
Toward Finding Malicious Cyber Discussions in Social Media Richard P. Lippmann, David J. Weller-Fahy, Alyssa C. Mensch, William M. Campbell, Joseph P. Campbell, William W. Streilein, Kevin M. Carter MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420 {lippmann,djwf,alyssa.mensch,wcampbell,jpc,wws}@ll.mit.edu, [email protected] Abstract Security analysts gather essential information about cyber attacks, exploits, vulnerabilities, and victims by manually searching social media sites. This effort can be dramatically reduced using natural language machine learning techniques. Using a new English text corpus containing more than 250K discussions from Stack Exchange, Reddit, and Twitter on cy- ber and non-cyber topics, we demonstrate the ability to de- tect more than 90% of the cyber discussions with fewer than 1% false alarms. If an original searched document corpus includes only 5% cyber documents, then our processing pro- vides an enriched corpus for analysts where 83% to 95% of the documents are on cyber topics. Good performance was obtained using term frequency (TF) – inverse document fre- quency (IDF) (TF–IDF) features and either logistic regression or linear support vector machine (SVM) classifiers. A clas- sifier trained using prior historical data accurately detected 86% of emergent Heartbleed discussions and retrospective experiments demonstrate that classifier performance remains stable up to a year without retraining. Introduction Four essential tasks must be performed to secure any enter- prise computer network. System administrators must (1) an- ticipate future attacks, (2) defend against attacks that can be prevented, (3) observe their own network to detect evidence of attempted and successful attacks, and (4) recover from successful attacks. Improving the ability to anticipate at- tacks improves defenses, targeting of observations, and the ability to recover from attacks. Anticipation can take ad- vantage of social media conversations by criminal hackers and security researchers who discuss cyber attacks, exploits, vulnerabilities, strategies, tools, and actual and potential vic- tims on the Internet. Cyber analysts currently manually ex- amine the Internet to find these discussions. DISTRIBUTION STATEMENT A. Approved for public re- lease: distribution unlimited. This material is based upon work supported by the Department of Defense under Air Force Contract No. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opin- ions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Department of Defense. Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. Manual searches often use metadata (e.g., account names, thread or discussion topics, sources and destinations of so- cial media discussions). This process is labor intensive and sometimes ineffective because attackers can easily hide malicious conversations in discussions on non-cyber topics (e.g. music or astronomy) and security researchers can post information using news feeds or forums that only contain a small percentage of cyber content. In this paper, we present an efficient and effective direct mining application for dis- cussion text using machine-learning and human language technology (HLT) approaches. Related Work This work expands upon and provides more detail about pre- vious experiments (Lippmann et al. 2016), and provides fur- ther information about current directions in our research. There has been substantial research on human language technology (HLT) (Jurafsky and Martin 2008) and cyber- security (Anderson 2008), but only a few researchers are beginning to apply HLT to the cybersecurity domain. Re- cent research (Sabottke, Suciu, and Dumitras 2015) demon- strated that finding keywords related to exploits and vul- nerabilities across more than 250K Twitter tweets provides some useful evidence that a vulnerability listed in the Na- tional Vulnerability Database (NVD) (https://nvd.nist.gov) will actually be exploited. In another recent paper (Lau, Xia, and Ye 2014) interactions between known cyber crim- inals on social media were analyzed to distinguish between discussions where cyber-attack tools are bought or sold and discussions where cyber criminals share tools or informa- tion without any monetary exchange (Lau, Xia, and Ye 2014). They do not identify cyber discussions, but re- quire manual extraction of cyber discussions before analy- sis. Two recent papers focus on the problem of extracting relational information concerning vulnerabilities from text (Jones et al. 2015; Syed et al. 2016). Example relations might be that “Adobe is vendor of Acrobat” or “CVE-2002- 2435 is a vulnerability in Acrobat”. The exploratory study by Jones et al. only extracted vulnerability relations from 62 news articles, while Syed et al. focused on well-structured NVD vulnerability descriptions but mapped extracted enti- ties to corresponding entities in the large DBpedia structured The AAAI-17 Workshop on Artificial Intelligence for Cyber Security WS-17-04 203

Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

Toward Finding Malicious Cyber Discussions in Social Media∗

Richard P. Lippmann, David J. Weller-Fahy, Alyssa C. Mensch,William M. Campbell, Joseph P. Campbell, William W. Streilein, Kevin M. Carter

MIT Lincoln Laboratory, 244 Wood Street, Lexington, MA 02420{lippmann,djwf,alyssa.mensch,wcampbell,jpc,wws}@ll.mit.edu, [email protected]

Abstract

Security analysts gather essential information about cyberattacks, exploits, vulnerabilities, and victims by manuallysearching social media sites. This effort can be dramaticallyreduced using natural language machine learning techniques.Using a new English text corpus containing more than 250Kdiscussions from Stack Exchange, Reddit, and Twitter on cy-ber and non-cyber topics, we demonstrate the ability to de-tect more than 90% of the cyber discussions with fewer than1% false alarms. If an original searched document corpusincludes only 5% cyber documents, then our processing pro-vides an enriched corpus for analysts where 83% to 95% ofthe documents are on cyber topics. Good performance wasobtained using term frequency (TF) – inverse document fre-quency (IDF) (TF–IDF) features and either logistic regressionor linear support vector machine (SVM) classifiers. A clas-sifier trained using prior historical data accurately detected86% of emergent Heartbleed discussions and retrospectiveexperiments demonstrate that classifier performance remainsstable up to a year without retraining.

Introduction

Four essential tasks must be performed to secure any enter-prise computer network. System administrators must (1) an-ticipate future attacks, (2) defend against attacks that can beprevented, (3) observe their own network to detect evidenceof attempted and successful attacks, and (4) recover fromsuccessful attacks. Improving the ability to anticipate at-tacks improves defenses, targeting of observations, and theability to recover from attacks. Anticipation can take ad-vantage of social media conversations by criminal hackersand security researchers who discuss cyber attacks, exploits,vulnerabilities, strategies, tools, and actual and potential vic-tims on the Internet. Cyber analysts currently manually ex-amine the Internet to find these discussions.

∗DISTRIBUTION STATEMENT A. Approved for public re-lease: distribution unlimited. This material is based upon worksupported by the Department of Defense under Air Force ContractNo. FA8721-05-C-0002 and/or FA8702-15-D-0001. Any opin-ions, findings, conclusions or recommendations expressed in thismaterial are those of the author(s) and do not necessarily reflect theviews of the Department of Defense.Copyright c© 2017, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

Manual searches often use metadata (e.g., account names,thread or discussion topics, sources and destinations of so-cial media discussions). This process is labor intensiveand sometimes ineffective because attackers can easily hidemalicious conversations in discussions on non-cyber topics(e.g. music or astronomy) and security researchers can postinformation using news feeds or forums that only contain asmall percentage of cyber content. In this paper, we presentan efficient and effective direct mining application for dis-cussion text using machine-learning and human languagetechnology (HLT) approaches.

Related Work

This work expands upon and provides more detail about pre-vious experiments (Lippmann et al. 2016), and provides fur-ther information about current directions in our research.There has been substantial research on human languagetechnology (HLT) (Jurafsky and Martin 2008) and cyber-security (Anderson 2008), but only a few researchers arebeginning to apply HLT to the cybersecurity domain. Re-cent research (Sabottke, Suciu, and Dumitras 2015) demon-strated that finding keywords related to exploits and vul-nerabilities across more than 250K Twitter tweets providessome useful evidence that a vulnerability listed in the Na-tional Vulnerability Database (NVD) (https://nvd.nist.gov)will actually be exploited. In another recent paper (Lau,Xia, and Ye 2014) interactions between known cyber crim-inals on social media were analyzed to distinguish betweendiscussions where cyber-attack tools are bought or sold anddiscussions where cyber criminals share tools or informa-tion without any monetary exchange (Lau, Xia, and Ye2014). They do not identify cyber discussions, but re-quire manual extraction of cyber discussions before analy-sis. Two recent papers focus on the problem of extractingrelational information concerning vulnerabilities from text(Jones et al. 2015; Syed et al. 2016). Example relationsmight be that “Adobe is vendor of Acrobat” or “CVE-2002-2435 is a vulnerability in Acrobat”. The exploratory studyby Jones et al. only extracted vulnerability relations from 62news articles, while Syed et al. focused on well-structuredNVD vulnerability descriptions but mapped extracted enti-ties to corresponding entities in the large DBpedia structured

The AAAI-17 Workshop on Artificial Intelligence for Cyber Security

WS-17-04

203

rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
Page 2: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

Topics Documents

Corpus CyberNon-cyber

Number ofDocuments

Median Numberof Words Time Covered

DocumentLabeling Method

Stack Exchange 5 10 ˜200K 245 YearsCommunity topicand tags

Reddit 10 51 ˜59K 152Months (Non-cyber)

Years (Cyber) Sub-Reddit topic

Twitter 127 500 627 546 MonthsExpert cyberusers’ tweets

Table 1: Social Media Corpora Document Labeling

database (Auer et al. 2007). Such an extended databasecould be used to support queries concerning vulnerabili-ties and assist in extracting vulnerability information fromother unstructured text. Syed et al. extended earlier work byMulwad et al. (2011) where entities such as software prod-ucts, vulnerabilities, and organizations were extracted froma small set of 107 NVD vulnerability text descriptions. Theconceptual approach described by Mulwad et al. includes aclassifier to identify documents that describe vulnerabilities,but the simple text classifier described was developed andtested using only 155 example documents including only 75NVD text descriptions as examples of documents describingvulnerabilities.

Classifier Development

Our major goal is to develop text classifiers that analyze thelarge number of discussions in online forums and discoverthe few that are most likely to be concerned with attackmethods, exploits, vulnerabilities, strategies, attack tools,defenses, and actual and potential victims that are of interestto cyber analysts. There are two potential use cases for afilter that detects malicious cyber topics. First, already dis-covered Internet content, such discussions under specifiedStack Exchange (http://www.stackexchange.com) or Reddit(http://www.reddit.com) topics, as well as lists of users inTwitter (http://www.twitter.com), can be processed to rankthe content by relevance. This ranking is necessary becausediscussions often drift off-topic and may move to other fo-rums. Second, new Internet forums (e.g., Twitter hashtagthreads) can be discovered that contain cyber discussions ofinterest. This scenario is more difficult because the searchcan extend across many types of social media and forums. Itwould benefit from a classifier that works well across differ-ent forum and language types.

Training and Testing Corpora

Training requires both cyber and non-cyber social mediadiscussions. Cyber discussions should be representative ofthose that are of most interest to analysts. Non-cyber train-ing examples should contain discussions that cover manynon-cyber topics to provide good performance during usewhen any of a wide range of non-cyber topics might be en-countered. After a classifier is trained, it can be fed input

text from a social media discussion and provide as outputthe probability that the discussion is on a cyber topic. Anoutput probability supports both use-cases described above.Conversations in forums of interest can be ranked by proba-bility, or many new forums can be scanned to identify thosewith the greatest number of probable cyber conversations.

We trained and tested our classifiers using text discussionsfrom three social media forums: Stack Exchange, Reddit,and Twitter. Stack Exchange is a well-moderated question-and-answer network with communities dedicated to manydiverse topics. Questions and answers can be quite compre-hensive, long, and well written. Reddit is a minimally mod-erated variety of forums with main topics called sub-Redditsand many individual threads or discussions under each topic.Twitter data consist of short text messages (tweets) with atmost 140 characters each. Tweets can be followed via usernames and hashtags that identify tweets on a similar topic,or Twitter lists (i.e., curated groups of Twitter users). Thesethree corpora were selected because they contain text with atleast some cyber content, span a range of social media types,and offer a history of posts over a reasonable time span.

Table 1 shows the number of cyber and non-cyber top-ics in each corpus, the number of documents and mediannumber of words in documents, the time period coveredby the collection, and a summary of how documents werelabeled as cyber or non-cyber. Documents were createdusing all posts concerning discussions on a specific ques-tion for Stack-Exchange, all posts for a specific sub-Redditthread in Reddit, and collected tweets from users with a min-imum of 20 and a maximum of 300 tweets. Preprocess-ing eliminated side information such as the date, hashtag,and user name that might not be available for future puretext searches. Documents for Stack Exchange and Redditwere labeled using topic titles and tags already provided ineach corpus. For Reddit, all posts under cyber-related top-ics (e.g., reverse engineering, security, malware, blackhat)were labeled as cyber and posts on non-cyber topics (e.g.,astronomy, electronics, beer, biology, music, movies, fit-ness) were labeled as non-cyber. For Stack Exchange wealso used lower-level tags such as penetration-test, buffer-overflow, denial-of-service, and heartbleed to further iden-tify cyber discussions. For Twitter, 127 cyber experts wereidentified and tweets from those experts were labeled cyber

204

rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
rivapple
高亮
Page 3: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

while tweets from 500 other randomly selected users werelabeled non-cyber. The disparity between the time coveredfor the cyber and non-cyber Reddit posts is a function of thelimits imposed by the Reddit API: the most recent 999 postsin each sub-Reddit were collected and the non-cyber postsoccurred at a much higher rate than the cyber posts.

Reference Keyword Detector

As a baseline reference to compare more advanced clas-sifiers, we used a keyword-based approach that had beendeveloped by security analysts to detect cyber discussions.This approach searches for 200 keywords and key phrasesin documents and counts the number of occurrences. Highercounts indicate documents that are more likely about cy-ber topics. Examples of keywords include “kit,” “Infected,”and “checksum.” Examples of phrases are “buffer over-flow,” “privilege escalation,” and “Distributed Denial of Ser-vice.” This reference keyword detector was similar to theapproach used by Sabottke, Suciu, and Dumitras (2015) todetect Twitter tweets discussing vulnerabilities except secu-rity analysts selected keywords by hand based on past expe-rience.

Processing and Classification

Our classification pipeline requires preprocessing each doc-ument, generating an input feature for each distinct word ina document, and training a classifier to distinguish betweencyber and non-cyber documents on the basis of these gener-ated features. Preprocessing employs stemming to normal-ize word endings and text normalization techniques, such asthe removal of words containing numbers and the replace-ment of Uniform Resource Locators (URLs) with a specialtoken indicating a URL, to ensure that the feature inputs arestandardized. We used term frequency (TF) – inverse docu-ment frequency (IDF) (TF–IDF) features, created by count-ing the occurrences of words in documents and normalizingby the number of documents in which the words occur (e.g.Jurafsky and Martin, 2008). In our research and in the hu-man language technology (HLT) community’s research ingeneral, TF–IDF features have provided good performancewhen used in text classification. For testing, the inverse doc-ument frequency counts were set as calculated during train-ing. Most of the experiments used single word counts to cre-ate features. Using N-grams and feature selection providedonly small differences in performance. Much like Wang andManning (2012), we found that logistic regression classi-fiers (e.g. Hastie, Tibshirani, and Friedman, 2009) and sup-port vector machine (SVM) classifiers (e.g. Cortes and Vap-nik, 1995) provided similar performance. Both classifiersrequire little computation to analyze a document and pro-vide an output proportional to the probability that an inputis a cyber document. We also experimented with a numberof different types of features and classifiers including latentDirichlet allocation (LDA) with 20 topics as features, a ran-dom forest classifier, and both word2vec and paragraph2vecembeddings but none of the features and classifiers we triedimproved performance. In the remainder of this paper, wedescribe results obtained using an L2 regularized logistic re-gression classifier for the Stack Exchange and Reddit cor-

0.1 0.2 1 2 5 10 40 50False Alarm Probability (in %)

0.1

0.2

1

2

5

10

40

50

Mis

s Pro

babi

lity

(in %

)

Stack ExchangeTwitter (>100 tweets)Reddit (>200 words)Performance target

Figure 1: The DET curves indicate that the classifiers per-form well for all three social media corpora missing fewerthan 10% of input cyber documents at 1% false alarms.

pora and using a linear SVM classifier for the smaller Twit-ter corpus.

Results

All results shown were obtained using 10-fold cross vali-dation training/testing performed separately on each corpus.Figure 1 shows results for logistic regression classifiers onStack Exchange, Twitter, and Reddit. Results for Stack Ex-change are for all documents, results for Twitter are for doc-uments for users with more than 100 tweets, and results forReddit are for documents containing more than 200 words.Each classifier outputs the probability that a document dis-cusses cyber topics. This probability can be used to labelthe document as cyber or non-cyber based on a set threshold(the minimum probability required for the classifier to labela document as cyber). The document labels then make itpossible to determine the number of false alarms (i.e., non-cyber documents that are classified as cyber) and misses(i.e., cyber documents that are classified as non-cyber) foreach threshold. We present our results in the form of detec-tion error tradeoff (DET) curves that show false alarm andmiss probabilities as the threshold on the classifier’s outputprobability varies plotted on normal deviate scales (Martinet al. 1997). These curves make it possible to determine theoutput percentage of cyber documents for any input percent-age of cyber documents and any threshold setting.

Good performance in this figure is indicated by curvesthat are lower and closer to the bottom left. As shown bythe gray box in Figure 1, a false alarm rate below 1% and amiss rate below 10% was selected as our performance tar-get after discussions with analysts. This provides a high

205

Page 4: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

concentration of cyber documents in the filtered documentssent to analysts. If we assume that the initial set of docu-ments to be classified has been pre-selected to contain 5%or 5/100 cyber documents, then a classifier with lower than1% false alarms and lower than 10% misses will result in83% or 83/100 cyber documents in the filtered stream ofdocuments classified as cyber that are presented to an an-alyst. This means that with this classifier an analyst has toexamine 16.6 times fewer non-cyber documents to find thesame number of cyber documents. In commonly-used infor-mation retrieval terminology we require recall (percentageof cyber documents labeled “cyber” by the classifier) above90% and precision (percentage of cyber documents amongall those labeled “cyber” by the classifier) above 83% for aninput stream of documents containing only 5% cyber docu-ments.

The curves shown in Figure 1 indicate that the classi-fiers we developed for each social media corpus do meetthe performance target. They miss less than 10% of cyber-documents and classify less than 1% of the non-cyber docu-ments as cyber. Before obtaining these results, we exploredthe effect of varying: the minimum number of words in eachdocument, the amount of training data used, and types of textpreprocessing. We also explored feature selection, using n-grams, and alternative classifiers. None of the exploratoryresults were substantially better than those shown here.

Although we focus on reaching the above performancetarget, there are times when analysts only have time to ex-amine those few documents rated as having the highest prob-ability of being cyber by the classifier. Performance in thiscase can also be determined by the DET curves shown. Thesimplest measure would be the recall (one minus the verti-cal miss probability) at the left of the curves where the falsealarm probability is set as determined by the number of doc-uments examined. In most cases, performance in the leftcorrelates with performance in the target region so our dis-cussion focuses on reaching the targeted gray box region inthe rest of this paper.

Comparative Analysis of Classifiers

Figure 2 compares the performance of the baseline keywordclassifier to the logistic regression classifier using Stack Ex-change data. The logistic regression classifier (solid line)passes through the performance target region. The baselinekeyword system (dashed line) performs substantially worse.At a false alarm probability of 10%, the system fails to de-tect roughly 40% of the cyber documents; at a false alarmprobability of 1%, the miss probability is roughly 60%. Todetermine the cause of this poor performance, we examinedthe Stack Exchange documents that corresponded with thefalse alarms. False alarms were often caused by one or moreoccurrences of cyber keywords in documents with topics un-related to cyber. For example, the keyword “infected” ap-peared in documents referring to bacterial infection. Sim-ilarly, the keyword “checksum” appeared in many docu-ments on technical topics. Simply counting occurrences ofkeywords without considering the context of the documentsled to the false alarms. Worst-case performance, shown bythe chance-guessing curve in the upper right (dashed-dotted

Top 50 Cyber Termspassword secur attack user server

vulner exploit hash malwar encryptfile ip network access ?

: site phone window keyvir infect sql scan client

web inject websit email ’authent instal <url> hack log

test question databas salt servicbrowser system code applic tool

http app ‘ login @

Top 50 Non-Cyber Terms$ arduino power voltag god

tax circuit signal movi boardplay rate credit film ledhi pin music wire frequenc

christian stock output clock serialnote design sensor guitar current} resistor motor chip high

logic microcontrol chord , churchyear jesu { price devic

electron invest compon fund interrupt

Table 2: A table of the most discriminative cyber and non-cyber terms (those with the largest magnitude coefficients)in a logistic regression classifier trained on Stack Exchange.Note that the tables of words should be read from left toright, top to bottom.

line), is obtained by randomly labeling each document asbeing cyber with a probability that ranges from zero to one.

Table 2 provides some insight into why our logistic re-gression classifier performs better than the keyword system.On the top are the 50 terms that receive the highest positiveweights and thus contribute more than other terms in caus-ing a document to be classified as cyber. These terms span awide range of cyber discussions on several topics. Many ofthese terms and other positively weighted cyber terms usedby this classifier are highly likely to be present in cyber doc-uments. Unlike the keyword system, our classifier stronglyindicates cyber only if many of the 50 cyber terms are com-bined in one document. Multiple instances of one term willnot yield a strong cyber indication. The bottom of this tablelists the 50 terms that receive the highest magnitude nega-tive weights and thus contribute more than others in causinga document to be classified as non-cyber. These terms indi-cate the breadth of topics that non-cyber documents cover.This diversity suggests that a large set of non-cyber docu-ments needs to be fed into the logistic regression classifierto obtain good performance.

The Effect of Document Length and Amount ofTraining Data

The DET curves in Figure 3 show how performance dependson the number of words in a Reddit document. The top solidline shows performance using all 59 thousand documents.The lower curves show performance only for documents that

206

Page 5: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

0.2 1 2 10 40 50 80 90False Alarm Probability (in %)

0.2

1

2

10

40

50

80

90

Mis

s Pro

babi

lity

(in %

)

TFIDF-LRKey word/phraseChancePerformance target

Figure 2: The DET curves for Stack Exchange documentsshow that the logistic regression classifier (solid line) sig-nificantly outperforms both the baseline keyword system(dashed line) and chance guessing (dashed-dotted line).

0.2 0.5 1 2 5 10 20False Alarm Probability (in %)

0.2

0.5

1

2

5

10

20

Mis

s Pro

babi

lity

(in %

)

All documentsDocuments >20 wordsDocuments >80 wordsDocuments >140 wordsDocuments >200 wordsPerformance target

Figure 3: As the minimum number of words in each Red-dit document is increased, the classifier’s performance im-proves. Note that the top solid line is for all documents,while the bottom solid line is for all documents containingmore than 200 words.

exceed the displayed word count. As seen, performance ini-tially increases rapidly as the number of words increases.However, the rate of performance increase slows as the mini-mum number of words increases, and classifier performanceenters the target range when the minimum number of wordsis above 200. Our results suggest that 200 or more wordsin an Internet conversation are required to provide accurateclassification of cyber and non-cyber documents. In our ap-plication, this is a reasonable restriction on document lengthbecause analysts are looking for long-term cyber discussionsand not isolated mentions of cyber topics.

Classifier performance also improves for Twitter as thenumber of words per document and the amount of non-cybertraining data are increased (Figure 4). For Twitter, a docu-ment is composed of all the tweets from a single user, sothe number of words per document is increased by includ-ing more tweets per user. The number of non-cyber trainingdocuments is increased by randomly sampling users and col-lecting their tweets in additional documents. Because we as-sume that there is a very low probability of a randomly sam-pled user discussing cyber topics, no extra labeling or costis incurred by incorporating additional training data. Thesame 127 cyber users were used to obtain each of the resultsin Figure 4. On average, there are 10 words per tweet af-ter preprocessing, so in each of the results with a minimumof 20 tweets, there are 200 words per document. Perfor-mance was further improved by collecting additional tweetsand increasing the average number of words per documentto 1,000. These results are consistent with the Reddit resultsshowing improved classifier performance as more words areadded to the documents and with the Reddit results showingimproved classifier performance as more non-cyber trainingdata are provided.

Stability in Performance over Time

Another test of our logistic regression approach determinedwhether a classifier trained on posts from before public re-lease of the Heartbleed vulnerability (an exploit of the Se-cure Sockets Layer (SSL) protocol) could detect social me-dia discussions concerning Heartbleed. Such discussionscould only be found if they included words that were usedin prior social network cyber discussions because the clas-sifier would have never seen the word “Heartbleed”. Fig-ure 5 plots the cumulative percentage of Stack Exchangethreads detected by a logistic regression classifier trainedon 3924 cyber and 7848 non-cyber documents posted be-fore the Heartbleed attack was announced on 8 April 2014.The classifier immediately detects the flurry of posts on 8April and in the following days. Of the 106 Heartbleed-tagged threads, 86% were detected at a false alarm rate of1%. Our logistic regression classifier performed much betterthan the keyword baseline system, which only detected 5%of the Heartbleed discussions, because our system detectswords related to the protocols affected by Heartbleed (e.g.,SSL, Transport Layer Security (TLS)) and other words as-sociated with cyber vulnerabilities (e.g., malware, overflow,attack). The keyword system lacked the keywords used inHeartbleed discussions, and thus suffered from a high missrate.

207

Page 6: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

0.2 0.5 1 2 10 20 40 50False Alarm Probability (in %)

0.2

0.5

1

2

10

20

40

50

Mis

s Pro

babi

lity

(in %

)

100 noncyber users, >20 tweets250 noncyber users, >20 tweets500 noncyber users, >20 tweets500 noncyber users, >100 tweetsPerformance target

Figure 4: As the number of non-cyber users and tweets peruser are increased, performance improves. For example, at1% false alarms, the miss rate is 40% with 100 non-cyberusers (solid line), 20% with 250 users (dashed line), andonly 6% with 500 users (dashed-dotted line). By adding ad-ditional tweets from the 500 users, the miss rate was reducedto 2% at 1% false alarms (dotted line).

A system to detect cyber documents is most useful if doesnot require frequent retraining to match possible changes incyber vocabulary over time. We performed experiments inwhich a classifier was trained on Stack Exchange data up toa given date and then tested every month after that date with-out retraining. Figure 6 plots the miss percentage (averagedover false alarm rates ranging from 0.25% to 2.0%) for aclassifier that was trained on data before June 2012 and thentested on the new data in each month for the following year.The results indicate that the miss rate increases little overthe year and is always below 10%. The experiment was re-peated over multiple time periods from 2012 through 2014,producing similar results each time. It is thus sufficient toretrain our classifier once per year or every six months.

Conclusions and Future Work

After developing three text corpora containing cyber andnon-cyber documents we demonstrated that human languagetechnology (HLT) classifiers perform well for all corporaproviding a high concentration of cyber documents after fil-tering. Logistic regression classifiers provided good perfor-mance when there were more than roughly 200 words in adiscussion. In one illustrative test, a classifier trained beforethe major Heartbleed vulnerability was announced could ac-curately detect discussions of this vulnerability and classifierperformance was maintained without retraining even whentested on discussions occurring six months to a year after

Logistic Regression (86%)

Key Words/Phrases (5%)

Cumu

lative

Perce

nt Po

sts D

etecte

d

100

80

60

40

20

0

Date of Post

Figure 5: Our classifier detects 86% of the Stack Exchangeposts discussing Heartbleed after training on data from be-fore the Heartbleed vulnerability (circles), but the baselinekeyword system detects only 5% of posts discussing Heart-bleed (triangles).

training. Classifiers can process an input containing only5% cyber documents into an enriched set for analysts con-taining from between 83% and 95% cyber documents.

Preliminary experiments suggest that performance de-grades when a classifier is trained on one corpus (e.g., Red-dit) and tested on another (e.g., Stack Exchange). To im-prove cross-domain performance, we are currently explor-ing using neural network word embeddings as word featuresto take advantage of unlabeled training data (Mikolov et al.2013). We are also exploring adapting classifiers acrossdomains and creating more general non-cyber word distri-bution models that can be used across corpora and createdwithout document labels.

We have begun collecting and curating both English andnon-English social media content to extend our approachesto other languages and to ease transition to other sources oftextual information. Follow-on work includes automaticallylinking entities in discussions to cyber concepts as suggested

60

40

20

0

Miss

Perce

ntage

(Ave

rage O

ver F

A 0.5

to 2%

)

Figure 6: A classifier trained on Stack Exchange data beforeJune 2012 and tested every month after that for one yearstays within the performance target.

208

rivapple
高亮
rivapple
高亮
Page 7: Toward Finding Malicious Cyber Discussions in Social Mediastatic.tongtianta.site/paper_pdf/6a734706-9bd0-11e... · Toward Finding Malicious Cyber Discussions in Social Media∗ Richard

by Mulwad et al.. One goal of this work will be to automat-ically fill in components of the “Diamond Model” of intru-sion analysis (Caltagirone, Pendergast, and Betz 2013) byextracting data from discussions concerning attacker capa-bilities, attacker infrastructure, and victims of cyber adver-saries. Such discussions could provide information aboutactors and victims, exploits and tools being used or devel-oped, tactics, and technologies being exploited as requiredby the diamond model. They could also provide detailsabout how an adversary performs different steps of the cy-ber kill chain (Hutchins et al. 2011). Another goal is to im-prove performance for cyber discussions containing fewerthan 200 words and reduce the amount of training data re-quired when extending our work to additional Internet socialmedia forums. We are currently developing a classifier usingnew content to identify cyber discussions related to attackplanning, implementation, and execution.

Acknowledgements

We are grateful to Julie Dhanraj, Beth Richards, Jason Dun-can, Peter Laird, and their subject-matter experts for theirsupport. Vineet Mehta’s initial contributions are apprecia-tively acknowledged. Thank you to Clifford Weinstein andKara Greenfield for assistance in initiating the program andproviding helpful insights.

References

Anderson, R. J. 2008. Security Engineering: A Guide toBuilding Dependable Distributed Systems. Wiley.Auer, S.; Bizer, C.; Kobilarov, G.; Lehmann, J.; Cyganiak,R.; and Ives, Z. 2007. Dbpedia: A nucleus for a web of opendata. Springer.Caltagirone, S.; Pendergast, A.; and Betz, C. 2013. TheDiamond Model of Intrusion Analysis. Technical reportada586960, Center for Cyber Threat Intelligence and ThreatResearch, Hanover, MD.Cortes, C., and Vapnik, V. 1995. Support-Vector Networks.Machine learning 20(3):273–297.Hastie, T.; Tibshirani, R.; and Friedman, J. 2009. The Ele-ments of Statistical Learning, volume 2. Springer.Hutchins, E. M.; ; Cloppert, M. J.; and Amin, R. M. 2011.Intelligence-driven computer network defense informed byanalysis of adversary campaigns and intrusion kill chains.Leading Issues in Information Warfare & Security Research1:80.Jones, C. L.; Bridges, R. A.; Huffer, K. M. T.; and Goodall,J. R. 2015. Towards a Relation Extraction Framework forCyber-Security Concepts. In Proceedings of the 10th AnnualCyber and Information Security Research Conference, CISR’15, 11:1–11:4. ACM.Jurafsky, D., and Martin, J. 2008. Speech and LanguageProcessing, 2nd Edition. Prentice Hall.Lau, R. Y.; Xia, Y.; and Ye, Y. 2014. A Probabilistic Genera-tive Model for Mining Cybercriminal Networks from OnlineSocial Media. IEEE Computational Intelligence Magazine9(1):31–43.

Lippmann, R. P.; Campbell, W. M.; Weller-Fahy, D. J.; Men-sch, A. C.; Zeno, G. M.; and Campbell, J. P. 2016. FindingMalicious Cyber Discussions in Social Media. LINCOLNLABORATORY JOURNAL 22.Martin, A. F.; Doddington, G. R.; Kamm, T. M.; Ordowski,M. L.; and Przybocki, M. A. 1997. The DET curve in as-sessment of detection task performance. In Fifth EuropeanConference on Speech Communication and Technology, EU-ROSPEECH 1997, Rhodes, Greece, September 22-25, 1997.Mikolov, T.; Chen, K.; Corrado, G.; and Dean, J. 2013.Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781.Mulwad, V.; Li, W.; Joshi, A.; Finin, T.; and Viswanathan,K. 2011. Extracting Information about Security Vulnera-bilities from Web Text. In Web Intelligence and IntelligentAgent Technology (WI-IAT), 2011 IEEE/WIC/ACM Interna-tional Conference on, volume 3, 257–260. IEEE.Sabottke, C.; Suciu, O.; and Dumitras, T. 2015. Vulnerabil-ity Disclosure in the Age of Social Media: Exploiting Twit-ter for Predicting Real-Rorld Exploits. In Proceedings of the24th USENIX Conference on Security Symposium, 1041–1056. USENIX Association.Syed, Z.; Padia, A.; Mathews, M. L.; Finin, T.; and Joshi,A. 2016. UCO: A Unified Cybersecurity Ontology. In Pro-ceedings of the AAAI Workshop on Artificial Intelligence forCyber Security. AAAI Press.Wang, S., and Manning, C. D. 2012. Baselines and Bigrams:Simple, Good Sentiment and Topic Classification. In Pro-ceedings of the 50th Annual Meeting of the Association forComputational Linguistics: Short Papers-Volume 2, 90–94.Association for Computational Linguistics.

209