Visualizing Social Media Sentiment in Disaster Scenarios ...2016-11-20 Student’s Name: J. A. Samarasinghez Student ID: 18161335 ... Twitter information can summarize as a social

Visualizing Social Media Sentiment in Disaster Scenarios

Project Report - 2nd Semester, 2016

2016-11-20

Student’s Name: J. A. Samarasinghe‡Student ID: 18161335

Course: Master by Coursework in Computer Science‡ Department of Computing, Curtin University, Western Australia

Email: [email protected]* * *

Supervisor: Sonny PhamCo-Supervisor: N/A

November 20, 2016

1 ABSTRACT

According to [1] Twitter is a social media which shows known characteristics of human social networks.So, Twitter information can summarize as a social networking and micro blogging service which en-ables users to send and retrieve messages of length up to 140 characters, known as tweets. BasicallyTweets contain rich information about people’s preferences and the way they look at a particularproblem.

Recently Twitter has been successfully used to measure and count the impact of disasters occurredin real world. [2] research paper is a successful work to prove this fact which exploited Ebola twitterdataset to show how Twitter data can reveal interesting patterns in disaster scenarios.

Our research will be similar to the [2] research, here we have planned to work regarding the waysand means to examine the Zika virus (ZIKV) by using Twitter data to reveal patterns by using vi-sual analytics techniques and sentiment modeling. So the proposed system will be a visual analyticsframework for sentiment visualization of Geo-located Twitter data.

2 OBJECTIVES

According to the Wikipedia, sentiment analysis is a process of computationally identifying and cat-egorizing opinions expressed in a piece of text, especially in order to determine whether the writer’sattitude towards a particular topic, product, incident etc., is positive, negative or neutral.

In our research we have planned to use sentiment analysis to identify the positive, negative andneutral aspects of Twitter data regarding Zika virus (ZIKV). We planned to use a crawler system toget data from Twitter. In case of sentiment analysis, planned to use three main methods to reach theexpected result. Those will be SentiWordNet, SentiStrength and CoreNLP. These three procedureswill divide Tweets into Positive, Negative and Neutral criteria.

According to those criteria, it is possible to compare the results with real Zika virus (ZIKV) data.Provided, our methods are accurate, those compared results should be accurate with data of real worldincident.

The research questions motivating our project can be described as follows,

• How to retrieve more accurate results from three sentiment classifiers?

• Accuracy of the method which we are using to retrieve the data?

• How to retrieve a final result from those three classifier results?

• Whether the Twitter data results reflect the results of the real world?

1

3 BACKGROUND

Twitter is a well-known website, which Provide a social networking and micro blogging services, it en-able its users to update their status in tweets, follow the people they are interested in re-tweet others’posts and even communicate with them directly. Sentiment analysis on Twitter data has provided aneffective way to represent real time public sentiment, which is critical for decision making in variousdomains. For instance, a company can study the public sentiment in Tweets to obtain user’s feedbacktowards its products. It is a very useful concept for academic purposes as well as for organizations.Twitter is a unique micro-blogging platform with huge number of people. On Twitter, any user candrop a message referred as a tweet and public can view this tweet on their home page. For analysisof public sentiment, Twitter is very good platform. Twitter provide information which are useful fortake some decision. The recent topics within the sentiment variation periods are correlated to thereasons behind the variations.

Nowadays most researchers use Twitter data to analyze various scenarios according the require-ments of their research. In our research we are going to use Twitter data to analyze Zika virus (ZIKV)scenario which is a currently ongoing virus in the world mostly in Americas and Pacific countries,and it is also affecting several islands in the pacific, began in early 2015 in Brazil. There is anotherresearch by [2], which will be very similar to this proposed research project.

Now a day’s sentiment analysis is a major trend in the natural language processing zone. Twittercontain large number of messages with small tweets. Most researchers use sentiment analyzing in theirresearch when exploring social media. Our intentions also will be same. According to [3] research,Micro-blogging nowadays became one of the major types of communication. Their work uses uniquemethodology for targeted review sentiment analysis and classification of that review. In their worksentiment analysis is done by very effective way that is automatic data generate in structure manner.Classification is done in three different classes that are, Positive, negative, neutral. The algorithmused is give better result and performance because of using fast and accurate nave Bayes classifier.

In this research we are going to use three methods to retrieve the required results. Those will beSentiStrength, SentiWordNet, and CoreNLP. By using all three we can categorize results as positive,negative and neutral like the [3] research, then we can come up with a more accurate result.

• A desktop system for Crawling

We planned to use a desktop system as the crawler to collect data from Twitter. There was anearlier research based on this type of crawler also. According to [4] Real time Tweet analysisfor event detection and reporting system for Earthquake research allows obtaining data fromlarge Twitter communities. It will capture real time data for their research from Twitter. Ourpurposed crawler also will be very similar to this research. We are capturing real time data andstores in a database for further use in this stage.

• SentiStrength

According to [5] sentiment analyze software reads text and uses an algorithm to produce anestimate of its sentiment content. Sentistrength employs a lexicon of sentiment words and wordstems together with average positive or negative scores of any constituent word unless these aremodified by any of the additional classification rules, such as in the case of emotions negationsand booster words.

Sentistrength has near human accuracy on general short social web texts but is less accuratewhen the texts often contain sarcasm, as in the case of political discussions. The accuracy ofSentiStrength can be enhanced by extending its lexicon and altering its mood setting for sets of

2

text with a narrow topic focus. As per the [5], SentiStrength can be used to analyze large scalesentiment patterns in the social web in addition to its commercial uses.

• SentiWordNet

Sentiment classification concerns the use of automatic methods for predicting the orientation ofsubjective content on text documents, with applications on a number of areas including recom-mender and advertising systems, customer intelligence and information retrieval. SentiWordNetis an opinion lexicon derived from the WordNet database where each term is associated withnumerical scores indicating positive and negative sentiment information.

According to [6] research they believe that SentiWordNet can prove a useful tool for opinionmining applications, because of its wide coverage (all WordNet synsets are tagged according toeach of the three labels Objective, Positive, Negative) and because of its fine grain, obtained byqualifying the labels by means of numerical scores.

[7] assessed the use of the SentiWordNet opinion lexicon in the task of sentiment classification offilm reviews. Results obtained by simple word counting were similar to other results employingmanual lexicons, indicating SentiWordNet performs well when compared with manual resourceson this task. In addition, using SentiWordNet as a source of features for a supervised learningscheme has shown improvements over pure term counting.

• CoreNLP

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms ofwords, their parts of speech, whether they are names of companies, people, etc., normalize dates,times, and numeric quantities, and mark up the structure of sentences in terms of phrases andword dependencies, indicate which noun phrases refer to the same entities, indicate sentiment,extract open-class relations between mentions, etc.

[8] paper describe the design and use of the Stanford CoreNLP toolkit, an extensible pipelinethat provides core natural language analysis. Its extensive documentation maximizes the poten-tial for independent learning.

4 SIGNIFICANCE

In this research we are going to create a system to visualize social media sentiment in disaster sce-narios. If we are successful in finding the expected result from the project, then we will be able toget an accurate result, which will be similar to the actual real world scenario data. This means, wecan use social media like Twitter in analyzing disaster scenarios in the world. Also it will be useful inanalyzing any other such situation.

If we can figure out a successful method to mine the data and retrieve a result then we will be ableto apply those methods and scenarios in present data for in-time monitoring. We will be able to guesspresent time scenarios before they are actually going to happen which will be a great service to thesociety.

Nowadays most researchers use social media as Twitter for guessing and figuring various types ofscenarios. Our research will be a basic step of a big work, and our research will have those mainsignificances in action.

3

5 METHODOLOGY

after receiving the Tweet data with Geo-location data and storing them in MongoDB database. Byusing, SentiStrength, SentiWordNet and CoreNLP, we can separate the Tweet data in to 3 sections asPositive, Negative and Neutral. After one Tweet sent through SentiStrength, SentWordNet and NLPsection, we can guess which category will be the most suitable for this used Tweet. Is it a Positive,Negative or Neutral Tweet by using the Boyer-Moore majority voting algorithm. Using that methodwe can get output what we are seeking and we can create a heat graph and/or scatter plot graph(Since we have Geo-graphical data) and a forex chart for display the flow of data following by dates.below sections will explain the steps and procedures required to access the required level of result

• Collecting Data

Tweets on Twitter are publicly available for everyone. According to Twitters API, the searchfunction is restricted at 180 queries per 15 minute (720 queries per hour). Since we are usingstreamAPI which is for capture real time data on twitter, there is no limit for crawl data. Butsince we need Geo-located data we had to go to regions levels separately to capture data withGeo-location.

Created the crawler with few filters for keyword, language and for region location.

Since we need to get results regarding Zika virus created filter for zika virus as ’zika virus’ and’zeka virus’. Since people tend to use both spellings for type zika virus. Filter language asEnglish ’en’ since we all are familiar and it is easier to manage when getting results throughclassifiers.

For filter data from region we need the Geo-coordinates for those regions. Every places givecoordinates in two values. From,

– Decimal Degrees - N/S/E/W

– Decimal Degrees - Plus/Minus

– Degrees/Minutes/Seconds - N/S/E/W

– Degrees/Minutes/Seconds - Plus/Minus

Input values for twitter search through tweepy needs 4 decimal points as location input coordi-nates. location bounding box [south west longitude, south west latitude, north west longitude,north west latitude].

4

Figure 1: location format.

5

WWW.openstreetmap.org gives output in 4 decimal values. (Search for required area and clickon Export button)

Figure 2: Region coordinates.

Then save all the retrieving data in MongoDB using separate collections, region vise.

Used a timer inside of the program incase of having exceptions and terminating the program.Now with the timer if there was a exception occured and program stopped, program itself willautomatically start after 15 seconds.

6

• Sentiment Analysis

for analyze the sentiments we can follow the below steps.

– SentiStrength

Automatic sentiment analysis of up to 16,000 social web texts per second with up to hu-man level accuracy for English - other languages available or easily added. SentiStrengthestimates the strength of positive and negative sentiment in short texts, even for informallanguage. It has human-level accuracy for short social web texts in English, except politicaltexts.

SentiStrength was originally developed for English and optimized for general short socialweb texts but can be configured for other languages and contexts by changing its input files.

According to the project specification plan needs to use a public-ally available versionwhich must compatible with other softwares. The Java version of SentiStrength is similarto the Windows version in core functions but has additional capabilities. It can conductbinary (positive/negative), trinary (positive/neutral/negative), single-scale classifications(-4 very negative to very positive +4) in addition to the standard type, and can conductkeyword-oriented and domain-oriented classifications. It also has a special mode for binaryand trinary classification on longer texts. It allows wild-cards in the idiom list file. It canprocess about 16,000 tweets per second.

Positive sentiment strength ranges from 1 (not positive) to 5 (extremely positive) and neg-ative sentiment strength from -1 (not negative) to -5 (extremely negative). The sentimentstrength detection results are not always accurate - they are guesses using a set of rules toidentify words and language patterns usually associated with sentiment.

There are 4 types of outputs we can achieve using SentiStrength.

∗ ””Dual Type”” for the text ’I love dogs quite a lot but cats I really hate.’ has positivestrength 3 and negative strength -5.(If positive¡negative = -1 else positive¿negative =1). Approximate classification rationale: I love[3] dogs quite a lot but cats I reallyhate[-4] [-1 booster word] .[sentence: 3,-5] [result: max + and - of any sentence][overallresult = -1 as positive¡negative].

∗ ””Binary Type”” for the text ’I love dogs quite a lot but cats I really hate.’ has binaryresult -1. (1 is positive, -1 is negative). Approximate classification rationale: I love[3]dogs quite a lot but cats I really hate[-4] [-1 booster word] .[sentence: 3,-5] [result: max+ and - of any sentence][overall result = -1 as positive¡negative].

∗ ””Trinary Type”” for the text ’I love dogs quite a lot but cats I really hate.’ hastrinary result -1. (1 is positive, 0 is neutral, -1 is negative). Approximate classificationrationale: I love[3] dogs quite a lot but cats I really hate[-4] [-1 booster word] .[sentence:3,-5] [result: max + and - of any sentence][overall result = -1 as positive¡negative].

∗ ””Scale Type”” for the text ’I love dogs quite a lot but cats I really hate.’ has scaleresult -2. Approximate classification rationale: I love[3] dogs quite a lot but cats I reallyhate[-4] [-1 booster word] .[sentence: 3,-5] [result: max + and - of any sentence][scaleresult = sum of positive and negative scores].

7

– SentiWordNet

SentiWordNet is a lexical resource for opinion mining. SentiWordNet assigns to each synsetof WordNet three sentiment scores: positivity, negativity, objectivity. Each syn-set s is as-sociated to three numerical scores Pos(s), Neg(s), and Obj(s) which indicate how positive,negative, and objective (i.e., neutral).

Need NLTK(Natural Language Tool Kit) as the basic requirement for downloading Senti-WordNet. NLTK - A leading platform for building Python programs to work with humanlanguage data. It provides easy-to-use interfaces to over 50 corpora and lexical resourcessuch as WordNet, along with a suite of text processing libraries for classification, tokeniza-tion, stemming, tagging, parsing, and semantic reasoning.

The SentiWordNet is a document resource which contains a list of English terms whichhave been attributed a score of positivity and negativity. SentiWordNet provides this in-formation which is extracted and matched to produce an overall score and hence predictionof the expression expressed in the document.

SentiWordNet is made up of tens of thousands of words, there meanings, part of speechrepresented and the degree of positivity and negativity of the word, ranging from 0 to1.These words were all derived from the WordNet 2.0 database, which is a database ofEnglish words and their meanings where terms are organized according to semantic rela-tions or meanings. These words are all grouped by there synonyms into what is calledsyn-sets(Since same words can have different meanings with respect to the part of speechbeing represented,sentiwordnet was designed by ranking subjectivity of all terms /syn-setsaccording to the part of speech the term belongs to.). So basically, SentiWordNet extendsthe WordNet by addition of subjectivity information ( + or - ) to every word in the database.

– Stanford CoreNLP

Stanford CoreNLP provides a set of natural language analysis tools. It can give the baseforms of words, their parts of speech, whether they are names of companies, people, etc.,normalize dates, times, and numeric quantities, and mark up the structure of sentences interms of phrases and word dependencies, indicate which noun phrases refer to the sameentities, indicate sentiment, extract open-class relations between mentions, etc.

Then we can separate the Tweet data in to 3 sections. Positive, Negative and Neutral. Af-ter one Tweet sent through SentiStrength, SentWordNet and CoreNLP section, we can guesswhich category is most suitable for this used Tweet. Is it a Positive, Negative or Neutral Tweet.Can use the Boyer-Moore majority voting algorithm. Using that method we can get outputwhat we are seeking and we can create a heat graph and/or scatter plot graph (Since we haveGeo-graphical data) and a time series visualization for display the flow of data following by dates.

8

– Boyer-Moore Majority Vote Algorithm - The BoyerMoore majority vote algorithm is analgorithm for finding the majority of a sequence of elements using linear time and constantspace. In its simplest form, the algorithm finds a majority element, if there is one: that is,an element that occurs repeatedly for more than half of the elements of the input. However,if there is no majority, the algorithm will not detect that fact, and will still output one ofthe elements. A version of the algorithm that makes a second pass through the data canbe used to verify that the element found in the first pass really is a majority. Made achange according to the project specification requirement when there is no majority value,algorithm detects it as a neutral value, rather than ignoring those elements.

Figure 3: Boyer-Moore majority voting algorithm.

9

• Geo-Location and Interactive Maps

”Geo-location” is the process of identifying the geographic location of an object. Twitter allowsits users to provide their location when they publish a tweet, in the form of latitude and lon-gitude coordinates. With this information, we can create some nice visualization for our data,in the form of interactive maps. Here I’m planning to use GeoJSON format, a nice Java scriptlibrary for interactive map outputs in my project.

”GeoJSON” is a format for encoding a variety of geographic data structures. This is an opensource software. A GeoJSON object may represent a geometry, a feature, or a collection offeatures. GeoJSON supports the following geometry types: Point, LineString, Polygon, Multi-Point, MultiLineString, MultiPolygon, and GeometryCollection. Features in GeoJSON containa geometry object and additional properties, and a feature collection represents a list of features.

6 ETHICAL ISSUES

In this research, everything we were planned to use will be open source, or permitted things, in whichpermission was given by original authors for the non-commercial purposes.

Since we are using the Twitter Streaming API. We have not faced for any ethical issues.

7 FACILITIES, RESOURCES AND EXPECTED OUTPUT

• Back end

– Python

– MongoDB

• Front end

– Java-GUI

– GeoJSON for Geo location map

– Forex for time map

• Storage

– MongoDB

• Resources

– Twitter data

– Sample twitter data set for testing

10

• Sample Final Output

Figure 4: sample Output GUI.

11

8 RESULTS

Twitter Database Used

Figure 5: DataBase.

12

Twitter Collections Used

Figure 6: Collections.

13

Retrieved data with Geo-location and without Geo-location.

Figure 7: Program Results.

Figure 8: Program Results.

14

Figure 9: Program Results Count.

Sample Program for getting sentiment results using three classifiers and final result for one textgiven.

Figure 10: Program.

15

Figure 11: Sample Program result.

For checking purposes - Used a sample twitter dataset.

Got a sample tweet set with positive, negative and neutral values.

• Sample Twitter Data Set - AirLine Twitter Sentiment.

• A sentiment analysis job about the problems of each major U.S. airline. Twitter data wasscraped from February of 2015 with categorizing positive, negative and neutral tweets.

• Extracted 90 tweets. positive = 30, negative = 30, and neutral = 30.

• Extracted only sentiment column and text column(sample tweets.txt).

Saved the sentiment and text separately. Got the answer from all three classifiers(CoreNLP, Sen-tiSrength and SentiWordNet) using the extracted text. Compare those results separately with finalgiven result in the test tweets set, to check whether answers are same or not. Noticed that Classifiersare kind of vulnerable when it comes to extracting and finding neutral sentiments. Used the maximumentropy method to get a final answer from those three classifier answers.

16

All the test results (Results.txt) is included in the submission. Here is the final results percentage.

Figure 12: sentiment Result using sample data.

9 CONCLUSION AND FUTURE WORKS

When we collect data there was duplicate data among them. But since we are collecting data fromdifferent regions separately, there shouldn’t be any duplicates. So, need to check the coordinates againto clearly identify the boundaries of those coordinates. Since according to the coordinate boundariestwitter get the results. Should check for other ways to get coordinates rather than using the open-mapresults, with 4 decimal places.

Should try for get country based results to get more Geo based tweets since we need Geo-locateddata for visualizing purposes.

Need to modify classifiers to get more accurate data as currently they are not very accurate whenit comes to identifying neutral tweets.

After getting more accurate results and collecting more Geo based data need do visualize resultsand should get an output similar to sample output mentioned above. And should identify the strengthsand weaknesses of the used classifiers.

17

References

[1] H. Kwak, C. Lee, H. Park, and S. Moon, “What is twitter, a social network or a news media?,”in Proceedings of the 19th International Conference on World Wide Web, WWW ’10, (New York,NY, USA), pp. 591–600, ACM, 2010.

[2] Y. Lu, X. Hu, F. Wang, S. Kumar, H. Liu, and R. Maciejewski, “Visualizing social media sentimentin disaster scenarios,” in Proceedings of the 24th International Conference on World Wide Web,WWW ’15 Companion, (Republic and Canton of Geneva, Switzerland), pp. 1211–1215, Interna-tional World Wide Web Conferences Steering Committee, 2015.

[3] B. N. Kolte and P. Deshmukh, “Analysis and classification of public sentiments from twitter,”

[4] G. V. Shendge, M. R. Pawar, N. D. Patil, P. R. Pawar, and D. B. Bagul, “Real time tweet analysisfor event detection & reporting system for earthquake,” 2015.

[5] M. Thelwall, “Heart and soul: Sentiment strength detection in the social web with sentistrength,”Proceedings of the CyberEmotions, pp. 1–14, 2013.

[6] A. Esuli and F. Sebastiani, “Sentiwordnet: A publicly available lexical resource for opinion mining,”in Proceedings of LREC, vol. 6, pp. 417–422, Citeseer, 2006.

[7] B. Ohana and B. Tierney, “Sentiment classification of reviews using sentiwordnet,” in 9th. IT &T Conference, p. 13, 2009.

[8] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky, “The stanfordcorenlp natural language processing toolkit.,” in ACL (System Demonstrations), pp. 55–60, 2014.

18

Documents

Visualizing Social Media Sentiment in Disaster Scenarios ...2016-11-20 Student’s Name: J. A. Samarasinghez Student ID: 18161335 ... Twitter information can summarize as a social