36
The data The script Your turn Questions? Hands-on-Workshop Big (Twitter) Data Damian Trilling [email protected] @damian0604 www.damiantrilling.net Afdeling Communicatiewetenschap Universiteit van Amsterdam 30 January 2014 10.45 #bigdata Damian Trilling

Analyzing social media with Python and other tools (2/4)

Embed Size (px)

Citation preview

Page 1: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Hands-on-WorkshopBig (Twitter) Data

Damian Trilling

[email protected]@damian0604

www.damiantrilling.net

Afdeling CommunicatiewetenschapUniversiteit van Amsterdam

30 January 201410.45

#bigdata Damian Trilling

Page 2: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

In this sesion (2/4):

1 The dataRecording tweets with yourTwapperkeeperCSV-filesOther ways to collect tweetsNot that different: Facebook posts

2 The scriptPseudo-codePython codeThe output

3 Your turn

4 Questions?

#bigdata Damian Trilling

Page 3: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Recording tweets with yourTwapperkeeper

The data:Recording tweets with yourTwapperkeeperhttp://datacollection.followthenews-uva.cloudlet.sara.nl

#bigdata Damian Trilling

Page 4: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

StorageContinuosly calls the Twitter-API and saves alltweets containing specific hashtags to amySQL-database.

You tell it once which data to collect – andwait some months.

#bigdata Damian Trilling

Page 5: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

StorageContinuosly calls the Twitter-API and saves alltweets containing specific hashtags to amySQL-database.

You tell it once which data to collect – andwait some months.

#bigdata Damian Trilling

Page 6: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

Retrieving the dataYou could access the MySQL-database directly.

But yourTwapperkeeper has a nice interfacethat allows you to export the data to a formatwe can use for the analysis.

#bigdata Damian Trilling

Page 7: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Recording tweets with yourTwapperkeeper

yourTwapperkeeper

Retrieving the dataYou could access the MySQL-database directly.

But yourTwapperkeeper has a nice interfacethat allows you to export the data to a formatwe can use for the analysis.

#bigdata Damian Trilling

Page 8: Analyzing social media with Python and other tools (2/4)
Page 9: Analyzing social media with Python and other tools (2/4)
Page 10: Analyzing social media with Python and other tools (2/4)
Page 11: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

CSV-files

The data:CSV-files

#bigdata Damian Trilling

Page 12: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

CSV-files

CSV-files

The format of our choice

• All programs can read it• Even human-readable in a simple text editor:• Plain text, with a comma (or a semicolon) denoting columnbreaks

• No limits regarging the size

#bigdata Damian Trilling

Page 13: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

CSV-files

1 text,to_user_id,from_user,id,from_user_id,iso_language_code,source,profile_image_url,geo_type,geo_coordinates_0,geo_coordinates_1,created_at,time

2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #klimaattop http://t.co/Udjpk48EIB,,henklbr,407085917011079169,118374840,nl,web,http://pbs.twimg.com/profile_images/378800000673845195/b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,SunDec 01 09:57:00 +0000 2013,1385891820

3 Wat zijn de resulaten vd #klimaattop in #Warschau waard?@EP_Environment ontmoet voorzitter klimaattop@MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL,406058792573730816,37623918,en,<a href="http://www.hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs.twimg.com/profile_images/2943831271/b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,ThuNov 28 13:55:35 +0000 2013,1385646935

#bigdata Damian Trilling

Page 14: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Other ways to collect tweets

The data:Other ways to collect tweets

#bigdata Damian Trilling

Page 15: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Other ways to collect tweets

Other ways to collect tweets

Again, we want a CSV file. . .

• If you want tweets per person: www.allmytweets.net• Up to six days backwards: www.scraperwiki.com• Buy it from a commercial vendor• TCAT (from the guys at DMI/mediastudies)• For specific purposes, write your own Python script to accessthe Twitter-API(if you want to, I can show you more about this tomorrow)

#bigdata Damian Trilling

Page 16: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Not that different: Facebook posts

The data:Not that different: Facebook posts

#bigdata Damian Trilling

Page 17: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Not that different: Facebook posts

Not that different: Facebook posts

Have a look at netvizz

• Gephi-files for network analysis• . . . and a tab-seperated (essentially the same as CSV) file withthe content)

An alternative: Facepager

• Tool to query different APIs (a.o. Twitter and Facebook) andto store the result in a CSV table

• http://www.ls1.ifkw.uni-muenchen.de/personen/wiss_ma/keyling_till/software.html

#bigdata Damian Trilling

Page 18: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Not that different: Facebook posts

Not that different: Facebook posts

Have a look at netvizz

• Gephi-files for network analysis• . . . and a tab-seperated (essentially the same as CSV) file withthe content)

An alternative: Facepager

• Tool to query different APIs (a.o. Twitter and Facebook) andto store the result in a CSV table

• http://www.ls1.ifkw.uni-muenchen.de/personen/wiss_ma/keyling_till/software.html

#bigdata Damian Trilling

Page 19: Analyzing social media with Python and other tools (2/4)
Page 20: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Pseudo-code

The script:Pseudo-code

#bigdata Damian Trilling

Page 21: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Pseudo-code

Our task: Identify all tweets that include a reference to PolandLet’s start with some pseudo-code!

1 open csv-table2 for each line:3 append column 1 to a list of tweets4 append column 3 to a list of corresponding users5 look for searchstring in column 16 append search result to a list of results7 save lists to a new csv-file

#bigdata Damian Trilling

Page 22: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Python code

The script:Python code

#bigdata Damian Trilling

Page 23: Analyzing social media with Python and other tools (2/4)

1 #!/usr/bin/python2 from unicsv import CsvUnicodeReader3 from unicsv import CsvUnicodeWriter4 import re5 inputfilename="mytweets.csv"6 outputfilename="myoutput.csv"7 user_list=[]8 tweet_list=[]9 search_list=[]10 searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)11 print "Opening "+inputfilename12 reader=CsvUnicodeReader(open(inputfilename,"r"))13 for row in reader:14 tweet_list.append(row[0])15 user_list.append(row[2])16 matches1 = searchstring1.findall(row[0])17 matchcount1=018 for word in matches1:19 matchcount1=matchcount1+120 search_list.append(matchcount1)21 print "Constructing data matrix"22 outputdata=zip(tweet_list,user_list,search_list)23 headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])24 print "Write data matrix to ",outputfilename25 writer=CsvUnicodeWriter(open(outputfilename,"wb"))26 writer.writerows(headers)27 writer.writerows(outputdata)

Page 24: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Python code

1 #!/usr/bin/python2 # We start with importing some modules:3 from unicsv import CsvUnicodeReader4 from unicsv import CsvUnicodeWriter5 import re6

7 # Let us define two variables that contain8 # the names of the files we want to use9 inputfilename="mytweets.csv"10 outputfilename="myoutput.csv"

#bigdata Damian Trilling

Page 25: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Python code

1 # We create some empty lists that we will use later on.2 # A list can contain several variables3 # and is denoted by square brackets.4 user_list=[]5 tweet_list=[]6 search_list=[]

#bigdata Damian Trilling

Page 26: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Python code

1 # What do we want to look for?2 searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau

|[Ww]arszawa’)3

4 # Enough preparation, let the program begin!5 # We tell the user what is going on...6 print "Opening "+inputfilename7

8 # ... and call the module that reads the input file.9 reader=CsvUnicodeReader(open(inputfilename,"r"))

#bigdata Damian Trilling

Page 27: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Python code

1 # Now we read the file line by line.2 # The indented block is repeated for each row3 # (thus, each tweet)4 for row in reader:5 # append data from the current row to our lists.6 # Note that we start counting with 0.7 tweet_list.append(row[0])8 user_list.append(row[2])9

10 # Let us count how often our searchstring is used in11 # in this tweet12 matches1 = searchstring1.findall(row[0])13 matchcount1=014 for word in matches1:15 matchcount1=matchcount1+116 search_list.append(matchcount1)

#bigdata Damian Trilling

Page 28: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Python code

1 # Time to put all the data in one container2 # and save it:3

4 print "Constructing data matrix"5 outputdata=zip(tweet_list,user_list,search_list)6 headers=zip(["tweet"],["user"],["how often is Poland

mentioned?"])7 print "Write data matrix to ",outputfilename8 writer=CsvUnicodeWriter(open(outputfilename,"wb"))9 writer.writerows(headers)10 writer.writerows(outputdata)

#bigdata Damian Trilling

Page 29: Analyzing social media with Python and other tools (2/4)

1 #!/usr/bin/python2 from unicsv import CsvUnicodeReader3 from unicsv import CsvUnicodeWriter4 import re5 inputfilename="mytweets.csv"6 outputfilename="myoutput.csv"7 user_list=[]8 tweet_list=[]9 search_list=[]10 searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)11 print "Opening "+inputfilename12 reader=CsvUnicodeReader(open(inputfilename,"r"))13 for row in reader:14 tweet_list.append(row[0])15 user_list.append(row[2])16 matches1 = searchstring1.findall(row[0])17 matchcount1=018 for word in matches1:19 matchcount1=matchcount1+120 search_list.append(matchcount1)21 print "Constructing data matrix"22 outputdata=zip(tweet_list,user_list,search_list)23 headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])24 print "Write data matrix to ",outputfilename25 writer=CsvUnicodeWriter(open(outputfilename,"wb"))26 writer.writerows(headers)27 writer.writerows(outputdata)

Page 30: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

The output

The script:myoutput.csv

#bigdata Damian Trilling

Page 31: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

The output

1 tweet,user,how often is Poland mentioned?2 :-) #Lectrr #wereldleiders #uitspraken #Wikileaks #

klimaattop http://t.co/Udjpk48EIB,henklbr,03 Wat zijn de resulaten vd #klimaattop in #Warschau waard?

@EP_Environment ontmoet voorzitter klimaattop@MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1

4 RT @greenami1: De winnaars en verliezers van delachwekkende #klimaattop in #Warschau (interview):http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis,1

5 De winnaars en verliezers van de lachwekkende #klimaattopin #Warschau (interview): http://t.co/DEYqnqXHdy #Misserfolg #Klimaschutz #FAZ,greenami1,1

#bigdata Damian Trilling

Page 32: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

The output

#bigdata Damian Trilling

Page 33: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Try it yourself!We’ll help you getting started. Please go tohttp://beehub.nl/bigdata-cw/workshop and download thesome files. Save the Python filesunicsv.pymyfirstscript.py as well as the datasetmytweets.csv in a new folder called workshop on yourH-drive.

When you are done, start Python (GUI) from theWindows Start Menu.

#bigdata Damian Trilling

Page 34: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Recap

1 The dataRecording tweets with yourTwapperkeeperCSV-filesOther ways to collect tweetsNot that different: Facebook posts

2 The scriptPseudo-codePython codeThe output

3 Your turn

4 Questions?

#bigdata Damian Trilling

Page 35: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

This afternoon

Your own script

#bigdata Damian Trilling

Page 36: Analyzing social media with Python and other tools (2/4)

The data The script Your turn Questions?

Vragen of opmerkingen?

Damian Trilling

[email protected]@damian0604

www.damiantrilling.net

#bigdata Damian Trilling