DIY basic Facebook data mining


Citation preview

Pleasures of basic Facebook data



Guest Lecture at Charles University,

Prague, 4.12.2013

1. Why A tiny philosophical


2. How No programming, just copy


Today we are going to talk about :

The Boring part

Why are we doing this?

What‘s in it for you?

What are other ways to do this?

The Fun part

How is it done?

Why would I even try to mine FB data myself?

What is a facebook like worth for your business?

In what ways are my fans like my other customers?

What do I actually know about my fans and followers on top of their age?

Can I group my followers into segments?

Can I target my followers based on what they (are) like ?

Which ones are creating the most activity?

What on earth are all the other ones doing?

How similar/different is my competitors fanbase?

Here‘s why. Sample questions:

Built-in insights are fine for fanpage managers, but not for research

Who could have guessed..

External validity Research in social media tells you little about life outside social media Facebook self vs. Real self

Sampling Only some profiles are public > Is there enough data to make claims about my fanbase?

Organic environment Network engineers keep changing stuff so you are in constant need of adjustment

Limitations of FB research?

OK, but there are other ways..

Bambillion !

Always posted by a lady in her 40s

Indeed, there are ways:

Ask professionals and pay them accordingly (see below)

Setup a social media login or create an app (a rather good


Use ready-made tools and solutions (and pay for the useful ones)






Buy more

What does a brand

manager want from

a customer?





Engage more

What does a fanpage

manager want from a fan?

How is it done?

Facebook developers are smart so the road is a bit thorny

Good tools are usually not free

Open source tools are usually not as good

Its mostly fine legally

Obstacles ahead

… but I am not a technical type.

a) Find someone who is b) Break it down into little steps c) Your chance to stand out

MS Excel / iOS Numbers Programs > MS Office / ??


Engineered at Google Inc., formerly named Google Refine

Facebook‘s own Graph API

Tools to use (where facebook meets google and google meets microsoft)

Subjects to examine (pick any fanpage or group or event)

Subjects to examine (pick any fanpage or group or event)



Product More expensive, high-end beer

Widely and wildly consumed cheaper



Quality, tradition, national


Fun, shared moments, soccer

Number of fans 204 734 47 566

Number of posts in 2013

415 425

Not really competitors,have the same mothership !

Hypothesis time

H1 : Their active fanbase consists of a less 10% of the total fans

H2 : There is more than 10% overlap in their active fanbase

H3 : Gambrinus and Pilsner Urquell have the same engagement per post

H4 :The interest positioning will show a small affinity as beer is widely appreaciate across the population

Action !

Step 1 - Do not fear the Graph API

Step 1 - Do not fear the Graph API

Step 1 - Do not fear the Graph API

Access_token !

Fields selector

Result window

Step 1 – Facebook is nothing but a couple big tables

Step 1 – The JSON result format (JavaScript object notation)

Graph API gives you a result in JSON Format. Visually disturbing yet convenient format used in web applications. Wait and see how OpenRefine handles it..

No, not this Json

Get the id of the fanpage - many ways to do it, f.e :

1) Click on a page profile pic

2) Look in the address bar and cut the last number before „type“

Step 2 – Making a simple Graph API query


1) Get a fresh access_token

2) And get data from your own timeline


Step 2 – Making a simple Graph API query

Important, otherwise you will only get a handful

1) Repeat with our fanpage

2) And add some more fields – query likes and comments, increase limit, reduce timespan with a unix timestamp (135..)

146991996743/posts?fields=likes,comments&limit=20000&since=1356998400 (from 1.1.2013)

Step 2 – Making a more complex query

A) URL : B) query : 146991996743/posts?fields=likes,comments&limit=20000&since=1356998400 C) Access token : &access_token=XXXXXXXXX……and so on

Put together A+B+C :,comments&limit=20000&since=1356998400&access_token=XXXXX

Step 3 – Build a string to post the same query in browser address bar

Step 4 – Run OpenRefine

1) Run the programme (it opens in your browser)

2) Select Web Addresses

Step 5 – Paste your address into the field

1) Take our query,comments&limit=20000&since=1356998400&access_token=XXXXXXX

2) Paste here

3) Click next

Step 6 – Transform your result

1) Tell the programme that your result is JSON by clicking on „JSON Files“

Step 7 – Pick an individual node !

This is one „like“ on a post made by user Maggu Ka

Step 7 – Behold !

Click on „Create Project“ in the upper left and download data in Excel Sheet

Be sure this does not exceed your

„limit“ in the query, otherwise increase

the limit

Back to Step 3 !

The only thing you need to change is the id – instead of Gambrinus, now try the Pilsner Urquell id Don‘t remember?

Analysis (sort of)

Note : The metrics chosen could be re- designed to reflect other stuff like time and location

Engagement, like .. ehm,kiwi.. has layers

Sample question : Has my post attracted anyone outside the usual bunch of followers who simply like everything?

Skin : All fans

Inside : Number fans who interact

Core : Fans who interact


Make crude metrics of those layers

Tip : By messing around with the column named created_time you can see how your core fanbase has been losing and gaining interest in your posts and whether it kept ineracting = compute a lifetime of a fan

Skin : All fans = 100%

Unique Ids within

ineractions / All fans = 7%

Fans with more than 1

interaction / All fans = 2%

Try it with real Gambrinus fanpage data

Tip : What are these ratios among competitors ? Isn‘t that more important than the widely cited number of fans?? Are any of your fans also in the competitors core fanbase? Uhh, you nasty weasels !

47 566 = 100%

2004 unique interactors =


575 interactors with more than

1 action = 1.2% (28% of all active fans)

And now the Pilsner Urquell

Tip : What are these ratios among competitors ? Isn‘t that more important than the widely cited number of fans?? Are any of your fans also in the competitors core fanbase? Uhh, you nasty weasels !

204 734= 100%

2358 unique interactors =


715 interactors with more than

1 action = 0.03% (30% of all active fans)

Stand-off revisited. H1 rejected and H2 confirmed


Number of fans 204 734 47 566

Number of posts in 2013

415 425

Number of active fans in 2013

2358 / 1.1% 2004 / 4.2%

Number of repeated

interactions 715 / 30% of active 575 / 28% of active

Fanbase overlap 5% of active

Variations : Share of all interactions created by the TOP 10% fans..

How to compute average engagement?

1) You may want to try to query the „insights“ table, but mostly no success for pages other than yours

2) Else you need all the posts with likes,comments (and shares) already aggregated

3) Paste this query to OpenRefine like previously and work with Excel sheet from there post_id, like_info,comment_info,share_info from stream where source_id=146991996743 and created_time>1356998400 and actor_id=146991996743 LIMIT 20000&access_token=XXXXX

Tip : Limit the type by adding type in(46,80,128,247) to the where clause so you don‘t get posts like „group created“


Average engagement

248 74

Median Engagement

144 29

10% Top trimmed average

169 / diff of 79 44 / diff of 30

Stand-off again. H3 rejected

Tip : For more precise information, you may want to exclude the top 5% fans to see how much it changes

This may look surprising, especially considering the active fanbase is more or less equal. Seems like the total fanbase does play a role.

Study competitor‘s top posts

Tip : Take the URL of the page and add /posts/ and the post id you get from spreadsheet.

Some conclusions

Followers have a lifespan, some are zombies, some have left Facebook

Large group of active followers is superior to having large zombie fanbase => Facebook edge rank has buried your posts for those guys anyway.

You can make up metrics once you have the data > sometimes better to have the data first

The Graph API returns errors all the time, so don‘t be discouraged..

Step 4 –

• Sum it up

The dogdy part : Know more

about the fans

The fans are well described by their favorites, likes, interests, ...

Facebook ids of fans + Web Scraper

You have facebook id of someone => you can visit her profile

You have a web scraper (like OpenRefine) => you can visit all the profiles without actually browsing throught them

.. And download whatever the browser sees..

It is against the Facebook policies to scrape profile pages en-masse, but its „ok“ as a training excercise.

Pete Warden scraped 200 000 000 FB profiles and they let the lawyers off the leash


Step 2 – Preparing data for Outwit Hub

OutWit Hub is a free intelligent scraper (limited amounts of data)

Prepare the links of Pilsner fans is a notepad file like below and File=> Open the txt. File in Outwit Hub

Step 3 – Creating a scraper in Outwit Hub

Prepare a scraper

1) Go to the „scrapers“ tab

2) Click new

3) Name the scraper somehow

4) Do the rest as below

Get everything starting with --

- and ending with

Step 4 – Running the scraper on a couple of links

Step 5 – Calculate Affinity

Count occurences of individual fanpages in the results and compare them to the occurence in the total czech facebook population of 3 770 000

1) Natural affinity = Total fans of the page / 3 770 000

2) Pilsner affinity = Occurences in results / Fans of Pilsner

3) Affinity ratio = Get the ratio of the two

4) Repeat for all fanpages

5) Bring up those where occurence is the largest

Tip : Take the URL of the page and add /posts/ and the post id you get from spreadsheet.

Step 6 – Results (sample)

Step 6 – Troubleshooting

a) Go to Preferences > Time Settings and make sure none of the sliders is „in the red“. That would result in frequent CAPTCHA checks on most protected servers..

b) Make sure your scraper is targeting the right domain

c) Make sure your „Marker Before“ and „Marker After“ are actually present on the page..

d) It is becoming easier to programm an app than try to scrape a meaningful amount of data

Thank you. Now to your questions. Credits for affinity idea : Work by Jan Schmid & Josef Šlerka Images :

Download all materials at :

By the way, Mark Zuckerberg likes Pilsner Urquell.
