88
Mining Social Web APIs with IPython Notebook Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com Montréal - 9 April 2014 1

Mining Social Web APIs with IPython Notebook (PyCon 2014)

Embed Size (px)

DESCRIPTION

From the tutorial description at https://us.pycon.org/2014/schedule/presentation/134/ - Description Social websites such as Twitter, Facebook, LinkedIn, Google+, and GitHub have vast amounts of valuable insights lurking just beneath the surface, and this workshop minimizes the barriers to exploring and mining this valuable data by presenting turn-key examples from the thoroughly revised 2nd Edition of Mining the Social Web. Abstract This workshop teaches you fundamental data mining techniques as applied to popular social websites by adapting example code from Mining the Social Web (2nd Edition, O'Reilly 2013) in a tutorial-style step-by-step manner that is designed specifically to accommodate attendees with very little programming or domain experience. This workshop's extensive use of IPython Notebook facilitates interactive learning with turn-key examples against a Vagrant-based virtual machine that takes care of installing all 3rd party dependencies that are needed. The barriers to entry are truly minimal, which allows maximal use of the time to be spent on interactive learning. The workshop is somewhat broadly designed and acclimates you to mining social data from Twitter, Facebook, LinkedIn, Google+, and GitHub APIs in five corresponding modules with the following memorable approach for each of them: * Aspire - Set out to answer a question or test a hypothesis as part of a data science experiment * Acquire - Collect and store the data that you need to answer the question or test the hypothesis * Analyze - Use fundamental data mining techniques to explore and exploit the data * Summarize - Present analytical findings in a compact and meaningful way Each module consists of a brief period in which each attendee will customize the corresponding notebook for the module with their own account credentials with the remainder of the module devoted to learning what data is available from the API and exercises demonstrating analysis of the data—all from a pre-populated IPython Notebook. Time will be set aside at the end of each module for attendees to hack on the code, discuss examples, and ask any lingering questions.

Citation preview

Page 1: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Mining Social Web APIswith IPython Notebook

Matthew A. Russell - @ptwobrussell - http://MiningTheSocialWeb.com

Montréal - 9 April 2014

1

Page 2: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Intro

2

Page 3: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Hello, My Name Is ... Matthew

3

Background in Computer Science

Data mining & machine learning

CTO @ Digital Reasoning Systems

Data mining; machine learning

Author @ O'Reilly Media

5 published books on technology

Principal @ Zaffra

Selective boutique consulting

Page 4: Mining Social Web APIs with IPython Notebook (PyCon 2014)

4

The only easy day was yesterday

-- Motto of the U.S. Navy SEALs

Page 5: Mining Social Web APIs with IPython Notebook (PyCon 2014)

5

It pays to be a winner

-- Motto of the U.S. Navy SEALs

Page 6: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Transforming Curiosity Into Insight

6

An open source software (OSS) project

http://bit.ly/MiningTheSocialWeb2E

A book

http://bit.ly/135dHfs

Accessible to (virtually) everyone

Virtual machine with turn-key coding templates for data science experiments

Think of the book as "premium" support for the OSS project

Page 7: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Table of Contents (1/2)

Chapter 1 - Mining Twitter: Exploring Trending Topics, Discovering What People Are Talking About, and More

Chapter 2 - Mining Facebook: Analyzing Fan Pages, Examining Friendships, and More

Chapter 3 - Mining LinkedIn: Faceting Job Titles, Clustering Colleagues, and More

Chapter 4 - Mining Google+: Computing Document Similarity, Extracting Collocations, and More

Chapter 5 - Mining Web Pages: Using Natural Language Processing to Understand Human Language, Summarize Blog Posts, and More

Chapter 6 - Mining Mailboxes: Analyzing Who's Talking to Whom About What, How Often, and More

7

Page 8: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Table of Contents (2/2)

Chapter 7 - Mining GitHub: Inspecting Software Collaboration Habits, Building Interest Graphs, and More

Chapter 8 - Mining the Semantically Marked-Up Web: Extracting Microformats, Inferencing over RDF, and More

Chapter 9 - Twitter Cookbook

Appendix A - Information About This Machine's Virtual Machine Experience

Appendix B - OAuth Primer

Appendix C - Python and IPython Notebook Tips & Tricks

8

Page 9: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Designed for PedagogyBrief Intro

Objectives

API Primer

Analysis Technique(s)

Data Visualization

Recap

Suggested Exercises

Recommended Resources

9

Page 10: Mining Social Web APIs with IPython Notebook (PyCon 2014)

The Social Web Is All the Rage

World population: ~7B people

Facebook: 1.15B users

Twitter: 500M users

Google+ 343M users

LinkedIn: 238M users

~200M+ blogs (conservative estimate)

10

Page 11: Mining Social Web APIs with IPython Notebook (PyCon 2014)

OverviewIntro (5 mins)

Module 1 - Virtual Machine Setup (10 mins)

Module 2 - Mining Twitter (45 mins)

Module 3 - Mining Facebook (30 mins)

BREAK (20 mins)

Module 4 - Mining LinkedIn (30 mins)

Module 5 - Choice: Open Hack (30 mins)

Module 6 - Privacy & Ethics; (20 mins)

Module 7 - Final Q&A; Surveys (10 mins)

11

Page 12: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module Format

~10-15 minutes of exposition

I talk; you listen

~15 minutes of independent (or collaborative) work

You hack while I walk around and help you

~5 minutes of recap with Q&A

You ask; I try to answer

12

Page 13: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Workshop Objective

To send you away as a social web hacker

Broad working knowledge popular social web APIs

Hands-on experience hacking on social web data with a common toolkit

Not for me talk to you for 3 straight hours

13

Page 14: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Just a Few More Things

This workshop is...

An adaptation of Mining the Social Web, 2nd Edition

More of a guided hacking session where you follow along (vs a preso)

Wider than it is deeper

There's only so much you can do in a few hours

I'm available 24/7 this week (and beyond) to help you be successful

14

Page 15: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Assumptions

At some point in your life, you have

Programmed with Python

Worked with JSON

Made requests and processed responses to/from web servers

Or you want to learn to do these things now...

And you're a quick learner

15

Page 16: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 1: Virtual Machine Setup

16

Page 17: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Why do you need a VM?

17

To save time

Because installation and configuration management is harder than it first appears

So that you can focus on the task at hand instead

So that I can support you regardless of your hardware and operating system

Page 18: Mining Social Web APIs with IPython Notebook (PyCon 2014)

But I can do all of that myself...True...

If you would rather troubleshoot unexpected installation/configuration issues instead of immediately focusing on the real task at hand

At least give it a shot before resorting to your own devices so that you don't have to install specific versions of ~40 Python packages

Including scientific computing tools that require underlying C/C++ code to be compiled

Which requires specific versions of developer libraries to be installed

You get the idea...

18

Page 19: Mining Social Web APIs with IPython Notebook (PyCon 2014)

The Virtual Machine ExperienceVagrant

A nice abstraction around virtual machine providers

One ring to rule them all

Virtualbox, VMWare, AWS, ...

IPython Notebook

The easiest way to program with Python

A better REPL (interpreter)

Great for hacking

19

Page 20: Mining Social Web APIs with IPython Notebook (PyCon 2014)

What happens when you vagrant up?

Vagrant follows the instructions in your Vagrantfile

Starts up a Virtualbox instance

Uses Chef to provision it

Installs OS patches/updates

Installs MTSW software dependencies

Starts IPython Notebook server on port 8888

20

Page 21: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Why Should I Use IPython Notebook?

Because it's great for hacking

And hacking is usually the first step

Because it's great for collaboration

Sharing/publishing results is trivial

Because the UX is as easy as working in a notepad

Think of it as "executable paper"

21

Page 22: Mining Social Web APIs with IPython Notebook (PyCon 2014)

22

Page 23: Mining Social Web APIs with IPython Notebook (PyCon 2014)

23

Page 24: Mining Social Web APIs with IPython Notebook (PyCon 2014)

VM Quick Start Instructions

Go to http://MiningTheSocialWeb.com/quick-start/

Follow the instructions

And watch the screencasts!

Basically:

Install Virtualbox & Vagrant

Run "vagrant up" in a terminal to start a guest VM

Then, go to http://localhost:8888 on your host machine's web browser

24

Page 25: Mining Social Web APIs with IPython Notebook (PyCon 2014)

What Could Be Easier?

A hosted version of the VM!

But only for a few hours during this workshop

Because it costs money to run these servers

Go to [See Live Slides for URL] and pick a machine

Do not share the URLs outside of this workshop!

Please don't try to hack the machines

Learn how I arrived at this setup at http://MiningTheSocialWeb.com

25

Page 26: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 2: Mining Twitter

26

Page 27: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Objectives

27

Be able to identify Twitter primitives

Understand tweet metadata and how to use it

Learn how to extract entities such as user mentions, hashtags, and URLs from tweets

Apply techniques for performing frequency analysis with Python

Be able to plot histograms of Twitter data with IPython Notebook

Page 28: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Twitter Primitives

28

Accounts Types: "Anything"

"Following" Relationships

Favorites

Retweets

Replies

(Almost) No Privacy Controls

Page 29: Mining Social Web APIs with IPython Notebook (PyCon 2014)

API RequestsRESTful requests

Everything is a "resource"

You GET, PUT, POST, and DELETE resources

Standard HTTP "verbs"

Example: GET https://api.twitter.com/1.1/statuses/user_timeline.json?screen_name=SocialWebMining

Streaming API filters

JSON responses

Cursors (not quite pagination)

29

Page 30: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Twitter is an Interest Graph

30

Roberto Mercedes

Jorge

Ana

Nina

Johnny Araya

Rodolfo Hernández

Page 31: Mining Social Web APIs with IPython Notebook (PyCon 2014)

What's in a Tweet?

31

140 Characters ...

... Plus ~5KB of metadata!

Authorship

Time & location

Tweet "entities"

Replying, retweeting, favoriting, etc.

Page 32: Mining Social Web APIs with IPython Notebook (PyCon 2014)

What are Tweet Entities?

Essentially, the "easy to get at" data in the 140 characters

@usermentions

#hashtags

URLs

multiple variations

(financial) symbols

stock tickers

media

32

Page 33: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Data Mining = Curiosity + StatsCuriosity

Interests, desires, and intuitions

Statistics

Counting

Comparing

Filtering

Ranking

Hypothesis testing; knowledge discovery

33

Page 34: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Histograms

A chart that is handy for frequency analysis

They look like bar charts...except they're not bar charts

Each value on the x-axis is a range (or "bin") of values

Not categorical data

Each value on the y-axis is the combined frequency of values in each range

34

Page 35: Mining Social Web APIs with IPython Notebook (PyCon 2014)

35

Example: Histogram of Retweets

Page 36: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Social Media Analysis FrameworkA memorable four step process to guide data science experiments:

Aspire

To test a hypothesis (answer a question)

Acquire

Get the data

Analyze

Count things

Summarize

Plot the results

36

Page 37: Mining Social Web APIs with IPython Notebook (PyCon 2014)

ExercisesReview Python idioms in the "Appendix C (Python Tips & Tricks)" notebook

Follow the setup instructions in the "Chapter 1 (Mining Twitter)" notebook

Fill in Example 1-1 with credentials and begin work

Execute each example sequentially

Customize queries

Explore tweet metadata; count tweet entities; plot histograms of results

Explore the "Chapter 9 (Twitter Cookbook)" notebook

Think of it as a collection of building blocks

37

Page 38: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 3: Mining Facebook

38

Page 39: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Objectives

39

Be able to identify Facebook primitives

Learn about Facebook’s Social Graph API and how to make API requests

Understand how Open Graph protocol extends Facebook's Social Graph API

Be able to analyze likes from Facebook pages and friends

Page 40: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Facebook Primitives

Account Types: People & Pages

Mutual Connections

Likes

Shares

Comments

Extensive Privacy Controls

40

Page 41: Mining Social Web APIs with IPython Notebook (PyCon 2014)

API Requests

Social Graph API requests

Not RESTful but easy to learn and use

Special "field expansion" syntax

Example: GET http://graph.facebook.com/ptwobrussell/?fields=id,name,friends.fields(likes.limit(10))

JSON responses

Traditional pagination

41

Page 42: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Facebook is an Interest Graph

42

Roberto Mercedes

Jorge

Ana

Nina

Johnny Araya

Rodolfo Hernández

Page 43: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Facebook API Explorer

43

Go to https://developers.facebook.com/tools/explorer

Really, go there right now...

Page 44: Mining Social Web APIs with IPython Notebook (PyCon 2014)

44

Retrieve Your Likes

Page 45: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Facebook Permissions

45

Page 46: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Facebook Permissions

46

Page 47: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Explore Facebook Pages

47

Names of pages

MiningTheSocialWeb

CrossFit

OReilly

Web URLs (OGP extensions to Facebook's Social Graph)

http://www.imdb.com/title/tt0117500

Page 48: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Social Media Analysis Framework

Recall the same four step process to guide data science experiments:

Aspire

Acquire

Analyze

Summarize

48

Page 49: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Social Network Diagram with D3

49

Page 50: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Exercises

Copy/paste your access token from the Graph API Explorer into the "Chapter 2 (Mining Facebook)" notebook

Paste the value and execute the cell just before Example 2-1

Execute examples sequentially (try to at least make it to Example 2-10)

Analyze your likes, your friends and likes from pages of interest

If you have time...

Remaining examples

50

Page 51: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 4: Mining LinkedIn

51

Page 52: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Objectives

52

Learn about LinkedIn’s Developer Platform

Understand how clustering works

A fundamental type of machine learning

Be able to employ geocoding services to arrive at a set of coordinates from a textual reference to a location

Visualize geographic data with cartograms

Page 53: Mining Social Web APIs with IPython Notebook (PyCon 2014)

LinkedIn Primitives

Account Types: People, Companies

The data seems "more closely held" than Facebook or Twitter

No FOAF visibility

Richest data source

Profile descriptions from mutual connections

A little messier than it first appears

Not necessarily a bad thing

53

Page 54: Mining Social Web APIs with IPython Notebook (PyCon 2014)

API Requests

(Strangely) RESTful Requests

Not really RESTful

Field selector syntax

http://api.linkedin.com/v1/people/~:(first-name,last-name,headline,picture-url)

XML responses

CSV address book download

54

Page 55: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Is LinkedIn an Interest Graph?

Fundamentally: yes. But not so much at the developer API level

Less trivial to find some of the "pivots"

No Skills API (yet?)

But the data is there (mostly in profile descriptions) for your direct connections

Companies, job titles, job descriptions

Lots of richness is tucked away in human language data

55

Page 56: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Clustering

An unsupervised machine learning learning technique

Think: an algorithm that organizes the data into partitions

56

Page 57: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Example: Clustered Job Titles

57

Page 58: Mining Social Web APIs with IPython Notebook (PyCon 2014)

3 Steps to Clustering Your Data

Normalization

Compare (similarity/distance measurement)

n-grams, edit distance, and Jaccard are common, but your imagination is the limit

Why can't you just compare everything to everything?

Dimensionality Reduction

Ideally, your clustering algorithm will mitigate the pain

k-means is among the most common clustering techniques in use

58

Page 59: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Jaccard Similarity

59

Page 60: Mining Social Web APIs with IPython Notebook (PyCon 2014)

k-Means Explained

1. Randomly pick k points in the data space as initial values that will be used to compute the k clusters: K1, K2, ..., Kk.

2. Assign each of the n points to a cluster by finding the nearest Kn—effectively creating k clusters and requiring k*n comparisons.

3. For each of the k clusters, calculate the centroid, or the mean of the cluster, and reassign its Ki value to be that value. (Hence, you’re computing “k-means” during each iteration of the algorithm.)

4. Repeat steps 2–3 until the members of the clusters do not change between iterations. Generally speaking, relatively few iterations are required for convergence.

60

Page 61: Mining Social Web APIs with IPython Notebook (PyCon 2014)

k-Means: Initialize

61

Page 62: Mining Social Web APIs with IPython Notebook (PyCon 2014)

k-Means: Step 1

62

Page 63: Mining Social Web APIs with IPython Notebook (PyCon 2014)

k-Means: Step 2

63

Page 64: Mining Social Web APIs with IPython Notebook (PyCon 2014)

k-Means: Step 3

64

Page 65: Mining Social Web APIs with IPython Notebook (PyCon 2014)

k-Means: (Fast-Forward) Step 9

65

Page 66: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Geocoding

Transforming a location to a set of coordinates

Nashville, TN => (36.16783905029297, -86.77816009521484)

A harder problem than it first appears

The Bing API is especially generous

Requires an account sign up: http://bingmapsportal.com

Use the API key with the geopy package

66

Page 67: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Introducing: The Dorling Cartogram

67

Page 68: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Social Media Analysis Framework

Remember: Use the same four step process to guide data science experiments:

Aspire

Acquire

Analyze

Summarize

68

Page 69: Mining Social Web APIs with IPython Notebook (PyCon 2014)

ExercisesFollow the instructions in the "Chapter 3 (Mining LinkedIn)" notebook to create an API connection and follow along with the first few examples

Download your connections as a CSV file from http://www.linkedin.com/people/export-settings and save them to your VM

A deviation from instructions in Example 3-6 is necessary for remote VMs

See http://bit.ly/mtsw-ch03-helper-code

Create a Bing Maps portal account and get your API key for Examples 3-8 and beyond

Try clustering your contacts in Example 3-12

Try Example 3-13 (visualizing data in Google Earth) at home...

69

Page 70: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 5: Choice

70

Page 71: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Objectives

71

To work on "loose ends" or areas of interest from previous modules

To hack on code in notebooks not yet encountered

To setup the virtual machine on your own box if you haven't yet

To collaborate/talk and otherwise make the most of our togetherness

Page 72: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Social Media Analysis Framework

Remember:

Aspire

Acquire

Analyze

Summarize

72

Page 73: Mining Social Web APIs with IPython Notebook (PyCon 2014)

RecommendationsSetup your own development environment if you haven't already

Appendix A

Text Mining & Natural Language Processing

Chapter 4 (Mining Google+) & Chapter 5 (Mining Web Pages)

Graph Mining

Chapter 7 (Mining GitHub)

Analyzing Semantic Markup

Chapter 8 (Mining the Semantically Marked-Up Web)

73

Page 74: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 6: Privacy & Ethics

74

Page 75: Mining Social Web APIs with IPython Notebook (PyCon 2014)

75

Know thy data, and know thyself

--Matthew A. Russell

Page 76: Mining Social Web APIs with IPython Notebook (PyCon 2014)

76

If we have data, let’s look at data. If we have opinions, let’s go with mine

--Jim Barksdale

Page 77: Mining Social Web APIs with IPython Notebook (PyCon 2014)

77

In God we trust. All others must bring data

--W. Edwards Deming

Page 78: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Communication => Data

Communication

Senders

humans & machines

Messages

natural language, images, videos, etc.

Recipients

humans & machines

78

Page 79: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Data Alchemy

Data: Documents & document fragments (text messages, etc.)

Information: "Assertions", summaries, tags, etc.

Knowledge: Aggregated, queryable information

Wisdom: “Compressed” knowledge

Gold: Money

79

Page 80: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Machine Learning

80

A program that learns (improves) from experience (data) according to some objective

Supervised learning

Unsupervised learning

Reinforcement learning

How to do it

Program mathematical models and hope for the best...

How to do it well

Program state-of-the-art mathematical models with sufficient representative data

Page 81: Mining Social Web APIs with IPython Notebook (PyCon 2014)

81

Knowledge is a process of piling up facts; wisdom lies in their simplification

--Martin Fischer

Page 82: Mining Social Web APIs with IPython Notebook (PyCon 2014)

82

Any sufficiently advanced technology is indistinguishable from magic

--Arthur C. Clarke

Page 83: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Is Privacy Already an Illusion?

83

Digital happenings circa 2014

The Cloud

Social Media

Deep Learning

The Internet of Things

Internet.org

Page 84: Mining Social Web APIs with IPython Notebook (PyCon 2014)

84

Civilization is the progress toward a society of privacy...

-- Ayn Rand

Page 85: Mining Social Web APIs with IPython Notebook (PyCon 2014)

85

If you have something that you don’t want anyone to know, maybe you shouldn’t be doing it in the first place.

-- Eric Schmidt, (former) CEO of Google

Page 86: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Influences on Ethics

Capitalism, economics, & marketing

A for-profit corporation's fiduciary duty: To maximize the common stock's value

How to do it? By transacting commerce

How do it well? By advertising more effectively than competitors

How to do it really well? With highly relevant personalized ads (recommenders)

Terms of Service (ToS) - The legal extent of ethical obligations?

86

Page 87: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Module 7: Final Q&A; Survey

87

Survey Link:

https://www.surveymonkey.com/s/pycon2014_tutorials

Page 88: Mining Social Web APIs with IPython Notebook (PyCon 2014)

Free Stuff

http://MiningTheSocialWeb.com

Mining the Social Web 2E Chapter 1 (Chimera)

http://bit.ly/13XgNWR

Source Code (GitHub)

http://bit.ly/MiningTheSocialWeb2E

http://bit.ly/1fVf5ej (numbered examples)

Screencasts (Vimeo)

http://bit.ly/mtsw2e-screencasts

88