105
Social Data Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies

Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Social DataToby Segaran

Author, Programming Collective IntelligenceData Magnate, Metaweb Technologies

Data mining?

“Sorting through data* to identify patterns and establish relationships”

* usually a lot of data

Where and why?

Methods and examples

Where and why?• Targeted Advertising

• Recommendations

• Search Results

• Group Discovery

• Filtering of Documents

• Theme Extraction

Google ad

Facebook ad

This is strange...

• Google just has text

• Facebook knows more about me

• But it’s taking a few cues...

Status: “engaged”

Where and why?• Targeted Advertising

• Recommendations

• Search Results

• Group Discovery

• Filtering of Documents

• Theme Extraction

Real Amazon Products

Netflix Prize

Strands Contest

Custom News

Custom News

Custom News

Where and why?• Targeted Advertising

• Recommendations

• Search Results

• Group Discovery

• Filtering of Documents

• Theme Extraction

Ranking algorithms

The now-incredibly-famous paper

Ranking algorithms

• Google begins tracking clicks in 2005

• MSN search claims neural network

• AOL Data Scandal

Learning behavior

Where and why?• Targeted Advertising

• Recommendations

• Search Results

• Group Discovery

• Filtering of Documents

• Theme Extraction

In Biology

Page grouping

News stories

Where and why?• Targeted Advertising

• Recommendations

• Search Results

• Group Discovery

• Filtering of Documents

• Theme Extraction

The obvious: spam

SpamBayes

Other email uses

Web documents

“As you add information to Twine, it is automatically tagged so that you and others can find it more easily”

Where and why?• Targeted Advertising

• Recommendations

• Search Results

• Group Discovery

• Filtering of Documents

• Theme Extraction

What is the buzz?

Customer Community

Where and why?

Methods and examples

Methods and Examples• Bayesian Filtering

• Distance Metrics

• Clustering

• Decision Trees

• Network Analysis

• Feature Extraction

Bayesian Filtering

Bayesian Filtering

Bayesian Filtering

Bayesian Filtering

Bayesian Filtering

Bayesian Filteringschoolwork

algorithm

Bayesian Filteringschoolwork

algorithm

v1agratrades

associate

Craigslist personals

Analysis

Five Cities

W4M Personal Ads

ResultsNew YorkMetsLounges

Offense

Desires

Musical

Submissive

Create

Song

Oral

BostonPink

Sox

Poetry

Intellectually

Punk

Appreciation

Exercise

Winter

Education

ChicagoCubsBurbs

BearsGirlie

Insecure

Cheat

Importance

Blunt

Mouth

ResultsLos AngelesExcellent

Vegas

Meaningful

Star

Lame

Industry

Heat

Fitness

Entertainment

Latino

San FranciscoTee

Employment

Picnic

STD

Tasting

Hikes

French

.com

Kayaking

Cycling

Methods and Examples• Bayesian Filtering

• Distance Metrics

• Clustering

• Decision Trees

• Network Analysis

• Feature Extraction

Preference distance

Sarah Marshall

Leatherheads

3

3

2

3

1

5

2

5

Preference distance5

4

3

2

1

1 2 3 4 5

Preference distance5

4

3

2

1

1 2 3 4 5

1

2.23

For recommendations5

4

3

2

1

1 2 3 4 5

Prom Night: 5 Prom Night: 2?

1

2.23

For recommendations5

4

3

2

1

1 2 3 4 5

Prom Night: 5 Prom Night: 24.1

Linguistic distanceThe

Six

Degrees

Hypothesis

Experienced

It

Is

When

You

Travel

Linguistic distanceThe

Six

Degrees

Hypothesis

Experienced

It

Is

When

You

Travel

Six

Degrees

Hypothesis

Experienced

Travel

Six 3

Degrees 3

Hypothesis 1

Experienced 5

Travel 6

Linguistic distance

“china” “kids” “music” “travel” “yahoo”

Gothamist 0 3 3 3 0

GigaOM 6 0 1 4 2

QuickOnlineTips 0 2 2 0 12

O’Reilly Radar 1 0 3 6 4

Linguistic distance“china” “kids” “music” “yahoo”

Gothamist 0 3 3 0

GigaOM 6 0 1 2

Quick Online Tips 0 2 2 12

Euclidean “as the crow flies”

= 12 (approx)

Article/blog similarity

Valleywag - Huffington > Slashdot - Wired

Methods and Examples• Bayesian Filtering

• Distance Metrics

• Clustering

• Decision Trees

• Network Analysis

• Feature Extraction

Hierarchical Clustering5

4

3

2

1

1 2 3 4 5

Hierarchical Clustering5

4

3

2

1

1 2 3 4 5

Hierarchical Clustering5

4

3

2

1

1 2 3 4 5

Hierarchical Clustering5

4

3

2

1

1 2 3 4 5

Hierarchical Clustering

Grouping bloggers

Grouping bloggers

Grouping bloggers

Grouping articles

Methods and Examples• Bayesian Filtering

• Distance Metrics

• Clustering

• Decision Trees

• Network Analysis

• Feature Extraction

Decision Trees

CART AlgorithmBrand Type Life (hrs)

Duracell C 4

Energizer C 5

Duracell AA 2

Energizer AA 2.5

From any dataset...

CART AlgorithmBrand Type Life (hrs)

Duracell C 4

Energizer C 5

Duracell AA 2

Energizer AA 2.2

... find the best split ...

Type is C?

Avg=4.5Avg=2.1

No Yes

CART AlgorithmBrand Type Life (hrs)

Duracell C 4

Energizer C 5

Duracell AA 2

Energizer AA 2.2

... and repeat.

Type is C?No Yes

DuracellNo Yes

DuracellNo Yes

42.2 2 5

Hot or Not

Hot or Not

Methods and Examples• Bayesian Filtering

• Distance Metrics

• Clustering

• Decision Trees

• Network Analysis

• Feature Extraction

A networkA

B

CD

E

F

PageRankA

B

CD

E

F

1.0

1.0

1.01.0

1.0

1.0

PageRankA

B

CD

E

F

1.0

1.0

1.01.0

1.0

1.0D = 0.15 + .85*E/1 + .85 * F/2 + .85*B/1 = 2.275

PageRankA

B

CD

E

F

0.58

0.58

1.02.275

1.0

0.15

PageRankA

B

CD

E

F

0.58

0.58

2.081.56

0.3

0.15

PageRankA

B

CD

E

F

1.03

1.03

1.481.56

0.3

0.15

PageRankA

B

CD

E

F

0.78

0.78

1.481.34

0.3

0.15

CI FOO participants

Science papers

The paper attempts to provide an alternative method for measuring the importance of scientific papers based on the Google's PageRank. The method is a meaningful extension of

the common integer counting of citations and is then experimented for bringing PageRank to the citation analysis

in a large citation network. It offers a more integrated picture of the publications' influence in a specific field.

Bringing PageRank to the citation analysis

Clustering coefficient

“How many of each persons friendsare friends with each other?”

Clustering coefficient

A

BC

D

EF

Low clustering coefficient

Clustering coefficient

A

BC

D

EF

High clustering coefficient“small world graph”

Twitter!

Twitter!

Methods and Examples• Bayesian Filtering

• Distance Metrics

• Clustering

• Decision Trees

• Network Analysis

• Feature Extraction

Independent Features

Message boards

Message boards

Matrix Factorization

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 1 3 3 0 1

Calorie 0 2 4 1 3

Weigh 2 3 1 0 1

Carbs 0 1 1 0 2

Treadmill 3 2 0 2 2

Msg1 M2 M3 M4 M5

F1 1 0 2 3 0

F2 0 2 1 1 3

F3 1 0 2 0 0

F1 F2 F3

Gym 0 1 2

Calorie 2 0 1

Weigh 2 2 1

Carbs 1 0 3

Treadmill 0 1 2

Features MatrixWeight Matrix

x

Current Guess

Matrix FactorizationMsg1 M2 M3 M4 M5

F1 1 0 2 3 0

F2 0 2 1 1 3

F3 1 0 2 0 0

F1 F2 F3

Gym 0 1 2

Calorie 2 0 1

Weigh 2 2 1

Carbs 1 0 3

Treadmill 0 1 2

Features MatrixWeight Matrix

x

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Target Result

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 1 3 3 0 1

Calorie 0 2 4 1 3

Weigh 2 3 1 0 1

Carbs 0 1 1 0 2

Treadmill 3 2 0 2 2

Current Guess

Matrix FactorizationMsg1 M2 M3 M4 M5

F1 2 0 0 1 0

F2 0 2 0 1 3

F3 1 0 1 0 0

F1 F2 F3

Gym 1 0 0

Calorie 0 1 1

Weigh 0 0 2

Carbs 0 1 0

Treadmill 1 0 0

Features MatrixWeight Matrix

x

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Target Result

Msg1 Msg2 Msg3 Msg4 Msg5

Gym 2 0 0 3 0

Calorie 0 2 1 1 3

Weigh 1 0 2 0 0

Carbs 0 3 0 0 2

Treadmill 1 0 0 2 0

Current Guess

Interpreting Features

Msg1 M2 M3 M4 M5

F1 2 0 0 1 0

F2 0 2 0 1 3

F3 1 0 1 0 0

F1 F2 F3

Gym 1 0 0

Calorie 0 1 1

Weigh 0 0 2

Carbs 0 1 0

Treadmill 1 0 0

Features Matrix

Weight Matrix

Theme 1 Theme 2 Theme 3

Gym Calorie Weigh

Treadmill Carbs Calorie

Msg1 Msg2 Msg3 etc.

Theme 1 Theme 2 Theme 3

Theme 3

Diet & Body themesAtkinsInductionSouthBeachCarbs

ChocolateBlackCoffeeOliveBroccoli Gym

WeightsExerciseRunningInjured

CookRecipeFriedHome Money

OrganicWantBest

CaloriesWeightFatsProteinCholesterol

Wikipedia peoplesheherafterwhenfather

women

seriestelevision

showwhichradiobbc

leaguemajor

baseballseasonplayedwith

olympicscompeted

wonsummermedal

athelete

universityprofessorreceivedscienceresearch

born

We’re just getting started...

Homepage http://kiwitobes.com

Freebase http://freebase.com

Questions?