73
1 Graph mining techniques applied to blogs Mary McGlohon Seminar on Social Media Analysis- Oct 2 2007

Graph mining techniques applied to blogs

  • Upload
    erna

  • View
    41

  • Download
    0

Embed Size (px)

DESCRIPTION

Graph mining techniques applied to blogs. Mary McGlohon Seminar on Social Media Analysis- Oct 2 2007. Last week… Lots of methods for graph mining and link analysis. Last week… Lots of methods for graph mining and link analysis. This week… - PowerPoint PPT Presentation

Citation preview

Page 1: Graph mining techniques applied to blogs

1

Graph mining techniques applied to

blogsMary McGlohon

Seminar on Social Media Analysis- Oct 2 2007

Page 2: Graph mining techniques applied to blogs

2

Last week…

Lots of methods for graph mining and link analysis.

Page 3: Graph mining techniques applied to blogs

3

Last week…

Lots of methods for graph mining and link analysis.

This week…

A few examples of these methods applied to blogs.

Page 4: Graph mining techniques applied to blogs

4

Paper #1

● Jure Leskovec, Mary McGlohon, Christos Faloutsos, Natalie Glance, and Matthew Hurst. Patterns of Cascading Behavior in Large Blog Graphs, SDM 2007.– What temporal and topological features do we

observe in a large network of blogs?

Page 5: Graph mining techniques applied to blogs

5

Blogosphere network

B1 B2

B4B3

Representing blogs as graphsslashdot

boingboing

DlistedMichelleMalkin

Page 6: Graph mining techniques applied to blogs

6

Blogosphere network

B1 B2

B4B3

B1 B2

B4B3

11

2

1 3

1

Representing blogs as graphs

1

slashdotboingboing

DlistedMichelleMalkin

slashdotboingboing

Dlisted

MichelleMalkin

Blog network

Page 7: Graph mining techniques applied to blogs

7

Blogosphere network

B1 B2

B4B3

B1 B2

B4B3

11

2

1 3

1

a

b c

de

Representing blogs as graphs

1

Blog network Post network

slashdotboingboing

DlistedMichelleMalkin

slashdotboingboing

Dlisted

MichelleMalkin

Page 8: Graph mining techniques applied to blogs

8

Extracting subgraphs: Cascades

We gather cascades using the following procedure:– Find all initiators (out-degree 0).

a

b c

de

Page 9: Graph mining techniques applied to blogs

9

Extracting subgraphs: Cascades

We gather cascades using the following procedure:– Find all initiators (out-degree 0).– Follow in-links.

a

b c

de

a

b c

de

Page 10: Graph mining techniques applied to blogs

10

Extracting subgraphs: Cascades

We gather cascades using the following procedure:– Find all initiators (out-degree 0).– Follow in-links.– Produces directed acyclic graph.

a

b c

de

a

b c

de

d

e

b c

e

a

Page 11: Graph mining techniques applied to blogs

11

Paper #1,2 Dataset (Nielsen Buzzmetrics)

● Gathered from August-September 2005*

● Used set of 44,362 blogs, traced cascades

● 2.4 million posts, ~5 million out-links, 245,404 blog-to-blog links

Time [1 day]

Nu

mb

er

of p

ost

s

Page 12: Graph mining techniques applied to blogs

12

Temporal Observations

Does blog traffic behave periodically?• Posts have “weekend effect”, less traffic on

Saturday/Sunday.

Page 13: Graph mining techniques applied to blogs

13

Temporal Observations

How does post popularity change over time?

Monday post dropoff- days after post

Num

ber

in-li

nks

(log)

Popularity on day 1

Popularity on day 40

Page 14: Graph mining techniques applied to blogs

14

Temporal Observations

How does post popularity change over time?

Days after post

Nu

mb

er

of in

-lin

ks

Monday post dropoff- days after post

Num

ber

in-li

nks

(log)

How does post popularity change over time?

Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006].

Page 15: Graph mining techniques applied to blogs

15

Temporal Observations

How does post popularity change over time?

Days after post

Nu

mb

er

of in

-lin

ks

How does post popularity change over time?

Post popularity dropoff follows a power law identical to that found in communication response times in [Vazquez2006].

The probability that a post written at time tp acquires a link at time tp + is:

p(tp+) 1.5

Page 16: Graph mining techniques applied to blogs

16

Topological Observations

What graph properties does the blog network exhibit?

B1 B2

B4B3

11

2

1 3

1

Page 17: Graph mining techniques applied to blogs

17

Topological Observations

What graph properties does the blog network exhibit?

● 44,356 nodes, 122,153 edges● Half of blogs belong to largest connected

component.

B1 B2

B4B3

11

2

1 3

1

Page 18: Graph mining techniques applied to blogs

18

Topological Observations

What power laws does the blog network exhibit?

Both in- and out-degree follows a power law distribution, in-link PL exponent -1.7, out-degree PL exponent near -3.

This suggests strong rich-get-richer phenomena.

Number of blog in-links (log scale) Number of blog out-links (log scale)

Co

unt

(lo

g s

cale

)

Co

unt

(lo

g s

cale

)

Page 19: Graph mining techniques applied to blogs

19

Topological Observations

What graph properties does the post network exhibit?

a

b c

de

Page 20: Graph mining techniques applied to blogs

20

Topological Observations

a

b c

de

What graph properties does the post network exhibit?

Very sparsely connected: 98% of posts are isolated.

Inlinks/outlinks also follow power laws.

Page 21: Graph mining techniques applied to blogs

21

Topological Observations

How do we measure how information flows through the network?

Common cascade shapes are extracted using algorithms in [Leskovec2006].

Page 22: Graph mining techniques applied to blogs

22

Topological Observations

How do we measure how information flows through the network?

Number of edges increases linearally with cascade size, while effective diameter increases logarithmically, suggesting tree-like structures.

Cascade size (# nodes)

Num

ber

of e

dges

Cascade size

Eff

ectiv

e di

amet

er

Page 23: Graph mining techniques applied to blogs

More on cascades

● Cascade sizes, including sizes of particular shapes (stars, chains) also follow power laws.

● This paper also presents a model for influence propagation that generates cascades based on SIS model of epidemiology. The topic of influence propagation has been reserved for a later date.

Page 24: Graph mining techniques applied to blogs

24

Paper #2

Mary McGlohon, Jure Leskovec, Christos Faloutsos, Matthew Hurst, and Natalie Glance. Finding patterns in blog shapes and blog evolution, SDM 2007.

● Do different kinds of blogs exhibit different properties?

● What tools can we use to describe the behavior of a blog over time?

Page 25: Graph mining techniques applied to blogs

● Suppose we wanted to characterize a blog based on the properties of its posts.– Obtain a set of post features based on its role in a

cascade.– Use PCA for dimensionality reduction.

Page 26: Graph mining techniques applied to blogs

2626

Post features

● There are several terms we use to describe cascades:

● In-link, out-link

– Green node has one out-link

– Yellow node has one in-link.● Depth downwards/upwards

– Pink node has an upward depth of 1,

– downward depth of 2.

● Conversation mass upwards/downwards

– Pink node has upward CM 1,

– downward CM 3

Page 27: Graph mining techniques applied to blogs

2727

Dimensionality reduction

● Post features may be correlated, so some information may be unnecessary.

● Principal Component Analysis is a method of dimensionality reduction.

Depth upwards

Conversation mass upwards

Hypothetically, for each blog...

Page 28: Graph mining techniques applied to blogs

2828

Dimesionality reduction

● Post features may be correlated, so some information may be unnecessary.

● Principal Component Analysis is a method of dimensionality reduction.

Depth upwards

Conversation mass upwards

Hypothetically, for each blog...

Page 29: Graph mining techniques applied to blogs

2929

Dimensionality reduction

● Post features may be correlated, so some information may be unnecessary.

● Principal Component Analysis is a method of dimensionality reduction.

Depth upwards

Hypothetically, for each blog...

Conversation mass upwards

Page 30: Graph mining techniques applied to blogs

3030

Setting up the matrix

.6.1…

1.1.6boingboing-p002

6.24.2boingboing-p001

2.41.2…

4.5.2…

2.2.3slashdot-p002

4.5slashdot-p001

log(

# in

-link

s) l

og(#

out-l

inks

)

log(

CM

up)

log(

CM

dow

n)

lo

g(de

pth

up)

l

og(d

epth

dow

n)

~2,

400,

000

post

s Run PCA…

Page 31: Graph mining techniques applied to blogs

31

PostFeatures: Results

• Observation: Posts within a blog tend to retain similar network characteristics.

– PC1 ~ CM upward– PC2 ~ CM

downward

Page 32: Graph mining techniques applied to blogs

32

PostFeatures: Results

• Observation: Posts within a blog tend to retain similar network characteristics.

MichelleMalkin

Dlisted

– PC1 ~ CM upward– PC2 ~ CM

downward

Page 33: Graph mining techniques applied to blogs

33

● Suppose we want to cluster blogs based on content. What features do we use?– Get set of features based on cascade shapes.– Run PCA to reduce dimensionality.

Page 34: Graph mining techniques applied to blogs

34

PCA on a sparse matrix

• This time, each blog is one row.

• Use log(count+1)• Project onto 2

PC…

.01…

.07.67…

1.12.1…

5.1…

4.2…

.073.41.13.2boingboing

.092.14.6slashdot

…………

~9,000 cascade types

~44

,000

blo

gs

Page 35: Graph mining techniques applied to blogs

3535

CascadeType: Results

● Observation: Content of blogs and cascade behavior are often related.

• Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

Page 36: Graph mining techniques applied to blogs

3636

CascadeType: Results

● Observation: Content of blogs and cascade behavior are often related.

• Distinct clusters for “conservative” and “humorous” blogs (hand-labeling).

Page 37: Graph mining techniques applied to blogs

37

● What about time series data? How can we deal with that?

● Problem: time series data is nonuniform and difficult to analyze.

in-links over time

Page 38: Graph mining techniques applied to blogs

3838

BlogTimeFractal: Definitions

● Fortunately, we find that behavior is often self-similar.

● The 80-20 law describes self-similarity.● For any sequence, we divide it into two equal-

length subsequences. 80% of traffic is in one, 20% in the other.– Repeat recursively.

Page 39: Graph mining techniques applied to blogs

39

Self-similarity

● The bias factor for the 80-20 law is b=0.8.20 80

Det

ails

Page 40: Graph mining techniques applied to blogs

40

Self-similarity

● The bias factor for the 80-20 law is b=0.8.20 80

Q: How do we estimate b?

Det

ails

Page 41: Graph mining techniques applied to blogs

41

Self-similarity

● The bias factor for the 80-20 law is b=0.8.20 80

Q: How do we estimate b?

A: Entropy plots!

Det

ails

Page 42: Graph mining techniques applied to blogs

4242

BlogTimeFractal

● An entropy plot plots entropy vs. resolution.● From time series data, begin with resolution R=

T/2. ● Record entropy H

R

Page 43: Graph mining techniques applied to blogs

4343

BlogTimeFractal

● An entropy plot plots entropy vs. resolution.● From time series data, begin with resolution R=

T/2. ● Record entropy H

R

● Recursively take finer resolutions.

Page 44: Graph mining techniques applied to blogs

4444

BlogTimeFractal

● An entropy plot plots entropy vs. resolution.● From time series data, begin with resolution r=

T/2. ● Record entropy H

r

● Recursively take finer resolutions.

Page 45: Graph mining techniques applied to blogs

45

BlogTimeFractal: Definitions● Entropy measures the non-uniformity of histogram at

a given resolution.● We define entropy of our sequence at given R :

where p(t) is percentage of posts from a blog on interval t, R is resolution and 2R is number of intervals.

Det

ails

Page 46: Graph mining techniques applied to blogs

4646

BlogTimeFractal

● For a b-model (and self similar cases), entropy plot is linear. The slope s will tell us the bias factor.

● Lemma: For traffic generated by a b-model, the bias factor b obeys the equation:

s= - b log2 b – (1-b) log2 (1-b)

Page 47: Graph mining techniques applied to blogs

47

Entropy Plots

● Linear plot Self-similarity

Resolution

En

tro

py

Page 48: Graph mining techniques applied to blogs

48

Entropy Plots

● Linear plot Self-similarity● Uniform: slope s=1. bias=.5● Point mass: s=0. bias=1

Resolution

En

tro

py

Page 49: Graph mining techniques applied to blogs

49

Entropy Plots

● Linear plot Self-similarity● Uniform: slope s=1. bias=.5● Point mass: s=0. bias=1

Resolution

En

tro

py

Michelle Malkin in-links, s= 0.85

By Lemma 1, b= 0.72

Page 50: Graph mining techniques applied to blogs

5050

BlogTimeFractal: Results● Observation: Most time series of interest are

self-similar.● Observation: Bias factor is approximately 0.7--

that is, more bursty than uniform (70/30 law).

in-links, b=.72 conversation mass, b=.76 number of posts, b=.70

Entropy plots: MichelleMalkin

Page 51: Graph mining techniques applied to blogs

Papers #1,2 conclusions

● There are several power laws observed in a network of blogs.

● We can extract cascades to help describe how information propagates through a network.

● We can use cascade properties to describe behavior of some blogs.

● We can also use self-similarity to describe behavior of blogs over time.

Page 52: Graph mining techniques applied to blogs

52

Paper #3

● Eytan Adar, Li Zhang, Lada A. Adamic, and Rajan M. Lukose. Implicit Structure and the Dynamics of Blogspace. WWW 2004.– What are the large- and small- scale patterns of

blog epidemics?

Page 53: Graph mining techniques applied to blogs

Large scale: Epidemic profiles

● Example: The effects of popular websites linking to a given blog may cause popularity spikes.

53

Page 54: Graph mining techniques applied to blogs

Large scale: Epidemic profiles

● Quantify popularity of a topic into a vector.● Then, cluster different topics’ profiles.

Page 55: Graph mining techniques applied to blogs

Large scale: Epidemic profiles

● Used k-means clustering on topic buzz to identify different ways ideas gain and lose popularity. Found k=4 worked best.

55Centroids of clusters identified

Page 56: Graph mining techniques applied to blogs

Large scale: Epidemic profiles

● ‘Catchall’- picked up by different communities, no major spike.

● ‘Back page’ news- delayed spike, broader popularity.

● ‘Slashdot’- link picked up quickly, dies off quickly.

● ‘Front page’ news- immediate spike, broader popularity.

56‘catchall’48%

‘slashdot’14%

‘back page’20%

‘front page’18%

Page 57: Graph mining techniques applied to blogs

Link gathering

● Links acquired by blogrolls or automated trackbacks.

● Posts sometimes give information on source of information (‘via’).

57

May 16 2003, 8:48a

“GIANTmicrobes http://www.giantmicrobes.com/‘We make stuffed animals that look like tiny microbes– only a million times actual size! Now available: The Common Cold, The Flu, Sore Throat, and Stomach Ache.’ (via BoingBoing)

Page 58: Graph mining techniques applied to blogs

Small scale: link mining

● Links acquired by blogrolls or automated trackbacks.

● Posts sometimes give information on source of information (‘via’).

58

May 16 2003, 8:48a

“GIANTmicrobes http://www.giantmicrobes.com/‘We make stuffed animals that look like tiny microbes– only a million times actual size! Now available: The Common Cold, The Flu, Sore Throat, and Stomach Ache.’ (via BoingBoing)

Epstein- Barr

Ebola

Page 59: Graph mining techniques applied to blogs

Small scale: link mining

● Unfortunately, since ‘via’ information is rare (O(.1%)), there needs to be a better way to infer infection paths.– Solution: link prediction.

Page 60: Graph mining techniques applied to blogs

Link prediction

● Predict likelihood of 2 blogs linking to each other.– Blog similarity- common links to other blogs– Link similarity- common non-blog links– Textual similarity- text vector similarity– Timing of posts on certain topics.

● First three are cosine similarity, timing is likelihood based on observed distributions of link timings.

60

Page 61: Graph mining techniques applied to blogs

Link prediction results

● Used SVMs to predict links.– Undirected link prediction accuracy 91%– (Directed link prediction, 57%)

61

Page 62: Graph mining techniques applied to blogs

More goodies from Paper #3

● And…– Built Zoomgraph, a visualization tool (stay tuned

next week.)– Proposed iRank, a ranking based on

“infectiousness” of blogs (stay tuned Oct. 23.)

A more in-depth slide show may be found here: http://www.blogpulse.com/papers/Adar_blogworkshop2_ppt.pdf

62

Page 63: Graph mining techniques applied to blogs

63

Paper #4

● Noor Ali-Hasan and Lada Adamic. Expressing Social Relationships on the Blog through Links and Comments. ICWSM 2007 – Do different blog communities exhibit certain

structural properties?

Page 64: Graph mining techniques applied to blogs

[Ali-Hasen and Adamic 2007]

● Dataset of 3 blogging communities– Dallas/Ft. Worth– United Arab Emirates (UAE)– Kuwait

● Analyzed 3 types of links– Blogrolls (on a blog’s webpage)– Citations (link in a post)– Comments (interaction in a post’s discussion)

64

Page 65: Graph mining techniques applied to blogs

65

Citation link

Blogroll link

Page 66: Graph mining techniques applied to blogs

66

Comment link

Page 67: Graph mining techniques applied to blogs

Link type analysis

● It is of interest to compare different types of links…– Co-occurrences of

different link types.

67

Co-occurrence of link types (Kuwait)

Page 68: Graph mining techniques applied to blogs

Link type analysis

● It is of interest to compare different types of links…– Co-occurrences of

different link types.

– Reciprocity among link types, between communities.

68

Link reciprocation rates

Co-occurrence of link types (Kuwait)

Page 69: Graph mining techniques applied to blogs

Structural properties

● Centralization- to what extent links are not uniformly distributed. (low in all communities, indicating “hubs”)

69

Links per blog

Page 70: Graph mining techniques applied to blogs

Structural properties

● Centralization- to what extent links are not uniformly distributed. (low in all communities, indicating “hubs”)

● Modularity- to what extent “subcommunities” have formed.

70

Links per blog

Modularity

Page 71: Graph mining techniques applied to blogs

71

Comparing communities

Dallas-Fort Worth

-Most links are external to community (91%)

-Low centralization

-Low reciprocity

UAE

-Fewer links external to community

-More centralization

-Obvious “hub” structure

Kuwait

-Fewest links external to community (53%)

-Highly centralized

-Much reciprocity

Page 72: Graph mining techniques applied to blogs

Paper #4 Conclusions

● Based on a survey, they suggest that these different network characteristics indicated different mindsets inside the community.– Kuwait bloggers more often reported blogging in

order to make new friends.– DFW more often reported blogging to update

friends/family on events.

Page 73: Graph mining techniques applied to blogs

Conclusions

● Link analysis has discovered patterns in several aspects of the blogosphere.– Observing general network characteristics.– Describing behavior of specific blogs, or blog topics.– Illustrating how influence propagates.– Comparing different blogging communities.