View
32
Download
0
Category
Tags:
Preview:
DESCRIPTION
Studying Blogspace. Ravi Kumar IBM Almaden Research Center ravi@almaden.ibm.com. Etymology. From the OED new ed. (draft entry, Mar 2003) … blog intr. To write or maintain a weblog. Also: to read or browse through weblogs, esp. habitually. - PowerPoint PPT Presentation
Citation preview
Studying BlogspaceStudying Blogspace
Ravi KumarRavi KumarIBM Almaden Research CenterIBM Almaden Research Center
ravi@almaden.ibm.comravi@almaden.ibm.com
EtymologyEtymologyFrom the OED new ed. (draft entry, Mar 2003) …blog intr. To write or maintain a weblog. Also: to read
or browse through weblogs, esp. habitually. web¢log n. 2. A frequently updated web site
consisting of personal observations, excerpts from other sources, etc., typically run by a single person, and usually with hyperlinks to other sites; an online journal or diary.
From WWW 2003 (Kumar, Novak, Raghavan, Tomkins) …blog¢space n. The collection of weblogs; =
blogosphere, blogsphere, blogistan, …
Blogs 101Blogs 101• Characteristics
– Pages with reverse chronological sequences of dated entries
– Usually contain a persistent sidebar containing profile (and other blogs read by the author – “blogroll”)
– Usually maintained and published by one of the common variants of public-domain blog software
• From Slashdot, 1999“… a new, personal, and determinedly non-hostile
evolution of the electric community. They are also the freshest example of how people use the Net to make their own, radically different new media”
Look and feelLook and feel
• Quirky• Highly personal• Consumed by a small number of regular repeat
visitors• Often updated multiple times each day• Highly interwoven into a network of small but
active micro-communities
• Eg: LiveJournal, Xanga, DeadJournal, Blogger, Memepool, …
The blog eraThe blog era
• Blogs began in 1996, but exploded in popularity in 1999– Proliferation of authoring tools
• Newsweek 2002 estimates ~500K • LiveJournal 2005 estimates ~3.5M • Annual Blogathon for charity
– Bloggers update their Blogs every 30m for 24h– Sponsors pay …
• Impact of blogs– “Miserable failure” on Google
Structural studyStructural study(Kumar, Novak, Raghavan, Tomkins, CACM 2004)(Kumar, Novak, Raghavan, Tomkins, CACM 2004)
Livejournal blogspaceLivejournal blogspace
• Livejournal.com: popular blog site• 1.3M bloggers (Feb 2004)• 3.5M bloggers (Apr 2005)• Each blogger has a profile
– Name, age, …– Geographic information (city, state, zip, …)– Friends and friend of– Interests/communities
Eg, LiveJournal user “bill”Eg, LiveJournal user “bill”
LJ bloggers in USLJ bloggers in US
< 1K< 5K< 10K< 25K< 50K~ 100K
LJ bloggers world-wideLJ bloggers world-wide
< 1K< 2K< 5K~ 25K~ 50K~ 75K
Who are they?Who are they?Age % Representative interests
Friendship graphFriendship graph• Directed• 80% mutual• Average degree ~ 14• Power law degrees• Clustering coeff. ~ 0.2• Most friendships
explained by age, location, interest
Age 1%
Location20% Interest
16%
5%
16%
22%
Evolutionary studyEvolutionary study (Kumar, Novak, Raghavan, Tomkins, WWW 2003)(Kumar, Novak, Raghavan, Tomkins, WWW 2003)
Blogs and evolutionBlogs and evolution
• Every blog contains a dated record of– Every word ever written to the blog– Every link ever added in the blog
• Blogs are an increasingly important medium, but– Few systematic studies have been performed– Such study should take an evolutionary perspective
[Brewington et al] [Bharat et al] [Fetterly et al] [Cho et al]
– Tools for understanding evolution not fully understood
Time graphsTime graphsv1 v2
v3
v4
time
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
v1 v2
v3 v4
Underlying graph Time graph
Community evolution in blogsCommunity evolution in blogs
• What are the communities within the time graph? – Community definition, extraction– Graph-based methods (trawling)
[Kumar Raghavan Rajagopalan Tomkins, WWW 99]
• How active are these communities, and over what timeframe?– Burst analysis [Kleinberg, KDD 02]
Community extractionCommunity extraction• Community analysis based
on graph structure• Idea: there are many
subgraphs that would never occur in a random graph – if we find such subgraphs, there must be some reason
• In blogspace, we enumerate dense subgraphs using a greedy heuristic
Dense subgraph enumerationDense subgraph enumeration(heuristic)(heuristic)
• Scan edges, find triangles• For each triangle, greedily grow its neighbor set• Growth is allowable based on a measure of
connectivity to the current dense subgraph• Extracted “communities” are not unique
Current size (N) 2 <=6 <=9 <=20 >20Must connect to 2 N-1 N-2 0.7N 0.6N
Bursts: Static to dynamic Bursts: Static to dynamic communitiescommunities
• Phenomenon to characterize: A topic in a temporal stream occurs in a “burst of activity”
• Model source as multi-state• Each state has certain emission properties• Traversal between states is controlled by a
Markov model• Determine most likely underlying state
sequence over time, given observable output
An exampleAn example
Time
I’ve been thinking about your idea with
the asparagus…
Uh huh I think I see…
Uh huh Yeah, that’s what I’m saying…
So then I said “Hey, let’s give
it a try”
And anyway she said
maybe, okay?
0.0051 2
0.01State 1:Output rate: very low
State 2:Output rate: very high
1 1 1 1 2 2 2
Most likely “hidden” sequence
Some experimentsSome experiments
• Crawled 24,109 blogs from popular sites (2003)• Extract archive links from blogs• Extract all dates on blog pages, and tag each word
and link with a date– Simple heuristics to automatically extract time-stamps
from entries (regular expressions, training, …)
– Obtained dates for ~90% of edges
Experiments (contd.)Experiments (contd.)
• The time graph– 22,299 nodes, 70,472 unique edges– 0.77M multiedges (average edge multiplicity = 11)
• Consider graphs formed by prefixes from Jan 1, 1999 to some later month – generate 47 “prefix graphs” for analysis
• Enumerate communities and analyze their burstiness
SCC growthSCC growth
Largest SCC as fraction of all nodes
2nd and 3rd largest SCCs as fraction of all nodes
Connectivity in BlogspaceConnectivity in Blogspace
Fraction of nodes participatingIn some community
Number of communities
Number of nodes participating in a community
Burstiness of communitiesBurstiness of communities
Number of communities in “high state” during each time period
Are these results fluke?Are these results fluke?
• “Randomized Blogspace”: A distribution over time graphs that look very much like the time graph of Blogspace, but remove some of the locality of the true graph
• Vertices and edges arrive at the same times, each edge has the same source, but a randomly-chosen destination
• If randomized blogspace behaves like blogspace, then community structure is a fake
SCC evolutionSCC evolution
Blogspace
Randomized Blogspace
Randomized Blogspace formsan SCC much earlier
Community Community evolutionevolution
Blogspace
Randomized Blogspace
Blogspace has manymore communities
Exogenous eventsExogenous events
Number of communities identified automatically as exhibiting “bursty” behavior – measure of cohesiveness of the blogspace
Number of blog pages that belong to a community
Number of blog communities
Wired magazine publishes an article on weblogs that impacts the tech community
NewsWeek magazine publishes an article that reaches the population at large, responding to emergence, and triggering mainstream adoption
Some questions …Some questions …• Modeling
– Edge arrivals– `Interesting’ events
• Algorithms– Prediction– Information percolation – Search– o(t ¢ T(n))
• Studies– Sociological – Effect on search and ranking
Prediction via blogsPrediction via blogs (Gruhl, Guha, Kumar, Novak, Tomkins, 2005)(Gruhl, Guha, Kumar, Novak, Tomkins, 2005)
Blogs as trend indicatorsBlogs as trend indicators
• Can blogs be used to predict trends?• Data
– Amazon sales rank of some books– Blog chatter in an index
• Questions– How well do they correlate?– Can sales rank be predicted using blogs?
The Lance Armstrong Performance The Lance Armstrong Performance ProgramProgram
Query: Lance ArmstrongOR Tour de France
Vanity FairVanity Fair
Cross-correlation for Lance Cross-correlation for Lance ArmstrongArmstrong
Simple inferencesSimple inferences
• How to formulate queries automatically– Depends on the object (book, movie, …)– Simple heuristics work well
• Predicting sales motion is hard• Predicting spikes appears relatively easier• More to be done …
Blogs and social networksBlogs and social networks (Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)(Kumar, Liben-Nowell, Novak, Raghavan, Tomkins, 2005)
Social networksSocial networks
• Blog friendship graph is a social network• Is there a simple model to describe this
network?• Desiderata
– Fit experimental observations– Exhibit “six-degrees of separation”– Theoretically tractable
RBF: RBF: Rank-Based FriendshipRank-Based Friendship
• Population network model• Each person has a geographic location• d(¢, ¢) = measures geographic distance• rankA(B) = #{ C : d(A, C) < d(A, B) }• Pr[A “befriends” B] / 1/rankA(B)
– Independent of distance– Works with arbitrary population densities
• Plus local links to neighbors
RBF: Preliminary resultsRBF: Preliminary results
• Fits LiveJournal friendship experimental graph data (using geo data in the profile)
• Greedy routing: Is able to route messages from source to destination most of the time, just using geographic information
• Theoretical analysis: Can show that this model guarantees geographic routing to work
Recommended