S3G2 - a Scalable Structure-correlated Social Graph Generator

S3G2 . 27-Aug-12. Page 1/23

S3G2: a Scalable Structure-correlated Social Graph Generator

Minh-Duc Pham Peter Boncz Orri Erling

Database Architectures GroupCentrum Wiskunde & Informatica (CWI)

S3G2 . 27-Aug-12. Page 2/23

Data correlations between attributes

SELECT personID from person

WHERE firstName = AND addressCountry = ‘Germany’‘Joachim’

SELECT personID from person

WHERE firstName = AND addressCountry = ‘Italy’‘Cesare’

Query optimizers may underestimate or overestimate the result size of conjunctive predicates

Anti-Correlation

Loew PrandelliJoachim CesareCesare JoachimCorrelation between predicates has been studied to some extent in database research (e.g. in the LEO project)

But: correlation-aware query optimization is still hardly mainstream in database products

S3G2 . 27-Aug-12. Page 3/23

SELECT COUNT(*)

FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID

paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID

WHERE pa1.author = pa2.author AND

jn1.name = ‘VLDB Journal’ AND jn2.name =

Data correlations between attributes

‘TODS’

S3G2 . 27-Aug-12. Page 4/23

SELECT COUNT(*)

FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID

paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID

WHERE pa1.author = pa2.author AND

jn1.name = ‘VLDB Journal’ AND jn2.name =

Data correlations over joins

‘Bioinformatics’‘TODS’

A challenge to the optimizers to adjust estimated join hit ratio pa1.author = pa2.author depending on other predicates

Correlated predicates are still a frontier area in database research

S3G2 . 27-Aug-12. Page 5/23

Emerging class in database systems

Higher need for correlation-awareness

• graph queries navigate over many steps (=joins)• well known effect in RDF systems (many self-joins)• implicit structure of graph/RDF data model re-appears in queries as correlations (structural correlation)

No existing graph benchmark specifically tests for the effects of correlations

• Synthetic graphs used for benchmarking do not have structural correlations

Graph database systems

Need a data generator generating synthetic graph with data/structure correlations S3G2

S3G2 . 27-Aug-12. Page 6/23

what data do we generate?

• social network, Facebook-like

how to generate correlated properties?

• with a compact data generator

how to generate correlated structure?

• multiple correlation dimensions

• scalable MapReduce algorithm (multi-pass)

Next …

S3G2 . 27-Aug-12. Page 7/23

S3G2: Generating a Correlated Social Graph

knows

know

s

knows

User

User

User

User

Post

create

Photo

upload

hasN

ame

studyAt

InRelationShipUser

“Yamaku”

“EPFL”

“Switzerland”

liveA

t

Comment

Comment

create

create

like

S3G2 . 27-Aug-12. Page 8/23






• multiple correlation dimensions

• scalable MapReduce algorithm (multi-pass)

Next …

S3G2 . 27-Aug-12. Page 9/23

How do data generators generate values? E.g. FirstName

Generating Correlated Property Values

S3G2 . 27-Aug-12. Page 10/23


Value Dictionary D()

• a fixed set of values, e.g., {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”, .. }

Probability density function F()

• steers how the generator chooses values− cumulative distribution over dictionary entries determines which value to pick

• could be anything: uniform, binomial, geometric, etc…− geometric (discrete exponential) seems to explain many natural phenomena

Generating Property Values

S3G2 . 27-Aug-12. Page 11/23




Ranking Function R()

• Gives each value a unique rank between 1 and |D|−determines which value gets which probability

• Depends on some parameters (parameterized function)− value frequency distribution becomes correlated by the parameters or R()


S3G2 . 27-Aug-12. Page 12/23



{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”, .. }


geometric distribution

Ranking Function R(gender,country,birthyear)

• gender, country, birthyear correlation parameters


How to implement R()?

We need a table storing

|Gender| X |Country| X |BirthYear| X |D|

How to implement R()?

We need a table storing

|Gender| X |Country| X |BirthYear| X |D|

Our Solution:- Just store the rank of the top-N values, not all|D|- Assign the rank of the other dictionary values randomly

Our Solution:- Just store the rank of the top-N values, not all|D|- Assign the rank of the other dictionary values randomly

limited #combinations

PotentiallyMany!

S3G2 . 27-Aug-12. Page 13/23

Compact Correlated Property Value GenerationUsing geometric distribution for function F()

S3G2 . 27-Aug-12. Page 14/23

Main source of dictionary values from DBpedia (http://dbpedia.org)

Various realistic property value correlations ()

e.g., (person.location,person.gender,person.birthDay) person.firstName person.location person.lastName person.location person.universityperson.createdDate person.photoAlbum.createdDate….

Correlated Properties used in S3G2

http://dbpedia.org/

http://dbpedia.org/

S3G2 . 27-Aug-12. Page 15/23






• multiple correlation dimensions• scalable MapReduce algorithm (multi-pass)

Next …

S3G2 . 27-Aug-12. Page 16/23

Correlated Edges in a social network

P4

<knows>

<knows>

<knows>

P5

Student “Anna”

<is>

<studyAt> “University of

Leipzig”

<liveAt> “Germ

any”

“1990”

<birthYear>

<firs

tnam

e>

<firstname>P1

< studyAt >

“University of Leipzig”

“Laura”

“1990”

<birthYear>

<like

>

<Britney Spears>

<Britney Spears>

<like>

<kno

ws>

P3

< studyAt >


“1990”

<birthYear>P2

<studyAt>

“University of Amsterdam”

<liveAt>

“Netherlands”

S3G2 . 27-Aug-12. Page 17/23

How to generated correlated edges?

P4

P5

Student “Anna”

<is>


Leipzig”

<liveAt> “Germ

any”

“1990”

<birthYear>

<firs

tnam

e>

<firstname>P1

< studyAt >


“Laura”

“1990”

<birthYear>

<like

>

<Britney Spears>

<Britney Spears>

<like>

P3

< studyAt >


“1990”

<birthYear>P2

<studyAt>


<liveAt>

“Netherlands”Continuously access possibly any node for correlated edges Expensive random I/Os for graphs of a size > RAM

? ???

?

• Compute similarity of tw

o nodes based

on their (correlated) propertie

s.

• Use a probability density function wrt to

this similarity for connecting nodes

• Compute similarity of tw

o nodes based

on their (correlated) propertie

s.

• Use a probability density function wrt to

this similarity for connecting nodes

connection

probability

highly similar less similar

?

Multiple correlation dimensions:-Studying near each other--liking the same music-- etc, etc--

S3G2 . 27-Aug-12. Page 18/23

Our observation

P4

<knows>

<knows>

<knows>

P5

Student “Anna”

<is>


Leipzig”

<liveAt> “Germ

any”

“1990”

<birthYear>

<firs

tnam

e>

<firstname>P1

< studyAt >


“Laura”

“1990”

<birthYear>

<like

>

<Britney Spears>

<Britney Spears>

<like>

<kno

ws>

P3

< studyAt >


“1990”

<birthYear>P2

<studyAt>


<liveAt>

“Netherlands”Probability that two nodes are connected is skewed w.r.t the similarity between the nodes (due to probability distr.)

connection

probability

highly similar less similar

Window

Trick: disregard nodes with too large similarity distance(only connect nodes in a similarity window)

S3G2 . 27-Aug-12. Page 19/23

We can Sort nodes on Correlation Dimension

Similar metric

Sort nodes on similarity (similar nodes are brought near each other)

Probability function

Pick edge between two nodes based on their ranked distance

(often: geometric distribution, again)

Similarity metric + Probability function

P1Munich

P5Dresden

P3Leipzig

P2Leipzig

P4Potsdam

<Ranking along the “Having study together” dimension> we use space filling curves (e.g. Z-order) to get a linear dimension

S3G2 . 27-Aug-12. Page 20/23

Sort nodes using MapReduce on similarity metric

Reduce() function keeps a window of nodes to generate edges

• Keep low memory usage (sliding window approach)

Slide the window for multiple passes, each pass corresponds to one correlation dimension (multiple MapReduce jobs)

• for each node we choose degree per pass (also using a prob. function)steers how many edges are picked in the window for that node

Generate edges along correlation dimensions

W

S3G2 . 27-Aug-12. Page 21/23

Having studied together

Having common interests (hobbies)

Random dimension

• motivation: not all friendships are explainable (…)

(of course, these two correlation dimensions are still a gross simplification of reality

but this provides some interesting material for benchmark queries)

Correlation dimensions for our Social Graph

S3G2 . 27-Aug-12. Page 22/23

Social graph characteristics

• Output graph has similar characteristics as observed in real social network (i.e., “small-world network” characteristics)

- Power-law social degree distribution- Low average path-length- High clustering coefficient

Scalability

• Generates up to 1.2 TB of data (1.2 million users) in half an hour- Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org)

• Scales out linearly

Evaluation (… see the paper)

http://www.scilens.org/

S3G2 . 27-Aug-12. Page 23/23

Propose novel framework for scalable graph generator that can

• Generate huge graph having correlations between the graph structure and graph data• Exploit parallelism offered by MapReduce paradigm for scalability

Future step: Use S3G2 as the base for a novel benchmark in graph query processing (www.w3.org/wiki/Social_Network_Intelligence_BenchMark)

Conclusion

http://www.w3.org/wiki/Social_Network_Intelligence_BenchMark

Documents

S3G2 - a Scalable Structure-correlated Social Graph Generator