Upload
pham-duc
View
674
Download
1
Embed Size (px)
Citation preview
S3G2 . 27-Aug-12. Page 1/23
S3G2: a Scalable Structure-correlated Social Graph Generator
Minh-Duc Pham Peter Boncz Orri Erling
Database Architectures GroupCentrum Wiskunde & Informatica (CWI)
S3G2 . 27-Aug-12. Page 2/23
Data correlations between attributes
SELECT personID from person
WHERE firstName = AND addressCountry = ‘Germany’‘Joachim’
SELECT personID from person
WHERE firstName = AND addressCountry = ‘Italy’‘Cesare’
Query optimizers may underestimate or overestimate the result size of conjunctive predicates
Anti-Correlation
Loew PrandelliJoachim CesareCesare JoachimCorrelation between predicates has been studied to some extent in database research (e.g. in the LEO project)
But: correlation-aware query optimization is still hardly mainstream in database products
S3G2 . 27-Aug-12. Page 3/23
SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
jn1.name = ‘VLDB Journal’ AND jn2.name =
Data correlations between attributes
‘TODS’
S3G2 . 27-Aug-12. Page 4/23
SELECT COUNT(*)
FROM paper pa1 JOIN journal jn1 ON pa1.journal = jn1.ID
paper pa2 JOIN journal jn2 ON pa2.journal = jn2.ID
WHERE pa1.author = pa2.author AND
jn1.name = ‘VLDB Journal’ AND jn2.name =
Data correlations over joins
‘Bioinformatics’‘TODS’
A challenge to the optimizers to adjust estimated join hit ratio pa1.author = pa2.author depending on other predicates
Correlated predicates are still a frontier area in database research
S3G2 . 27-Aug-12. Page 5/23
Emerging class in database systems
Higher need for correlation-awareness
• graph queries navigate over many steps (=joins)• well known effect in RDF systems (many self-joins)• implicit structure of graph/RDF data model re-appears in queries as correlations (structural correlation)
No existing graph benchmark specifically tests for the effects of correlations
• Synthetic graphs used for benchmarking do not have structural correlations
Graph database systems
Need a data generator generating synthetic graph with data/structure correlations S3G2
S3G2 . 27-Aug-12. Page 6/23
what data do we generate?
• social network, Facebook-like
how to generate correlated properties?
• with a compact data generator
how to generate correlated structure?
• multiple correlation dimensions
• scalable MapReduce algorithm (multi-pass)
Next …
S3G2 . 27-Aug-12. Page 7/23
S3G2: Generating a Correlated Social Graph
knows
know
s
knows
User
User
User
User
Post
create
Photo
upload
hasN
ame
studyAt
InRelationShipUser
“Yamaku”
“EPFL”
“Switzerland”
liveA
t
Comment
Comment
create
create
like
S3G2 . 27-Aug-12. Page 8/23
what data do we generate?
• social network, Facebook-like
how to generate correlated properties?
• with a compact data generator
how to generate correlated structure?
• multiple correlation dimensions
• scalable MapReduce algorithm (multi-pass)
Next …
S3G2 . 27-Aug-12. Page 9/23
How do data generators generate values? E.g. FirstName
Generating Correlated Property Values
S3G2 . 27-Aug-12. Page 10/23
How do data generators generate values? E.g. FirstName
Value Dictionary D()
• a fixed set of values, e.g., {“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”, .. }
Probability density function F()
• steers how the generator chooses values− cumulative distribution over dictionary entries determines which value to pick
• could be anything: uniform, binomial, geometric, etc…− geometric (discrete exponential) seems to explain many natural phenomena
Generating Property Values
S3G2 . 27-Aug-12. Page 11/23
How do data generators generate values? E.g. FirstName
Value Dictionary D()
Probability density function F()
Ranking Function R()
• Gives each value a unique rank between 1 and |D|−determines which value gets which probability
• Depends on some parameters (parameterized function)− value frequency distribution becomes correlated by the parameters or R()
Generating Correlated Property Values
S3G2 . 27-Aug-12. Page 12/23
How do data generators generate values? E.g. FirstName
Value Dictionary D()
{“Andrea”,“Anna”,“Cesare”,“Camilla”,“Duc”,“Joachim”,“Leon”,“Orri”, .. }
Probability density function F()
geometric distribution
Ranking Function R(gender,country,birthyear)
• gender, country, birthyear correlation parameters
Generating Correlated Property Values
How to implement R()?
We need a table storing
|Gender| X |Country| X |BirthYear| X |D|
How to implement R()?
We need a table storing
|Gender| X |Country| X |BirthYear| X |D|
Our Solution:- Just store the rank of the top-N values, not all|D|- Assign the rank of the other dictionary values randomly
Our Solution:- Just store the rank of the top-N values, not all|D|- Assign the rank of the other dictionary values randomly
limited #combinations
PotentiallyMany!
S3G2 . 27-Aug-12. Page 13/23
Compact Correlated Property Value GenerationUsing geometric distribution for function F()
S3G2 . 27-Aug-12. Page 14/23
Main source of dictionary values from DBpedia (http://dbpedia.org)
Various realistic property value correlations ()
e.g., (person.location,person.gender,person.birthDay) person.firstName person.location person.lastName person.location person.universityperson.createdDate person.photoAlbum.createdDate….
Correlated Properties used in S3G2
S3G2 . 27-Aug-12. Page 15/23
what data do we generate?
• social network, Facebook-like
how to generate correlated properties?
• with a compact data generator
how to generate correlated structure?
• multiple correlation dimensions• scalable MapReduce algorithm (multi-pass)
Next …
S3G2 . 27-Aug-12. Page 16/23
Correlated Edges in a social network
P4
<knows>
<knows>
<knows>
P5
Student “Anna”
<is>
<studyAt> “University of
Leipzig”
<liveAt> “Germ
any”
“1990”
<birthYear>
<firs
tnam
e>
<firstname>P1
< studyAt >
“University of Leipzig”
“Laura”
“1990”
<birthYear>
<like
>
<Britney Spears>
<Britney Spears>
<like>
<kno
ws>
P3
< studyAt >
“University of Leipzig”
“1990”
<birthYear>P2
<studyAt>
“University of Amsterdam”
<liveAt>
“Netherlands”
S3G2 . 27-Aug-12. Page 17/23
How to generated correlated edges?
P4
P5
Student “Anna”
<is>
<studyAt> “University of
Leipzig”
<liveAt> “Germ
any”
“1990”
<birthYear>
<firs
tnam
e>
<firstname>P1
< studyAt >
“University of Leipzig”
“Laura”
“1990”
<birthYear>
<like
>
<Britney Spears>
<Britney Spears>
<like>
P3
< studyAt >
“University of Leipzig”
“1990”
<birthYear>P2
<studyAt>
“University of Amsterdam”
<liveAt>
“Netherlands”Continuously access possibly any node for correlated edges Expensive random I/Os for graphs of a size > RAM
? ???
?
• Compute similarity of tw
o nodes based
on their (correlated) propertie
s.
• Use a probability density function wrt to
this similarity for connecting nodes
• Compute similarity of tw
o nodes based
on their (correlated) propertie
s.
• Use a probability density function wrt to
this similarity for connecting nodes
connection
probability
highly similar less similar
?
Multiple correlation dimensions:-Studying near each other--liking the same music-- etc, etc--
S3G2 . 27-Aug-12. Page 18/23
Our observation
P4
<knows>
<knows>
<knows>
P5
Student “Anna”
<is>
<studyAt> “University of
Leipzig”
<liveAt> “Germ
any”
“1990”
<birthYear>
<firs
tnam
e>
<firstname>P1
< studyAt >
“University of Leipzig”
“Laura”
“1990”
<birthYear>
<like
>
<Britney Spears>
<Britney Spears>
<like>
<kno
ws>
P3
< studyAt >
“University of Leipzig”
“1990”
<birthYear>P2
<studyAt>
“University of Amsterdam”
<liveAt>
“Netherlands”Probability that two nodes are connected is skewed w.r.t the similarity between the nodes (due to probability distr.)
connection
probability
highly similar less similar
Window
Trick: disregard nodes with too large similarity distance(only connect nodes in a similarity window)
S3G2 . 27-Aug-12. Page 19/23
We can Sort nodes on Correlation Dimension
Similar metric
Sort nodes on similarity (similar nodes are brought near each other)
Probability function
Pick edge between two nodes based on their ranked distance
(often: geometric distribution, again)
Similarity metric + Probability function
P1Munich
P5Dresden
P3Leipzig
P2Leipzig
P4Potsdam
<Ranking along the “Having study together” dimension> we use space filling curves (e.g. Z-order) to get a linear dimension
S3G2 . 27-Aug-12. Page 20/23
Sort nodes using MapReduce on similarity metric
Reduce() function keeps a window of nodes to generate edges
• Keep low memory usage (sliding window approach)
Slide the window for multiple passes, each pass corresponds to one correlation dimension (multiple MapReduce jobs)
• for each node we choose degree per pass (also using a prob. function)steers how many edges are picked in the window for that node
Generate edges along correlation dimensions
W
S3G2 . 27-Aug-12. Page 21/23
Having studied together
Having common interests (hobbies)
Random dimension
• motivation: not all friendships are explainable (…)
(of course, these two correlation dimensions are still a gross simplification of reality
but this provides some interesting material for benchmark queries)
Correlation dimensions for our Social Graph
S3G2 . 27-Aug-12. Page 22/23
Social graph characteristics
• Output graph has similar characteristics as observed in real social network (i.e., “small-world network” characteristics)
- Power-law social degree distribution- Low average path-length- High clustering coefficient
Scalability
• Generates up to 1.2 TB of data (1.2 million users) in half an hour- Runs on a cluster of 16 nodes (part of the SciLens cluster, www.scilens.org)
• Scales out linearly
Evaluation (… see the paper)
S3G2 . 27-Aug-12. Page 23/23
Propose novel framework for scalable graph generator that can
• Generate huge graph having correlations between the graph structure and graph data• Exploit parallelism offered by MapReduce paradigm for scalability
Future step: Use S3G2 as the base for a novel benchmark in graph query processing (www.w3.org/wiki/Social_Network_Intelligence_BenchMark)
Conclusion