Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Stream Size (%)
Ave
rag
e K
S D
ista
nce
Motivation
• Studying complex networks is a challenging task due to their: - Heterogeneous and dependent structure - Large size - Evolution over the time
is used to select a subset of nodes/edges from the full graph
• A sample is representative of the full graph if it preserves characteristics of the graph - degree, path length, clustering coefficient
• Network Sampling can be broadly classified as: (1) Node (2) Edge (3) Topology
Problem Definition
• Current sampling methods require access to either: (a) entire graph (b) node’s neighbors
• This requires the graph to be memory-resident • Big graphs are too large/dynamic to fit in main memory • Most big graphs evolve as a stream of edges over time
- e.g. email communications, tweets in twitter hashtags
• Study how to sample from large graph streams
• We propose a novel graph PIES: - reservoir-based sampling - maintains a dynamic/changing representative
sample from large graph streams - uses a single pass over the edges O(|E|)
• We propose streaming implementations for current sampling methods
- Node, Edge, and Breadth First sampling
Paper ID: 8
Nesreen K. Ahmed
Purdue University, CS Dept. [email protected]
Jennifer Neville
Purdue University, CS Dept. [email protected]
Ramana Kompella
Purdue University, CS Dept. [email protected]
Results
- Average KS-Distance across 5 datasets, 10 repeated runs
- Key observation: PIES outperforms NS, ES, BFS for degree, path length, and clustering coefficient
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.2
0.4
0.6
0.8
1
Clustering Coefficient
P(X
>x)
5 10 15 20 25 300
0.2
0.4
0.6
0.8
1
Path Length
P(X
>x)
- Sample size = 20% - Key observations: NS, ES, BFS under-estimate the graph distributions PIES accurately preserves the three distributions
0 1 10 100 10000
0.2
0.4
0.6
0.8
1
Degree
P(X
>x)
5 10 20 30 400
0.2
0.4
0.6
0.8
1
Sampling Fraction (%)
Ave
rag
e K
S D
ista
nce
PIES
BFS
NS
ES
Degree
5 10 20 30 400
0.2
0.4
0.6
0.8
1
Sampling Fraction (%)
Ave
rage K
S D
ista
nce
Path Length
5 10 20 30 400
0.2
0.4
0.6
0.8
1
Sampling Fraction (%)
Ave
rage K
S D
ista
nce
Clustering Coeff.
Node Sampling (NS) Forest Fire Sampling Edge Sampling (ES)
- Average on sample sizes 5%--40%
- Key observation: PIES maintains a representative sample at different points of the stream
Algorithms
Partially Induced Edge Sampling - using a reservoir to maintain a random sample - using partial induction in the forward direction of the
stream to capture the connectivity of the nodes • Implementation of Node Sampling
- keep a reservoir with n nodes with min hash values • Implementation of Edge Sampling
- keep a reservoir with m edges with min hash values • Implementation of Breadth-First Sampling
- using a sliding window of size w
Spectrum of Computational Models for Sampling Graphs