1
20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Stream Size (%) Average KS Distance Motivation Studying complex networks is a challenging task due to their: - Heterogeneous and dependent structure - Large size - Evolution over the time is used to select a subset of nodes/ edges from the full graph A sample is representative of the full graph if it preserves characteristics of the graph - degree, path length, clustering coefficient Network Sampling can be broadly classified as: (1) Node (2) Edge (3) Topology Problem Definition Current sampling methods require access to either: (a) entire graph (b) node’s neighbors This requires the graph to be memory-resident Big graphs are too large/dynamic to fit in main memory Most big graphs evolve as a stream of edges over time - e.g. email communications, tweets in twitter hashtags Study how to sample from large graph streams We propose a novel graph PIES : - reservoir-based sampling - maintains a dynamic/changing representative sample from large graph streams - uses a single pass over the edges O(|E|) We propose streaming implementations for current sampling methods - Node, Edge, and Breadth First sampling Paper ID: 8 Nesreen K. Ahmed Purdue University, CS Dept. [email protected] Jennifer Neville Purdue University, CS Dept. [email protected] Ramana Kompella Purdue University, CS Dept. [email protected] Results - Average KS-Distance across 5 datasets, 10 repeated runs - Key observation: PIES outperforms NS, ES, BFS for degree, path length, and clustering coefficient 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.2 0.4 0.6 0.8 1 Clustering Coefficient P(X>x) 5 10 15 20 25 30 0 0.2 0.4 0.6 0.8 1 Path Length P(X>x) - Sample size = 20% - Key observations: NS, ES, BFS under-estimate the graph distributions PIES accurately preserves the three distributions 0 1 10 100 100 0 0.2 0.4 0.6 0.8 1 Degree P(X>x) 5 10 20 30 40 0 0.2 0.4 0.6 0.8 1 Sampling Fraction (%) Average KS Distance PIES BFS NS ES Degree 5 10 20 30 40 0 0.2 0.4 0.6 0.8 1 Sampling Fraction (%) Average KS Distance Path Length 5 10 20 30 40 0 0.2 0.4 0.6 0.8 1 Sampling Fraction (%) Average KS Distance Clustering Coeff. Node Sampling (NS) Forest Fire Sampling Edge Sampling (ES) - Average on sample sizes 5%--40% - Key observation: PIES maintains a representative sample at different points of the stream Algorithms Partially Induced Edge Sampling - using a reservoir to maintain a random sample - using partial induction in the forward direction of the stream to capture the connectivity of the nodes Implementation of Node Sampling - keep a reservoir with n nodes with min hash values Implementation of Edge Sampling - keep a reservoir with m edges with min hash values Implementation of Breadth-First Sampling - using a sliding window of size w Spectrum of Computational Models for Sampling Graphs

Nesreen K. Ahmed Jennifer Neville Ramana Kompellanesreenahmed.com/posters/ahmed-bigmine2012-ppt.pdf · 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Stream Size (%) Average KS Distance Motivation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Nesreen K. Ahmed Jennifer Neville Ramana Kompellanesreenahmed.com/posters/ahmed-bigmine2012-ppt.pdf · 20 40 60 80 100 0 0.2 0.4 0.6 0.8 1 Stream Size (%) Average KS Distance Motivation

20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Stream Size (%)

Ave

rag

e K

S D

ista

nce

Motivation

•  Studying complex networks is a challenging task due to their: -  Heterogeneous and dependent structure -  Large size -  Evolution over the time

is used to select a subset of nodes/edges from the full graph

•  A sample is representative of the full graph if it preserves characteristics of the graph -  degree, path length, clustering coefficient

•  Network Sampling can be broadly classified as: (1) Node (2) Edge (3) Topology

Problem Definition

•  Current sampling methods require access to either: (a) entire graph (b) node’s neighbors

•  This requires the graph to be memory-resident •  Big graphs are too large/dynamic to fit in main memory •  Most big graphs evolve as a stream of edges over time

-  e.g. email communications, tweets in twitter hashtags

•  Study how to sample from large graph streams

•  We propose a novel graph PIES: -  reservoir-based sampling -  maintains a dynamic/changing representative

sample from large graph streams -  uses a single pass over the edges O(|E|)

•  We propose streaming implementations for current sampling methods

-  Node, Edge, and Breadth First sampling

Paper ID: 8

Nesreen K. Ahmed

Purdue University, CS Dept. [email protected]

Jennifer Neville

Purdue University, CS Dept. [email protected]

Ramana Kompella

Purdue University, CS Dept. [email protected]

Results

-  Average KS-Distance across 5 datasets, 10 repeated runs

-  Key observation: PIES outperforms NS, ES, BFS for degree, path length, and clustering coefficient

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

Clustering Coefficient

P(X

>x)

5 10 15 20 25 300

0.2

0.4

0.6

0.8

1

Path Length

P(X

>x)

-  Sample size = 20% -  Key observations: NS, ES, BFS under-estimate the graph distributions PIES accurately preserves the three distributions

0 1 10 100 10000

0.2

0.4

0.6

0.8

1

Degree

P(X

>x)

5 10 20 30 400

0.2

0.4

0.6

0.8

1

Sampling Fraction (%)

Ave

rag

e K

S D

ista

nce

PIES

BFS

NS

ES

Degree

5 10 20 30 400

0.2

0.4

0.6

0.8

1

Sampling Fraction (%)

Ave

rage K

S D

ista

nce

Path Length

5 10 20 30 400

0.2

0.4

0.6

0.8

1

Sampling Fraction (%)

Ave

rage K

S D

ista

nce

Clustering Coeff.

Node Sampling (NS) Forest Fire Sampling Edge Sampling (ES)

-  Average on sample sizes 5%--40%

-  Key observation: PIES maintains a representative sample at different points of the stream

Algorithms

Partially Induced Edge Sampling -  using a reservoir to maintain a random sample -  using partial induction in the forward direction of the

stream to capture the connectivity of the nodes •  Implementation of Node Sampling

-  keep a reservoir with n nodes with min hash values •  Implementation of Edge Sampling

-  keep a reservoir with m edges with min hash values •  Implementation of Breadth-First Sampling

-  using a sliding window of size w

Spectrum of Computational Models for Sampling Graphs