28
Large Scale Complex Network Analysis using Large Scale Complex Network Analysis using th H b id C bi ti f th H b id C bi ti f the Hybrid Combination of a the Hybrid Combination of a MapReduce MapReduce Cluster and Cluster and a Highly Multithreaded System a Highly Multithreaded System Seunghwa Kang David A. Bader a Highly Multithreaded System a Highly Multithreaded System 1

Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Large Scale Complex Network Analysis usingLarge Scale Complex Network Analysis usingth H b id C bi ti fth H b id C bi ti fthe Hybrid Combination of athe Hybrid Combination of a

MapReduceMapReduce Cluster andCluster anda Highly Multithreaded Systema Highly Multithreaded System

Seunghwa Kang David A. Badera Highly Multithreaded Systema Highly Multithreaded System

1

Page 2: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Various Complex Networks

• Friendship networkCit ti t k• Citation network

• Web-link graphC ll b ti t kSource: http://www facebook com • Collaboration network

=> Need to extract h f l

Source: http://www.facebook.com

graphs from large volumes of raw data.

=> Extracted graphs are=> Extracted graphs are highly irregular.

2

Source: http://academic.research.microsoft.com

Page 3: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A Challenge Problem

• Extracting a subgraph from a larger graph a=0.55

a b=0.1

larger graph.- The input graph: An R-MAT* graph

(undirected, unweighted) with c=0 1 d=0 25

c d

(undirected, unweighted) with approx. 4.29 billion vertices and 275 billion edges (7.4 TB in text format)

c=0.1 d=0.25

format).- Extract subnetworks that cover

10%, 5%, and 2% of the vertices.• Finding a single-pair shortest

path (for up to 30 pairs).

3

* D. Chakrabarti, Y. Zhan, and C. Faloutsos, “R-MAT: A recursive model for graph mining,” SIAM Int’l Conf. on Data Mining (SDM), 2004.

Source: Seokhee Hong

Page 4: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Presentation Outline

• Present the hybrid system.S l th bl i th diff t• Solve the problem using three different systems: A MapReduce cluster, a highly multithreaded system and the hybrid systemmultithreaded system, and the hybrid system.

• Show the effectiveness of the hybrid system byby- Algorithm level analyses- System level analysesy y- Experimental results

4

Page 5: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Highlights

A MapReduce clusterA highly

multithreaded A hybrid system of the twosystem of the two

Theory level

Graph extraction: WMapReduce(n) ≈ θ(T*(n))

Work optimal Effective if|Thmt - TMapReduce| > level

analysis

p

Shortest path: WMapReduce(n) > θ(T*(n))

| hmt MapReduce|n / BWinter

System Bisection bandwidth Limited aggregate BWinter is System level

analysis

and disk I/O overheadgg g

computing power, disk capacity, and I/O bandwidth

interimportant.

Experi-ments

Five orders of magnitude slower than the highly multithreaded system in finding a

Incapable of storing the input graph

Efficient in solving the challenge problemsystem in finding a

shortest pathproblem.

5

Page 6: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A Hybrid System to Address theDistinct Computational ChallengesDistinct Computational Challenges

A highlymultithreaded

A MapReduce cluster

system

A MapReduce cluster

666

Page 7: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A Hybrid System to Address theDistinct Computational ChallengesDistinct Computational Challenges

1. graph extractionA highly

multithreaded

A MapReduce cluster

system

A MapReduce cluster

76

Page 8: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A Hybrid System to Address theDistinct Computational ChallengesDistinct Computational Challenges

1. graph extraction2 h

A highlymultithreaded

2. graphanalysisqueriesA MapReduce cluster

system

queriesA MapReduce cluster

6

Page 9: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

The MapReduce Programming Model• Scans the entire

input data in the map

map

sort

sort

reduce

reduce pmap phase.

• # MapReduce

map

map

sort

sort

reduce

reduce

iterations = the depth of a directed

li h

map sort reduce

Inputdata

Intermediatedata

Sortedintermediate

Outputdata

acyclic graph (DAG) for MapReduce

data

A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7]Depth

1

MapReduce computationA’[0] A’[1] A’[2] A’[3] A’[4]2

7A’’[0] A’’[1] A’’[2] A’’[3] A’’[4]3

Page 10: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Evaluating the efficiency ofM R d Al ithMapReduce Algorithms• WMapReduce = Σi = 1 to k (O(ni • (1 + fi • (1 + ri)) + pr •

Sort(n f / p ))Sort(nifi / pr))- k: # MapReduce iterations.- ni: the input data size for the ith iteration.i- fi: map output size / map input size- ri: reduce output size / reduce input size.

p : # reducers- pr: # reducers• Extracting a subgraph

k = 1 and f << 1W (n) ≈ θ(T*(n)) T*(n): the- k = 1 and fi << 1 WMapReduce(n) ≈ θ(T (n)), T (n): the time complexity of the best sequential algorithm

• Finding a single-pair shortest path

10

- k =┌ d/2 ┐, fi ≈ 1 WMapReduce(n) > θ(T*(n))

Page 11: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Evaluating the efficiency ofM R d Al ithMapReduce Algorithms• WMapReduce = Σi = 1 to k (O(ni • (1 + fi • (1 + ri)) + pr •

Sort(nifi / p ))Sort(nifi / pr))- k: # MapReduce iterations.- ni: the input data size for the ith iteration.- fi: map output size / map input size- ri: reduce output size / reduce input size.

p : # reducers- pr: # reducers• Extracting a subgraph

- k = 1 and f << 1W (n) ≈ θ(T*(n)) T*(n): the- k = 1 and fi << 1 WMapReduce(n) ≈ θ(T (n)), T (n): the time complexity of the best sequential algorithm

• Finding a single-pair shortest path

10

- k =┌ d/2 ┐, fi ≈ 1 WMapReduce(n) > θ(T*(n))

Page 12: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Evaluating the efficiency ofM R d Al ithMapReduce Algorithms• WMapReduce = Σi = 1 to k (O(ni • (1 + fi • (1 + ri)) + pr •

Sort(n f / p ))Sort(nifi / pr))- k: # MapReduce iterations.- ni: the input data size for the ith iteration.i- fi: map output size / map input size- ri: reduce output size / reduce input size.

p : # reducers- pr: # reducers• Extracting a subgraph

k = 1 and f << 1W (n) ≈ θ(T*(n)) T*(n): the- k = 1 and fi << 1 WMapReduce(n) ≈ θ(T (n)), T (n): the time complexity of the best sequential algorithm

• Finding a single-pair shortest path

11

- k =┌ d/2 ┐, fi ≈ 1 WMapReduce(n) > θ(T*(n))

Page 13: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Bisection Bandwidth Requirements for a MapReduce ClusterRequirements for a MapReduce Cluster• The shuffle phase, which requires inter-node

communication, can be overlapped with the , ppmap phase.

• If Tmap > Tshuffle, Tshuffle does not affect the overall execution timeoverall execution time.- Tmap scales trivially.- To scale Tshuffle linearly, bisection bandwidth also

d t l i ti t b f dneeds to scale in proportion to a number of nodes. Yet, the cost to linearly scale bisection bandwidth increases super-linearly.If f << 1 the sub linear scaling of T does not- If f << 1, the sub-linear scaling of Tshuffle does not increase the overall execution time.

- If f ≈ 1, it increases the overall execution time.

10

Page 14: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A single-pair shortest path

131313

Source: http://academic.research.microsoft.com

Page 15: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A single-pair shortest path

1413

Source: http://academic.research.microsoft.com

Page 16: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A single-pair shortest path

13

Source: http://academic.research.microsoft.com

Page 17: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Disk I/O overhead

• Disk I/O overhead is unavoidable if the size of data overflows the main memory capacitydata overflows the main memory capacity.

• Raw data can be very large.E t t d h h ll• Extracted graphs are much smaller.- The Facebook network: 400 million users × 130

friends per user less than 256 GB using thefriends per user less than 256 GB using the sparse representation.

36

12

2 71 3 4 5 7

1

34

7

345

222

6

7

11

25

7 67

31 2 5

Page 18: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A Highly Multithreaded Systemw/ the Shared Memory Programming Modelw/ the Shared Memory Programming Model

• Provide a random access mechanism

Sun Fire T2000 (Niagara)

mechanism.• In SMPs, non-contiguous

accesses are expensive *accesses are expensive.• Multithreading tolerates

memory access latency +

Source: Sun Microsystems

Cray XMT

memory access latency.+• There is a work optimal

parallel algorithm to find aparallel algorithm to find a single-pair shortest path.

Source: Cray* D. R. Helman and J. Ja’Ja’, “Prefix computations on symmetric multiprocessors,” J. of parallel and di t ib t d ti 61(2) 2001

12

distributed computing, 61(2), 2001.+ D. A. Bader, V. Kanade, and K. Madduri, “SWARM: A parallel programming framework for multi-core processors,” Workshop on Multithreaded Architectures and Applications, 2007.

Page 19: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A single-pair shortest path

181813

Source: http://academic.research.microsoft.com

Page 20: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A single-pair shortest path

1913

Source: http://academic.research.microsoft.com

Page 21: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A single-pair shortest path

13

Source: http://academic.research.microsoft.com

Page 22: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Low Latency High BisectionBandwidth Interconnection NetworkBandwidth Interconnection Network• Latency increases as the size of a system

increasesincreases.- A larger number of threads and additional

parallelism are required as latency increases.parallelism are required as latency increases.• Network cost to linearly scale bisection

bandwidth increases super-linearly.p y- But not too expensive for a small number of nodes.

• These limit the size of a system.- Reveal limitations in extracting a subgraph from a

very large graph.

14

Page 23: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

The Time Complexity of anAlgorithm on the Hybrid SystemAlgorithm on the Hybrid System• Thybrid = Σi = 1 to k min(Ti, MapReduce + Δ, Ti, hmt + Δ)

k # t- k: # steps- Ti, MapReduce and Ti, hmt: time complexities of the ith step

on a MapReduce cluster and a highly multithreadedon a MapReduce cluster and a highly multithreaded system, respectively.

- Δ: ni / BWinter ×δ(i – 1, i), - ni : the input data size for the ith step.- BWinter: the bandwidth between a MapReduce cluster

and a highly multithreaded systemand a highly multithreaded system.- δ(i – 1, i): 0 if selected platforms for the i - 1th and ith

steps are same. 1, otherwise.

15

Page 24: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Test Platforms

• A MapReduce cluster- 4 nodes4 nodes- 4 dual core 2.4 GHz Opteron

processors and 8 GB main memory per node.

Source: http://hadoop.apache.org/

Sun Fire T2000 (Niagara)memory per node.- 96 disks (1 TB per disk).

• A highly multithreaded system

Sun Fire T2000 (Niagara)

system- A single socket UltraSparc T2

1.2 GHz processor (8 core, 64 threads)threads).

- 32 GB main memory.- 2 disks (145 GB per disk)

Source: Sun Microsystems

• A hybrid system of the two16

Page 25: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

A subgraph that covers 10%of the input graphof the input graph

140

(hou

rs)

100

120MapReduce clusterHybrid system

MapReduce Hybrid

Subgraph extraction

24 24

ecut

ion

time

40

60

80 Memory loading

- 0.83

Finding a 103 0.00073

0 5 10 15 20 25 30

Exe

0

20shortest path (for 30 pairs)

Num. pairs

Once the subgraph is loaded into the memory, the hybrid system analyzes the subgraph five orders of magnitude faster than the

17

analyzes the subgraph five orders of magnitude faster than the MapReduce cluster (103 hours vs 2.6 seconds).

Page 26: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Subgraphs that cover 5% (left) and 2% (right) of the input graph

ours

)

80

100MapReduce clusterHybrid system

ours

)

80

100MapReduce clusterHybrid system

5% (left) and 2% (right) of the input graphec

utio

n tim

e (h

o

40

60

ecut

ion

time

(ho

40

60

Num. pairs0 5 10 15 20 25 30

Exe

0

20

Num. pairs0 5 10 15 20 25 30

Ex

0

20

MapReduce HybridSubgraph extraction

22 22MapReduce Hybrid

Subgraph extraction

21 21

Memory loading

- 0.42

Finding a 61 0.00047

Memory loading

- 0.038

Finding a 5.2 0.00019gshortest path (for 30 pairs)

18

gshortest path (for 30 pairs)

Page 27: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Conclusions

• We identified the key computational challenges in large scale complex networkchallenges in large-scale complex network analysis problems.

• Our hybrid system effectively addresses the• Our hybrid system effectively addresses the challenges by using a right tool in a right place in a synergistic way.place in a synergistic way.

• Our work showcases a holistic approach to solve real-world challenges.g

19

Page 28: Large Scale Complex Network Analysis using th H b …...Large Scale Complex Network Analysis using th H b id C bi ti fthe Hybrid Combination of a MapReduceMapReduce Cluster and Cluster

Acknowledgment of Support

20