Large-Scale Graph Processing〜Introduction〜(完全版)

Large-Scale GraphProcessing

@doryokujinHadoop Conference Japan 2011 Fall

～Introduction～

・井上敬浩(26歳)

・twitter: doryokujin

・データマイニングエンジニア

・MongoDB JP 代表

・Hadoop, MongoDB, GraphDB に関心

・マラソン2時間33分

自己紹介

・SNSを中心に大規模なグラフデータが爆発的に増加

→ グラフデータの解析が必要に

・Page Rank, Popularity Rank, Recommend, Shortest Path, Recommend, Friend Search

→ 大規模グラフの分散処理において、MapReduce が本当に適切であるか？

Motivation: Why Graph?

http://www.catehuston.com/blog/2009/11/02/touchgraph/

@IT さんの9/15付のインタビュー記事でカッティング氏がGiraphについて言及

Hadoop MapReduce デザインパターン——MapReduceによる大規模テキストデータ処理

1 Jimmy Lin, Chris Dyer�著、神林飛志、野村直之�監修、玉川竜司�訳

2 2011年10月01日発売予定3 210ページ4 定価2,940円

MapReduceデザインパターン本の5章でグラフアルゴリズムが取り上げられている

Motivation: Why Graph?

MR: Good For Simple Problems

Reduce

MR: Bad for Iterative Problems

Reduce

Shuffle & barrier

job start/shutdown

イテレーション毎のデータロード

Other Iterative MR Framework

・Haloop

・Spark

・Twister

・Daytona

標準で効率的なIterationが可能なMapReduceフレームワークは多く存在する

Is MR Fit for Graph Data?

min(6,4)

Graph Processing = “Vertex Based Approach”

隣接するノード間のメッセージパッシングをMRでどう記述する？

Is MR Fit for Graph Data?

BSP: Bulk Synchronous Parallel

a super step

http://en.wikipedia.org/wiki/Bulk_Synchronous_Parallel

BSP: Bulk Synchronous Parallel

1. Local Computation: 　各Processorがローカルデータに対して独立した処理を行う

2. Communication: 　Processor 間でメッセージパッシングを行う

3. Barrier Synchronisation:全Processorのメッセージパッシングが完了するまで待機。

1.～3. の “super step” のイテレーション

Relation: MR and BSP

Local Computation= Map Phase

Communication + Barrier = Shuffle and Sort Phase

Aggregation or (next) Local Computation

= Reduce Phase

a super step

BSP Iterative MR

Relation: MR and BSP

Graph, Matrix, MachineLearning

BSP Iterative MR

MRGraph Processing

Matrix Computation

Machine Leaning ※1

※1 多くのMachine Learning ModelはMapReduceで記述可能なことが証明されている

・2009年6月に Google が発表

- BSP を Graph Processing に応用

- 大規模データの80%をMapReduceで、20%をPregelで

- 10億node, 800億edgeのグラフをPC480台で並列処理、最短経路問題を200秒で解く

- YouTube の Graph-Based Recommendations で使われてるれているらしい

- 論文も入手可能

Google Pregel

・Input: - Directed Graph ( vertex, edgeはunique_idと変更可能なvalueを持つ )

・Each Superstep S:- S-1 stepからのメッセージを受信

- Compute(): vertex Vにユーザー定義関数を適用

- V のlocal stateを変更、Graphのlocal topologyを変更

- Communication: 隣接するvertexへメッセージパッシング

- Barrier Synchronisation: 全てのCommunicationが終了するまで待機

Google Pregel Model

・Termination Condition: - 各vertexはそれ以上の処理が無ければ停止信号を送信 (Vote to Halt) して inactive な状態に

- 全てのvertexが同時にinactiveになっていれば終了

- メッセージパッシングが全く行われなくなれば終了

Google Pregel Model

active inactive

Vote to Halt

Message Received

Map: BSP to Graph ProcessingLocal Computation

-> [頂点へのユーザ定義関数] Compute()

Communication -> 隣接するノードへメッセー

ジパッシング

Barrier synchronisation-> 全ノードのメッセージパッシングが終了するまで待機

a super step

SSSP: Parallel BFS

MapReduce & Pregel

※ SSSP: Single Source Shortest Paths, BFS: Breadth First Search

SSSP: Parallel BFS

MapReduce & Pregel

SSSP: MapReduce Model

initialize

・Load: Adjacency ListA: <(B,5),(D,3)>

B: <(E,1)>

C: <(F,5)>

D: <(B,1),(C,3),(E,4),(F,2)>

F: <(G,4)>

Source

+∞+∞

・Map Input: [Graph Structure]- <A: <0, (B,5),(D,3)>>

- <B: <∞, (E,1)>>

- <C: <∞, (F,5)>>

- <D: <∞, (B,1),(C,3),(E,4),(F,2)>>

- <E: <∞>>

- <F: <∞, (G,4)>>

- <G: <∞>>

+∞+∞

・Map Output:- (B,5),(D,3), <A: <0, (B,5),(D,3)>>

- (E,∞), <B: <∞, (E,1)>>

- (F,∞), <C: <∞, (F,5)>>

- (B,∞),(C,∞),(E,∞),(F,∞),

<D: <∞, (B,1),(C,3),(E,4),(F,2)>>

- <E: <∞>>

- (G,∞), <F: <∞, (G,4)>>

- <G: <∞>>

SSSP: MapReduce ModelGraph 構造もReducerに送信

Local Disk のFlush

+∞+∞

・Reduce Input:[A] - <A: <0, (B,5),(D,3)>>

[B] - (B,5),(B,∞), <B: <∞, (E,1)>>

[C] - (C,∞), <C: <∞, (F,5)>>

[D] - (D,3),

<D: <∞, (B,1),(C,3),(E,4),(F,2)>>

[E] - (E,∞),(E,∞), <E,<∞>>

[F] - (F,∞),(F,∞), <F: <∞, (G,4)>>

[G] - (G,∞), <G: <∞>>

+∞+∞

・Reduce Process:[A] - <A: <0, (B,5),(D,3)>>

[B] - (B,5),(B,∞), <B: <∞, (E,1)>>

[C] - (C,∞), <C: <∞, (F,5)>>

[D] - (D,3),

<D: <∞, (B,1),(C,3),(E,4),(F,2)>>

[E] - (E,∞),(E,∞), <E,<∞>>

[F] - (F,∞),(F,∞), <F: <∞, (G,4)>>

[G] - (G,∞), <G: <∞>>Reduce後、HDFS のフラッシュ

+∞+∞

・Map Input (Reduce Output):- <A: <0, (B,5),(D,3)>>

- <B: <5, (E,1)>>

- <C: <∞, (F,5)>>

- <D: <3, (B,1),(C,3),(E,4),(F,2)>>

- <E,<∞>>

- <F: <∞, (G,4)>>

- <G: <∞>>

+∞+∞

・Map Output:- (B,5),(D,3), <A: <0, (B,5),(D,3)>>

- (E,6), <B: <5, (E,1)>>

- (F,∞), <C: <∞, (F,5)>>

- (B,4),(C,6),(E,7),(F,5), <D: <3, (B,1),(C,3),(E,4),(F,2)>>

- <E,<∞>>

- (G,∞), <F: <∞, (G,4)>>

- <G: <∞>>Local Disk のFlush

[B] - (B,5),(B,4), <B: <5, (E,1)>>

[C] - (C,6), <C: <∞, (F,5)>>

[D] - (D,3),

<D: <3, (B,1),(C,3),(E,4),(F,2)>>

[E] - (E,6),(E,7), <E, <∞>>

[F] - (F,∞),(F,5), <F: <∞, (G,4)>>

[G] - (G,∞), <G: <∞>>

Reduce後、HDFS のフラッシュ

・Map Input (Reduce Output):- <A: <0, (B,5),(D,3)>>

- <B: <4, (E,1)>>

- <C: <6, (F,5)>>

- <D: <3, (B,1),(C,3),(E,4),(F,2)>>

- <E: <6>>

- <F: <5, (G,4)>>

- <G: <∞>>

・Map Output:- (B,5),(D,3), <A: <0, (B,5),(D,3)>>

- (E,5), <B: <4, (E,1)>>

- (F,11), <C: <6, (F,5)>>

- (B,4),(C,6),(E,7),(F,5),

<D: <3, (B,1),(C,3),(E,4),(F,2)>>

- <E: <6>>

- (G,9), <F: <5, (G,4)>>

- <G: <∞>>

Local Disk のFlush

[B] - (B,5),(B,4), <B: <4, (E,1)>>

[C] - (C,6), <C: <6, (F,5)>>

[D] - (D,3),

<D: <3, (B,1),(C,3),(E,4),(F,2)>>

[E] - (E,5), (E,7), <E, <6>>

[F] - (F,5),(F,11), <F: <5, (G,4)>>

[G] - (G,9), <G: <∞>>

class ShortestPathMapper(Mapper)

def map(self, node_id, Node):

# send graph structure

emit node_id, Node

# get node value and add it to edge distance

dist = Node.get_value()

for neighbour_node_id in Node.get_adjacency_list():

dist_to_nbr = Node.get_distance(

node_id, neighbour_node_id )

emit neighbour_node_id, dist + dist_to_nbr

class ShortestPathReducer(Reducer):

def reduce(self, node_id, dist_list):

min_dist = sys.maxint

for dist in dist_list:

# dist_list contains a Node

if is_node(dist):

Node = dist

elif dist < min_dist:

min_dist = dist

Node.set_value(min_dist)

" emit node_id, Node

・MapReduce:- “Dence” なグラフに対する処理は苦手

- グラフ構造もShuffle Phaseで毎回送信しないといけない

- 基本的な問題に関しては最適化手法が存在

・Pregel:- シンプルなアルゴリズム

- ネットワーク通信はメッセージのみ

MapReduce v.s. Pregel

- グラフ構造もShuffle Phaseで毎回送信しないといけない

- ↑基本的なグラフ問題に関しては最適化可能

SSSP: MR Optimization・Combiner:

- shuffle phase でのネットワーク通信量を削減

- reduce() と同じ処理を combine() で適用すれば良い

・In-Mapper Combiner:

- map phase で buffer (hash map) を使用

- bufferの上限に達するかmap処理完了後にemit

・Shimmy trick:- mapperからでなく、HDFSから上手にグラフ構造を読み込む

# In-Mapper Combiner

class ShortestPathMapper(Mapper):

def __init__(self):

self.buffer = {}

def check_and_put(self, key, value):

if key not in self.buffer or value < self.buffer[key]:

self.buffer[key] = value

def check_and_emit(self):

if is_exceed_limit_buffer_size(self.buffer):

for key, value in self.buffer.items():

emit key, value

self.buffer = {}

def close(self):

for key, value in self.buffer.items():

emit key, value

#...continue

def map(self, node_id, Node):

# send graph structure

emit node_id, Node

# get node value and add it to edge distance

dist = Node.get_value()

for nbr_node_id in Node.get_adjacency_list():

dist_to_nbr = Node.get_distance(node_id, nbr_node_id)

dist_nbr = dist + dist_to_nbr

check_and_put(nbr_node_id, dist_nbr)

check_and_emit()

Shimmy Trick・”parallel merge join” のアイデアの活用

Sorted by join_key

分割してparallel join

Shimmy Trick・Graph G の頂点はソート済みのnode_idで順序づけられているとする

・G = G1 ∪ G2 ∪ ... ∪ Gn に分割

・Reducer数をGraph分割数 n と一致させる

・Partitioner:- 全てのイテレーションで共通にする- Reducer Ri に送られるnode_id 集合は Graph Partition Gi の node_id 集合に必ず含まれるように

Shimmy Trick・Mapper:- (node_id, dist) のペアのみemit。構造はemitしない

・Reducer:- Reducerに渡されるnode_idの順序はshuffle phaseでsort済み

- 対応するpartitioned graph Gi のnode_idの順序と一致

- 対応する Gi をHDFSからシーケンシャルに読み込み、node_idのグラフ構造を読み取る。計算した最小値と構造をペアにしてemit

Shimmy Trick

id_1, [d1,d2,...]id_2, [d1,d2,...]

...id_10,[d1,d2,...]

Reduce

id_1, N1 id_2, N2...

id_10, N10

Map Map Map

id_11, N11 id_12, N12

...id_20, N20

id_11, [d1,d2,...]id_12, [d1,d2,...]

...id_20,[d1,d2,...]

Reduceemit id_1, N1’emit id_2, N2’

emit id_11, N11’emit id_12, N12’

# Shimmy trick

class ShortestPathReducer(Reducer):

def __init__(self):

P.open_graph_partition()

def emit_precede_node(self, node_id):

for pre_node_id, Node in P.read():

if node_id == pre_node_id:

return Node

emit pre_node_id, Node

#(...continue)

def reduce(node_id, dist_list):

Node = self.emit_precede_node(node_id)

min_dist = sys.maxint

for dist in dist_list:

if dist < min_dist:

min_dist = dist

Node.set_value(min_dist)

emit node_id, Node

SSSP: Parallel BFS

MapReduce & Pregel

SSSP: Pregel Model

+∞+∞

P1 P2 P1 P2Compute()

Comupte():Vertexの値と前回のステップからの

メッセージを元に計算

+∞+∞

Communicate

SSSP: Pregel Model

Communicate:更新した値を矢線の出る方のノード

へメッセージパッシング

+∞+∞

CommunicateBarrier

SSSP: Pregel Model

Barrier:全てのノードへメッセージパッシン

グが終了するまで待機

+∞+∞

CommunicateBarrier

Compute()

SSSP: Pregel Model

Compute():Vertexの値と前回のステップからの

メッセージを元に計算

+∞+∞

CommunicateBarrier

Compute()Communicate

Barrier

SSSP: Pregel Model

Communicate & Barrier:更新した値を矢線の出る方のノードへメッセージパッシング、待機

CommunicateBarrier

BarrierCompute()

SSSP: Pregel Model

CommunicateBarrier

BarrierCompute()

CommunicateBarrier

SSSP: Pregel Model

CommunicateBarrier

BarrierCompute()

CommunicateBarrier

Compute()

SSSP: Pregel Model

CommunicateBarrier

BarrierCompute()

CommunicateBarrier

Barrier

SSSP: Pregel Model

CommunicateBarrier

BarrierCompute()

CommunicateBarrier

Barrier

SSSP: Pregel Model

Compute()

CommunicateBarrier

BarrierCompute()

CommunicateBarrier

Barrier

SSSP: Pregel Model

Barrier

CommunicateBarrier

BarrierCompute()

CommunicateBarrier

Barrier

SSSP: Pregel Model

BarrierTerminate

class ShortestPathVertex:

def compute(self, msgs):

min_dist = 0 if self.is_source() else sys.maxint;

# get values from all incoming edges.

for msg in msgs:

min_dist = min(min_dist, msg.get_value())

if min_dist < self.get_value():

# update current value(state).

" self.set_current_value(min_dist)

# send new value to outgoing edge.

out_edge_iterator = self.get_out_edge_iterator()

for out_edge in out_edge_iterator:

recipient =

out_edge.get_other_element(self.get_id())

self.send_massage(recipient.get_id(),

min_dist + out_edge.get_distance() )

self.vote_to_halt()

- ネットワーク通信は状態とグラフ構造

- 基本的な問題に関しては最適化可能

Hama, Giraph,

GoldenOrb, Pregel

Hama GoldenOrb Giraph

API BSP Pregel(Graph) Pregel(Graph)

NextGen MR 対応？対応

Lincense Apache Apache Apache

Infrastructure 必要必要不要(on Hadoop)

Hama, GoldenOrb, Giraph

Hama GoldenOrb Giraph

API BSP Pregel(Graph) Pregel(Graph)

NextGen MR 対応？対応

Lincense Apache Apache Apache

Infrastructure 必要必要不要(on Hadoop)

YARN 対応！

HamaはBSP全般を扱う

Hadoop上でのMapのイテレーション

Pregelに準拠したGraphAPI

BSP Iterating MR

MRGraph Processing

Matrix Computation

Machine Leaning

Hama は BSP 全般を扱う

Google Pregel・Giraph・GordenOrb は Graph Processing に特化

Pregel

・Apache Hadoop MapReduce

・Spark

・Apache HAMA

・Apache Giraph

・Open MPI

・Generic Co-Processors for Apache HBase

NextGen Hadoop Framework

現在NextGen Hadoop Mapreduce2.0に対応を表明しているフレームワーク

・Master- Controlling SuperStep and Fault

- Scheduling Job

- Managing Workers

・Worker- Task Processor

- Running with GFS and BigTable

- Communicating with Workers

[Fault Tolerance]・Check Point- Each Supertep S:・Workers: Checkpoint V, E, and Massages

・Master: Checkpoints Aggregators

・Save to Persistent (Local) Storage

・Node Failure- Detect by ping massages

- Reload the Checkpoint and Start From S

Pregel: Architecture

※調査中

※ Map Only Job in Hadoop + Thread Assignment・Master:- Using Input Format for Graph

- Making VertexSplitObjects

- Synchronization of Supersteps

- Handling Changes Occurred within Supersteps

- WIth Multiple Masters for Fault Tolerrance

・Worker- Reading vertices from VertexSplitObjects, Splitting Them into VertexRanges

- Executing compute() for Each Vertex and Buffering Incoming Massages

- Running with HDFS

・Zookeeper

Apache Giraph: Architecture

[Fault Tolerance]・Check Point- Multiple Master and Zookeepr

- The Same Concept of Pregel

※調査中

・BSP Master: (≒ Job Tracker)- Controlling SuperStep and Fault

- Scheduling Jobs

- Managing Groom Servers

・Groom Server: (≒ Task Tracker)- BSP Task Processor

- Running with HDFS and Other DFS

※ Hadoop RPC is used for BSPPeers to communicate each other.

・Zookeeper:- Managing the Barrier Synchronisation of the BSPPeers

Apache Hama: Architecture

HAMA: An Efficient Matrix Computation with theMapReduce Framework

Sangwon SeoComputer Science Division

KAIST (Korea Advanced Institute ofScience and Technology), South Korea

swseo@calab.kaist.ac.kr

Edward J. YoonUser Service Development Center

NHN Corp., South Koreaedwardyoon@apache.org

Jaehong KimComputer Science Division

jaehong@calab.kaist.ac.kr

Seongwook JinComputer Science Division

swjin@calab.kaist.ac.kr

Jin-Soo KimSchool of Information and CommunicationSungkyunkwan University, South Korea

jinsookim@skku.edu

Seungryoul MaengComputer Science Division

maeng@calab.kaist.ac.kr

Abstract—APPLICATION. Various scientific computationshave become so complex, and thus computation tools play animportant role. In this paper, we explore the state-of-the-artframework providing high-level matrix computation primitiveswith MapReduce through the case study approach, and demon-strate these primitives with different computation engines toshow the performance and scalability. We believe the opportunityfor using MapReduce in scientific computation is even morepromising than the success to date in the parallel systemsliterature.

I. INTRODUCTION

As cloud computing environment emerges, Google hasintroduced the MapReduce framework to accelerate paralleland distributed computing on more than a thousand of in-expensive machines. Google has shown that the MapReduceframework is easy to use and provides massive scalabilitywith extensive fault tolerance [2]. Especially, MapReduce fitswell with complex data-intensive computations such as high-dimensional scientific simulation, machine learning, and datamining. Google and Yahoo! are known to operate dedicatedclusters for MapReduce applications, each cluster consistingof several thousands of nodes. One of typical MapReduceapplications in these companies is to analyze search logs tocharacterize user tendencies. The success of Google promptedan Apache opensource project called Hadoop [11], which isthe clone of the MapReduce framework. Recently, Hadoopgrew into an enormous project unifying many Apache sub-projects such as HBase [12] and Zookeeper [13].Massive matrix/graph computations are often used as pri-

mary means for many data-intensive scientific applications.For example, such applications as large-scale numerical anal-ysis, data mining, computational physics, and graph renderingfrequently require the intensive computation power of matrixinversion. Similarly, graph computations are key primitives forvarious scientific applications such as machine learning, infor-mation retrieval, bioinformatics, and social network analysis.

MapReduce BSP Dryad

Zookeeper

HAMA Core

Computation Engine(Plugged In/Out)

HAMA API

Storage Systems

Distributed Locking

HAMA Shell

Fig. 1. The overall architecture of HAMA.

HAMA is a distributed framework on Hadoop for massivematrix and graph computations. HAMA aims at a power-ful tool for various scientific applications, providing basicprimitives for developers and researchers with simple APIs.HAMA is currently being incubated as one of the subprojectsof Hadoop by the Apache Software Foundation [10].Figure 1 illustrates the overall architecture of HAMA.

HAMA has a layered architecture consisting of three compo-nents: HAMA Core for providing many primitives to matrixand graph computations, HAMA Shell for interactive userconsole, and HAMA API. The HAMA Core component alsodetermines the appropriate computation engine. At this mo-ment, HAMA supports three computation engines: Hadoop’sMapReduce engine, our own BSP (Bulk Synchronous Parallel)[9] engine, and Microsoft’s Dryad [3] engine. The Hadoop’sMapReduce engine is used for matrix computations, whileBSP and Dryad engines are commonly used for graph com-putations. The main difference between BSP and Dryad isthat BSP gives high performance with good data locality,while Dryad provides highly flexible computations with thefine control over the communication graph.

※調査中

http://wiki.apache.org/hama/Articles

・Graph ProcessingにおいてはMR / BSPの解法が存在

・Graphの構造や扱うアルゴリズムによって使い分ける

・SSSPなどの基本的な問題は最適化されたMRによって効率良く解ける手法が提案されている

・複雑な問題はBSPの方が比較的シンプルに実装できる

・Giraph はHadoopフレームワークのMap処理のみをイテレーションなので最も親和性が高い

・今後は動作検証やベンチマークが必要

まとめ

・Large-scale graph computing at Google

・ Pregel: A System for Large-Scale Graph Processing

・Processing graph/relational data with Map-Reduce and Bulk Synchronous Parallel

・2010-Pregel

・Google Pregel and other massive graph distributed systems.

・Design patterns for efficient graph algorithms in MapReduce

・Apache HAMA: An Introduction to Bulk Synchronization Parallel on Hadoop

・2011.06.29. Giraph - Hadoop Summit 2011

・Graph Exploration with Apache Hadoop and MapReduce

・Graph Exploration with Apache Hama

・Shortest Path Finding with Apache Hama

参考文献

Large-Scale Graph Processing〜Introduction〜(完全版)

Technology

「完整版PPT」版權聲明及使用說明 - 智勝文化 · 「完整版ppt」版權聲明及使用說明本ppt為《企業策略管理》一書之「完整版ppt」（編號：

主體教育(完整版) 02

2/7 最新版2/7 最新版（完売）（完売）（完売）（完売）（完売）（完売）（完売）（完売）（完売）（完売）（完売）（完売

Aula de viscon de cairu(完成版

E journalplus　マニュアル(完成版)

ThinkPad X230s 完整版 V20130404

GuJiayu古佳玉_full version完整版

阀门知识简介完全版（下）

Goodbye, my beloved grandma 完整版

ワークショップ完成版表（修正）Title ワークショップ完成版表（修正） Created Date 11/15/2013 12:41:17 PM

100712 本ゼミプレゼン資料完成版

ゼミ説明会パンフレット2013 完成版

臉書廣告提案完整版

設計完成版型錄 v4

Sbp 36班　プレゼンテーション完成版

Large-Scale Graph Processing〜Introduction〜(LT版)

嶺南完成版 With Reference By Vince Cheung

陳芝婷 portfolio 完整版

韓文子音Part1 完整版

【完成版】シンポジウムチラシ · 2020. 9. 27. · Title 【完成版】シンポジウムチラシ Created Date: 9/19/2020 9:20:40 PM