GPU MapReduceによる規模グラフ処 - 北海道大学...2012/12/12 · MapReduce ! Locality-aware parallel data processing ! GIM-V (Generalized Iterative Matrix-Vector multiplication)

GPU MapReduceによる⼤大規模グラフ処理理

東京⼯工業⼤大学　学術国際情報センター　佐藤　仁

Large-scale Graphs and HPC

}  Emergence of Large-scale Graphs }  Various Applications

}  Traffic network, SNS, Smart Grids, Biology, Business Intelligence, etc.

}  Modern supercomputers }  can accommodate peta-flops class

performance /w peta-byte class storage

}  Important Kernel in HPC }  Graph500

Algorithm Theory + Practical HPC implementation technique

TSUBAME2.0 (Nov. 2010 /w NEC-HP)}  A green-, cloud-based SC at TokyoTech (Tokyo Japan)

}  〜～2.4 Pflops (Peak), 〜～1.2 Pflops (Linpack) }  4th in TOP500 Rank (Nov. 2010), 14th (Jun. 2012)

}  Next gen multi-core x86 CPUs + GPUs }  1432 nodes, Intel Westmere/Nehalem EX CPUs }  4244 NVIDIA Tesla (Fermi) M2050 GPUs

}  〜～ 95TB of MEM /w 0.7 PB/s aggregate BW }  Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) }  1.2 MW Power , PUE = 1.28

}  2nd in Green500 Rank (Nov. 2010) }  Greenest Production SC (Nov. 2010)

}  VM Operation (KVM), Linux + Windows HPC

HDD-based Storage systems： Total 7.13PB (Parallel FS + Home)

Interconnets: 　Full-bisection Optical QDR Infiniband Network

Parallel FS 5.93PB

SupreTitenet

Home 1.2PB

SupreSinet3

StorageTek SL8500 Tape Library

~4PB

OSS 　x20 MDS x10

MDS,OSS servers HP DL360 G6 30nodes Storage DDN SFA10000 x5 (10 enclosure x5) x5

Voltaire Grid Director 4700 ×12 IB QDR: 324 ports

Core Switch

Edge Switch

Edge Switch (/w 10GbE ports)

Voltaire Grid Director 4036 ×179 IB QDR : 36 ports

Voltaire Grid Director 4036E ×6 IB QDR:34ports 10GbE: 2port

12switches

6switches 179switches

Storage Servers HP DL380 G6 4nodes BlueArc Mercury 100 x2 Storage DDN SFA10000 x1 （10 enclosure x1）

ManagementServers

Thin nodes

1408nodes (32nodes x44 Racks)

HP Proliant SL390s G7 1408nodes 　　　　　　　　　　　　CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/node GPU: NVIDIA M2050, 3GPUs/node Mem: 54GB (96GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB) 　

Medium nodes

HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GB SSD: 120GB x 4 = 480GB

Fat nodes

HP Proliant DL580 G7 10nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem: 256GB (512GB) SSD: 120GB x 4 = 480GB

・・・・・・

　　Computing Nodes： 2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD

GSIC:NVIDIA Tesla S1070GPU

PCI –E gen2 x16 x2slot/node

TSUBAME2.0 Overview

NFS,CIFS⽤用　x4 NFS,CIFS,iSCSI⽤用 x2

High Speed Data Transfer Servers

TSUBAME2.0 Storage Overview

“Global Work Space” #1

SFA10k #5


“Global Work Space” #3 “Scratch”

SFA10k #4SFA10k #3SFA10k #2SFA10k #1

/work9 /work0 /work19 /gscr0

“cNFS/Clusterd Samba w/ GPFS”

HOME

System application

“NFS/CIFS/iSCSI by BlueARC”

HOME

iSCSI

Infiniband QDR Network for LNET and Other Services

SFA10k #6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel File System VolumesHome Volumes

QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

Lustre GPFS with HSM

“Thin node SSD” “Fat/Medium node

SSD”

ScratchGrid Storage

1.2PB

2.4 PB HDD + 〜～4PB Tape

3.6 PB

130 TB〜～

190 TB

TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)

TSUBAME2.0 Storage Overview


SFA10k #5


“Global Work Space” #3 “Scratch”

SFA10k #4SFA10k #3SFA10k #2SFA10k #1

/work9 /work0 /work19 /gscr0

“cNFS/Clusterd Samba w/ GPFS”

HOME

System application

“NFS/CIFS/iSCSI by BlueARC”

HOME

iSCSI

Infiniband QDR Network for LNET and Other Services

SFA10k #6

GPFS#1 GPFS#2 GPFS#3 GPFS#4

Parallel File System VolumesHome Volumes

QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

Lustre GPFS with HSM

“Thin node SSD” “Fat/Medium node

SSD”

ScratchGrid Storage

1.2PB

2.4 PB HDD + 〜～4PB Tape

3.6 PB

130 TB〜～

190 TB

•  Home storage for computing nodes •  Cloud-based campus storage services

Concurrent Parallel I/O (e.g. MPI-IO)

Fine-grained R/W I/O (check point, temporal files)

Data transfer service between SCs/CCs

Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)

Backup

TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)

}  Computation }  Increase in Parallelism, Heterogeneity, Density

}  Multi-core, Many-core processors }  Heterogeneous processors

}  Storage/Memory Architecture }  Deep hierarchy

}  Next-gen memory, FLASH/NVRAM Parallel FS

}  Memory and I/O Walls

Towards Exa-scale Supercomputing

Network

Locality

Productivity

FT

Algorithm

Power

Storage Hierarchy

I/O

Problems

Heterogeneity

Scalability

Post-peta-scale system

Memory Tiers

nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)

PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

DRAM

L1 L2 Cache

L3 Cache

Memory Wall I/O Wall

Memory Tiers


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

PCI-‐e Flash

Flash Device

Accelerator

New memory devices

より高速に

より大容量に

Memory Tiers (VolaQlity)


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

PCI-‐e Flash

Flash Device

Accelerator

VolaQle Non-‐volaQle

Memory Tiers (APIs)


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

PCI-‐e Flash

Flash Device

Accelerator

Memory File

ファイルとしてファイルシステムを介したアクセス

メモリとして透過的にアクセス

Key Technology in Exa-‐scale Data-‐Intensive SupercompuQng


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

PCI-‐e Flash

Flash Device

Accelerator

ApplicaQon Kernel

APIs for so#ware-‐controllable new memory devices

MiddlewareLarge-‐scale Graphs

PGAS

Memory Tiers


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

DRAM

L1 L2 Cache

L3 Cache

Memory Tiers


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

PCI-‐e Flash

Flash Device

Memory Tiers (VolaQlity)


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

VolaQle Non-‐volaQle

PCI-‐e Flash

Flash Device

Memory Tiers (API)


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

Memory File

PCI-‐e Flash

Flash Device

メモリとして透過的にアクセス

ファイルとしてファイルシステムを介したアクセス

Memory Tiers (API)


PB

TB

GB

MB

KB

Capacity

Latency

Tape

HDD

SSD

DRAM

L1 L2 Cache

L3 Cache

PCI-‐e Flash

Flash Device

ApplicaQon Kernel

APIs for soVware-‐controllable new devices

Middleware

Large-Scale Graph Processing /w GPGPU}  MapReduce

}  Locality-aware parallel data processing }  GIM-V (Generalized Iterative

Matrix-Vector multiplication) for large-scale graphs

}  GPGPU }  Pros.

}  Massively parallel threads }  High memory bandwidth

}  Cons. }  Data transfer between host and GPU device

Goal }  アクセラレータを搭載した⼤大規模環境でのMapReduce

}  局所性の活⽤用 }  メモリの階層性を考慮

}  1000台〜～のアクセラレータ }  最新のデバイスへの適⽤用を⾒見見越した開発

}  アクセラレータ: NVIDIA K20, INTEL Xeon Phi }  次世代メモリ、FLASH/NVRAM

}  グラフは1つのインスタンス

}  現在進⾏行行中 }  GIM-V for GPU MapReduce }  Hamar

GIM-‐V Algorithm

20*1 : Kang, U. et al, “PEGASUS: A Peta-‐Scale Graph Mining System-‐ ImplementaQon and ObservaQons”, IEEE INTERNATIONAL CONFERENCE ON DATA MINING 2009

•  Generalized IteraQve Matrix-‐Vector mulQplicaQon*1 –  v’ = M ×G v where

v’i = assign(vj , combineAllj ({xj | j = 1..n, xj = combine2(mi,j, vj)})) (i = 1..n) –  Various graph algorithms

•  PageRank, Random Walk /w Restart, Diameter EsQmaQon, Connected Components –  Hadoop-‐based implementaQon

＝ ×Gvj

v’ M v

combine2

j

jmi2,j

mi1,ji1i2

＝ ×Gv’ivi

v’ M

combineAll

assign vj1

ii

j2

mi,j2mi,j1

Stage1

Stage2

Mars •  Mars*1 : An exisQng GPU-‐based MapReduce framework

–  Map, Reduce funcQons are implemented as CUDA kernels •  # Mapper/Reducer = # GPU thread = # keys •  Map/Reduce Count → Prefix sum → Map/Reduce

–  Shuffle stage executes GPU-‐based Bitonic Sort –  CPU-‐GPU communicaQon at starQng Map

21

Pre process

Map Split

Prefix Sum

Map Count

Map Count

Map

Map

Sort Reduce SplitPrefix Sum

Reduce Count

Reduce Count

Reduce

Reduce

Map Stage

Reduce StageShuffle Stage

GPU Processing Mars Scheduler

*1 : Fang W. et al, “Mars: AcceleraQng MapReduce with Graphics Processors”, Parallel and Distributed Systems, 2011

Mars Extension for MulQ-‐GPU Devices

Map Sort

Map Sort

Reduce

Reduce

GPU Processing Scheduler

Upload CPU → GPU

Download GPU → CPU

Download GPU → CPU

Upload CPU → GPU

MulQ GPU GIM-‐V implementaQon on top of Mars using MPI –  ConQnuous execuQon feature of mulQ MapReduce stages

•  CPU-‐GPU communicaQon at the start and the end of each iteraQon •  Convergence test as a post processing

–  Inter-‐GPU communicaQons on Shuffle stage •  GPU-‐CPU -‐> All to all processes -‐> CPU-‐GPU •  AVer global data exchange, each GPU sort intermediate key-‐values locally

–  Convergence test •  First, each GPU counts #converged verQces locally •  AVer local count, count global #converged verQces using MPI_Allreduce

22

•  Graph parQQoning –  ParQQon a graph into sub-‐graphs in accordance with #GPUs –  In Shuffle stage, distribute verQces/edges idenQcal to a list of vertex ids

GPU owns •  Data structure

–  Mars has metadata (size) and payload (actual data) of key-‐value pairs –  We eliminate metadata and use fixed size payload to reduce the amount

of data •  Thread allocaQon

–  Mars assigns a single CUDA thread to a reduce operaQon for values to a single key

–  Our implementaQon allocates mulQple CUDA threads to a single reduce operaQon in combine2 in MapReduce stage1

23

GIM-‐V for MulQ-‐GPU Devices

Graph ParQQon GIM-‐V Stage 1

Read input GIM-‐V Stage 2

Write output

GPU Processing SchedulerPreprocess

Convergence Test

PostprocessGIM-‐V

Experiments•  QuesQon

–  Performance of our GIM-‐V implementaQon on a GPU •  Measurement method

–  A single round of iteraQons •  vs. CPU-‐based Mars •  vs. Hadoop-‐based implementaQon (PEGASUS)

•  Methods –  ApplicaQon

•  PageRank –  Measures relaQve importance of

web pages

–  Input data •  ArQficial Kronecker graph

–  Generated by generator in Graph 500 •  Parameters

–  SCALE: log 2 of #verQces (#verQces = 2SCALE) –  Edge factor: #edges = 16 × #verQces

24

0.19 0.05

0.57 0.19

0.19 0.05

i1 j1 i2 j2 …

i2x j2x

Adjacency matrix

ProbabiliQes for adding edges

Experimental environments•  We use 3 GPUs on 1 node

–  CPU •  6 cores，12 threads (HyperThread enabled)

–  GPU •  CUDA Driver Version: 4.1 •  CUDA RunQme Version: 4.0 •  Compute Capability: 2.0 •  shared/L1 cache size: 64 KB

•  Mars –  MarsGPU-‐n

•  n GPUs / node (n: 1, 2, 3) •  # threads = # different keys

–  256 threads on a thread block –  MarsCPU

•  12 threads / node •  implemented by C and POSIX thread library instead of CUDA •  Sort is implemented by parallel quick sort

•  PEGASUS –  Hadoop 0.21.0 –  HDFS as file system

25

CPU GPU

Model Intel® Xeon® X5670 Tesla C2050

# Physical cores 6 448

Frequency 2.93 GHz 1.15 GHz

Amount of memory

16.3 GB 2.7 GB (Global)

Compiler gcc 4.3.4 nvcc 3.2

Conclusions•  Conclusions

–  Scalable MapReduce-‐based GIM-‐V implementaQon using mulQ-‐GPU

•  87.04 ME/s on SCALE 30 (256 nodes, 768 GPUs) •  1.52x speedup than the CPU-‐based implementaQon on SCALE 29

•  OpQmizaQon of load balance •  Future work

– OpQmizaQon of our implementaQon •  Improve communicaQon, locality

– Data handling for out of GPU memory •  Use local storage as well as CPU/GPU memories •  Efficient memory hierarchy management

26

Documents

GPU MapReduceによる 規模グラフ処 - 北海道大学...2012/12/12 · MapReduce ! Locality-aware parallel data processing ! GIM-V (Generalized Iterative Matrix-Vector multiplication)

GPU MapReduceによる規模グラフ処 - 北海道大学...2012/12/12 · MapReduce ! Locality-aware parallel data processing ! GIM-V (Generalized Iterative Matrix-Vector multiplication)