Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
GPU MapReduceによる ⼤大規模グラフ処理理
東京⼯工業⼤大学 学術国際情報センター 佐藤 仁
Large-scale Graphs and HPC
} Emergence of Large-scale Graphs } Various Applications
} Traffic network, SNS, Smart Grids, Biology, Business Intelligence, etc.
} Modern supercomputers } can accommodate peta-flops class
performance /w peta-byte class storage
} Important Kernel in HPC } Graph500
Algorithm Theory + Practical HPC implementation technique
TSUBAME2.0 (Nov. 2010 /w NEC-HP)} A green-, cloud-based SC at TokyoTech (Tokyo Japan)
} 〜~2.4 Pflops (Peak), 〜~1.2 Pflops (Linpack) } 4th in TOP500 Rank (Nov. 2010), 14th (Jun. 2012)
} Next gen multi-core x86 CPUs + GPUs } 1432 nodes, Intel Westmere/Nehalem EX CPUs } 4244 NVIDIA Tesla (Fermi) M2050 GPUs
} 〜~ 95TB of MEM /w 0.7 PB/s aggregate BW } Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) } 1.2 MW Power , PUE = 1.28
} 2nd in Green500 Rank (Nov. 2010) } Greenest Production SC (Nov. 2010)
} VM Operation (KVM), Linux + Windows HPC
HDD-based Storage systems: Total 7.13PB (Parallel FS + Home)
Interconnets: Full-bisection Optical QDR Infiniband Network
Parallel FS 5.93PB
SupreTitenet
Home 1.2PB
SupreSinet3
StorageTek SL8500 Tape Library
~4PB
OSS x20 MDS x10
MDS,OSS servers HP DL360 G6 30nodes Storage DDN SFA10000 x5 (10 enclosure x5) x5
Voltaire Grid Director 4700 ×12 IB QDR: 324 ports
Core Switch
Edge Switch
Edge Switch (/w 10GbE ports)
Voltaire Grid Director 4036 ×179 IB QDR : 36 ports
Voltaire Grid Director 4036E ×6 IB QDR:34ports 10GbE: 2port
12switches
6switches 179switches
Storage Servers HP DL380 G6 4nodes BlueArc Mercury 100 x2 Storage DDN SFA10000 x1 (10 enclosure x1)
ManagementServers
Thin nodes
1408nodes (32nodes x44 Racks)
HP Proliant SL390s G7 1408nodes CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/node GPU: NVIDIA M2050, 3GPUs/node Mem: 54GB (96GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB)
Medium nodes
HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GB SSD: 120GB x 4 = 480GB
Fat nodes
HP Proliant DL580 G7 10nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem: 256GB (512GB) SSD: 120GB x 4 = 480GB
・・・・・・
Computing Nodes: 2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD
GSIC:NVIDIA Tesla S1070GPU
PCI –E gen2 x16 x2slot/node
TSUBAME2.0 Overview
NFS,CIFS⽤用 x4 NFS,CIFS,iSCSI⽤用 x2
High Speed Data Transfer Servers
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
System application
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System VolumesHome Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
Lustre GPFS with HSM
“Thin node SSD” “Fat/Medium node
SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜~4PB Tape
3.6 PB
130 TB〜~
190 TB
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
TSUBAME2.0 Storage Overview
“Global Work Space” #1
SFA10k #5
“Global Work Space” #2
“Global Work Space” #3 “Scratch”
SFA10k #4SFA10k #3SFA10k #2SFA10k #1
/work9 /work0 /work19 /gscr0
“cNFS/Clusterd Samba w/ GPFS”
HOME
System application
“NFS/CIFS/iSCSI by BlueARC”
HOME
iSCSI
Infiniband QDR Network for LNET and Other Services
SFA10k #6
GPFS#1 GPFS#2 GPFS#3 GPFS#4
Parallel File System VolumesHome Volumes
QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8
Lustre GPFS with HSM
“Thin node SSD” “Fat/Medium node
SSD”
ScratchGrid Storage
1.2PB
2.4 PB HDD + 〜~4PB Tape
3.6 PB
130 TB〜~
190 TB
• Home storage for computing nodes • Cloud-based campus storage services
Concurrent Parallel I/O (e.g. MPI-IO)
Fine-grained R/W I/O (check point, temporal files)
Data transfer service between SCs/CCs
Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)
Backup
TSUBAME2.0 Storage 11PB (7PB HDD, 4PB Tape)
} Computation } Increase in Parallelism, Heterogeneity, Density
} Multi-core, Many-core processors } Heterogeneous processors
} Storage/Memory Architecture } Deep hierarchy
} Next-gen memory, FLASH/NVRAM Parallel FS
} Memory and I/O Walls
Towards Exa-scale Supercomputing
Network
Locality
Productivity
FT
Algorithm
Power
Storage Hierarchy
I/O
Problems
Heterogeneity
Scalability
Post-peta-scale system
Memory Tiers
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
DRAM
L1 L2 Cache
L3 Cache
Memory Wall I/O Wall
Memory Tiers
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
PCI-‐e Flash
Flash Device
Accelerator
New memory devices
より高速に
より大容量に
Memory Tiers (VolaQlity)
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
PCI-‐e Flash
Flash Device
Accelerator
VolaQle Non-‐volaQle
Memory Tiers (APIs)
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
PCI-‐e Flash
Flash Device
Accelerator
Memory File
ファイルとしてファイルシステムを 介したアクセス
メモリとして透過的にアクセス
Key Technology in Exa-‐scale Data-‐Intensive SupercompuQng
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
PCI-‐e Flash
Flash Device
Accelerator
ApplicaQon Kernel
APIs for so#ware-‐controllable new memory devices
MiddlewareLarge-‐scale Graphs
PGAS
Memory Tiers
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
DRAM
L1 L2 Cache
L3 Cache
Memory Tiers
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
PCI-‐e Flash
Flash Device
Memory Tiers (VolaQlity)
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
VolaQle Non-‐volaQle
PCI-‐e Flash
Flash Device
Memory Tiers (API)
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
Memory File
PCI-‐e Flash
Flash Device
メモリとして透過的にアクセス
ファイルとしてファイルシステムを 介したアクセス
Memory Tiers (API)
nsec (10e-‐9) usec (10e-‐6) msec (10e-‐3)
PB
TB
GB
MB
KB
Capacity
Latency
Tape
HDD
SSD
DRAM
L1 L2 Cache
L3 Cache
PCI-‐e Flash
Flash Device
ApplicaQon Kernel
APIs for soVware-‐controllable new devices
Middleware
Large-Scale Graph Processing /w GPGPU} MapReduce
} Locality-aware parallel data processing } GIM-V (Generalized Iterative
Matrix-Vector multiplication) for large-scale graphs
} GPGPU } Pros.
} Massively parallel threads } High memory bandwidth
} Cons. } Data transfer between host and GPU device
Goal } アクセラレータを搭載した⼤大規模環境でのMapReduce
} 局所性の活⽤用 } メモリの階層性を考慮
} 1000台〜~のアクセラレータ } 最新のデバイスへの適⽤用を⾒見見越した開発
} アクセラレータ: NVIDIA K20, INTEL Xeon Phi } 次世代メモリ、FLASH/NVRAM
} グラフは1つのインスタンス
} 現在進⾏行行中 } GIM-V for GPU MapReduce } Hamar
GIM-‐V Algorithm
20*1 : Kang, U. et al, “PEGASUS: A Peta-‐Scale Graph Mining System-‐ ImplementaQon and ObservaQons”, IEEE INTERNATIONAL CONFERENCE ON DATA MINING 2009
• Generalized IteraQve Matrix-‐Vector mulQplicaQon*1 – v’ = M ×G v where
v’i = assign(vj , combineAllj ({xj | j = 1..n, xj = combine2(mi,j, vj)})) (i = 1..n) – Various graph algorithms
• PageRank, Random Walk /w Restart, Diameter EsQmaQon, Connected Components – Hadoop-‐based implementaQon
= ×Gvj
v’ M v
combine2
j
jmi2,j
mi1,ji1i2
= ×Gv’ivi
v’ M
combineAll
assign vj1
ii
j2
mi,j2mi,j1
Stage1
Stage2
Mars • Mars*1 : An exisQng GPU-‐based MapReduce framework
– Map, Reduce funcQons are implemented as CUDA kernels • # Mapper/Reducer = # GPU thread = # keys • Map/Reduce Count → Prefix sum → Map/Reduce
– Shuffle stage executes GPU-‐based Bitonic Sort – CPU-‐GPU communicaQon at starQng Map
21
Pre process
Map Split
Prefix Sum
Map Count
Map Count
Map
Map
Sort Reduce SplitPrefix Sum
Reduce Count
Reduce Count
Reduce
Reduce
Map Stage
Reduce StageShuffle Stage
GPU Processing Mars Scheduler
*1 : Fang W. et al, “Mars: AcceleraQng MapReduce with Graphics Processors”, Parallel and Distributed Systems, 2011
Mars Extension for MulQ-‐GPU Devices
Map Sort
Map Sort
Reduce
Reduce
GPU Processing Scheduler
Upload CPU → GPU
Download GPU → CPU
Download GPU → CPU
Upload CPU → GPU
MulQ GPU GIM-‐V implementaQon on top of Mars using MPI – ConQnuous execuQon feature of mulQ MapReduce stages
• CPU-‐GPU communicaQon at the start and the end of each iteraQon • Convergence test as a post processing
– Inter-‐GPU communicaQons on Shuffle stage • GPU-‐CPU -‐> All to all processes -‐> CPU-‐GPU • AVer global data exchange, each GPU sort intermediate key-‐values locally
– Convergence test • First, each GPU counts #converged verQces locally • AVer local count, count global #converged verQces using MPI_Allreduce
22
• Graph parQQoning – ParQQon a graph into sub-‐graphs in accordance with #GPUs – In Shuffle stage, distribute verQces/edges idenQcal to a list of vertex ids
GPU owns • Data structure
– Mars has metadata (size) and payload (actual data) of key-‐value pairs – We eliminate metadata and use fixed size payload to reduce the amount
of data • Thread allocaQon
– Mars assigns a single CUDA thread to a reduce operaQon for values to a single key
– Our implementaQon allocates mulQple CUDA threads to a single reduce operaQon in combine2 in MapReduce stage1
23
GIM-‐V for MulQ-‐GPU Devices
Graph ParQQon GIM-‐V Stage 1
Read input GIM-‐V Stage 2
Write output
GPU Processing SchedulerPreprocess
Convergence Test
PostprocessGIM-‐V
Experiments• QuesQon
– Performance of our GIM-‐V implementaQon on a GPU • Measurement method
– A single round of iteraQons • vs. CPU-‐based Mars • vs. Hadoop-‐based implementaQon (PEGASUS)
• Methods – ApplicaQon
• PageRank – Measures relaQve importance of
web pages
– Input data • ArQficial Kronecker graph
– Generated by generator in Graph 500 • Parameters
– SCALE: log 2 of #verQces (#verQces = 2SCALE) – Edge factor: #edges = 16 × #verQces
24
0.19 0.05
0.57 0.19
0.19 0.05
i1 j1 i2 j2 …
i2x j2x
Adjacency matrix
ProbabiliQes for adding edges
Experimental environments• We use 3 GPUs on 1 node
– CPU • 6 cores,12 threads (HyperThread enabled)
– GPU • CUDA Driver Version: 4.1 • CUDA RunQme Version: 4.0 • Compute Capability: 2.0 • shared/L1 cache size: 64 KB
• Mars – MarsGPU-‐n
• n GPUs / node (n: 1, 2, 3) • # threads = # different keys
– 256 threads on a thread block – MarsCPU
• 12 threads / node • implemented by C and POSIX thread library instead of CUDA • Sort is implemented by parallel quick sort
• PEGASUS – Hadoop 0.21.0 – HDFS as file system
25
CPU GPU
Model Intel® Xeon® X5670 Tesla C2050
# Physical cores 6 448
Frequency 2.93 GHz 1.15 GHz
Amount of memory
16.3 GB 2.7 GB (Global)
Compiler gcc 4.3.4 nvcc 3.2
Conclusions• Conclusions
– Scalable MapReduce-‐based GIM-‐V implementaQon using mulQ-‐GPU
• 87.04 ME/s on SCALE 30 (256 nodes, 768 GPUs) • 1.52x speedup than the CPU-‐based implementaQon on SCALE 29
• OpQmizaQon of load balance • Future work
– OpQmizaQon of our implementaQon • Improve communicaQon, locality
– Data handling for out of GPU memory • Use local storage as well as CPU/GPU memories • Efficient memory hierarchy management
26