26
GPU MapReduceによる 規模グラフ処 東京業学 学術国際情報センター 佐藤

GPU MapReduceによる 規模グラフ処 - 北海道大学...2012/12/12  · MapReduce ! Locality-aware parallel data processing ! GIM-V (Generalized Iterative Matrix-Vector multiplication)

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

  • GPU MapReduceによる ⼤大規模グラフ処理理

    東京⼯工業⼤大学  学術国際情報センター  佐藤  仁

  • Large-scale Graphs and HPC

    }  Emergence of Large-scale Graphs }  Various Applications

    }  Traffic network, SNS, Smart Grids, Biology, Business Intelligence, etc.

    }  Modern supercomputers }  can accommodate peta-flops class

    performance /w peta-byte class storage

    }  Important Kernel in HPC }  Graph500

    Algorithm Theory + Practical HPC implementation technique

  • TSUBAME2.0 (Nov. 2010 /w NEC-HP)}  A green-, cloud-based SC at TokyoTech (Tokyo Japan)

    }  〜~2.4 Pflops (Peak), 〜~1.2 Pflops (Linpack) }  4th in TOP500 Rank (Nov. 2010), 14th (Jun. 2012)

    }  Next gen multi-core x86 CPUs + GPUs }  1432 nodes, Intel Westmere/Nehalem EX CPUs }  4244 NVIDIA Tesla (Fermi) M2050 GPUs

    }  〜~ 95TB of MEM /w 0.7 PB/s aggregate BW }  Optical Dual-Rail QDR IB /w full bisection BW (FAT Tree) }  1.2 MW Power , PUE = 1.28

    }  2nd in Green500 Rank (Nov. 2010) }  Greenest Production SC (Nov. 2010)

    }  VM Operation (KVM), Linux + Windows HPC

  • HDD-based Storage systems:  Total 7.13PB (Parallel FS + Home)

    Interconnets:  Full-bisection Optical QDR Infiniband Network

    Parallel FS 5.93PB

    SupreTitenet

    Home 1.2PB

    SupreSinet3

    StorageTek SL8500 Tape Library

    ~4PB

    OSS  x20 MDS x10

    MDS,OSS servers HP DL360 G6 30nodes Storage DDN SFA10000 x5 (10 enclosure x5) x5

    Voltaire Grid Director 4700 ×12 IB QDR: 324 ports

    Core Switch

    Edge Switch

    Edge Switch (/w 10GbE ports)

    Voltaire Grid Director 4036 ×179 IB QDR : 36 ports

    Voltaire Grid Director 4036E ×6 IB QDR:34ports 10GbE: 2port

    12switches

    6switches 179switches

    Storage Servers HP DL380 G6 4nodes BlueArc Mercury 100 x2 Storage DDN SFA10000 x1 (10 enclosure x1)

    ManagementServers

    Thin nodes

    1408nodes (32nodes x44 Racks)

    HP Proliant SL390s G7 1408nodes                        CPU: Intel Westmere-EP 2.93GHz 6cores × 2 = 12cores/node GPU: NVIDIA M2050, 3GPUs/node Mem: 54GB (96GB) SSD: 60GB x 2 = 120GB (120GB x 2 = 240GB)  

    Medium nodes

    HP Proliant DL580 G7 24nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem:128GB SSD: 120GB x 4 = 480GB

    Fat nodes

    HP Proliant DL580 G7 10nodes CPU: Intel Nehalem-EX 2.0GHz 8cores × 2 = 32cores/node Mem: 256GB (512GB) SSD: 120GB x 4 = 480GB

    ・・・・・・

        Computing Nodes:  2.4PFlops(CPU+GPU), 224.69TFlops(CPU), ~100TB MEM, ~200TB SSD

    GSIC:NVIDIA Tesla S1070GPU

    PCI –E gen2 x16 x2slot/node

    TSUBAME2.0 Overview

    NFS,CIFS⽤用  x4 NFS,CIFS,iSCSI⽤用  x2

    High Speed Data Transfer Servers

  • TSUBAME2.0 Storage Overview

    “Global Work Space” #1

    SFA10k #5

    “Global Work Space” #2

    “Global Work Space” #3 “Scratch”

    SFA10k #4SFA10k #3SFA10k #2SFA10k #1

    /work9 /work0 /work19 /gscr0

    “cNFS/Clusterd Samba w/ GPFS”

    HOME

    System application

    “NFS/CIFS/iSCSI by BlueARC”

    HOME

    iSCSI

    Infiniband QDR Network for LNET and Other Services

    SFA10k #6

    GPFS#1 GPFS#2 GPFS#3 GPFS#4

    Parallel File System VolumesHome Volumes

    QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

    Lustre GPFS with HSM

    “Thin node SSD” “Fat/Medium node

    SSD”

    ScratchGrid Storage

    1.2PB

    2.4 PB HDD + 〜~4PB Tape

    3.6 PB

    130 TB〜~

    190 TB

    TSUBAME2.0  Storage  11PB  (7PB  HDD,  4PB  Tape)

  • TSUBAME2.0 Storage Overview

    “Global Work Space” #1

    SFA10k #5

    “Global Work Space” #2

    “Global Work Space” #3 “Scratch”

    SFA10k #4SFA10k #3SFA10k #2SFA10k #1

    /work9 /work0 /work19 /gscr0

    “cNFS/Clusterd Samba w/ GPFS”

    HOME

    System application

    “NFS/CIFS/iSCSI by BlueARC”

    HOME

    iSCSI

    Infiniband QDR Network for LNET and Other Services

    SFA10k #6

    GPFS#1 GPFS#2 GPFS#3 GPFS#4

    Parallel File System VolumesHome Volumes

    QDR IB(×4) × 20 10GbE × 2QDR IB (×4) × 8

    Lustre GPFS with HSM

    “Thin node SSD” “Fat/Medium node

    SSD”

    ScratchGrid Storage

    1.2PB

    2.4 PB HDD + 〜~4PB Tape

    3.6 PB

    130 TB〜~

    190 TB

    •  Home storage for computing nodes •  Cloud-based campus storage services

    Concurrent Parallel I/O (e.g. MPI-IO)

    Fine-grained R/W I/O (check point, temporal files)

    Data transfer service between SCs/CCs

    Read mostly I/O (data-intensive apps, parallel workflow, parameter survey)

    Backup

    TSUBAME2.0  Storage  11PB  (7PB  HDD,  4PB  Tape)

  • }  Computation }  Increase in Parallelism, Heterogeneity, Density

    }  Multi-core, Many-core processors }  Heterogeneous processors

    }  Storage/Memory Architecture }  Deep hierarchy

    }  Next-gen memory, FLASH/NVRAM Parallel FS

    }  Memory and I/O Walls

    Towards Exa-scale Supercomputing

    Network

    Locality

    Productivity

    FT

    Algorithm

    Power

    Storage Hierarchy

    I/O

    Problems

    Heterogeneity

    Scalability

    Post-peta-scale system

  • Memory  Tiers

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    DRAM

    L1  L2  Cache

    L3  Cache

    Memory  Wall   I/O  Wall  

  • Memory  Tiers

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    PCI-‐e  Flash

    Flash  Device

    Accelerator

    New  memory  devices

    より高速に  

    より大容量に  

  • Memory  Tiers  (VolaQlity)

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    PCI-‐e  Flash

    Flash  Device

    Accelerator

    VolaQle Non-‐volaQle

  • Memory  Tiers  (APIs)

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    PCI-‐e  Flash

    Flash  Device

    Accelerator

    Memory File

    ファイルとしてファイルシステムを  介したアクセス  

    メモリとして透過的にアクセス

  • Key  Technology  in  Exa-‐scale  Data-‐Intensive  SupercompuQng

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    PCI-‐e  Flash

    Flash  Device

    Accelerator

    ApplicaQon  Kernel

    APIs  for  so#ware-‐controllable  new  memory  devices

    MiddlewareLarge-‐scale  Graphs

    PGAS

  • Memory  Tiers

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    DRAM

    L1  L2  Cache

    L3  Cache

  • Memory  Tiers

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    PCI-‐e  Flash

    Flash  Device

  • Memory  Tiers  (VolaQlity)

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    VolaQle Non-‐volaQle

    PCI-‐e  Flash

    Flash  Device

  • Memory  Tiers  (API)

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    Memory File

    PCI-‐e  Flash

    Flash  Device

    メモリとして透過的にアクセス

    ファイルとしてファイルシステムを  介したアクセス  

  • Memory  Tiers  (API)

    nsec  (10e-‐9) usec  (10e-‐6) msec  (10e-‐3)

    PB

    TB

    GB

    MB

    KB

    Capacity

    Latency

    Tape

    HDD

    SSD

    DRAM

    L1  L2  Cache

    L3  Cache

    PCI-‐e  Flash

    Flash  Device

    ApplicaQon  Kernel

    APIs  for  soVware-‐controllable  new  devices

    Middleware

  • Large-Scale Graph Processing /w GPGPU}  MapReduce

    }  Locality-aware parallel data processing }  GIM-V (Generalized Iterative

    Matrix-Vector multiplication) for large-scale graphs

    }  GPGPU }  Pros.

    }  Massively parallel threads }  High memory bandwidth

    }  Cons. }  Data transfer between host and GPU device

  • Goal }  アクセラレータを搭載した⼤大規模環境でのMapReduce

    }  局所性の活⽤用 }  メモリの階層性を考慮

    }  1000台〜~のアクセラレータ }  最新のデバイスへの適⽤用を⾒見見越した開発

    }  アクセラレータ: NVIDIA K20, INTEL Xeon Phi }  次世代メモリ、FLASH/NVRAM

    }  グラフは1つのインスタンス

    }  現在進⾏行行中 }  GIM-V for GPU MapReduce }  Hamar

  • GIM-‐V  Algorithm

    20*1  :  Kang,  U.  et  al,  “PEGASUS:  A  Peta-‐Scale  Graph  Mining  System-‐  ImplementaQon    and  ObservaQons”,  IEEE  INTERNATIONAL  CONFERENCE  ON  DATA  MINING  2009

    •  Generalized  IteraQve  Matrix-‐Vector  mulQplicaQon*1  –  v’  =  M  ×G  v      where    

                                                   v’i  =  assign(vj  ,  combineAllj  ({xj  |  j  =  1..n,  xj  =  combine2(mi,j,  vj)}))    (i  =  1..n)  –  Various  graph  algorithms  

    •  PageRank,  Random  Walk  /w  Restart,  Diameter  EsQmaQon,  Connected  Components  –  Hadoop-‐based  implementaQon  

    = ×Gvj

    v’ M v

    combine2

    j

    jmi2,j

    mi1,ji1i2

    = ×Gv’ivi

    v’ M

    combineAll

    assign vj1

    ii

    j2

    mi,j2mi,j1

    Stage1

    Stage2

  • Mars  •  Mars*1  :  An  exisQng  GPU-‐based  MapReduce  framework  

    –  Map,  Reduce  funcQons  are  implemented  as  CUDA  kernels  •  #  Mapper/Reducer  =  #  GPU  thread  =  #  keys  •  Map/Reduce  Count  →  Prefix  sum  → Map/Reduce  

    –  Shuffle  stage  executes  GPU-‐based  Bitonic  Sort  –  CPU-‐GPU  communicaQon  at  starQng  Map  

    21

    Pre  process

    Map  Split

    Prefix  Sum

    Map  Count

    Map  Count

    Map  

    Map  

    Sort Reduce  SplitPrefix  Sum

    Reduce  Count

    Reduce  Count

    Reduce

    Reduce

    Map  Stage

    Reduce  StageShuffle  Stage

    GPU  Processing Mars  Scheduler

    *1  :  Fang  W.  et  al,  “Mars:  AcceleraQng  MapReduce  with  Graphics  Processors”,  Parallel  and  Distributed  Systems,  2011

  • Mars  Extension  for  MulQ-‐GPU  Devices

    Map Sort

    Map Sort

    Reduce

    Reduce

    GPU  Processing Scheduler

    Upload    CPU  →  GPU

    Download  GPU  →  CPU

    Download    GPU  →  CPU

    Upload  CPU  →  GPU

    MulQ  GPU  GIM-‐V  implementaQon  on  top  of  Mars  using  MPI  –  ConQnuous  execuQon  feature  of  mulQ  MapReduce  stages  

    •  CPU-‐GPU  communicaQon  at  the  start  and  the  end  of  each  iteraQon  •  Convergence  test  as  a  post  processing  

    –  Inter-‐GPU  communicaQons  on  Shuffle  stage  •  GPU-‐CPU  -‐>  All  to  all  processes    -‐>  CPU-‐GPU    •  AVer  global  data  exchange,  each  GPU  sort  intermediate  key-‐values  locally  

    –  Convergence  test    •  First,  each  GPU  counts  #converged  verQces  locally  •  AVer  local  count,  count  global  #converged  verQces  using  MPI_Allreduce  

    22

  • •  Graph  parQQoning  –  ParQQon  a  graph  into  sub-‐graphs  in  accordance  with  #GPUs  –  In  Shuffle  stage,  distribute  verQces/edges  idenQcal  to  a  list  of  vertex  ids  

    GPU  owns  •  Data  structure  

    –  Mars  has  metadata  (size)  and  payload  (actual  data)  of  key-‐value  pairs  –  We  eliminate  metadata  and  use  fixed  size  payload  to  reduce  the  amount  

    of  data  •  Thread  allocaQon  

    –  Mars  assigns  a  single  CUDA  thread  to  a  reduce  operaQon  for  values  to  a  single  key  

    –  Our  implementaQon  allocates  mulQple  CUDA  threads  to  a  single  reduce  operaQon  in  combine2  in  MapReduce  stage1

    23

    GIM-‐V  for  MulQ-‐GPU  Devices

    Graph  ParQQon GIM-‐V  Stage  1

    Read  input     GIM-‐V  Stage  2

    Write  output

    GPU  Processing SchedulerPreprocess

    Convergence  Test

    PostprocessGIM-‐V

  • Experiments•  QuesQon  

    –  Performance  of  our  GIM-‐V  implementaQon  on  a  GPU  •  Measurement  method  

    –  A  single  round  of  iteraQons  •  vs.  CPU-‐based  Mars  •  vs.  Hadoop-‐based  implementaQon  (PEGASUS)  

    •  Methods  –  ApplicaQon  

    •  PageRank  –  Measures  relaQve  importance  of    

    web  pages  

    –  Input  data    •  ArQficial  Kronecker  graph  

    –  Generated  by  generator  in  Graph  500    •  Parameters  

    –  SCALE:  log  2  of  #verQces  (#verQces  =  2SCALE)  –  Edge  factor:  #edges  =  16  ×  #verQces  

    24

    0.19 0.05

    0.57   0.19  

    0.19   0.05  

    i1  j1  i2  j2  …  

    i2x  j2x

    Adjacency  matrix

    ProbabiliQes  for  adding  edges  

  • Experimental  environments•  We  use  3  GPUs  on  1  node  

    –  CPU  •  6  cores,12  threads  (HyperThread  enabled)  

    –  GPU  •  CUDA  Driver  Version:  4.1  •  CUDA  RunQme  Version:  4.0  •  Compute  Capability:  2.0  •  shared/L1  cache  size:  64  KB  

    •  Mars  –  MarsGPU-‐n  

    •  n  GPUs  /  node  (n:  1,  2,  3)  •  #  threads  =  #  different  keys  

    –  256  threads  on  a  thread  block  –  MarsCPU  

    •  12  threads  /  node  •  implemented  by  C  and  POSIX  thread  library  instead  of  CUDA  •  Sort  is  implemented  by  parallel  quick  sort  

    •  PEGASUS  –  Hadoop  0.21.0  –  HDFS  as  file  system  

    25

    CPU GPU

    Model   Intel®  Xeon®  X5670 Tesla  C2050  

    #  Physical    cores 6 448

    Frequency   2.93  GHz 1.15  GHz

    Amount  of  memory

    16.3  GB 2.7  GB  (Global)

    Compiler   gcc  4.3.4 nvcc  3.2

  • Conclusions•  Conclusions  

    –  Scalable  MapReduce-‐based  GIM-‐V  implementaQon  using  mulQ-‐GPU  

    •  87.04  ME/s  on  SCALE  30  (256  nodes,  768  GPUs)  •  1.52x  speedup  than  the  CPU-‐based  implementaQon  on  SCALE  29  

    •  OpQmizaQon  of  load  balance  •  Future  work  

    – OpQmizaQon  of  our  implementaQon  •  Improve  communicaQon,  locality    

    – Data  handling  for  out  of  GPU  memory  •  Use  local  storage  as  well  as  CPU/GPU  memories  •  Efficient  memory  hierarchy  management  

    26