Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

Li ChenState Key Laboratory of Computer Architecture

Institute of Computing Technology, CAS

Exploiting the Potential of Modern Supercomputers Through High Level

Language AbstractionsExploit Hierarchical and Irregular Parallelism in UPC

2

Exploit Hierarchical and Irregular Parallelism in UPC-H

Motivations Why use UPC?Exploit the tiered network of Dawning 6000

– GASNet support for HPP architecture

Exploit hierarchical data parallelism for regular applications

Shared work list for irregular applications

3

Deep Memory Hierarchies in Modern Computing Platforms

Many-core accelerators

Traditional multicore processors

Harpertown Dunnington

Intra-node parallelism should be well exploited

4

HPP Interconnect of Dawning 6000

Traditional clusterHPP Architecture of Dawning7000

Discrete CPU ： App CPU and OS CPUHypernode: discrete OS, SSIDiscrete interconnection: data int, OS int, global Sync

Three-tier networkPE: Cache coherence

2 CPUsHPP: 4 nodesIB

Global address space, through HPP controller

5

Mapping Hierarchical Parallelism to Modern SupercomputersHybrid programming models, MPI+X

– MPI+OpenMP/TBB (+OpenAcc/OpenCL)– MPI+StarPU– MPI+UPC

Sequoia– Explicitly tune the data layout and data transfer (Parallel memory hierar

chy) – Recursive task tree, static mapping for tasks

HTA– Data type for hierarchical tiled array (multiple level tiling)– Parallel operators: map parallelism statically

X10– Combine HTA with Sequoia– Abstraction on memory hierarchies: hierarchical place tree (Habanero-j

ava)– Nested task parallelism, task mapping until launching time

6

Challenges in Efficient Parallel Graph Processing

Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable

Unstructured problems– Unstructured and highly irregular data

structure– Data partitioning is not suitable, may lead

to load balancing Poor locality

– data access patterns has less locality High data access to computation

ratio– explore the structure, not computation– dominated by the wait for memory

fetches

Express Parallelism

memory latency dominated

Communication dominated

Low level

tedious

Express Parallelism

Memory latency dominated

Low levelborrowed: Andrew Lumsdaine, Douglas Gregor

7

Why Unified Parallel C?

UPC, parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)

Important UPC features– Global address space: thread may directly read/write remote data– Partitioned: data is designated as local or global, affinity– Two kinds of memory consistency

UPC Performance benefit over MPI– Permits data sharing, better memory utilization

Thinking future many core chips, Exascale system– Better bandwidth and latency using one-sided messages (GASNET)– No less scalable than MPI! (to 128K threads)

Why use UPC?– Grasp the non-uniform memory access feature of modern computers – Programmability very close to shared memory programming

8

The Status of the UPC Community

Portability and usability– Many different UPC implementations and tools

Berkeley UPC, Cray UPC, HP UPC, GCC-based Intrepid UPC and MTU UPC Performance tools: GASP interface and Parallel Performance Wizard (PPW) Debuggability: TotalView

– Provide Interoperability with pthreads/MPI(/OpenMP)UPC is developing in

– Hierarchical parallelism, asynchronous executionTasking mechanism

– Scalable work stealing; hierarchical tasking library– Place, async~finish; Asynchronous Remote Methods

Nested parallelism Instant team: data centric collective Irregular benchmarks: UTS, MADNESS,GAP

– InteroperabilitySupport for hybrid programming with OpenMP and other languagesMore convenient support for writing libraries

9

What is UPC-H?

Developed by the compiler group of ICT– H: heterogeneous, hierarchical

Based on Berkeley UPC compilerAdded features by ICT

– Support HW features of Dawning series computersHPP interconnectLoad/store in physical global address space

– Hierarchical data distribution and parallelismGodson-T (many-core processor)GPU clusterD6000 computer and X86 clusters

– SWL support for graph algorithms– Communication optimizations

Software cache, msg vectorization– Runtime system for heterogeneous platform

Data management

UPC-H Support for HPP　 Architecture

11

Lack a BCL Conduit in the UPC system

Gasnet Extended API

Gasnet Core API

lnfiniband inter-Process S

Hared Memory

HPPBCL

GASNet: Networking for Global-Address Space Languages

BCL: low level communication layer of HPP

12

Implementation of the BCL-conduit

Initialization of the tiered-network – construct the topology of the tiered network– set up reliable datagram service through QP virtualization– initialize internal data structures such as send buffers

Finalization of communication Network selection in the core API of GASNet

– PSHM, HPP, IBFlow control of messages Implementation of Active Message

– Short Message ： NAP– Medium Message: NAP– Long Message: RDMA +NAP– RDMA Put/Get : RMDA+NAP

Two-tiered topology three-tiered topology

13

BCL Conduit: latency of short messages

Latency of short message intra-HPP Latency of short message inter-HPP

14

BCL Conduit, Bandwidth of Med. Messages (intra HPP)

15

BCL Conduit, Bandwidth of Med. Messages (inter HPP)

16

BCL Conduit, Latency of Med. Messages (intra HPP)

17

BCL Conduit , Latency of Med. Messages (inter HPP)

18

BCL Conduit, Latency of Barriers

Net latency of barrier (inter- HPP)

Net latency of barrier (intra-HPP)

19

Summary and Ongoing work of UPCH targeting Dawning 6000

Summary– UPCH compiler can now support HPP architecture,

benefit from the 3-tier network

Ongoing work– Optimization on DMA registration strategy– Evaluate HPP-supported barrier and collective– Full-length evaluation

Hierarchical Data Parallelism, UPC-H Support for Regular

Applications

21

UPC-H(UPC-Hierarchical/Heterogeneous)Execution model

Standard UPC is SPMD style and has flat parallelism

UPC-H extension– Mix SPMD with fork-join

Implicit subgroups

Implicit threads

Implicit thread or thread subgroup

UPC threads

fork point

Join point at upc_forall

UPC thread

Implicit subgroups

Implicit threads

fork joint

Join point at upc_forall

UPC program

upc_forall

– Two approach to express hierarchical parallelism

Implicit threads (or gtasks), organized in thread groups implicitly specified by the data distribution

Explicit low-level gtask

22

Multi-level Data Distribution

Data distribution => an implicit thread tree

shared [32][32], [4][4],[1][1] float A[128][128];

UPC thread

UPC program

44

… …

16

Upc-tiles

128

… …64

32

Subgroup-tiles

Thread-tiles

… … 64

32

… …

16

Subgroup

logical implicit threads

16

64

16

1

…

23

UPC-H: Mapping Forall Loop to the Implicit Thread Tree

Leverage an existing language construct, upc_forall – Independent loop– Point-to-shared or integer type affinity expression

Loop Iterations Implicit thread tree CUDA thread tree

3-level data distribution

Machine configuration

shared [32][32], [4][4],[1][1] float A[128][128];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=6; j<129; j++; &A[i][j-1]) ... body...

=>Thread topology: <THREADS,64,16>

24

UPC-H Codes for nbody

shared [1024],[128],[1] point P[4096];Shared [1024][1024] float tempf[4096][4096]; for(int time=0; time<1000;time++) { upc_forall(int i=0;i<N;i++; &P[i])

for (int j=0;j<N;j++) { if(j!=i) { distance = (float)sqrt((P[i].x-P[j].x)*(P[i].x-P[j].x)+ (P[i].y-P[j].y)*(P[i].y-P[j].y)); if(distance!=0) { magnitude = (G*m[i]*m[j])/(distance*distance+C*C); …… tempf[i][j].x = magnitude*direction.x/distance; tempf[i][j].y = magnitude*direction.y/distance; } } upc_forall(int i=0;… …)

… …}

25

Overview of the Compiling SupportOn Berkeley UPC compiler v2.8.0Compiler analysis

– Multi-dimensional and multi-level data distribution – Affinity-aware multi-level tiling

upc tilingSubgroup tiling, thread tilingMemory tiling for scratchpad memory

– Communication optimizationMessage vectorization, loop peeling, static comm. scheduling

– Data layout optimizations for GPUShared memory optimizationFind better data layout for memory coalescing

– array transpose and structure splitting

– Code Generation: CUDA, hier parallelism

26

Affinity-aware Multi-level Loop Tiling (Eg.) shared [32][32], [4][4],[1][1] float A[128][12

8];… …upc_forall(i=6; i<128; i++; continue)upc_forall(j=0; j<128; j++; &A[i-1][ j]) ... ... F[i][j]...

Step1: iteration space transformation, to make affinity expression consistent with data space

upc_forall(i=5; i<127; i++; continue)upc_forall(j=0; j<128; j++; &A[i][j]) ... ... F[i+1][j]... //transformation

Step2: three level tiling, actually two levelfor (iu=0; iu<128; iu=iu+32)for (ju=0; ju<128; ju=ju+32) //upc thread affinity if (has_affinity(MYTHREAD, &A[iu][ju])) { // for exposed region …dsm_read… F[iu+1:min(128, iu+32)]

[ju: min(127,ju+31) ] for (ib=iu ; ib<min(128, iu+32); ib=ib+4) for (jb=ju; jb< min(128, ju+32); jb=jb+4) for (i=ib; i<min(128,ib+4); i=i+1) for (j=jb; j<min(128,jb+4); j=j+1) if(i>=5 && i<127) //sink guards here! ... F[i+1][j]... ; }//of upc thread affinity Step 3: spawn fine-grained threads… …

27

Memory Optimizations for CUDA

What data will be put into the shared memory?– 0-1 bin packing problem (over shared memory’s capacity)

The profit: reuse degree integrated with coalescing attribute – inter-thread reuse and intra-thread reuse– average reuse degree for merged region

The cost: the volume of the referenced array region prefer inter-thread reuse

– Compute the profit and cost for each reference What is the optimal data layout in GPU’s global memory?

– Coalescing attributes of array reference only consider contiguous constraints of coalescing

– Legality analysis– Cost model and amortization analysis

28

Overview of the Runtime Support

Multi-dimensional data distribution supportGtask support on multicore platforms

– Workload scheduling, synchronization, topology-aware mapping and binding

DSM system for unified memory management– GPU heap management– Memory consistency, block-based– Inter-UPC message generation and data shuffling

Data shuffling to generate data tiles with halos

Data transformations for GPUs– Dynamic data layout transformations

For global memory coalescing, demand driven

– Demand driven data transfer between cpu and GPU

29

Unified Memory Management

Demand driven data transfer– Only on local data space, no software caching on remote data

– Consistency maintenance is on the boundary of CPU code and GPU code

Demand driven data layout transformation– Redundant data transformation removal

– An extra field is recorded for the current layout of the data tile copy

30

Benchmarks for the GPU Cluster

Applications

Description Original language

Application field

Source

Nbody n-body simulation CUDA+MPI Scientific computing

CUDA campus programming contest 2009

LBM Lattice Boltzmann method in computational fluid dynamics

C Scientific computing

SPEC CPU 2006

CP Coulombic Potential

CUDA Scientific computing

UIUC Parboil Benchmark

MRI-FHD Magnetic Resonance Imaging FHD

CUDA Medical image analysis


MRI-Q Magnetic Resonance Imaging Q

CUDA Medical image analysis


TPACF Two Point Angular Correlation Function

CUDA Scientific computing


31

UPC-H Performance on GPU Cluster

Use 4-node cuda cluster, 1000M Ethernet. Each node has– CPUs : 2 dual core AMD Opteron 880– GPU: NVIDIA GeForce 9800 GX2

Compilers: nvcc (2.2) –O3, GCC (3.4.6) –O3

one-node speedup to serial execution

05

10152025303540

nbody lbm

spee

dup

base DSM memory coalescing SM reuse manual CUDA

Four-node speedup to serial execution (log2)

0123456789

10

nbody mri -fhd mri -q tpacf cp

spee

dup

base DSM memory coalescing SM reuse Manual CUDA/MPI

Performance 72%, on average

32

UPC-H Performance on Godson-T

The average speedup of SPM opt is 2.30 ， that of double-buffering is 2.55

speedup

33

UPC-H Performance on Multi-core Cluster Hardware and software

– Xeon(R) CPU X7550 *8=64 cores/node, 40Gb infiniband, ibv conduit, mvapich2-1.4

Benchmarks– NPB: CG, FT, – nbody, MM, cannon MM

Results– NPB performance ： UPC-H r

each 90% of UPC+OMP– Cannon MM can leverage opti

mal data sharing and communication coalescing

express complicated hierarchical data parallelism which is hard to express in UPC+OpenMP

perf ormance rat i o: UPCH/ UPC+OMP

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1 2 4 8

thread team si ze

perf

. ra

tio CG- B- 2

CG- C- 2FT- B- 4FT- C- 8nbody- 16384- 16

UPCH/ UPC perf ormance (cannon MM)

0

1

2

3

4

5

6

7

8

9

4 8 16 32 64

total threads

spee

dups 1024

204840968192

SWL, UPC-H Support for Graph Algorithms

35

Introduction

Graph– flexible abstraction for describing relationships between

discrete objects – basis of exploration based applications (genomics, astro

physics, social network analysis, machine-learning)

Graph search algorithms – Important technique for analyzing vertices or edges in it– Breadth-first search (BFS) is widely used and is the basi

s of many others (CC ， SSSP ， Best-first-search, A*) Kernel of Graph500 benchmarks

35

36

Challenges in Efficient Parallel Graph Processing

Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable

Unstructured problems– Unstructured and highly irregular data

structure– Data partitioning is not suitable, may lead

to load balancing Poor locality

– data access patterns has less locality High data access to computation

ratio– explore the structure, not computation– dominated by the wait for memory

fetches

Express Parallelism

memory latency dominated

Communication dominated

Low level

tedious

Express Parallelism

Memory latency dominated

Low level

User directed, Auto opt

Global

view, high

level

borrowed: Andrew Lumsdaine, Douglas Gregor

37

Tedious Optimizations of Bfs (graph algorithm)

Perf. Problem Goal Techniques

Memory accessleverage Non-blocking Cache Multithreading

Synchronization

Reduce the overhead of shared data protection

Use atomic operation not locks

The scalability problem of collective operation

Multithreading+Hierarchical collective

Communication

Avoid small messages which waste network bandwidth

Message vectorization

Hide the overhead of communication Async operation

Reduce the number of messages Multithreading

Optimize bfs on clusters:

38

Amorphous Data Parallelism (Keshav Pingali)– Active elements (activities)– Neighborhood– Ordering

Exploit such parallelism: work list – Keep track of active elements and ordering

Unordered-set Iterator, ordered-set Iterator

– Conflicts among concurrent operationssupport for speculative execution

In Galois system

Def: Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints

Data-Centric Parallelism Abstraction for Irregular Applications

39

Design Principles of SWLProgrammability

– global-view programming – High level language abstraction

Flexibility– user control on data locality (constructing/executing)– customize the construction and behavior of work

items

lightweight speculative execution– Trigger on by user hints, not purely automatic– Lightweight conflict detecting, lock is too costly

39

40

SWL Extension in UPC-H

40

1) specify a work list

2) user-defined work constructor

3) two iterators of work list blocking one non-blocking one

4) Two kinds of work item dispatcher

Hide optimization detail from users:

message coalescing, queue management,asynchronous communication,Speculative execution etc.

5) user-assisted speculation upc_spec_get() upc_spec_put()

41

Level Synchronized BFS in SWL, Code Example

Work_t usr_add(Msg_t msg){ Work_t res_work; if(!TEST_VISITED(msg.tgt)){ pred[msg.tgt] = msg.src; SET_VISITED(msg.tgt); res_work = msg.tgt;} else res_work = NULL; return res_work;}

In Galois on shared memory machines: while(1){ int any_set = 0; upc_worklist_foreach(Work_t rcv: list1) { size_t ei = g.rowstarts[VERTEX_LOCAL(rcv)]; size_t ei_end = g.rowstarts[VERTEX_LOCAL(rcv) + 1]; for( ; ei < ei_end; ++ei){ long w = g.column[ei]; if( w == rcv) continue; Msg_t msg; msg.tgt = w; msg.src = rcv; upc_worklist_add(list2, &pred[w], usr_add(msg)); any_set = 1; } //for each row } //foreach bupc_all_reduce_allI(.....); if(final_set[MYTHREAD] == 0) break; upc_worklist_exchage(list1, list2); }//while

In UPCH on clusters:

42

Asynchronous BFS in SWL, Code Example

Asynchronous implementation on SM machines(Galois)

In UPCH on clusters:

43

Execution model – SPMD– SPMD+Multithreading

Master/slave– State transition

Executing; idle; termination detection; Exit

Work dispatching– AM-based, distributed– Coalescing work items and asyn

c transfer– mutual exclusion on SWL and w

ork-item buffers

Implementation of SWL

44

User-Assisted Speculative Execution

User API upc_spec_get:

– get the data Ownership;– data transfer, get the shadow copy– conflict checking and rollback – enter the critical region ；

upc_cmt_put – release the data Ownership– commit the computation

Compiler: Identify speculative hints

– Upc_spec_get/put Fine-grained atomic protection

– Full/empty bits

Runtime system: two modes: (non-)speculative rollback of data and computation

45

(Intel Xeon E5450 @ 3.00GHz * 2) *64 nodesScale=20,edgefactor=16

On the shared memory machine, UPC gets very close to OpenMP

SPMD Execution, on Shared Memory Machine and Cluster

On the cluster, UPC is better than MPI ：1)Save one copying for each work2)Frequent polling raise the network throughp

ut

Intel Xeon X7550 @ 2.00GHz * 8Scale=20,edgefactor=16

46

SPMD+MT, on X86 Cluster

pthreads/UPC thread

SWL SYNC BFSScale=24, EdgeFactor=16

47

On D6000, Strong Scaling of SWL SYNC BFS

ICT Loongson-3A V0.5 FPU [email protected], *2

1 ） MPI Conduit, large overhead2 ） tiered network behaves better when more intra-HPP comm happens

Strong Scal i ng of SWL_SYNC_BFS

0. 00E+00

2. 00E+06

4. 00E+06

6. 00E+06

8. 00E+06

1. 00E+07

1. 20E+07

1. 40E+07

H1N4P1(4) H4N1P1(1) H4N2P1(2) H4N4P1(4) H4N4P2(4) H4N4P2(8)

TEPS

MPI _Si mpl e(MPI ) UPC_SWL( I BV) UPC_SWL( I BV+BCL) UPC_SWL(MPI )

Scale=24, EdgeFactor=16

48

Summary and Future Work on SWL

Summary– Put forward Shared Work List (SWL) to UPC to tackle amorpho

us data-parallelism– Using SWL, bfs can achieve better performance and scalability

than MPI at certain scale and runtime configurations– Realize tedious optimizations with less user effort

Future work– Realize and evaluate the speculative execution support

Delaunay Triangulation Refinement – Add dynamic scheduler to the SWL iterators– Evaluate more graph algorithms

49

Acknowledgement

Shenglin Tang

Shixiong Xu

Xingjing Lu

Zheng Wu

Lei Liu

Chengpeng Li

Zheng Jing

50

THANKS

51

Workload distribution of an upc_forall shared [32][32], [4][4],[1][1] float A[128][12

8];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=5; j<128; j++; &A[i][j]) ... body...

0:63

0:3

4:15

8:15

UPC threads

UPC program

16:63

0:15

4:15

multiple edges

0:15

0:7

Subgroups

Implicit threads0

11

1

0:3

0

0:15

THREADS=16

8

64 subgroups of 0-th grid

4X4

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Workload distribution on UPC threads

i

j

5

127

0 127

52

Leverage load/store support within HPP

Documents

Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through