52
Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions Exploit Hierarchical and Irregular Parallelism in UPC

Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

Embed Size (px)

Citation preview

Page 1: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

Li ChenState Key Laboratory of Computer Architecture

Institute of Computing Technology, CAS

Exploiting the Potential of Modern Supercomputers Through High Level

Language AbstractionsExploit Hierarchical and Irregular Parallelism in UPC

Page 2: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

2

Exploit Hierarchical and Irregular Parallelism in UPC-H

Motivations Why use UPC?Exploit the tiered network of Dawning 6000

– GASNet support for HPP architecture

Exploit hierarchical data parallelism for regular applications

Shared work list for irregular applications

Page 3: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

3

Deep Memory Hierarchies in Modern Computing Platforms

Many-core accelerators

Traditional multicore processors

Harpertown Dunnington

Intra-node parallelism should be well exploited

Page 4: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

4

HPP Interconnect of Dawning 6000

Traditional clusterHPP Architecture of Dawning7000

Discrete CPU : App CPU and OS CPUHypernode: discrete OS, SSIDiscrete interconnection: data int, OS int, global Sync

Three-tier networkPE: Cache coherence

2 CPUsHPP: 4 nodesIB

Global address space, through HPP controller

Page 5: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

5

Mapping Hierarchical Parallelism to Modern SupercomputersHybrid programming models, MPI+X

– MPI+OpenMP/TBB (+OpenAcc/OpenCL)– MPI+StarPU– MPI+UPC

Sequoia– Explicitly tune the data layout and data transfer (Parallel memory hierar

chy) – Recursive task tree, static mapping for tasks

HTA– Data type for hierarchical tiled array (multiple level tiling)– Parallel operators: map parallelism statically

X10– Combine HTA with Sequoia– Abstraction on memory hierarchies: hierarchical place tree (Habanero-j

ava)– Nested task parallelism, task mapping until launching time

Page 6: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

6

Challenges in Efficient Parallel Graph Processing

Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable

Unstructured problems– Unstructured and highly irregular data

structure– Data partitioning is not suitable, may lead

to load balancing Poor locality

– data access patterns has less locality High data access to computation

ratio– explore the structure, not computation– dominated by the wait for memory

fetches

Express Parallelism

memory latency dominated

Communication dominated

Low level

tedious

Express Parallelism

Memory latency dominated

Low levelborrowed: Andrew Lumsdaine, Douglas Gregor

Page 7: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

7

Why Unified Parallel C?

UPC, parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)

Important UPC features– Global address space: thread may directly read/write remote data– Partitioned: data is designated as local or global, affinity– Two kinds of memory consistency

UPC Performance benefit over MPI– Permits data sharing, better memory utilization

Thinking future many core chips, Exascale system– Better bandwidth and latency using one-sided messages (GASNET)– No less scalable than MPI! (to 128K threads)

Why use UPC?– Grasp the non-uniform memory access feature of modern computers – Programmability very close to shared memory programming

Page 8: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

8

The Status of the UPC Community

Portability and usability– Many different UPC implementations and tools

Berkeley UPC, Cray UPC, HP UPC, GCC-based Intrepid UPC and MTU UPC Performance tools: GASP interface and Parallel Performance Wizard (PPW) Debuggability: TotalView

– Provide Interoperability with pthreads/MPI(/OpenMP)UPC is developing in

– Hierarchical parallelism, asynchronous executionTasking mechanism

– Scalable work stealing; hierarchical tasking library– Place, async~finish; Asynchronous Remote Methods

Nested parallelism Instant team: data centric collective Irregular benchmarks: UTS, MADNESS,GAP

– InteroperabilitySupport for hybrid programming with OpenMP and other languagesMore convenient support for writing libraries

Page 9: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

9

What is UPC-H?

Developed by the compiler group of ICT– H: heterogeneous, hierarchical

Based on Berkeley UPC compilerAdded features by ICT

– Support HW features of Dawning series computersHPP interconnectLoad/store in physical global address space

– Hierarchical data distribution and parallelismGodson-T (many-core processor)GPU clusterD6000 computer and X86 clusters

– SWL support for graph algorithms– Communication optimizations

Software cache, msg vectorization– Runtime system for heterogeneous platform

Data management

Page 10: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

UPC-H Support for HPP  Architecture

Page 11: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

11

Lack a BCL Conduit in the UPC system

Gasnet Extended API

Gasnet Core API

lnfiniband inter-Process S

Hared Memory

HPPBCL

GASNet: Networking for Global-Address Space Languages

BCL: low level communication layer of HPP

Page 12: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

12

Implementation of the BCL-conduit

Initialization of the tiered-network – construct the topology of the tiered network– set up reliable datagram service through QP virtualization– initialize internal data structures such as send buffers

Finalization of communication Network selection in the core API of GASNet

– PSHM, HPP, IBFlow control of messages Implementation of Active Message

– Short Message : NAP– Medium Message: NAP– Long Message: RDMA +NAP– RDMA Put/Get : RMDA+NAP

Two-tiered topology three-tiered topology

Page 13: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

13

BCL Conduit: latency of short messages

Latency of short message intra-HPP Latency of short message inter-HPP

Page 14: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

14

BCL Conduit, Bandwidth of Med. Messages (intra HPP)

Page 15: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

15

BCL Conduit, Bandwidth of Med. Messages (inter HPP)

Page 16: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

16

BCL Conduit, Latency of Med. Messages (intra HPP)

Page 17: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

17

BCL Conduit , Latency of Med. Messages (inter HPP)

Page 18: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

18

BCL Conduit, Latency of Barriers

Net latency of barrier (inter- HPP)

Net latency of barrier (intra-HPP)

Page 19: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

19

Summary and Ongoing work of UPCH targeting Dawning 6000

Summary– UPCH compiler can now support HPP architecture,

benefit from the 3-tier network

Ongoing work– Optimization on DMA registration strategy– Evaluate HPP-supported barrier and collective– Full-length evaluation

Page 20: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

Hierarchical Data Parallelism, UPC-H Support for Regular

Applications

Page 21: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

21

UPC-H(UPC-Hierarchical/Heterogeneous)Execution model

Standard UPC is SPMD style and has flat parallelism

UPC-H extension– Mix SPMD with fork-join

Implicit subgroups

Implicit threads

Implicit thread or thread subgroup

UPC threads

fork point

Join point at upc_forall

UPC thread

Implicit subgroups

Implicit threads

fork joint

Join point at upc_forall

UPC program

upc_forall

– Two approach to express hierarchical parallelism

Implicit threads (or gtasks), organized in thread groups implicitly specified by the data distribution

Explicit low-level gtask

Page 22: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

22

Multi-level Data Distribution

Data distribution => an implicit thread tree

shared [32][32], [4][4],[1][1] float A[128][128];

UPC thread

UPC program

44

… …

16

Upc-tiles

128

… …64

32

Subgroup-tiles

Thread-tiles

… … 64

32

… …

16

Subgroup

logical implicit threads

16

64

16

1

Page 23: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

23

UPC-H: Mapping Forall Loop to the Implicit Thread Tree

Leverage an existing language construct, upc_forall – Independent loop– Point-to-shared or integer type affinity expression

Loop Iterations Implicit thread tree CUDA thread tree

3-level data distribution

Machine configuration

shared [32][32], [4][4],[1][1] float A[128][128];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=6; j<129; j++; &A[i][j-1]) ... body...

=>Thread topology: <THREADS,64,16>

Page 24: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

24

UPC-H Codes for nbody

shared [1024],[128],[1] point P[4096];Shared [1024][1024] float tempf[4096][4096]; for(int time=0; time<1000;time++) { upc_forall(int i=0;i<N;i++; &P[i])

for (int j=0;j<N;j++) { if(j!=i) { distance = (float)sqrt((P[i].x-P[j].x)*(P[i].x-P[j].x)+ (P[i].y-P[j].y)*(P[i].y-P[j].y)); if(distance!=0) { magnitude = (G*m[i]*m[j])/(distance*distance+C*C); …… tempf[i][j].x = magnitude*direction.x/distance; tempf[i][j].y = magnitude*direction.y/distance; } } upc_forall(int i=0;… …)

… …}

Page 25: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

25

Overview of the Compiling SupportOn Berkeley UPC compiler v2.8.0Compiler analysis

– Multi-dimensional and multi-level data distribution – Affinity-aware multi-level tiling

upc tilingSubgroup tiling, thread tilingMemory tiling for scratchpad memory

– Communication optimizationMessage vectorization, loop peeling, static comm. scheduling

– Data layout optimizations for GPUShared memory optimizationFind better data layout for memory coalescing

– array transpose and structure splitting

– Code Generation: CUDA, hier parallelism

Page 26: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

26

Affinity-aware Multi-level Loop Tiling (Eg.) shared [32][32], [4][4],[1][1] float A[128][12

8];… …upc_forall(i=6; i<128; i++; continue)upc_forall(j=0; j<128; j++; &A[i-1][ j]) ... ... F[i][j]...

Step1: iteration space transformation, to make affinity expression consistent with data space

upc_forall(i=5; i<127; i++; continue)upc_forall(j=0; j<128; j++; &A[i][j]) ... ... F[i+1][j]... //transformation

Step2: three level tiling, actually two levelfor (iu=0; iu<128; iu=iu+32)for (ju=0; ju<128; ju=ju+32) //upc thread affinity if (has_affinity(MYTHREAD, &A[iu][ju])) { // for exposed region …dsm_read… F[iu+1:min(128, iu+32)]

[ju: min(127,ju+31) ] for (ib=iu ; ib<min(128, iu+32); ib=ib+4) for (jb=ju; jb< min(128, ju+32); jb=jb+4) for (i=ib; i<min(128,ib+4); i=i+1) for (j=jb; j<min(128,jb+4); j=j+1) if(i>=5 && i<127) //sink guards here! ... F[i+1][j]... ; }//of upc thread affinity Step 3: spawn fine-grained threads… …

Page 27: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

27

Memory Optimizations for CUDA

What data will be put into the shared memory?– 0-1 bin packing problem (over shared memory’s capacity)

The profit: reuse degree integrated with coalescing attribute – inter-thread reuse and intra-thread reuse– average reuse degree for merged region

The cost: the volume of the referenced array region prefer inter-thread reuse

– Compute the profit and cost for each reference What is the optimal data layout in GPU’s global memory?

– Coalescing attributes of array reference only consider contiguous constraints of coalescing

– Legality analysis– Cost model and amortization analysis

Page 28: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

28

Overview of the Runtime Support

Multi-dimensional data distribution supportGtask support on multicore platforms

– Workload scheduling, synchronization, topology-aware mapping and binding

DSM system for unified memory management– GPU heap management– Memory consistency, block-based– Inter-UPC message generation and data shuffling

Data shuffling to generate data tiles with halos

Data transformations for GPUs– Dynamic data layout transformations

For global memory coalescing, demand driven

– Demand driven data transfer between cpu and GPU

Page 29: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

29

Unified Memory Management

Demand driven data transfer– Only on local data space, no software caching on remote data

– Consistency maintenance is on the boundary of CPU code and GPU code

Demand driven data layout transformation– Redundant data transformation removal

– An extra field is recorded for the current layout of the data tile copy

Page 30: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

30

Benchmarks for the GPU Cluster

Applications

Description Original language

Application field

Source

Nbody n-body simulation CUDA+MPI Scientific computing

CUDA campus programming contest 2009

LBM Lattice Boltzmann method in computational fluid dynamics

C Scientific computing

SPEC CPU 2006

CP Coulombic Potential

CUDA Scientific computing

UIUC Parboil Benchmark

MRI-FHD Magnetic Resonance Imaging FHD

CUDA Medical image analysis

UIUC Parboil Benchmark

MRI-Q Magnetic Resonance Imaging Q

CUDA Medical image analysis

UIUC Parboil Benchmark

TPACF Two Point Angular Correlation Function

CUDA Scientific computing

UIUC Parboil Benchmark

Page 31: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

31

UPC-H Performance on GPU Cluster

Use 4-node cuda cluster, 1000M Ethernet. Each node has– CPUs : 2 dual core AMD Opteron 880– GPU: NVIDIA GeForce 9800 GX2

Compilers: nvcc (2.2) –O3, GCC (3.4.6) –O3

one-node speedup to serial execution

05

10152025303540

nbody lbm

spee

dup

base DSM memory coalescing SM reuse manual CUDA

Four-node speedup to serial execution (log2)

0123456789

10

nbody mri -fhd mri -q tpacf cp

spee

dup

base DSM memory coalescing SM reuse Manual CUDA/MPI

Performance 72%, on average

Page 32: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

32

UPC-H Performance on Godson-T

The average speedup of SPM opt is 2.30 , that of double-buffering is 2.55

speedup

Page 33: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

33

UPC-H Performance on Multi-core Cluster Hardware and software

– Xeon(R) CPU X7550 *8=64 cores/node, 40Gb infiniband, ibv conduit, mvapich2-1.4

Benchmarks– NPB: CG, FT, – nbody, MM, cannon MM

Results– NPB performance : UPC-H r

each 90% of UPC+OMP– Cannon MM can leverage opti

mal data sharing and communication coalescing

express complicated hierarchical data parallelism which is hard to express in UPC+OpenMP

perf ormance rat i o: UPCH/ UPC+OMP

0

0. 2

0. 4

0. 6

0. 8

1

1. 2

1 2 4 8

thread team si ze

perf

. ra

tio CG- B- 2

CG- C- 2FT- B- 4FT- C- 8nbody- 16384- 16

UPCH/ UPC perf ormance (cannon MM)

0

1

2

3

4

5

6

7

8

9

4 8 16 32 64

total threads

spee

dups 1024

204840968192

Page 34: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

SWL, UPC-H Support for Graph Algorithms

Page 35: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

35

Introduction

Graph– flexible abstraction for describing relationships between

discrete objects – basis of exploration based applications (genomics, astro

physics, social network analysis, machine-learning)

Graph search algorithms – Important technique for analyzing vertices or edges in it– Breadth-first search (BFS) is widely used and is the basi

s of many others (CC , SSSP , Best-first-search, A*) Kernel of Graph500 benchmarks

35

Page 36: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

36

Challenges in Efficient Parallel Graph Processing

Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable

Unstructured problems– Unstructured and highly irregular data

structure– Data partitioning is not suitable, may lead

to load balancing Poor locality

– data access patterns has less locality High data access to computation

ratio– explore the structure, not computation– dominated by the wait for memory

fetches

Express Parallelism

memory latency dominated

Communication dominated

Low level

tedious

Express Parallelism

Memory latency dominated

Low level

User directed, Auto opt

Global

view, high

level

borrowed: Andrew Lumsdaine, Douglas Gregor

Page 37: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

37

Tedious Optimizations of Bfs (graph algorithm)

Perf. Problem Goal Techniques

Memory accessleverage Non-blocking Cache Multithreading

Synchronization

Reduce the overhead of shared data protection

Use atomic operation not locks

The scalability problem of collective operation

Multithreading+Hierarchical collective

Communication

Avoid small messages which waste network bandwidth

Message vectorization

Hide the overhead of communication Async operation

Reduce the number of messages Multithreading

Optimize bfs on clusters:

Page 38: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

38

Amorphous Data Parallelism (Keshav Pingali)– Active elements (activities)– Neighborhood– Ordering

Exploit such parallelism: work list – Keep track of active elements and ordering

Unordered-set Iterator, ordered-set Iterator

– Conflicts among concurrent operationssupport for speculative execution

In Galois system

Def: Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints

Data-Centric Parallelism Abstraction for Irregular Applications

Page 39: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

39

Design Principles of SWLProgrammability

– global-view programming – High level language abstraction

Flexibility– user control on data locality (constructing/executing)– customize the construction and behavior of work

items

lightweight speculative execution– Trigger on by user hints, not purely automatic– Lightweight conflict detecting, lock is too costly

39

Page 40: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

40

SWL Extension in UPC-H

40

1) specify a work list

2) user-defined work constructor

3) two iterators of work list blocking one non-blocking one

4) Two kinds of work item dispatcher

Hide optimization detail from users:

message coalescing, queue management,asynchronous communication,Speculative execution etc.

5) user-assisted speculation upc_spec_get() upc_spec_put()

Page 41: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

41

Level Synchronized BFS in SWL, Code Example

Work_t usr_add(Msg_t msg){ Work_t res_work; if(!TEST_VISITED(msg.tgt)){ pred[msg.tgt] = msg.src; SET_VISITED(msg.tgt); res_work = msg.tgt;} else res_work = NULL; return res_work;}

In Galois on shared memory machines: while(1){ int any_set = 0; upc_worklist_foreach(Work_t rcv: list1) { size_t ei = g.rowstarts[VERTEX_LOCAL(rcv)]; size_t ei_end = g.rowstarts[VERTEX_LOCAL(rcv) + 1]; for( ; ei < ei_end; ++ei){ long w = g.column[ei]; if( w == rcv) continue; Msg_t msg; msg.tgt = w; msg.src = rcv; upc_worklist_add(list2, &pred[w], usr_add(msg)); any_set = 1; } //for each row } //foreach bupc_all_reduce_allI(.....); if(final_set[MYTHREAD] == 0) break; upc_worklist_exchage(list1, list2); }//while

In UPCH on clusters:

Page 42: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

42

Asynchronous BFS in SWL, Code Example

Asynchronous implementation on SM machines(Galois)

In UPCH on clusters:

Page 43: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

43

Execution model – SPMD– SPMD+Multithreading

Master/slave– State transition

Executing; idle; termination detection; Exit

Work dispatching– AM-based, distributed– Coalescing work items and asyn

c transfer– mutual exclusion on SWL and w

ork-item buffers

Implementation of SWL

Page 44: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

44

User-Assisted Speculative Execution

User API upc_spec_get:

– get the data Ownership;– data transfer, get the shadow copy– conflict checking and rollback – enter the critical region ;

upc_cmt_put – release the data Ownership– commit the computation

Compiler: Identify speculative hints

– Upc_spec_get/put Fine-grained atomic protection

– Full/empty bits

Runtime system: two modes: (non-)speculative rollback of data and computation

Page 45: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

45

(Intel Xeon E5450 @ 3.00GHz * 2) *64 nodesScale=20,edgefactor=16

On the shared memory machine, UPC gets very close to OpenMP

SPMD Execution, on Shared Memory Machine and Cluster

On the cluster, UPC is better than MPI :1)Save one copying for each work2)Frequent polling raise the network throughp

ut

Intel Xeon X7550 @ 2.00GHz * 8Scale=20,edgefactor=16

Page 46: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

46

SPMD+MT, on X86 Cluster

pthreads/UPC thread

SWL SYNC BFSScale=24, EdgeFactor=16

Page 47: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

47

On D6000, Strong Scaling of SWL SYNC BFS

ICT Loongson-3A V0.5 FPU [email protected], *2

1 ) MPI Conduit, large overhead2 ) tiered network behaves better when more intra-HPP comm happens

Strong Scal i ng of SWL_SYNC_BFS

0. 00E+00

2. 00E+06

4. 00E+06

6. 00E+06

8. 00E+06

1. 00E+07

1. 20E+07

1. 40E+07

H1N4P1(4) H4N1P1(1) H4N2P1(2) H4N4P1(4) H4N4P2(4) H4N4P2(8)

TEPS

MPI _Si mpl e(MPI ) UPC_SWL( I BV) UPC_SWL( I BV+BCL) UPC_SWL(MPI )

Scale=24, EdgeFactor=16

Page 48: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

48

Summary and Future Work on SWL

Summary– Put forward Shared Work List (SWL) to UPC to tackle amorpho

us data-parallelism– Using SWL, bfs can achieve better performance and scalability

than MPI at certain scale and runtime configurations– Realize tedious optimizations with less user effort

Future work– Realize and evaluate the speculative execution support

Delaunay Triangulation Refinement – Add dynamic scheduler to the SWL iterators– Evaluate more graph algorithms

Page 49: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

49

Acknowledgement

Shenglin Tang

Shixiong Xu

Xingjing Lu

Zheng Wu

Lei Liu

Chengpeng Li

Zheng Jing

Page 50: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

50

THANKS

Page 51: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

51

Workload distribution of an upc_forall shared [32][32], [4][4],[1][1] float A[128][12

8];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=5; j<128; j++; &A[i][j]) ... body...

0:63

0:3

4:15

8:15

UPC threads

UPC program

16:63

0:15

4:15

multiple edges

0:15

0:7

Subgroups

Implicit threads0

11

1

0:3

0

0:15

THREADS=16

8

64 subgroups of 0-th grid

4X4

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Workload distribution on UPC threads

i

j

5

127

0 127

Page 52: Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS Exploiting the Potential of Modern Supercomputers Through

52

Leverage load/store support within HPP