Upload
collin-fields
View
231
Download
8
Tags:
Embed Size (px)
Citation preview
Li ChenState Key Laboratory of Computer Architecture
Institute of Computing Technology, CAS
Exploiting the Potential of Modern Supercomputers Through High Level
Language AbstractionsExploit Hierarchical and Irregular Parallelism in UPC
2
Exploit Hierarchical and Irregular Parallelism in UPC-H
Motivations Why use UPC?Exploit the tiered network of Dawning 6000
– GASNet support for HPP architecture
Exploit hierarchical data parallelism for regular applications
Shared work list for irregular applications
3
Deep Memory Hierarchies in Modern Computing Platforms
Many-core accelerators
Traditional multicore processors
Harpertown Dunnington
Intra-node parallelism should be well exploited
4
HPP Interconnect of Dawning 6000
Traditional clusterHPP Architecture of Dawning7000
Discrete CPU : App CPU and OS CPUHypernode: discrete OS, SSIDiscrete interconnection: data int, OS int, global Sync
Three-tier networkPE: Cache coherence
2 CPUsHPP: 4 nodesIB
Global address space, through HPP controller
5
Mapping Hierarchical Parallelism to Modern SupercomputersHybrid programming models, MPI+X
– MPI+OpenMP/TBB (+OpenAcc/OpenCL)– MPI+StarPU– MPI+UPC
Sequoia– Explicitly tune the data layout and data transfer (Parallel memory hierar
chy) – Recursive task tree, static mapping for tasks
HTA– Data type for hierarchical tiled array (multiple level tiling)– Parallel operators: map parallelism statically
X10– Combine HTA with Sequoia– Abstraction on memory hierarchies: hierarchical place tree (Habanero-j
ava)– Nested task parallelism, task mapping until launching time
6
Challenges in Efficient Parallel Graph Processing
Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable
Unstructured problems– Unstructured and highly irregular data
structure– Data partitioning is not suitable, may lead
to load balancing Poor locality
– data access patterns has less locality High data access to computation
ratio– explore the structure, not computation– dominated by the wait for memory
fetches
Express Parallelism
memory latency dominated
Communication dominated
Low level
tedious
Express Parallelism
Memory latency dominated
Low levelborrowed: Andrew Lumsdaine, Douglas Gregor
7
Why Unified Parallel C?
UPC, parallel extension to ISO C99– A dialect of PGAS languages (Partitioned Global Address Language)
Important UPC features– Global address space: thread may directly read/write remote data– Partitioned: data is designated as local or global, affinity– Two kinds of memory consistency
UPC Performance benefit over MPI– Permits data sharing, better memory utilization
Thinking future many core chips, Exascale system– Better bandwidth and latency using one-sided messages (GASNET)– No less scalable than MPI! (to 128K threads)
Why use UPC?– Grasp the non-uniform memory access feature of modern computers – Programmability very close to shared memory programming
8
The Status of the UPC Community
Portability and usability– Many different UPC implementations and tools
Berkeley UPC, Cray UPC, HP UPC, GCC-based Intrepid UPC and MTU UPC Performance tools: GASP interface and Parallel Performance Wizard (PPW) Debuggability: TotalView
– Provide Interoperability with pthreads/MPI(/OpenMP)UPC is developing in
– Hierarchical parallelism, asynchronous executionTasking mechanism
– Scalable work stealing; hierarchical tasking library– Place, async~finish; Asynchronous Remote Methods
Nested parallelism Instant team: data centric collective Irregular benchmarks: UTS, MADNESS,GAP
– InteroperabilitySupport for hybrid programming with OpenMP and other languagesMore convenient support for writing libraries
9
What is UPC-H?
Developed by the compiler group of ICT– H: heterogeneous, hierarchical
Based on Berkeley UPC compilerAdded features by ICT
– Support HW features of Dawning series computersHPP interconnectLoad/store in physical global address space
– Hierarchical data distribution and parallelismGodson-T (many-core processor)GPU clusterD6000 computer and X86 clusters
– SWL support for graph algorithms– Communication optimizations
Software cache, msg vectorization– Runtime system for heterogeneous platform
Data management
UPC-H Support for HPP Architecture
11
Lack a BCL Conduit in the UPC system
Gasnet Extended API
Gasnet Core API
lnfiniband inter-Process S
Hared Memory
HPPBCL
GASNet: Networking for Global-Address Space Languages
BCL: low level communication layer of HPP
12
Implementation of the BCL-conduit
Initialization of the tiered-network – construct the topology of the tiered network– set up reliable datagram service through QP virtualization– initialize internal data structures such as send buffers
Finalization of communication Network selection in the core API of GASNet
– PSHM, HPP, IBFlow control of messages Implementation of Active Message
– Short Message : NAP– Medium Message: NAP– Long Message: RDMA +NAP– RDMA Put/Get : RMDA+NAP
Two-tiered topology three-tiered topology
13
BCL Conduit: latency of short messages
Latency of short message intra-HPP Latency of short message inter-HPP
14
BCL Conduit, Bandwidth of Med. Messages (intra HPP)
15
BCL Conduit, Bandwidth of Med. Messages (inter HPP)
16
BCL Conduit, Latency of Med. Messages (intra HPP)
17
BCL Conduit , Latency of Med. Messages (inter HPP)
18
BCL Conduit, Latency of Barriers
Net latency of barrier (inter- HPP)
Net latency of barrier (intra-HPP)
19
Summary and Ongoing work of UPCH targeting Dawning 6000
Summary– UPCH compiler can now support HPP architecture,
benefit from the 3-tier network
Ongoing work– Optimization on DMA registration strategy– Evaluate HPP-supported barrier and collective– Full-length evaluation
Hierarchical Data Parallelism, UPC-H Support for Regular
Applications
21
UPC-H(UPC-Hierarchical/Heterogeneous)Execution model
Standard UPC is SPMD style and has flat parallelism
UPC-H extension– Mix SPMD with fork-join
Implicit subgroups
Implicit threads
Implicit thread or thread subgroup
UPC threads
fork point
Join point at upc_forall
UPC thread
Implicit subgroups
Implicit threads
fork joint
Join point at upc_forall
UPC program
upc_forall
– Two approach to express hierarchical parallelism
Implicit threads (or gtasks), organized in thread groups implicitly specified by the data distribution
Explicit low-level gtask
22
Multi-level Data Distribution
Data distribution => an implicit thread tree
shared [32][32], [4][4],[1][1] float A[128][128];
UPC thread
UPC program
44
… …
16
Upc-tiles
128
… …64
32
Subgroup-tiles
Thread-tiles
… … 64
32
… …
16
Subgroup
logical implicit threads
16
64
16
1
…
23
UPC-H: Mapping Forall Loop to the Implicit Thread Tree
Leverage an existing language construct, upc_forall – Independent loop– Point-to-shared or integer type affinity expression
Loop Iterations Implicit thread tree CUDA thread tree
3-level data distribution
Machine configuration
shared [32][32], [4][4],[1][1] float A[128][128];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=6; j<129; j++; &A[i][j-1]) ... body...
=>Thread topology: <THREADS,64,16>
24
UPC-H Codes for nbody
shared [1024],[128],[1] point P[4096];Shared [1024][1024] float tempf[4096][4096]; for(int time=0; time<1000;time++) { upc_forall(int i=0;i<N;i++; &P[i])
for (int j=0;j<N;j++) { if(j!=i) { distance = (float)sqrt((P[i].x-P[j].x)*(P[i].x-P[j].x)+ (P[i].y-P[j].y)*(P[i].y-P[j].y)); if(distance!=0) { magnitude = (G*m[i]*m[j])/(distance*distance+C*C); …… tempf[i][j].x = magnitude*direction.x/distance; tempf[i][j].y = magnitude*direction.y/distance; } } upc_forall(int i=0;… …)
… …}
25
Overview of the Compiling SupportOn Berkeley UPC compiler v2.8.0Compiler analysis
– Multi-dimensional and multi-level data distribution – Affinity-aware multi-level tiling
upc tilingSubgroup tiling, thread tilingMemory tiling for scratchpad memory
– Communication optimizationMessage vectorization, loop peeling, static comm. scheduling
– Data layout optimizations for GPUShared memory optimizationFind better data layout for memory coalescing
– array transpose and structure splitting
– Code Generation: CUDA, hier parallelism
26
Affinity-aware Multi-level Loop Tiling (Eg.) shared [32][32], [4][4],[1][1] float A[128][12
8];… …upc_forall(i=6; i<128; i++; continue)upc_forall(j=0; j<128; j++; &A[i-1][ j]) ... ... F[i][j]...
Step1: iteration space transformation, to make affinity expression consistent with data space
upc_forall(i=5; i<127; i++; continue)upc_forall(j=0; j<128; j++; &A[i][j]) ... ... F[i+1][j]... //transformation
Step2: three level tiling, actually two levelfor (iu=0; iu<128; iu=iu+32)for (ju=0; ju<128; ju=ju+32) //upc thread affinity if (has_affinity(MYTHREAD, &A[iu][ju])) { // for exposed region …dsm_read… F[iu+1:min(128, iu+32)]
[ju: min(127,ju+31) ] for (ib=iu ; ib<min(128, iu+32); ib=ib+4) for (jb=ju; jb< min(128, ju+32); jb=jb+4) for (i=ib; i<min(128,ib+4); i=i+1) for (j=jb; j<min(128,jb+4); j=j+1) if(i>=5 && i<127) //sink guards here! ... F[i+1][j]... ; }//of upc thread affinity Step 3: spawn fine-grained threads… …
27
Memory Optimizations for CUDA
What data will be put into the shared memory?– 0-1 bin packing problem (over shared memory’s capacity)
The profit: reuse degree integrated with coalescing attribute – inter-thread reuse and intra-thread reuse– average reuse degree for merged region
The cost: the volume of the referenced array region prefer inter-thread reuse
– Compute the profit and cost for each reference What is the optimal data layout in GPU’s global memory?
– Coalescing attributes of array reference only consider contiguous constraints of coalescing
– Legality analysis– Cost model and amortization analysis
28
Overview of the Runtime Support
Multi-dimensional data distribution supportGtask support on multicore platforms
– Workload scheduling, synchronization, topology-aware mapping and binding
DSM system for unified memory management– GPU heap management– Memory consistency, block-based– Inter-UPC message generation and data shuffling
Data shuffling to generate data tiles with halos
Data transformations for GPUs– Dynamic data layout transformations
For global memory coalescing, demand driven
– Demand driven data transfer between cpu and GPU
29
Unified Memory Management
Demand driven data transfer– Only on local data space, no software caching on remote data
– Consistency maintenance is on the boundary of CPU code and GPU code
Demand driven data layout transformation– Redundant data transformation removal
– An extra field is recorded for the current layout of the data tile copy
30
Benchmarks for the GPU Cluster
Applications
Description Original language
Application field
Source
Nbody n-body simulation CUDA+MPI Scientific computing
CUDA campus programming contest 2009
LBM Lattice Boltzmann method in computational fluid dynamics
C Scientific computing
SPEC CPU 2006
CP Coulombic Potential
CUDA Scientific computing
UIUC Parboil Benchmark
MRI-FHD Magnetic Resonance Imaging FHD
CUDA Medical image analysis
UIUC Parboil Benchmark
MRI-Q Magnetic Resonance Imaging Q
CUDA Medical image analysis
UIUC Parboil Benchmark
TPACF Two Point Angular Correlation Function
CUDA Scientific computing
UIUC Parboil Benchmark
31
UPC-H Performance on GPU Cluster
Use 4-node cuda cluster, 1000M Ethernet. Each node has– CPUs : 2 dual core AMD Opteron 880– GPU: NVIDIA GeForce 9800 GX2
Compilers: nvcc (2.2) –O3, GCC (3.4.6) –O3
one-node speedup to serial execution
05
10152025303540
nbody lbm
spee
dup
base DSM memory coalescing SM reuse manual CUDA
Four-node speedup to serial execution (log2)
0123456789
10
nbody mri -fhd mri -q tpacf cp
spee
dup
base DSM memory coalescing SM reuse Manual CUDA/MPI
Performance 72%, on average
32
UPC-H Performance on Godson-T
The average speedup of SPM opt is 2.30 , that of double-buffering is 2.55
speedup
33
UPC-H Performance on Multi-core Cluster Hardware and software
– Xeon(R) CPU X7550 *8=64 cores/node, 40Gb infiniband, ibv conduit, mvapich2-1.4
Benchmarks– NPB: CG, FT, – nbody, MM, cannon MM
Results– NPB performance : UPC-H r
each 90% of UPC+OMP– Cannon MM can leverage opti
mal data sharing and communication coalescing
express complicated hierarchical data parallelism which is hard to express in UPC+OpenMP
perf ormance rat i o: UPCH/ UPC+OMP
0
0. 2
0. 4
0. 6
0. 8
1
1. 2
1 2 4 8
thread team si ze
perf
. ra
tio CG- B- 2
CG- C- 2FT- B- 4FT- C- 8nbody- 16384- 16
UPCH/ UPC perf ormance (cannon MM)
0
1
2
3
4
5
6
7
8
9
4 8 16 32 64
total threads
spee
dups 1024
204840968192
SWL, UPC-H Support for Graph Algorithms
35
Introduction
Graph– flexible abstraction for describing relationships between
discrete objects – basis of exploration based applications (genomics, astro
physics, social network analysis, machine-learning)
Graph search algorithms – Important technique for analyzing vertices or edges in it– Breadth-first search (BFS) is widely used and is the basi
s of many others (CC , SSSP , Best-first-search, A*) Kernel of Graph500 benchmarks
35
36
Challenges in Efficient Parallel Graph Processing
Data-driven computations– Parallelism cannot be exploited statically– Computation partitioning is not suitable
Unstructured problems– Unstructured and highly irregular data
structure– Data partitioning is not suitable, may lead
to load balancing Poor locality
– data access patterns has less locality High data access to computation
ratio– explore the structure, not computation– dominated by the wait for memory
fetches
Express Parallelism
memory latency dominated
Communication dominated
Low level
tedious
Express Parallelism
Memory latency dominated
Low level
User directed, Auto opt
Global
view, high
level
borrowed: Andrew Lumsdaine, Douglas Gregor
37
Tedious Optimizations of Bfs (graph algorithm)
Perf. Problem Goal Techniques
Memory accessleverage Non-blocking Cache Multithreading
Synchronization
Reduce the overhead of shared data protection
Use atomic operation not locks
The scalability problem of collective operation
Multithreading+Hierarchical collective
Communication
Avoid small messages which waste network bandwidth
Message vectorization
Hide the overhead of communication Async operation
Reduce the number of messages Multithreading
Optimize bfs on clusters:
38
Amorphous Data Parallelism (Keshav Pingali)– Active elements (activities)– Neighborhood– Ordering
Exploit such parallelism: work list – Keep track of active elements and ordering
Unordered-set Iterator, ordered-set Iterator
– Conflicts among concurrent operationssupport for speculative execution
In Galois system
Def: Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints
Data-Centric Parallelism Abstraction for Irregular Applications
39
Design Principles of SWLProgrammability
– global-view programming – High level language abstraction
Flexibility– user control on data locality (constructing/executing)– customize the construction and behavior of work
items
lightweight speculative execution– Trigger on by user hints, not purely automatic– Lightweight conflict detecting, lock is too costly
39
40
SWL Extension in UPC-H
40
1) specify a work list
2) user-defined work constructor
3) two iterators of work list blocking one non-blocking one
4) Two kinds of work item dispatcher
Hide optimization detail from users:
message coalescing, queue management,asynchronous communication,Speculative execution etc.
5) user-assisted speculation upc_spec_get() upc_spec_put()
41
Level Synchronized BFS in SWL, Code Example
Work_t usr_add(Msg_t msg){ Work_t res_work; if(!TEST_VISITED(msg.tgt)){ pred[msg.tgt] = msg.src; SET_VISITED(msg.tgt); res_work = msg.tgt;} else res_work = NULL; return res_work;}
In Galois on shared memory machines: while(1){ int any_set = 0; upc_worklist_foreach(Work_t rcv: list1) { size_t ei = g.rowstarts[VERTEX_LOCAL(rcv)]; size_t ei_end = g.rowstarts[VERTEX_LOCAL(rcv) + 1]; for( ; ei < ei_end; ++ei){ long w = g.column[ei]; if( w == rcv) continue; Msg_t msg; msg.tgt = w; msg.src = rcv; upc_worklist_add(list2, &pred[w], usr_add(msg)); any_set = 1; } //for each row } //foreach bupc_all_reduce_allI(.....); if(final_set[MYTHREAD] == 0) break; upc_worklist_exchage(list1, list2); }//while
In UPCH on clusters:
42
Asynchronous BFS in SWL, Code Example
Asynchronous implementation on SM machines(Galois)
In UPCH on clusters:
43
Execution model – SPMD– SPMD+Multithreading
Master/slave– State transition
Executing; idle; termination detection; Exit
Work dispatching– AM-based, distributed– Coalescing work items and asyn
c transfer– mutual exclusion on SWL and w
ork-item buffers
Implementation of SWL
44
User-Assisted Speculative Execution
User API upc_spec_get:
– get the data Ownership;– data transfer, get the shadow copy– conflict checking and rollback – enter the critical region ;
upc_cmt_put – release the data Ownership– commit the computation
Compiler: Identify speculative hints
– Upc_spec_get/put Fine-grained atomic protection
– Full/empty bits
Runtime system: two modes: (non-)speculative rollback of data and computation
45
(Intel Xeon E5450 @ 3.00GHz * 2) *64 nodesScale=20,edgefactor=16
On the shared memory machine, UPC gets very close to OpenMP
SPMD Execution, on Shared Memory Machine and Cluster
On the cluster, UPC is better than MPI :1)Save one copying for each work2)Frequent polling raise the network throughp
ut
Intel Xeon X7550 @ 2.00GHz * 8Scale=20,edgefactor=16
46
SPMD+MT, on X86 Cluster
pthreads/UPC thread
SWL SYNC BFSScale=24, EdgeFactor=16
47
On D6000, Strong Scaling of SWL SYNC BFS
ICT Loongson-3A V0.5 FPU [email protected], *2
1 ) MPI Conduit, large overhead2 ) tiered network behaves better when more intra-HPP comm happens
Strong Scal i ng of SWL_SYNC_BFS
0. 00E+00
2. 00E+06
4. 00E+06
6. 00E+06
8. 00E+06
1. 00E+07
1. 20E+07
1. 40E+07
H1N4P1(4) H4N1P1(1) H4N2P1(2) H4N4P1(4) H4N4P2(4) H4N4P2(8)
TEPS
MPI _Si mpl e(MPI ) UPC_SWL( I BV) UPC_SWL( I BV+BCL) UPC_SWL(MPI )
Scale=24, EdgeFactor=16
48
Summary and Future Work on SWL
Summary– Put forward Shared Work List (SWL) to UPC to tackle amorpho
us data-parallelism– Using SWL, bfs can achieve better performance and scalability
than MPI at certain scale and runtime configurations– Realize tedious optimizations with less user effort
Future work– Realize and evaluate the speculative execution support
Delaunay Triangulation Refinement – Add dynamic scheduler to the SWL iterators– Evaluate more graph algorithms
49
Acknowledgement
Shenglin Tang
Shixiong Xu
Xingjing Lu
Zheng Wu
Lei Liu
Chengpeng Li
Zheng Jing
50
THANKS
51
Workload distribution of an upc_forall shared [32][32], [4][4],[1][1] float A[128][12
8];… …upc_forall(i=0; i<128; i++; continue)upc_forall(j=5; j<128; j++; &A[i][j]) ... body...
0:63
0:3
4:15
8:15
UPC threads
UPC program
16:63
0:15
4:15
multiple edges
0:15
0:7
Subgroups
Implicit threads0
11
1
0:3
0
0:15
THREADS=16
8
64 subgroups of 0-th grid
4X4
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Workload distribution on UPC threads
i
j
5
127
0 127
52
Leverage load/store support within HPP