DEGAS Program System via C++ Library Extension
Yili Zheng
Computa?onal Research Division
Lawrence Berkeley Na?onal Laboratory
C++ is Increasingly Important
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 2
C++; 4
C; 3
FORTRAN; 3
Co-‐Design Center Proxy Apps
10 Apps: 4 in C++, 3 in C, 3 in FORTRAN (source: Steve Hofmyer) Other C++ examples: BoxLib, Combinatorial BLAS, …
DEGAS Lib supports C/C++ apps, and can be used by FORTRAN.
A “Compiler-‐Free” Approach
• Leverage the standardized C++ specifica?ons and their compiler implementa?ons – New features in C++ 11 for DEGAS lib
async, future, lambda func?ons
• C++ 11 is well supported
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 3
DEGAS Lib So_ware Stack
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 4
GASNet Communica?on Library
Network Drivers and OS Libraries
C++ Compiler
C/C++ Apps
DEGAS Lib Run?me
DEGAS Template Header Files
UPC Run?me
UPC Apps
UPC Compiler
Design Goals
• PGAS – global address space memory management
– one-‐sided communica?on – sta?cally shared variables and arrays – synchroniza?on primi?ves
• Dynamic – asynchronous task execu?on
• Hierarchical PGAS
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 5
C++ Templates for PGAS Shared Data
• C++ templates enable generic programming • Declara?on template<class T> struct matrix { T *elements; };
• Instan?a?on matrix<double> A; matrix<complex> B;
• In DEGAS we use templates to describe shared data shared_var_t<int> s; // = shared int s; in UPC shared_array_t<int> sa(8); // = shared int sa[8]; in UPC
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 6
Shared Variable Example #include <degas.h> using namespace degas;
shared_var_t<int> s = 0; // shared by all processes global_lock_t l; // SPMD void update() { for (int i=0; i<100; i++) s = s + 1; // a race condition barrier(); // shared data accesses with lock protection for (int i=0; i<100; i++) { l.lock(); s = s + 1; l.unlock(); } }
6/4/13 7 DEGAS Retreat 2013 -‐-‐ Yili Zheng
Transla?on Flow
shared_var_t <int> s; s = 1; // “=” is overloaded
C++ Compiler
UPCR_PUT_PSHARED_VAL(sp, a); DEGAS Lib Run?me
GASNet Direct Memory Access
Is “s” local?
Yes No
operator = (s, 1);
6/4/13 8 DEGAS Retreat 2013 -‐-‐ Yili Zheng
Dynamic Shared Data
• Global address space pointers global_ptr_t<data_type, place_type> p;
• Dynamic global memory alloca?on global_ptr_t allocate(place_type place, size_t count);
• Data transfer func?ons (memory copy) copy(global_ptr_t src, global_ptr_t dst, size_t count);
Allocate and copy are func?on templates. C++ compliers can instan?ate the right version of allocate and copy based on the types of the pointers (meta-‐programming).
6/4/13 9 DEGAS Retreat 2013 -‐-‐ Yili Zheng
Advanced Communica?on Features
• Non-‐blocking one-‐sided communica?on – async_copy() – async_copy_fence()
• MPI-‐style collec?ve opera?ons built directly on GASNet – broadcast, gather, scaGer, reduce, all-‐to-‐all, …
• Can put to and get from any memory loca?ons – Leverage GASNet’s dynamic memory registra?on feature
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 10
Asynchronous Task Execu?on
• C++ 11 async library future = std::async(Function &&f, Args&&… args); !
future.wait(); !rv = future.get(); !
• DEGAS async library future = degas::async(Function &&f, Args&&… args) !
degas::async(place)(Function &&f, Args&&… args); !degas::async_after(place, future)(Function &&f, ! Args&&… args); !
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 11
Async Task Example
#include <degas.h> // A regular function can be called by remote processes void print_num(int num) { printf(“arg: %d\n”, num); } // Only the master process executes the main function int main(int argc, char **argv) { node_range_t group(1, node_count, 2); // odd num. proc. async(group)(print_num, 123); // call a function remotely degas::wait(); // wait for the remote tasks to complete return 0; }
6/4/13 12 DEGAS Retreat 2013 -‐-‐ Yili Zheng
Async with Lambda Func?on
// Process 0 spawns async tasks for (int i = 0; i < node_count(); i++) { // spawn a task at place “i” // the task is expressed by a lambda function async(i)([] (int num) { printf("num: %d\n”, num); }, 1000 + i); // argument to the lambda function degas::wait(); // wait for all tasks to finish }
mpirun –n 4 ./test_async!
Output: !num: 1000 !num: 1001 !num: 1002 !num: 1003 !
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 13
DEGAS Async is a Building Block
• One-‐sided collec?ve communica?on • Remote atomic opera?ons
• Ac?ve Message Broadcast
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 14
Random Access Benchmark (GUPS) // shared uint64_t Table[TableSize]; in UPC shared_array_t<uint64_t> Table(TableSize);
void RandomAccessUpdate() { uint64_t ran = starts(NUPDATE / NP * myid); for (uint64_t i = myid; i < NUPDATE; i += NP) { ran = (ran << 1) ^ ((int64_t)ran < 0 ? POLY : 0); Table[ran & (TableSize-‐1)] ^= ran; } }
6/4/13 15 DEGAS Retreat 2013 -‐-‐ Yili Zheng
0 4 8 12 1 5 9 13
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
2 6 10 14 3 7 11 15
Process 0 Process 1
Process 2 Process 3
Logical data layout
Physical data layout
Main update loop
Experimental System: MIC
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 16
Photo Source: hnp://newsroom.intel.com/community/intel_newsroom/blog/2012/11/12/
Manycore -‐ A Good Fit for PGAS
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 17
61 cores, 30MB aggregate L2 1TFlops, 352GB/s memory bw
L2
Core
TD
L2
Core
TD
L2
Core
TD
L2
Core
TD
L2
Core
TD
L2
Core
TD
L2
Core
TD
L2
Core
TD
MC
MC
MC
MC
PCIe Client Logic
Thread 0 Thread 1 Thread 2 Thread 3
Private Segment
Shared Segment
Private Segment
Shared Segment
Private Segment
Shared Segment
Private Segment
Shared Segment
Block Diagram of the Knights Corner (MIC) micro-‐architecture
PGAS Abstrac?on
GUPS Performance on MIC
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 18
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 4 8 16 32 60
Time (μs)
Num. of Processes
Random Access Latency
DEGAS C++
UPC
0.00
0.01
0.10
1 2 4 8 16 32 60 GUPS
Num. of Processes
Giga-‐Updates Per Second
DEGAS C++
UPC
Overhead is only about 0.2 μs (~220 cycles).
GUPS Performance on BlueGene/Q
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 19
Overhead is negligible at large scale.
0
2
4
6
8
10
12
14
1 2 4 8 16
32
64
128
256
512
1024
2048
4096
8192
Time (usec)
Num. of Processes
Random Access Latency
DEGAS C++
UPC
0.00
0.00
0.01
0.10
1.00
1 2 4 8 16
32
64
128
256
512
1024
2048
4096
8192
GUPS
Num. of Processes
Giga-‐Updates Per Second
DEGAS C++
UPC
LULESH Proxy App
• Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics
• Proxy App for UHPC, ExMatEx, and LLNL ASC
• Wrinen in C++ with MPI, OpenMP, and CUDA versions
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 20
hnps://codesign.llnl.gov/lulesh.php
LULESH 3-‐D Data Par??oning
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 21
LULESH Communica?on Panern
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 22
26 neighbors
• 6 faces • 12 edges • 8 corners
Cross-‐sec?on view of the 3-‐D processor grid
Data Layout of Each Par??on
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 23
Stride N2
Stride N
Stride N2
• Blue planes are con?guous • Green planes are stride-‐N2 chunks • Red planes are stride-‐N elements
• 3-‐D array A[x][y][z] • row-‐major storage • z index goes the fastest
x
z
y size: N3
Convert MPI to DEGAS Lib
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 24
// Post Non-‐blocking Recv MPI_Irecv(RecvBuf_1); … MPI_Irecv(RecvBuf_N);
// Post Non-‐bocking Send Pack_Data_to_Buf(); MPI_Isend(SendBuf_1); … MPI_Isend(SendBuf_N);
MPI_Wait(…); … Unpack_Data(…);
Pack_Data_to_Buf();
// Post Non-‐blocking Copy async_copy(SendBuf_1, RecvBuf_1); … async_copy(SendBuf_N, RecvBuf_N);
async_copy_fence(…); … Unpack_Data(…);
LULESH Performance on MIC
0
0.2
0.4
0.6
0.8
1
1.2
1 8 27 64 125 Num. of Processes
RelaPve Performance on MIC
DEGAS MPI
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 25
Por?ng LULESH to DEGAS is “Easy”
• DEGAS lib and LULESH speak the same language – C++
• LULESH is well modularized – easy to find out the communica?on components
• Non-‐blocking one-‐sided data copy func?ons are handy and efficient – leverage hardware RDMA and shared-‐memory for PGAS
• MPI-‐style collec?ves are necessary
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 26
Concluding Remarks • DEGAS Lib: a Minimally Invasive Technique for bringing DEGAS func?onality to exis?ng apps
• Many-‐core architectures with hardware shared-‐memory (no cache coherence is OK) will be ideal for PGAS, HPGAS, DEGAS
• High-‐level abstrac?ons in the base language are crucial to produc?vity
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 27
See You Tonight for PyGAS
Thank You!
6/4/13 DEGAS Retreat 2013 -‐-‐ Yili Zheng 28