Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and more

Rob Farber Chief Scientist, BlackDog Endeavors, LLC

Contact info at blackdogendeavors.com for consulting, teaching, writing, and other inquiries

http://www.scientificcomputing.com/

https://www.google.com/url?q=http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/01/28/the-george-washington-university-uses-gpus-to-study-flying-snakes.aspx&sa=U&ei=3DgmU7aeLsrkoASloIDoAw&ved=0CC8Q9QEwAQ&sig2=yzwuSJXqJUQBD1RwSIm3eQ&usg=AFQjCNGkhYPeiFXvKvSH6OHhOccW-q6-2Q

http://www.pgroup.com/index.htm

mailto:[email protected]

https://www.google.com/url?q=http://en.wikipedia.org/wiki/Linux_Journal&sa=U&ei=kzgmU6WJNpbYoAT1t4KgAw&ved=0CC0Q9QEwAA&sig2=YEb7zD16WsmnX4PkEdOGVg&usg=AFQjCNH7La4Nm8Cf1AbeyuW6K-0zjV-l9w

DATA

Previous talks: from “Hello World” to exascale double objFunc( ... ) { double err=0.; #pragma acc parallel loop reduction(+:err) #pragma omp parallel for reduction(+ : err) { err = 0.; for(int i=0; i < nExamples; i++) { // transform float d=myFunc(i, param, example, nExamples, NULL); //reduce err += d*d; } } return sqrt(err); }

int main() { cout << "Hello World" << endl; // load data and initialize parameters init(); #pragma acc data \ copyin(param[0:N_PARAM-1]) \ pcopyin(example[0:nExamples*EXAMPLE_SIZE-1]) { optimize( objFunc ); // the optimizer calls the objective function } return 0; }

13 PF/s average sustained flops using 16,384 GPUs (On both linear and nonlinear problems using MPI and thousands of GPUs)

3

𝐸𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑅𝑎𝑡𝑒 =

𝑇𝑜𝑡𝑎𝑙𝑂𝑝𝐶𝑜𝑢𝑛𝑡

𝑇𝑏𝑟𝑜𝑎𝑑𝑐𝑎𝑠𝑡 +𝑇𝑜𝑏𝑗𝑒𝑐𝑡𝑓𝑢𝑛𝑐 +𝑇𝑟𝑒𝑑𝑢𝑐𝑒

Note: Always report “Honest Flops”

0

2000

4000

6000

8000

10000

12000

14000

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

TF/s

(s

ingl

e-p

reci

sio

n c

alcu

lati

on

wit

h a

do

ub

le-p

reci

sio

n r

ed

uct

ion

)

Number of K20X GPUs

PCA

NLPCA

Acknowledgements: Oak Ridge Nat. Lab and Fernanda Foertter (PI) This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Minimal performance variation!

20 PF/s expected using all GPUs +

CPU cores

13 PF/s using 16,384 GPUs

Code: Numerical and Computational Optimization on the Intel Phi

http://www.drdobbs.com/parallel/numerical-and-computational-optimization/240151128

http://www.drdobbs.com/parallel/numerical-and-computational-optimization/240151128

• Threads can only communicate within a thread block – (yes, there are atomic ops)

• Fast hardware scheduling – Blks run when dependencies resolved

Scalability required to use all those cores (strong scaling execution model)

OpenACC portability exploits strong scaling execution /* matrix-acc.c */ int main() { … // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; }

! matrix-acc.f program example1 … !$acc data copyin(a,b) copy(c) !$acc kernels loop ! Compute matrix multiplication. do i=1, n_size do j=1, n_size do k = 1, n_size c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$acc end data end program example1

int main()

{

cout << "Hello World" << endl;

// load data and initialize parameters

init();

#pragma acc data \

copyin(param[0:N_PARAM-1]) \

pcopyin(example[0:nExamples*EXAMPLE_SIZE-1])

{

optimize( objFunc ); // the optimizer calls the objective function

}

return 0;

}

C Fortran

C++

Multi-core, coprocessor and GPU versions exist


https://www.google.com/url?q=http://www.caps-entreprise.com/&sa=U&ei=qz0mU828BdjroATW1IG4BA&ved=0CC8Q9QEwAQ&sig2=fUTwvZwpZoTakVE28MOmeg&usg=AFQjCNFL7Ndyb9gNJ2AOoMwTb71uzsMhhw

https://www.google.com/url?q=http://iconlibrary.iconshock.com/tag/illustrator/&sa=U&ei=1TomU-qwE8zwoATnrIHQAw&ved=0CDEQ9QEwAg&sig2=mY78X-VbxiC7yoMKTN_vLg&usg=AFQjCNE36Hqcc76PtThzUYmvgwozBTZ43g

http://emerginggrowth.com/wp-content/uploads/2012/09/CRAY-logo.png

How to work with complex data structures like graphs?

• Build a parallel stack for complex data structures – Must be performance robust (e.g. low-wait)

– Allow transparent host <-> device <-> host access

– For performance, must manage memory!

• A stack is required for dynamic parallelism – Dynamic parallelism == variable output!

– Following is a fast intro!

Outline to create a parallel stack 1. A fast, robust ParallelCounter class

– Counts items in stack … cannot limit parallelism • Even in pathological cases where every thread increments at the

same time

2. Transparent data movement between host and devices

3. Combine 1&2 to act as a stack that provides: – Fast object allocation – Transparent access to data on both host and device – Can handle arbitrary numbers of inserts efficiently

Atomics are great (but don’t use them)

thread

0

atomicAdd thread atomicAdd thread atomicAdd thread atomicAdd

1 2 3 4

Much faster when there is no contention

thread atomicAdd thread atomicAdd thread atomicAdd thread atomicAdd

0 0

0 0 1 1

1 1

Minimize the wait by partitioning (threadIdx.x % WARP_SIZE)


0 0

0 0 1 1

1 2 4

Sum the memory locations for counter value

Pretty simple in C++ • CUDA, Supercomputing for the Masses:

Part 25 (Atomic Operations and Low-Wait Algorithms in CUDA)

– Note the count array in red

• GPU: N_ATOMIC=32 means zero wait

inline __device__ uint32_t operator+=(uint32_t x) {

return atomicAdd(count + (threadIdx.x % N_ATOMIC), x);

}

• Works great on multicore! – OpenMP, Intel Xeon Phi

inline __device__ uint32_t operator+=(uint32_t x) {

return atomicAdd(count + (random() % N_ATOMIC), x);

}

struct ParallelCounter { public: uint32_t count[N_ATOMIC]; inline __device__ uint32_t operator-=(uint32_t x) { return atomicSub(count + (threadIdx.x % N_ATOMIC), x); } inline __device__ uint32_t operator+=(uint32_t x) { return atomicAdd(count + (threadIdx.x % N_ATOMIC), x); } // spread the counts across the counter __device__ __host__ void set(uint32_t x) { for(int i=0; i < N_ATOMIC; i++) count[i]=x/N_ATOMIC; for(int i=0; i < x % N_ATOMIC; i++) count[i] += 1; } inline __device__ __host__ uint32_t getCount() { // simplest slow method for right now. uint32_t sum=0; for(int i=0; i < N_ATOMIC; i++) { sum += count[i]; } return sum; } };

http://www.drdobbs.com/parallel/atomic-operations-and-low-wait-algorithm/240160177




ParallelCounter performance is great (lower is better)

0.00E+00

1.00E+00

2.00E+00

3.00E+00

4.00E+00

5.00E+00

6.00E+00

7.00E+00

8.00E+00

0 500000000 1000000000 1500000000 2000000000 2500000000 3000000000 3500000000 4000000000

Ru

nti

me

(se

con

ds)

nSamples

C2050 singleCounter

K20c singleCounter

C2050 ParallelCounter

K20c ParallelCounter

Fermi (single)

Fermi (low-wait)

Kepler (low-wait)

Kepler (single)


– Counts items in stack … cannot limit parallelism • even in pathological cases where every thread increments at the

same time



• Needed for cudaMemcpy() and Intel Xeon Phi data transfers

• Not mainstream! – Yet

• Happily addressed by the C++ standards committee for binary read/write – File I/O – Socket I/O

DEVICE { From, Attribute, to

}

Host/Device(s) object layout compatibility

HOST { Attribute, From, to

}

DEVICE { Attribute, From, to

}

HOST { Attribute, From, to

}

Ensure host/device object compatibility! • If your compiler supports the C++11 standard, use the standard!

• Unfortunately NVCC – like most compilers - is not C++11 compliant!

– Means the programmer must use compiler front end calls • Not ideal but it works

– This is not extremely as the calls are used only for a safety check and not functional reasons

• GNU – __is_pod() – __is_standard_layout() – __has_trivial_copy()

• Microsoft users – is_pod() – is_standard_layout() – has_trivial_copy()

Do the data movement

• Pre-6.0 – CUDA, Supercomputing for the Masses: Part 28

A Massively Parallel Stack for Data Allocation

– CUDA, Supercomputing for the Masses: Part 27 A Robust Histogram for Massive Parallelism

– CUDA, Supercomputing for the Masses: Part 26 CUDA: Unifying Host/Device Interactions with a Single C++ Macro

• CUDA 6.0: TBD

http://www.drdobbs.com/parallel/a-massively-parallel-stack-for-data-allo/240162018

http://www.drdobbs.com/parallel/a-robust-histogram-for-massive-paralleli/240161600

http://www.drdobbs.com/parallel/a-robust-histogram-for-massive-paralleli/240161600

http://www.drdobbs.com/parallel/cuda-unifying-hostdevice-interactions-wi/240161436


– Counts items in stack … cannot limit parallelism • even in pathological cases where every thread increments at the

same time



Allocation times on a GPU

• The CUDA allocator (cudaMalloc) is very heavyweight because it must manage all device memory – It’s only going to get slower

• The solution – Allocate many objects of a single type at one time

Results of testMalloc.cu

• Comparing individual malloc vs one big alloc

– 7.86 vs. 0.00001766 seconds

$ nvprof ./testMalloc 1000000 128 ======== NVPROF is profiling testMalloc... ======== Command: testMalloc 1000000 128 Using: nElements 1000000 128 ======== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00 7.86s 1 7.86s 7.86s 7.86s performIndividualAlloc(int, int) 0.00 17.66us 1 17.66us 17.66us 17.66us performBlockAlloc(int, int)

Modify ParallelCounter

• Need unique counts that are never duplicated

– ParallelCounter can potentially increase by warpsize in one instruction!


0 0

0 0 1 1

1 1

WARP

BoundedParallelCounter

count[0] 0

maxColCount -1

count[1] maxColCount

2*maxColCount -1

count[2] 2*maxColCount

3*maxColCount -1

1. Index off individual counters in the count array 2. Calculate the index so there is no overlap between counters: [0,N) [N,2N)

[2N,3N) {where N = maxColCount}

3. Each index represents a unique stack structure/class location • Find unused data structures quickly and in parallel even when all counters increment at the same time!

Use transparent object movement • Transparent object movement allows

– Host and device allocation

– Host and device traversal

• Create complex structures on the host

– Compute with them on the device

– Read variable results on the host

Host CUDA, Supercomputing for the Masses: Part 28 A Massively Parallel Stack for Data Allocation

http://www.drdobbs.com/parallel/a-massively-parallel-stack-for-data-allo/240162018

A GPU is actually many separate computers!

(Counter to popular thought – GPUs are great for small problems)

http://regmedia.co.uk/2012/05/17/nvidia_kepler2_smx_block_diagram.jpg















ParallelCounter is also an accumulator!

template <class ACC_TYPE, uint32_t N_ATOMIC=32> struct ParallelAccumulator { private: ACC_TYPE accum[N_ATOMIC]; public: __device__ ParallelAccumulator<ACC_TYPE, N_ATOMIC>() { } __device__ ~ParallelAccumulator<ACC_TYPE, N_ATOMIC>() { } inline __device__ ACC_TYPE operator-=(ACC_TYPE x) { return atomicSub(accum + (threadIdx.x % N_ATOMIC), x); } inline __device__ ACC_TYPE operator+=(ACC_TYPE x) { return atomicAdd(accum + (threadIdx.x % N_ATOMIC), x); } // spread the values across the accumulator __device__ void set(ACC_TYPE x) { for(int i=0; i < N_ATOMIC; i++) accum[i]=x; } inline __device__ ACC_TYPE getValue() { // simplest slow method for right now. ACC_TYPE sum=0; for(int i=0; i < N_ATOMIC; i++) { sum += accum[i]; } return sum; } };

Float/Double/struct/class Data Acc += val sum







Actual code (C++ template arg)

Parallel Concurrent Sorting // Based on http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm __global__ void bitonic_sort_step(float *data, int j, int k, int n) { //unsigned int i, ixj; /* Sorting partners: i and ixj */ for(unsigned int i = threadIdx.x + blockDim.x * blockIdx.x; i < n; i += blockDim.x * gridDim.x) { unsigned int ixj = i^j; /* The threads with the lowest ids sort the array. */ if ((ixj)>i) { if ((i&k)==0) { /* Sort ascending */ if (data[i]>data[ixj]) { /* exchange(i,ixj); */ float temp = data[i]; data[i] = data[ixj]; data[ixj] = temp; } } else { /* Sort descending */ if (data[i]<data[ixj]) { /* exchange(i,ixj); */ float temp = data[i]; data[i] = data[ixj]; data[ixj] = temp; } } } } }

Float/Double/struct/class Data Sort

Actual code

Sort

Float/Double/struct/class Data Sort Sort







__global__ void k_bitonic_sort(float *data, int n, int powOfTwo) { int nThreadsPerBlock = 512; int nBlocks = (powOfTwo-n)/nThreadsPerBlock; nBlocks /= 64; nBlocks = (nBlocks==0)?1:nBlocks; if(n < powOfTwo) bitonic_init<<<nBlocks, nThreadsPerBlock>>>(data, n, powOfTwo); nThreadsPerBlock = (powOfTwo > nThreadsPerBlock)?nThreadsPerBlock:powOfTwo; nBlocks = powOfTwo/nThreadsPerBlock; nBlocks /= 64; nBlocks = (nBlocks==0)?1:nBlocks; int j, k; /* Major step */ for (k = 2; k <= powOfTwo; k *= 2) { /* Minor step */ for (j=k>>1; j>0; j= j>>1) { bitonic_sort_step<<<nBlocks, nThreadsPerBlock>>>(data, j, k,powOfTwo); } } }

Concurrent Parallel Sorting (10k floats, 15 streams, k40c)

Great job NVIDIA!

• Nice scheduling trapezoid!

simpleBitonic sort

simpleQsort

• simpleBitonic: 4.5 ms

• simpleQsort: 246.9 ms

Recursion is expensive

(not surprising)

• Recursion is expensive

Lots of heterogeneity, variable sampling, varying certainty of measure: Umm…what can we trust?

Cold Spring Harbor Laboratory MS analysis pipeline POC: Dr. John Wilson (jwilson at cshl.edu)

Public URL forthcoming

• Protein differences between states cause disease • Proteins are the structures and machines of life • Biology varies: heterogeneous observations with large scatter

• Mass spectrometry (MS) reads out protein differences • MS data are not Gaussian and change (calibration, temp., sample, etc.) • Non-linear measurement reliability as a function of intensity • Unpredictable number of observations • => Control everything internally and translate all observations to p-values.

• Work in those.

High variability at low S/N or at saturation

Low variability within linear range of measure

• People can and do look at a single overall measure (e.g. p=0.08) o Then go and do 3 years and $500k

of research o Only to come back and say “Hmm,

can I look at that spectrum again?” • Often they then find out they are

looking at highly variable, low-intensity observations. (OUCH!)




• Bootstrapping: non-parametric resampling technique that works for all distributions and small n; allows determination of certainty of measure • Steps: sample observations, combine p values, repeat >=10,000 times, sort, find CIs

• Example: 18 heterogeneous measurements • p from 0.007 (down regulated) to 0.14 (up reg.) • Most say down reg.; overall p = 0.08 (maybe down reg.) • HOWEVER, 95% CI = 0.001 - 0.558! • In ~30-40% of observation sets,

upper 95% CIs exceed choice of alpha

p = 0.08

p-val

0.028534

0.262530

0.286691

0.609363

0.783723

0.378550

0.252444

0.067699

0.859116

0.292653

0.834993

0.471238

GPUs for interactive answers Timeline running on k40c + k20c

timeline

Cold Spring Harbor Laboratory MS analysis pipeline POC: Dr. John Wilson (jwilson at cshl.edu)

Public URL forthcoming




SurveyFit™ Find the distribution that best fits your data

info at statperfect.com

Multicore + K40c + k20c + GT640

• Parametric modeling can be extremely helpful • Forecasting, interpolation, understanding, speed, simplification

• Assumptions and/or the wrong model can have drastically terrible effects • “Recipe for Disaster: The Formula That Killed Wall Street” (Wired 2/09) • Oops, financial markets are not Gaussian distributed

• Many hundreds of probability distribution functions exist • Many techniques exist to fit data • Many goodness of fit tests

• Measure how well the model describes the data • Techniques to determine the relative quality of a model also exist

SurveyFit™ solves every distribution (in library)

With every technique (in library)

Ranks everything against everything Enabled by massive parallel processing




Kriging Interpolation and Graph Algorithms

Kriging Interpolation • Heavily used by the Air

Force and GIS services • Widely used in the domain

of spatial analysis and computer experiments

Graph Algorithms (sadly no time! ) • Heavily used in social media

analysis • See mpgraph & Graphlab http://www.systap.com/mpgraph/api/html/index.htm

http://www.systap.com/mpgraph/api/html/index.htm

http://www.systap.com/mpgraph/api/html/index.htm

Parallel Kriging (single K20c delivers a 55x speedup over a quad core Xeon)

Three kriging steps 1. Estimation of the semivariogram 2. Semivariogram model fitting. 3. Expression of the N_TILE x N_TILE solution of the

kriging weights and creation of the raster points.

Strong scalability of Magma sgetrf() by GPUs

http://icl.cs.utk.edu/news_pub/submissions/plagma_lu.pdf

for(int i=0; i < TILE_SIZE; i++) for(int j=0; j < TILE_SIZE; j++) kriging_results(i,j,..., raster)

4-core Xeon 825 ms

Nvidia K20c 15 ms

Single GPU Runtime on a representative dataset

Strong scaling by number of GPUs!

kriging_results(0,0,..., raster)

kriging_results(0,1,..., raster)

kriging_results(i/2-1,j,..., raster)

Parallel Implementation (one kriging result() call per thread)

kriging_results(i/2,0,..., raster)

kriging_results(i/2,1,..., raster)

kriging_results(i/2,j,..., raster)

2x GPU = 2x faster 4x GPU = 4x faster 16x GPU = …

http://icl.cs.utk.edu/news_pub/submissions/plagma_lu.pdf

Thank you! (Although there is so much more to say!)








• Looks sort of Gaussian

Probability Density Function

Histogram

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Data (watch the red line) SurveyFit™ animation:

• And a Gaussian sort of fits


Histogram Normal

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Gaussian (sort of fits) SurveyFit™ animation:

• General logistic looks better…


Histogram Gen. Logistic

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

General Logistic SurveyFit™ animation:

• Hypersecant is yet better


Histogram Hypersecant

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Hypersecant SurveyFit™ animation:

• Johnson SU is by far the best fit


Histogram Johnson SU

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Johnson SU SurveyFit™ animation:

• Gaussian compared to Johnson SU • The actual SurveyFit™product examines hundreds of distributions

and many methods


Histogram Johnson SU Normal

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Results (Johnson SU vs Gaussian) SurveyFit™ animation:

Thank you! (Although there is so much more to say!)








Documents

Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and