39
S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and more Rob Farber Chief Scientist, BlackDog Endeavors, LLC Contact info at blackdogendeavors.com for consulting, teaching, writing, and other inquiries

Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 2: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

DATA

Previous talks: from “Hello World” to exascale double objFunc( ... ) { double err=0.; #pragma acc parallel loop reduction(+:err) #pragma omp parallel for reduction(+ : err) { err = 0.; for(int i=0; i < nExamples; i++) { // transform float d=myFunc(i, param, example, nExamples, NULL); //reduce err += d*d; } } return sqrt(err); }

int main() { cout << "Hello World" << endl; // load data and initialize parameters init(); #pragma acc data \ copyin(param[0:N_PARAM-1]) \ pcopyin(example[0:nExamples*EXAMPLE_SIZE-1]) { optimize( objFunc ); // the optimizer calls the objective function } return 0; }

Page 3: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

13 PF/s average sustained flops using 16,384 GPUs (On both linear and nonlinear problems using MPI and thousands of GPUs)

3

𝐸𝑓𝑓𝑒𝑐𝑡𝑖𝑣𝑒𝑅𝑎𝑡𝑒 =

𝑇𝑜𝑡𝑎𝑙𝑂𝑝𝐶𝑜𝑢𝑛𝑡

𝑇𝑏𝑟𝑜𝑎𝑑𝑐𝑎𝑠𝑡 +𝑇𝑜𝑏𝑗𝑒𝑐𝑡𝑓𝑢𝑛𝑐 +𝑇𝑟𝑒𝑑𝑢𝑐𝑒

Note: Always report “Honest Flops”

0

2000

4000

6000

8000

10000

12000

14000

0 2000 4000 6000 8000 10000 12000 14000 16000 18000

TF/s

(s

ingl

e-p

reci

sio

n c

alcu

lati

on

wit

h a

do

ub

le-p

reci

sio

n r

ed

uct

ion

)

Number of K20X GPUs

PCA

NLPCA

Acknowledgements: Oak Ridge Nat. Lab and Fernanda Foertter (PI) This research used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725.

Minimal performance variation!

20 PF/s expected using all GPUs +

CPU cores

13 PF/s using 16,384 GPUs

Code: Numerical and Computational Optimization on the Intel Phi

Page 4: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Threads can only communicate within a thread block – (yes, there are atomic ops)

• Fast hardware scheduling – Blks run when dependencies resolved

Scalability required to use all those cores (strong scaling execution model)

Page 5: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

OpenACC portability exploits strong scaling execution /* matrix-acc.c */ int main() { … // Compute matrix multiplication. #pragma acc kernels copyin(a,b) copy(c) for (i = 0; i < SIZE; ++i) { for (j = 0; j < SIZE; ++j) { for (k = 0; k < SIZE; ++k) { c[i][j] += a[i][k] * b[k][j]; } } } return 0; }

! matrix-acc.f program example1 … !$acc data copyin(a,b) copy(c) !$acc kernels loop ! Compute matrix multiplication. do i=1, n_size do j=1, n_size do k = 1, n_size c(i,j) = c(i,j) + a(i,k) * b(k,j) enddo enddo enddo !$acc end data end program example1

int main()

{

cout << "Hello World" << endl;

// load data and initialize parameters

init();

#pragma acc data \

copyin(param[0:N_PARAM-1]) \

pcopyin(example[0:nExamples*EXAMPLE_SIZE-1])

{

optimize( objFunc ); // the optimizer calls the objective function

}

return 0;

}

C Fortran

C++

Multi-core, coprocessor and GPU versions exist

Page 6: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

How to work with complex data structures like graphs?

• Build a parallel stack for complex data structures – Must be performance robust (e.g. low-wait)

– Allow transparent host <-> device <-> host access

– For performance, must manage memory!

• A stack is required for dynamic parallelism – Dynamic parallelism == variable output!

– Following is a fast intro!

Page 7: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Outline to create a parallel stack 1. A fast, robust ParallelCounter class

– Counts items in stack … cannot limit parallelism • Even in pathological cases where every thread increments at the

same time

2. Transparent data movement between host and devices

3. Combine 1&2 to act as a stack that provides: – Fast object allocation – Transparent access to data on both host and device – Can handle arbitrary numbers of inserts efficiently

Page 8: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Atomics are great (but don’t use them)

thread

0

atomicAdd thread atomicAdd thread atomicAdd thread atomicAdd

1 2 3 4

Page 9: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Much faster when there is no contention

thread atomicAdd thread atomicAdd thread atomicAdd thread atomicAdd

0 0

0 0 1 1

1 1

Page 10: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Minimize the wait by partitioning (threadIdx.x % WARP_SIZE)

thread atomicAdd thread atomicAdd thread atomicAdd thread atomicAdd

0 0

0 0 1 1

1 2 4

Sum the memory locations for counter value

Page 11: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Pretty simple in C++ • CUDA, Supercomputing for the Masses:

Part 25 (Atomic Operations and Low-Wait Algorithms in CUDA)

– Note the count array in red

• GPU: N_ATOMIC=32 means zero wait

inline __device__ uint32_t operator+=(uint32_t x) {

return atomicAdd(count + (threadIdx.x % N_ATOMIC), x);

}

• Works great on multicore! – OpenMP, Intel Xeon Phi

inline __device__ uint32_t operator+=(uint32_t x) {

return atomicAdd(count + (random() % N_ATOMIC), x);

}

struct ParallelCounter { public: uint32_t count[N_ATOMIC]; inline __device__ uint32_t operator-=(uint32_t x) { return atomicSub(count + (threadIdx.x % N_ATOMIC), x); } inline __device__ uint32_t operator+=(uint32_t x) { return atomicAdd(count + (threadIdx.x % N_ATOMIC), x); } // spread the counts across the counter __device__ __host__ void set(uint32_t x) { for(int i=0; i < N_ATOMIC; i++) count[i]=x/N_ATOMIC; for(int i=0; i < x % N_ATOMIC; i++) count[i] += 1; } inline __device__ __host__ uint32_t getCount() { // simplest slow method for right now. uint32_t sum=0; for(int i=0; i < N_ATOMIC; i++) { sum += count[i]; } return sum; } };

Page 12: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

ParallelCounter performance is great (lower is better)

0.00E+00

1.00E+00

2.00E+00

3.00E+00

4.00E+00

5.00E+00

6.00E+00

7.00E+00

8.00E+00

0 500000000 1000000000 1500000000 2000000000 2500000000 3000000000 3500000000 4000000000

Ru

nti

me

(se

con

ds)

nSamples

C2050 singleCounter

K20c singleCounter

C2050 ParallelCounter

K20c ParallelCounter

Fermi (single)

Fermi (low-wait)

Kepler (low-wait)

Kepler (single)

Page 13: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Outline to create a parallel stack 1. A fast, robust ParallelCounter class

– Counts items in stack … cannot limit parallelism • even in pathological cases where every thread increments at the

same time

2. Transparent data movement between host and devices

3. Combine 1&2 to act as a stack that provides: – Fast object allocation – Transparent access to data on both host and device – Can handle arbitrary numbers of inserts efficiently

Page 14: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Needed for cudaMemcpy() and Intel Xeon Phi data transfers

• Not mainstream! – Yet

• Happily addressed by the C++ standards committee for binary read/write – File I/O – Socket I/O

DEVICE { From, Attribute, to

}

Host/Device(s) object layout compatibility

HOST { Attribute, From, to

}

DEVICE { Attribute, From, to

}

HOST { Attribute, From, to

}

Page 15: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Ensure host/device object compatibility! • If your compiler supports the C++11 standard, use the standard!

• Unfortunately NVCC – like most compilers - is not C++11 compliant!

– Means the programmer must use compiler front end calls • Not ideal but it works

– This is not extremely as the calls are used only for a safety check and not functional reasons

• GNU – __is_pod() – __is_standard_layout() – __has_trivial_copy()

• Microsoft users – is_pod() – is_standard_layout() – has_trivial_copy()

Page 16: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Do the data movement

• Pre-6.0 – CUDA, Supercomputing for the Masses: Part 28

A Massively Parallel Stack for Data Allocation

– CUDA, Supercomputing for the Masses: Part 27 A Robust Histogram for Massive Parallelism

– CUDA, Supercomputing for the Masses: Part 26 CUDA: Unifying Host/Device Interactions with a Single C++ Macro

• CUDA 6.0: TBD

Page 17: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Outline to create a parallel stack 1. A fast, robust ParallelCounter class

– Counts items in stack … cannot limit parallelism • even in pathological cases where every thread increments at the

same time

2. Transparent data movement between host and devices

3. Combine 1&2 to act as a stack that provides: – Fast object allocation – Transparent access to data on both host and device – Can handle arbitrary numbers of inserts efficiently

Page 18: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Allocation times on a GPU

• The CUDA allocator (cudaMalloc) is very heavyweight because it must manage all device memory – It’s only going to get slower

• The solution – Allocate many objects of a single type at one time

Page 19: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Results of testMalloc.cu

• Comparing individual malloc vs one big alloc

– 7.86 vs. 0.00001766 seconds

$ nvprof ./testMalloc 1000000 128 ======== NVPROF is profiling testMalloc... ======== Command: testMalloc 1000000 128 Using: nElements 1000000 128 ======== Profiling result: Time(%) Time Calls Avg Min Max Name 100.00 7.86s 1 7.86s 7.86s 7.86s performIndividualAlloc(int, int) 0.00 17.66us 1 17.66us 17.66us 17.66us performBlockAlloc(int, int)

Page 20: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Modify ParallelCounter

• Need unique counts that are never duplicated

– ParallelCounter can potentially increase by warpsize in one instruction!

thread atomicAdd thread atomicAdd thread atomicAdd thread atomicAdd

0 0

0 0 1 1

1 1

WARP

Page 21: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

BoundedParallelCounter

count[0] 0

maxColCount -1

count[1] maxColCount

2*maxColCount -1

count[2] 2*maxColCount

3*maxColCount -1

1. Index off individual counters in the count array 2. Calculate the index so there is no overlap between counters: [0,N) [N,2N)

[2N,3N) {where N = maxColCount}

3. Each index represents a unique stack structure/class location • Find unused data structures quickly and in parallel even when all counters increment at the same time!

Page 22: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Use transparent object movement • Transparent object movement allows

– Host and device allocation

– Host and device traversal

• Create complex structures on the host

– Compute with them on the device

– Read variable results on the host

Host CUDA, Supercomputing for the Masses: Part 28 A Massively Parallel Stack for Data Allocation

Page 24: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

ParallelCounter is also an accumulator!

template <class ACC_TYPE, uint32_t N_ATOMIC=32> struct ParallelAccumulator { private: ACC_TYPE accum[N_ATOMIC]; public: __device__ ParallelAccumulator<ACC_TYPE, N_ATOMIC>() { } __device__ ~ParallelAccumulator<ACC_TYPE, N_ATOMIC>() { } inline __device__ ACC_TYPE operator-=(ACC_TYPE x) { return atomicSub(accum + (threadIdx.x % N_ATOMIC), x); } inline __device__ ACC_TYPE operator+=(ACC_TYPE x) { return atomicAdd(accum + (threadIdx.x % N_ATOMIC), x); } // spread the values across the accumulator __device__ void set(ACC_TYPE x) { for(int i=0; i < N_ATOMIC; i++) accum[i]=x; } inline __device__ ACC_TYPE getValue() { // simplest slow method for right now. ACC_TYPE sum=0; for(int i=0; i < N_ATOMIC; i++) { sum += accum[i]; } return sum; } };

Float/Double/struct/class Data Acc += val sum

Float/Double/struct/class Data Acc += val sum

Float/Double/struct/class Data Acc += val sum

Float/Double/struct/class Data Acc += val sum

Float/Double/struct/class Data Acc += val sum

Float/Double/struct/class Data Acc += val sum

Float/Double/struct/class Data Acc += val sum

Actual code (C++ template arg)

Page 25: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Parallel Concurrent Sorting // Based on http://www.tools-of-computing.com/tc/CS/Sorts/bitonic_sort.htm __global__ void bitonic_sort_step(float *data, int j, int k, int n) { //unsigned int i, ixj; /* Sorting partners: i and ixj */ for(unsigned int i = threadIdx.x + blockDim.x * blockIdx.x; i < n; i += blockDim.x * gridDim.x) { unsigned int ixj = i^j; /* The threads with the lowest ids sort the array. */ if ((ixj)>i) { if ((i&k)==0) { /* Sort ascending */ if (data[i]>data[ixj]) { /* exchange(i,ixj); */ float temp = data[i]; data[i] = data[ixj]; data[ixj] = temp; } } else { /* Sort descending */ if (data[i]<data[ixj]) { /* exchange(i,ixj); */ float temp = data[i]; data[i] = data[ixj]; data[ixj] = temp; } } } } }

Float/Double/struct/class Data Sort

Actual code

Sort

Float/Double/struct/class Data Sort Sort

Float/Double/struct/class Data Sort Sort

Float/Double/struct/class Data Sort Sort

Float/Double/struct/class Data Sort Sort

Float/Double/struct/class Data Sort Sort

Float/Double/struct/class Data Sort Sort

Float/Double/struct/class Data Sort Sort

__global__ void k_bitonic_sort(float *data, int n, int powOfTwo) { int nThreadsPerBlock = 512; int nBlocks = (powOfTwo-n)/nThreadsPerBlock; nBlocks /= 64; nBlocks = (nBlocks==0)?1:nBlocks; if(n < powOfTwo) bitonic_init<<<nBlocks, nThreadsPerBlock>>>(data, n, powOfTwo); nThreadsPerBlock = (powOfTwo > nThreadsPerBlock)?nThreadsPerBlock:powOfTwo; nBlocks = powOfTwo/nThreadsPerBlock; nBlocks /= 64; nBlocks = (nBlocks==0)?1:nBlocks; int j, k; /* Major step */ for (k = 2; k <= powOfTwo; k *= 2) { /* Minor step */ for (j=k>>1; j>0; j= j>>1) { bitonic_sort_step<<<nBlocks, nThreadsPerBlock>>>(data, j, k,powOfTwo); } } }

Page 26: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Concurrent Parallel Sorting (10k floats, 15 streams, k40c)

Great job NVIDIA!

• Nice scheduling trapezoid!

simpleBitonic sort

simpleQsort

• simpleBitonic: 4.5 ms

• simpleQsort: 246.9 ms

Recursion is expensive

(not surprising)

• Recursion is expensive

Page 27: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Lots of heterogeneity, variable sampling, varying certainty of measure: Umm…what can we trust?

Cold Spring Harbor Laboratory MS analysis pipeline POC: Dr. John Wilson (jwilson at cshl.edu)

Public URL forthcoming

• Protein differences between states cause disease • Proteins are the structures and machines of life • Biology varies: heterogeneous observations with large scatter

• Mass spectrometry (MS) reads out protein differences • MS data are not Gaussian and change (calibration, temp., sample, etc.) • Non-linear measurement reliability as a function of intensity • Unpredictable number of observations • => Control everything internally and translate all observations to p-values.

• Work in those.

High variability at low S/N or at saturation

Low variability within linear range of measure

• People can and do look at a single overall measure (e.g. p=0.08) o Then go and do 3 years and $500k

of research o Only to come back and say “Hmm,

can I look at that spectrum again?” • Often they then find out they are

looking at highly variable, low-intensity observations. (OUCH!)

Page 28: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Bootstrapping: non-parametric resampling technique that works for all distributions and small n; allows determination of certainty of measure • Steps: sample observations, combine p values, repeat >=10,000 times, sort, find CIs

• Example: 18 heterogeneous measurements • p from 0.007 (down regulated) to 0.14 (up reg.) • Most say down reg.; overall p = 0.08 (maybe down reg.) • HOWEVER, 95% CI = 0.001 - 0.558! • In ~30-40% of observation sets,

upper 95% CIs exceed choice of alpha

p = 0.08

p-val

0.028534

0.262530

0.286691

0.609363

0.783723

0.378550

0.252444

0.067699

0.859116

0.292653

0.834993

0.471238

GPUs for interactive answers Timeline running on k40c + k20c

timeline

Cold Spring Harbor Laboratory MS analysis pipeline POC: Dr. John Wilson (jwilson at cshl.edu)

Public URL forthcoming

Page 29: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

SurveyFit™ Find the distribution that best fits your data

info at statperfect.com

Multicore + K40c + k20c + GT640

• Parametric modeling can be extremely helpful • Forecasting, interpolation, understanding, speed, simplification

• Assumptions and/or the wrong model can have drastically terrible effects • “Recipe for Disaster: The Formula That Killed Wall Street” (Wired 2/09) • Oops, financial markets are not Gaussian distributed

• Many hundreds of probability distribution functions exist • Many techniques exist to fit data • Many goodness of fit tests

• Measure how well the model describes the data • Techniques to determine the relative quality of a model also exist

SurveyFit™ solves every distribution (in library)

With every technique (in library)

Ranks everything against everything Enabled by massive parallel processing

Page 30: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Kriging Interpolation and Graph Algorithms

Kriging Interpolation • Heavily used by the Air

Force and GIS services • Widely used in the domain

of spatial analysis and computer experiments

Graph Algorithms (sadly no time! ) • Heavily used in social media

analysis • See mpgraph & Graphlab http://www.systap.com/mpgraph/api/html/index.htm

Page 31: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

Parallel Kriging (single K20c delivers a 55x speedup over a quad core Xeon)

Three kriging steps 1. Estimation of the semivariogram 2. Semivariogram model fitting. 3. Expression of the N_TILE x N_TILE solution of the

kriging weights and creation of the raster points.

Strong scalability of Magma sgetrf() by GPUs

http://icl.cs.utk.edu/news_pub/submissions/plagma_lu.pdf

for(int i=0; i < TILE_SIZE; i++) for(int j=0; j < TILE_SIZE; j++) kriging_results(i,j,..., raster)

4-core Xeon 825 ms

Nvidia K20c 15 ms

Single GPU Runtime on a representative dataset

Strong scaling by number of GPUs!

kriging_results(0,0,..., raster)

kriging_results(0,1,..., raster)

kriging_results(i/2-1,j,..., raster)

Parallel Implementation (one kriging result() call per thread)

kriging_results(i/2,0,..., raster)

kriging_results(i/2,1,..., raster)

kriging_results(i/2,j,..., raster)

2x GPU = 2x faster 4x GPU = 4x faster 16x GPU = …

Page 33: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Looks sort of Gaussian

Probability Density Function

Histogram

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Data (watch the red line) SurveyFit™ animation:

Page 34: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• And a Gaussian sort of fits

Probability Density Function

Histogram Normal

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Gaussian (sort of fits) SurveyFit™ animation:

Page 35: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• General logistic looks better…

Probability Density Function

Histogram Gen. Logistic

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

General Logistic SurveyFit™ animation:

Page 36: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Hypersecant is yet better

Probability Density Function

Histogram Hypersecant

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Hypersecant SurveyFit™ animation:

Page 37: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Johnson SU is by far the best fit

Probability Density Function

Histogram Johnson SU

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Johnson SU SurveyFit™ animation:

Page 38: Rob Farber Chief Scientist, BlackDog Endeavors, LLC ......S4178: Killer-app Fundamentals: Massively-parallel data structures, Performance to 13 PF/s, Portability, Transparency, and

• Gaussian compared to Johnson SU • The actual SurveyFit™product examines hundreds of distributions

and many methods

Probability Density Function

Histogram Johnson SU Normal

x

1.81.71.61.51.41.31.21.110.90.80.70.60.50.40.3

f(x)

0.052

0.048

0.044

0.04

0.036

0.032

0.028

0.024

0.02

0.016

0.012

0.008

0.004

0

Results (Johnson SU vs Gaussian) SurveyFit™ animation: