42
Unstructured Computations on Knights Landing Architecture Mohammed A. Al Farhan and David E. Keyes Extreme Computing Research Center King Abdullah University of Science and Technology Intel Extreme Performance Users Group (IXPUG) Middle East Conference 2018 at KAUST April 22-25, 2018 Thuwal, Saudi Arabia April 24, 2018

Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Unstructured Computations on Knights

Landing Architecture

Mohammed A. Al Farhan and David E. Keyes

Extreme Computing Research CenterKing Abdullah University of Science and Technology

Intel Extreme Performance Users Group (IXPUG)Middle East Conference 2018 at KAUSTApril 22-25, 2018 • Thuwal, Saudi Arabia

April 24, 2018

Page 2: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Highlights of the Contributions

Address the classical tension between unstructured data (indirect addressing) andhighly structured architectures (vector instructions)

Algorithm: nonlinearly implicit pseudo-transient Newton-Krylov-Schwarz solver

Application: computational aerodynamics (Gordon Bell winning NASA legacy code)

Architecture: Intel Xeon Phi “Knights Landing”

AVX-512CD instructions provide new solutions for conflicting objectives

Independence of writes versus memory locality

Finest-grain implementation of PETSc-FUN3D to exploit language features

Gains in wall-clock time 2.9x over scalar compilation for strong thread scaling

Distributed-memory not addressed here; proven for two decades in weak scaling

Gains in energy-to-solution for a given wall-clock time up to 1.7x

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 1 / 39

Page 3: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Outline

1 Application – Computational Aerodynamics

2 Shared-memory Optimizations and TuningData-level Parallelism

3 Performance Evaluation and Results

4 Summary and Reflections

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 2 / 39

Page 4: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Fully Unstructured Navier-Stokes in 3D

An unstructured tetrahedral mesh Euler and Navier-Stokes research code, which isclosely related to the export-controlled state-of-the-practice from NASA LangleyResearch Center

For over two decades FUN3D has been under active development for modeling fluidflow, and design optimization of airplanes, automobiles, and submarines with anumber of vertices up to 10 billion

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 3 / 39

Page 5: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

PETSc-FUN3D

1995

1999Gordon Bell Special Prize

2005Adaptive Linear Solver

2016KNC/HSX/BDX

2017KNL/SKY

2018GPU

2001Parallel I/O

2010IBM BlueGene/P

2015SNB/IVB

MIMD/SPMD

1996 2002

SIMD/SIMT

2009

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 4 / 39

Page 6: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Edge-based Loop Kernel – Sequential

1: for all e ∈ array of edges do2: Get the index of the right node of e3: Get the index of the left node of e4: Read the flow variables from both endpoints of e5: Compute the flux and the residual6: Perform “write-back” operations to update the flow variables of e7: end for

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 5 / 39

Page 7: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Spatial and Temporal Locality of Reference

Left endpoints of the edges are ordered in ascending orderI The right endpoints and the edges’ components are ordered accordingly

Traversing the edges index table is done through one iterator pointer that issequentially incremented

A kernel loop that iterates over the stencil data items is essentially transformedfrom a loop over the edges into a loop over the vertices, in which the iterativetraversing is linearly based upon the left endpoints

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 6 / 39

Page 8: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Geometric Data LayoutTree of Struct-of-Arrays (SoA)

*edges(struct)

nedges (size t)

*nodes(struct) *n0 (unsigned int)

*n1 (unsigned int)

*normals(struct)

*x (double)

*y (double)

*z (double)

*ln (double)

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 7 / 39

Page 9: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Residual and Gradient Data LayoutFROM Array-of-Structs-of-Arrays (AoSoA) TO Array-of-Structs-of-Strided-Arrays

struct {X: u v w p ......... u v w p

(A) Y: u v w p ......... u v w pZ: u v w p ......... u v w p

} Gradient;

struct {(B) Q: u v w p ......... u v w p

} Residual;

Alignment and padding to 64-bytes cache line boundaries

Contiguously adjust every 4 DoF next to each other (u, v, w, p)

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 8 / 39

Page 10: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Memory-aware Allocation

Explicit heap allocator based on how frequent the data are being accessedI Intel memkind heap manager library and jemalloc library

F hbw posix memalign psize()

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 9 / 39

Page 11: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Thread-level Parallelism

Edges workloads is partitioned using multilevel k-way partitioning scheme of METIS

METIS searches for an adequate load balance across the OpenMP threads

It minimizes the aggregate cross edges’ weight (i.e., reduce the edge-cut)

Reverse Cuthill-McKee (RCM) is used for reordering the vertices

Work replication – METIS distribution conserves thread safety without the need ofatomic

Only the “master” thread is allowed to perform the write-back operation – “OwnerCompute”

The METIS strategy of redundancy in computations purges the overheads of usingglobal synchronizing barrier between the working threads

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 10 / 39

Page 12: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Edge-based Loop – Optimized and Threaded

1 #pragma omp parallel

2 {

3 const uint32_t t = omp_get_thread_num();

4 for(uint32_t i = ie[t]; i < ie[t+1]; i++){

5 /*Load and compute*/

6 if(parts[n0[i]] == t){/*Write-back into v[n0[i]]*/}

7 if(parts[n1[i]] == t){/*Write-back into v[n1[i]]*/}

8 }

9 }

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 11 / 39

Page 13: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Intel AVX-512 Instruction Set Architecture

Knights Landing

Skylake

AVX512F: Foundation

AVX512CD: Conflict Detection

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 12 / 39

Page 14: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Intel AVX-512 Instruction Set Architecture

Knights Landing

Skylake

AVX512ER: Exponential and Reciprocal

AVX512PF: Prefetching

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 13 / 39

Page 15: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Intel AVX-512 Instruction Set Architecture

Knights Landing

Skylake

AVX512VL: Vector Length

AVX512DQ: Doubleword and Quadword

AVX512BW: Byte and Word

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 14 / 39

Page 16: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Vectorizing Edge-based Loop Kernels

Read: Load and gather instructions

Arithmetic: FMA (vfmadd/vfmsub/vfnmadd)

Control-flow to Data-flow: Masking instructions

Square Root/Division: Reciprocal instructionsI√x: mm512 mul pd( mm512 rsqrtXX pd(x),x)

I cx : mm512 mul pd( mm512 rcpXX pd(x),c)

Prefetching: Software gather prefetching instructions

Write-back: Conflict detection and scatter instructions

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 15 / 39

Page 17: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Conflict Detection Instructions

01234567

Conflict-free Array

00011122

Array with Conflicts

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 16 / 39

Page 18: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Conflict Detection Instructions

01234567

Conflict-free Array

00011122

Array with Conflicts

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 17 / 39

Page 19: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Conflict Detection Instructions

01234567

Conflict-free Array

0 0 00 0 10 0 21 1 01→

1→

11 1 22 2 02 2 1

Array with Conflicts

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 18 / 39

Page 20: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Conflict Detection Instructions1 Generate an initial mask based on the thread ID and node index

2 Scan the SIMD lane to identify the conflicted data

3 Generate a mask that separates a conflict-free data subset

4 Perform a safe gather mask operation (load) to generate registers with contiguousdata items

5 Update the operands with multiple FMA instructions (number of double precisionarithmetics)

6 Perform a safe scatter mask operation (write-back) to sprinkle the updated dataitems over the memory addresses based on their indices

7 Perform SIMD boolean operation to “mask out” the already written indices fromthe mask (swizzling)

8 Repeat the aforementioned steps again on all of the remaining subsets within theSIMD lane

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 19 / 39

Page 21: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Vectorized Edge-based Loop Kernel1 #pragma omp parallel

2 {

3 const uint32_t t = omp_get_thread_num();

4 const uint32_t l = ie[t+1] - ((ie[t+1]-ie[t]) % 8);

5 for(uint32_t i = ie[t]; i < l; i += 8){

6 /*Load and compute on the SIMD lane elements*/

7

8 /*The Write-back inner loop*/

9 _mm512_cmpeq_epi32_mask(/*...*/);

10 do {

11 _mm512_mask_conflict_epi32(/*...*/);

12 _mm512_broadcastmw_epi32(/*...*/);

13 _mm512_mask_testn_epi32_mask(/*...*/);

14 /*Gather, compute, and scatter*/

15 _mm512_kxor(/*...*/);

16 } while(/*...*/);

17 }

18 /*Peel and remainder loop*/

19 for(uint32_t i = l; i < ie[t+1]; i++){

20 /*load and compute*/

21 if(parts[n0[i]] == t){/*Write-back into v[n0[i]]*/}

22 if(parts[n1[i]] == t){/*Write-back into v[n1[i]]*/}

23 }

24 }

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 20 / 39

Page 22: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Fine-grained Data Partitioning

Improve the vectorization efficiency and increase the size of the independent datasubset within a SIMD lane

Bucket sort routine based on a variant of the edge coloring algorithm

Aim to:I Extract independent subsets of the edgesI Prune the overhead of the innermost CD loop

Extract at most 8 independent edges within a bucketI To avoid a potential disruption of the initial ordering for cache locality efficiency

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 21 / 39

Page 23: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Bucket Sort Based on Partial Coloring

Edges Bitmap0 1 2 3 4 5 6 7

0 0 11 0 2 1 1 1 1 1 1 0 0 Bucket 12 0 33 1 0 1 1 1 1 0 0 0 0 Bucket 24 1 25 1 3 1 1 1 1 0 0 0 0 Bucket 36 2 37 5 4 1 1 0 0 0 0 0 0 Bucket 4

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 22 / 39

Page 24: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Bucket Sort Based on Partial Coloring

1: Create an nedges per color array to track the colors2: Create a bitmap to track the endpoints coloring scheme3: for c← 0, k ← 0, i← 0 to nndges do4: Clear all of the bitmap bits5: for j ← i to nndges do6: Read n0[j], n1[j] bits from the bitmap → b0, b17: if b0 ∨ b1 is TRUE then8: CONTINUE9: end if

10: if nedges per color[c] = 8 then11: BREAK12: end if13: Set n0[j], n1[j] bits in bitmap14: Swap edge data from index j to k and vice versa15: Increment k and nedges per color[c] by 116: end for17: Increment c by 118: i ← k19: end for

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 23 / 39

Page 25: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Mesh size1

Vertices: 2,761,774

DoFs: 11,047,096

Edges: 18,945,809

1In 1999 Gordon Bell prize paper, this was considered to be a large mesh but it is hostedconveniently today within a single node

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 24 / 39

Page 26: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Experimental Design

Sample space: 20 independent experiments

Runtime:I API: RDSTC instructions and POSIX timestampsI Results summary: arithmetic mean

Memory bandwidth and FLOPS:I Bandwidth: MC/EDC RPQ and WPQ insertsI Flops: count the arithmetic operations manuallyI Results summary: harmonic mean

For every experiment, the source code is recompiled, and the memory and cacheare completely flushed

+/- standard deviation of the mean for each experimental sample

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39

Page 27: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Hardware Specifications

KNL-A KNL-B KNL-C HSX BDX SKYFamily x200 x200 x200 E5V3 E5V4 ScalableModel 7210 7210 7290 2699 2680 8176Socket(s) 1 1 1 2 2 2Cores 64 64 72 36 28 56GHz 1.30 1.30 1.50 2.30 2.40 2.10Watts/socket 215 215 245 145 120 165DDR4 (GB) 96 96 192 264 132 264Freq. Driver* I A A A I AMax GHz 1.50 1.30 1.50 2.30 3.30 2.10Governor** P C C O P OTurbo Boost × X X X X X

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 26 / 39

Page 28: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Flux Performance with Different KNL Modes

0

20

40

60

80

100

120

140

160

180

64 128 256

Tim

e (

Second)

Number of Threads

Chip: KNL-B

All2All/CAll2All/F

Quadrant/CQuadrant/F

Hemisphere/CHemisphere/F

SNC-2/CSNC-2/FSNC-4/CSNC-4/F

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 27 / 39

Page 29: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Optimizations of the Flux Kernel

0

80

160

240

320

400

64 128 256

Tim

e (

Second)

Number of Threads

Chip: KNL-A

BaselineOptimized/HBW-AVX512/HBW

IntrinsicsIntrinsics/HBW

Reordering/HBW

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 28 / 39

Page 30: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Optimizations of the Gradient Kernel

0

80

160

240

320

64 128 256

Tim

e (

Second)

Number of Threads

Chip: KNL-A

BaselineOptimized/HBW-AVX512/HBW

IntrinsicsIntrinsics/HBW

Reordering/HBWRestructuring/HBW

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 29 / 39

Page 31: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Strong Thread Scalability

0

4000

8000

12000

16000

20000

1 2 4 8 16 32 64 128 256

Tim

e (

Second)

Number of Threads

Chip: KNL-B

PETSc RoutinesFlux Kernel

Gradient KernelJacobian Matrix Construction

Preprocessing and SetupPseudo Time Step Kernel

Aerodynamics Forces Computation

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 30 / 39

Page 32: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Strong Thread Scalability (Flux Kernel)

16

64

256

1024

4096

1 2 4 8 16 32 64 128 256

Tim

e (

Second)

Number of Threads

Chip: KNL-B

100%

100%

100%

98%

98%

94%

88%54% 24%

1.0

2.1

4.0

7.8

15.7

30.1

56.168.5 61.8

Flux Routine

Ideal Scaling

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 31 / 39

Page 33: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Strong Thread Scalability (Gradient Kernel)

16

64

256

1024

4096

1 2 4 8 16 32 64 128 256

Tim

e (

Second)

Number of Threads

Chip: KNL-B

100%

100%

100%

100%

98%

88%72% 42% 22%

1.0

2.3

4.3

8.3

15.7

28.345.8 53.7 57.2

Gradient Routine

Ideal Scaling

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 32 / 39

Page 34: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Performance Speedup

0

10

20

30

40

50

60

70

164

128

256 1

64

128

256 1

64

128

256 1

64

128

256 1

64

128

256 1

64

128

256

Sp

eed

up

Number of Threads

Chip: KNL-B

ForcesTSSetupJacoGradFlux

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 33 / 39

Page 35: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Amdahl’s Law

0

1

2

3

4

1 2 4 8 16 32 64 128 256

Speedup

Number of Threads

Chip: KNL-B

Amdahl's LawActual Speedup

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 34 / 39

Page 36: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Memory Bandwidth and Flops PerformanceFlux Kernel Gradient Kernel

Threads 72 144 288 72 144 288GFlop/s 117 144 123 21 24 25

Flux Kernel Gradient KernelThreads 72 144 288 72 144 288

DD

R Read: GB/s 14 17 17 8 10 12Write: GB/s 7 9 9 0.2 0.4 1

HB

W Read: GB/s 40 51 45 43 53 54Write: GB/s 0.1 0.1 0.1 18 22 25

GFlop/s: 5% out of 3 TFlop/sGB/s:

I DRAM: 25% out of STREAM (77 GB/s (R), 36 GB/s (W))I MCDRAM: 17% out of STREAM (314 GB/s (R), 171 GB/s (W))

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 35 / 39

Page 37: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Flux Performance on x86 Hardware

0

40

80

120

160

200

240

280

320

360

18

36

72

14

28

56

64

128

256

72

144

288

28

56

112

Tim

e (

Secon

d)

Number of ThreadsSKXKNL-CKNL-BBDXHSX

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 36 / 39

Page 38: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Gradient Performance on x86 Hardware

0

40

80

120

160

200

240

18

36

72

14

28

56

64

128

256

72

144

288

28

56

112

Tim

e (

Secon

d)

Number of ThreadsSKXKNL-CKNL-BBDXHSX

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 37 / 39

Page 39: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

PETSc-FUN3D Performance on x86 Hardware

0

1000

2000

3000

4000

5000

6000

18

36

72

14

28

56

64

128

256

72

144

288

28

56

112

Tim

e (

Secon

d)

Number of ThreadsSKXKNL-CKNL-BBDXHSX

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 38 / 39

Page 40: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Conclusion

Demonstrate several shared-memory optimizations to extract:1 Thread-level parallelism – Careful workload distributions and load balancing2 Data-level parallelism – Utilizing the capabilities of AVX-512 ISA

Achieve 2.9x speedup in the performance of the flux kernel relative to the baselinecode

Exhibit almost linear scalability up to the full core count of KNL (64 cores), andcontinued scalability with SMT

Maintain on Skylake roughly similar speedup with less power consumption [KNL:245 Watts; SKX: 330 Watts]

Future Direction −→ Port PETSc-FUN3D onto GPU

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 39 / 39

Page 41: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Conclusion

Demonstrate several shared-memory optimizations to extract:1 Thread-level parallelism – Careful workload distributions and load balancing2 Data-level parallelism – Utilizing the capabilities of AVX-512 ISA

Achieve 2.9x speedup in the performance of the flux kernel relative to the baselinecode

Exhibit almost linear scalability up to the full core count of KNL (64 cores), andcontinued scalability with SMT

Maintain on Skylake roughly similar speedup with less power consumption [KNL:245 Watts; SKX: 330 Watts]

Future Direction −→ Port PETSc-FUN3D onto GPU

M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 39 / 39

Page 42: Unstructured Computations on Knights Landing Architecture · M. A. Al Farhan (KAUST) Unstructured Computations on KNL April 24, 2018 25 / 39. Hardware Specifications KNL-A KNL-B KNL-C

Q/A