Úvod do Intel Xeon Phiprace.it4i.cz/sites/prace.it4i.cz/files/files/salomon-09-2015-stachon.pdf · Intel® Xeon Phi™ Architecture and host CPUs •Messages to/from any core •Intel®

Úvod do Intel Xeon Phi

This project is funded by Structural Funds of the European Union (ESF) and state budget of the Czech Republic

Martin Stachoň Lubomír Říha

IT4Innovations

Outline

• Accelerators philosophy

• Programming models

• Practical info for Salomon

Historical Analysis

PetaFLOPS (MIC)

Vector

Machines MPPs with Multicores and

Heterogeneous Accelerators

Massively

Parallel

Processors

1993-

HPCC 2008-

End of Moore’s

Law in Clocking!

Performance

Time

PetaFLOPS (Cell)

TeraFLOPS (MPPs)

PetaFLOPS (GP GPUs)

2011

Trends for Petaflop/s Machines CPUs: Wider vector units, more cores

• General-purpose in nature

• High single-thread performance, moderate floating point throughput

• 2x E5-2680 - 0.40 Tflop/s, 260W

GPUs: Thousands of very simple stream processors

• Specialized for floating point.

• New programming models: CUDA, OpenCL, OpenACC

• Tesla K40 - 1.43 Tflop/s, 235W

MIC: Take CPU trends to an extreme, optimize for floating point.

• Retain general-purpose nature and programming models from CPU

• Low single-thread performance, high aggregate FP throughput

• 7210P - 1.24 Tflops/s, 300W

Accelerators in HPC Hardware Accelerators - Speeding up the Slow Part of the Code • Enable higher performance through fine-

grained parallelism

• Offer higher computational density than CPUs

• Accelerators present heterogeneity!

Main Features • Coprocessor to the CPU • PCIe based interconnection • Separate memory • Provide high bandwidth access to

local data

GPU or MIC

Future: Accelerators only?

Accelerated Execution Model

PC

mP

GPU, MIC,

FPGA, Cell CBE, …

• Transfer of Control

• Input Data

• Output Data

• Transfer of Control

• Fine grain computations with the accelerators, others with the CPU

• Interaction between accelerator and CPU can be blocking or asynchronous

• This scenario is replicated across the whole system and standard HPC parallel programming paradigms used for intranode interactions

Processors vs. Accelerators Accelerators

• tailored for compute-intensive, highly data parallel computation

• many parallel execution units

• have significantly faster and more advanced memory interfaces

• more transistors can be devoted to data processing

• less transistors for data caching and flow control

• Very Efficient For • Fast Parallel Floating Point Processing • High Computation per Memory Access

• Not As Efficient For • Branching-Intensive Operations • Random Access, Memory-Intensive Operations

DRAM

Cache

ALU

Control

ALU

ALU

ALU

DRAM

Processors have few execution units higher clock speeds

Intel MIC Architecture

• Up to 61 Cores, 244 Threads

• 512‐bit SIMD instructions

• >1TFLOPS DP-F.P. peak

• Up to 16GB GDDR5 Memory • 352 GB/s peak, but ~170 GB/s measured

• PCIe 2.0 x16 - 5.0 GT/s, 16-bit

• Data Cache • L1 32KB/core • L2 512KB/core, 30.5 MB/chip

• Up to 300W TDP (card)

• Linux* operating system • IP addressable - coprocessor becomes a network node • Common x86/IA (no binary compatibility) • Programming Models and SW-Tools

Intel MIC Architecture Overview

TD – cache Tag Directory

Based on what Intel learned from – Larrabee – SCC – TeraFlops Research Chip

Memory – up to 16 GB of GDDR5 • used for everything, including the OS image and the local filesystem • There are multiple memory controllers on the card. • Access over a shared ring bus • Cores compete for access. • Transfers to/from the card go over PCIe, so maximum speeds around 7GB/s.

Core Architecture Overview

• 61 cores • Full support for the x86 instruction set • In order execution • Coherent caches (per core)

• 32KB L1 instruction and data caches. • 512KB shared L2 data/instruction

• Two scalar pipelines • Scalar Unit based on Pentium® processors • Dual issue with scalar instructions • Pipelined one-per-clock scalar throughput • One pipeline is limited in functionality

• SIMD Vector Processing Engine • only significant difference to the Pentium • 512-bit vector processing unit (VPU) • 32 512-bit wide vector registers

• 4 hardware threads per core • 4 clock latency, hidden by round-robin scheduling of

threads • Cannot issue back to back instructions in same thread:

Means minimum two threads per core to achieve full compute potential

Vectorization

Source : Xeon Phi Tutorial Tim Cramer | Rechen- und Kommunikationszentrum

SIMD Vector Capabilities

• MMX: MMX Pentium and Pentium II (PentiumPro didn’t have MMX)

• SSE: Pentium III

• SSE2: Pentium 4

• SSE3: Pentium 4 with 90nm technology

• SSSE3: Core 2 Duo/Quad (65nm technology)

• SSE4.1: Core 2 Duo/Quad (45nm technology)

• SSE4.2: 1st generation Core i7 (45nm, 32nm)

• AVX: 2nd/3rd generation Core i7 (32nm, 22nm)

Vectorization


SIMD Vector Basic Arithmetic

Vectorization


SIMD Fused Multiply and Add - FMA

MIC vs. CPU

• CPUs designed for all workloads, high single-thread performance

• MIC also general purpose, though optimized for number crunching

• Focus on high aggregate throughput via lots of weaker threads

Regularly achieve >2x performance compared to dual E5 CPUs Single core scalar performance is 1/10th of E5 core

MIC (SE10P) CPU (E5) MIC is… 61 8 much higher 1.01 2.7 lower 512 256 higher 16+ 21+ lower 4 1* higher 320; GDDR5 100; DDR3 higher 170 75 higher

Number of cores Clock Speed (GHz) SIMD width (bit) DP GFLOPS/core HW threads/core Memory BW (GB/s) Sustained BW

High-performance Xeon Phi applications exploit both parallelism and vector processing.

MIC vs. CPU

Scalar & Single thread

Vector & Single thread

Scalar & Multiple threads

Vector & Multiple threads

More Parallel

More Performance

CPU MIC

CPU MIC

CPU MIC

1 10 100 1000 10000

Threads

Perf

orm

ance

[G

FLO

PS]

MIC Programming: Advantages • Intel’s MIC is based on x86 technology

• x86 cores w/ caches and cache coherency • SIMD instruction set • but is not x86 compatible (cross-compilation needed) • Coherent cache (but …)

• Programming for Phi is similar to programming for CPUs • Familiar languages: C/C++ and Fortran • Familiar parallel programming models: OpenMP & MPI • MPI on host and on the coprocessor • Any code can run on MIC, not just kernels

• Optimizing for Phi is similar to optimizing for CPUs • “Optimize once, run anywhere” • Early Phi porting efforts for codes “in the field” have obtained

double the performance of Sandy Bridge

Will My Code Run on Xeon Phi? • Yes

• … but that’s the wrong question • Will your code run “best” on Phi?, or • Will you get great Phi performance without additional work?

• Codes port easily • Minutes to days depending mostly on library dependencies

• Performance can require real work • Getting codes to run “at all” is almost too easy • Need to put in the effort to get what you expect

• Scalability • Multiple threads per core is really important • Getting your code to vectorize is really important

Performance Expectations “WOW, 240 hardware threads on a

single chip! My application will just

rock!” You really believe that?

Remember the limitations!

In-order cores

limited hardware prefetching

Running with 1GHz only

Small Caches (2 levels)

Poor single thread performance

Small main memory

PCIe as bottleneck + offload overhead

• In-order cores • limited hardware prefetching • Cores running with 1GHz only • Small Caches (2 levels)

• Poor single thread performance • Small main memory • PCIe as bottleneck + offload

overhead

... most likely NO …

MIC Programming Considerations • Getting full performance from the Intel® MIC

architecture requires both a high degree of parallelism and vectorization

• Not all code can be written this way • Not all programs make sense on this architecture

• Intel® MIC is different from Xeon • It specializes in running highly parallel and vectorized code. • Not optimized for processing serial code

• Parallelism and vectorization optimizations are beneficial across both architectures

Definition of a Node

• A “node” contains a host component and a MIC component

• host – refers to the Sandy Bridge component

• MIC – refers to one or two Intel Xeon Phi co-processor cards

Spectrum of Programming Models

MPI+Offload Programming Model

MPI ranks on CPUs only

• All messages into/out of host CPUs

• Offload models used to accelerate MPI ranks

• Intel® Cilk™ Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® Xeon Phi™ coprocessor

Build Intel® 64 executable with included offload by using the Intel compiler Run instances of the MPI application on the host, offloading code onto coprocessor Advantages of more cores and wider SIMD for certain applications

Symmetric Programming Model

MPI ranks on both Intel® Xeon Phi™ Architecture and host CPUs

• Messages to/from

any core

• Intel® Cilk™ Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes

Build binaries by using the resp. compilers targeting Intel 64 and Intel Xeon Phi Architecture Upload the binary to the Intel Xeon Phi coprocessor Run instances of the MPI application on different mixed nodes

Coprocessor only Programming Model

MPI ranks on Intel® Xeon Phi™ coprocessor only

• All messages into/out of the coprocessors

• Intel® Cilk™ Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads used directly within MPI processes

Build Intel Xeon Phi coprocessor binary using the Intel® compiler Upload the binary to the Intel Xeon Phi coprocessor Run instances of the MPI application on Intel Xeon Phi coprocessor nodes

Offload Programming Model Overview • Programmer designates code sections to offload

• No further programming/API usage is needed • The compiler and the runtime automatically manage setup/teardown,

data transfer, and synchronization

• Add pragmas and new keywords to working host code to make sections of code run on the Intel® Xeon Phi™ coprocessor

• Similar to adding parallelism to serial code using OpenMP* pragmas

• The Intel compiler generates code for both target architectures at once

• The resulting binary runs whether or not a coprocessor is present • Unless you use #pragma offload target(mic:cardnumber)

• The compiler adds code to transfer data automatically to the coprocessor and to start your code running (with no extra coding on your part)

• Hence the term “Heterogeneous Compiler” or “Offload Compiler”

Data Transfer Overview

• The host CPU and the coprocessor do not share physical or virtual memory in hardware

• Two offload data transfer models are available: • 1. Explicit Copy

• Programmer designates variables that need to be copied between host and card in the offload pragma/directive

• Syntax: Pragma/directive-based • C/C++ Example:

• #pragma offload target(mic) in(data:length(size))

• 2. Implicit Copy (only Cilk+) – not covered here • Programmer marks variables that need to be shared between host

and card • The same variable can then be used in both host and coprocessor

code • Runtime automatically maintains coherence at the beginning and

end of offload statements

Data Transfer: C/C++ Offload using Explicit Copies C/C++ Syntax Semantics

Offload pragma #pragma offload <clauses> <statement>

Allow next statement to execute on coprocessor or host CPU

Offload transfer #pragma offload_transfer <clauses>

Initiates asynchronous data transfer, or initiates and completes synchronous data transfer

Offload wait #pragma offload_wait <clauses> Specifies a wait for a previously initiated asynchronous activity

Keyword for variable & function definitions

__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and coprocessor

Entire blocks of code

#pragma offload_attribute(push, target(mic)) … #pragma offload_attribute(pop)

Mark entire files or large blocks of code for generation on both host CPU and Coprocessor

Conceptual Transformation

OpenMP* examples

Clauses / Modifiers Syntax Semantics

Target specification target( name[:card_number] ) Where to run construct

Inputs in(var-list modifiersopt) Copy from host to coprocessor

Outputs out(var-list modifiersopt) Copy from coprocessor to host

Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes

Non-copied data nocopy(var-list modifiersopt) Data is local to target

Conditional offload if (condition) Boolean expression

Modifiers

Specify pointer length length(element-count-expr) Copy N elements of the pointer’s type

Control pointer memory allocation

alloc_if ( condition ) Allocate memory to hold data referenced by pointer if condition is TRUE

Control freeing of pointer memory

free_if ( condition ) Free memory used by pointer if condition is TRUE

Control target data alignment

align ( expression ) Specify minimum memory alignment on target

Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars

Rules & Limitations

• The Host from/to Coprocessor data types allowed in a simple offload:

• Scalar variables of all types • Must be globals or statics if you wish to use them with nocopy, alloc_if, or free_if (i.e. if they are to persist on the coprocessor between offload calls)

• Structs that are bit-wise copyable (no pointer data members) • Arrays of the above types • Pointers to the above types

• What is allowed within coprocessor code?

• All data types can be used (incl. full C++ objects) • Any parallel programming technique (Pthreads*, Intel® TBB,

OpenMP*, etc.) • Intel® Xeon Phi™ versions of Intel® MKL

Offload using Explicit Copies: Example

float reduction(float *data, int numberOf)

{

float ret = 0.f;

#pragma offload target(mic) in(data:length(numberOf))

{

#pragma omp parallel for reduction(+:ret)

for (int i=0; i < numberOf; ++i)

ret += data[i];

}

return ret;

}

Note: copies numberOf*sizeof(float)elements to the coprocessor, not numberOf bytes – the compiler knows data’s type

Data Movement

• Default treatment of in/out variables in a #pragma offload statement

• At the start of an offload: • Space is allocated on the coprocessor • in variables are transferred to the coprocessor

• At the end of an offload: • out variables are transferred from the coprocessor • Space for both types (as well as inout) is deallocated on the

coprocessor

Heterogeneous Compiler: Reminder of What is Generated

• Note that the compiler generates two binaries:

• The host version • includes all functions/variables in the source code, whether

marked • #pragma offload,

• __attribute__((target(mic))) …….. or not

• The coprocessor version

• includes only functions/variables marked in the source code • #pragma offload,

• __attribute__((target(mic)))

• Linking creates one executable with both binaries included!

Heterogeneous Compiler: Command-line options • “–openmp” is automatically set when you build

• Don’t need –no-offload if compiling only for Xeon • Generates same Xeon only code as previous compilers

• But –no-offload creates smaller binaries

• Most command line arguments set for the host are set for the coprocessor build

• Unless overridden by

–offload-option,mic,xx=”…”

• Add –watch=mic-cmd to display the compiler options automatically passed to the offload compilation

Heterogeneous Compiler: Command-line options cont.

Offload-specific arguments to the Intel® Compiler:

• Generate host only code (by default both host + coprocessor code is generated): -no-offload

• Produce a report of offload data transfers at compile time -opt-report-phase=offload

• Add Intel® Xeon Phi™ compiler switches -offload-options,mic,compiler,“switches”

• Add Intel® Xeon Phi™ assembler switches -offload-options,mic,as:“switches”

• Add Intel® Xeon Phi™ linker switches -offload-options,mic,ld,“switches”

Example:

icc -I/my_dir/include -DMY_DEFINE=10 -offload-options, mic,compiler, “-I/my_dir/mic/include -DMY_DEFINE=20“ hello.c

Passes “-I/my_dir/mic/include -I/my_dir/include -DMY_DEFINE=10 -DMY_DEFINE=20” to the offload compiler

Compiler Support for SIMD Vectorization Intel auto-vectorizer Combination of loop unrolling and SIMD instructions to get vectorized loops

No guarantee given, compiler might need some hints

Compiler feedback Use –vec-report [n] to control the diagnostic information of the vectorizer

n can be between 0-5 (recommended 3 or 5)

concentrate on hotspots for optimization

C/C++ aliasing: Use restricted keyword

Intel specific pragma #pragma vector (Fortran: !DIR$ VECTOR)

indicates to the compiler that the loop should be vectorized #pragma simd (Fortran: !DIR$ SIMD)

enforces vectorization of the (innermost) loop

SIMD support will be added in OpenMP 4.0

Simultaneous Host/Coprocessor Computing • #pragma offload statement blocks thread until the

statement completes

• Simultaneous host and coprocessor computing requires multiple threads of execution on the host:

• One or more to block until their #pragma offload statements completes

• Others to do simultaneous processing on the host

• You can use most multithreading APIs to do this • OpenMP* tasks or parallel sections • Pthreads* • Intel® TBB’s parallel_invoke, Intel® Cilk™ Plus, …

Simultaneous Host/Coprocessor Computing - Sequential

master thread

offload single idle threads

workshare on cpu

Simultaneous Host/Coprocessor Computing - Concurrent

master thread

offload single nowait

workshare on cpu

assist when done in single

Thread Placement

• Thread placement may be controlled with the following environment variable

• KMP_AFFINITY=<type>

Asynchronous Offload and Data Transfer

• signal() and wait()

• Available async functionality • offload

• offload_transfer

• offload_wait

• Examples #pragma offload target(mic:0) signal(flg1)

#pragma offload_transfer target(mic:0) signal(flg2) wait(flg1)

#pragma offload_wait target(mic:0) wait(flg1)

Signal, Wait and tag

• Examples #pragma offload_transfer target(mic:0) \

signal(tagA) wait(tag0, tag1) ...

• Do not start transfer until the operations signaling tag0 and tag1 are complete

• Upon completion, indicate completion using tagA

#pragma offload target(mic:0) \

signal(tagB) wait(tag2, tag3, tag4) ...

• Do not start until the operations signaling tag2, tag3 and tag4 are complete

• Upon completion, indicate completion using tagB

Persistence of Pointer Data

• Default treatment of in/out variables in a #pragma offload statement

• At the start of an offload: • Space is allocated on the coprocessor • in variables are transferred to the coprocessor

• At the end of an offload: • out variables are transferred from the coprocessor • Space for both types (as well as inout) is deallocated on the

coprocessor

• This behavior can be modified • free_if(boolean) controls space deallocation on the

coprocessor at the end of the offload • alloc_if(boolean) controls space allocation on

coprocessor at the start of the offload • Use nocopy rather than in/out/inout to indicate that the

variable’s value is reused from a previous offload or is only relevant within this offload section

Persistence of Pointer Data: Example • Allocate space on coprocessor, transfer data to, and do not

release at end (persist)

• Use persisting data in subsequent offload code

• At end, transfer data from, and deallocate

__attribute__((target(mic))) static float *A, *B, *C, *C1;

// Transfer matrices A, B, and C to MIC device and

// do not deallocate matrices A and B

#pragma offload target(mic) \

in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \

in(A:length(NCOLA * LDA) alloc_if(1) free_if(0)) \ // e.g. ALLOC

in(B:length(NCOLB * LDB) alloc_if(1) free_if(0)) \ //and RETAIN

inout(C:length(N * LDC))

{

sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB,

&beta, C, &LDC);

}

Persistence of Pointer Data: Example

// Transfer matrix C1 to MIC device and reuse matrices A and B


in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \

nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) \ // e.g. REUSE

nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) \ //and RETAIN

inout(C1:length(N * LDC1))

{

sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B,

&LDB, &beta1, C1, &LDC1);

}

// Deallocate A and B on an Intel(R) Xeon Phi(TM) device


nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(1)) \ // e.g. REUSE

nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(1)) \ //and FREE

inout(C1:length(N * LDC1))

{

x = stuff(C1);

}

Allocation for Parts of Arrays • Allocation of array slices is possible

• alloc(p[5:1000]) modifier allocate 1000 elements on coprocessor

• first useable element has index 5, last 1004 (dark blue + orange)

• p[10:100] specifies 100 elements to transfer (orange)

int *p;

// 1000 elements allocated. Data transferred into p[10:100]

#pragma offload … in ( p[10:100] : alloc(p[5:1000]) )

{ … }

Array length 1005 length first element

source: Tim Cramer, Rechen- und Kommunikationszentrum (RZ)

Double Buffering Example

• Overlap computation and communication

• Generalizes to data domain decomposition



Intel Math Kernel Library - MKL Usage Models on Intel Xeon Phi Coprocessor

• Automatic Offload • No code changes required • Automatically uses both host and target • Transparent data transfer and execution management

• Compiler Assisted Offload • Explicit controls of data transfer and remote execution using

compiler offload pragmas/directives • Can be used together with Automatic Offload

• Native Execution • Compile and run code directly on the Xeon Phi • Input data is copied to targets in advance - no need to trasfer data • homogeneous

Performance: DGEMM on Xeon Phi • Xeon Phis have 60 cores, 4 HW threads/core:

• $ export OMP_NUM_THREADS=240

• First try … and fail: • $./dgemm_offload

• Matrices of size 10000x10000

• 250 Gflops

• Theoretical peak performance is: • peak = (# cores)*(vector size)*(ops/cycle)*(frequency) • peak = 60*8*2*1.052 = 1011 Gflops!

Where are the flops?

Performance: Thread Affinity • Affinity is VERY important on manycore systems. By setting

KMP_AFFINITY performance is significantly affected:

• Scatter affinity: • export ENV_PREFIX=MIC;

• export MIC_KMP_AFFINITY=scatter; ./dgemm_offload

• 250 Gflops

• Compact affinity: • export MIC_KMP_AFFINITY=compact; ./dgemm_offload

• 500 Gflops

• Balanced affinity: • export MIC_KMP_AFFINITY=balanced; ./dgemm_offload

• 500 Gflops

• Balanced was introduced to the MIC

Performance: Alignment and Huge pages Huge Pages

For any array allocation bigger than 10MB, use huge pages: • export MIC_USE_2MB_BUFFERS=10M

• 750 Gflops

Alignment

• As a general rule, we need to align arrays to the vector size. • 16-byte alignement for SSE processors, • 32-byte alignement for AVX processors, • 64-byte for Xeon Phi.

• In offload mode the compiler will consider that arrays alignment is the same on both the host and the device. We need to change the alignement on the host to match that on the device:

double *A = (double*) \

_mm_malloc(sizeof(double)*size_A, Alignment);

Alignment = 16 Alignment = 64

750 Gflops 780 Gflops

Inspired by: Gilles Fourestey, CSCS

From 37 % to 80% of the peak

performance

#pragma offload in(a:length(count) align(64))

Performance Expectations “WOW, 240 hardware threads on a

single chip! My application will just

rock!” You really believe that?

Remember the limitations!

In-order cores

limited hardware prefetching

Running with 1GHz only

Small Caches (2 levels)

Poor single thread performance

Small main memory

PCIe as bottleneck + offload overhead

[Offload] [MIC 0] [Tag 0] [MIC Time] 8.25 7.89 (seconds) 240 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 7.95 7.33 (seconds) 250 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 7.82 7.65 (seconds) 255 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 4.00 4.00 (seconds) 500 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 4.00 4.00 (seconds) 500 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 2.67 2.57 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 2.67 2.58 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 1] [MIC Time] 2.67 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 1] [MIC Time] 2.67 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 1] [MIC Time] 2.57 (seconds) 780 GFLOPS

MIC_OMP_NUM_THREADS=240, Data alignment = 16 MIC_KMP_AFFINITY=“disabled" MIC_KMP_AFFINITY=“none" MIC_KMP_AFFINITY="scatter" MIC_KMP_AFFINITY="compact" MIC_KMP_AFFINITY="balanced" MIC_USE_2MB_BUFFERS=100M MIC_OMP_NUM_THREADS=236 MIC_KMP_AFFINITY="compact“ Data alignment = 16 Data alignment = 32 Data alignment = 64

Performance: DGEMM on Anselm

Matrix dimension is set to 10 000

default

leave one core for OS

Stream benchmark on Anselm Size of input arrays

[MB] Bandwidth

GB/s

32 119.3047 64 142.6699

128 148.3955 256 156.2842

512 159.2646

1024 158.7783

2048 159.0215 4096 159.0313

0

50

100

150

200

1 8 64 512 4096

Triad bandwidth GB/s

http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad http://www.cs.virginia.edu/stream/stream_mail/2013/0002.html

Compiler Knobs

–mmic :build an application that runs natively on Intel® Xeon Phi coprocessor

–O3 :optimize for maximum speed and enable more aggressive optimizations that may not

improve performance on some programs

–openmp: enable the compiler to generate multi-threaded code based on the OpenMP*

directives (same as -fopenmp)

-opt-prefetch-distance=64,8:Software Prefetch 64 cachelines ahead for L2 cache;

Software Prefetch 8 cachelines ahead for L1 cache

-opt-streaming-cache-evict=0:Turn off all cache line evicts

-opt-streaming-stores always: enables generation of streaming stores under the assumption

that the application is memory bound

-DSTREAM_ARRAY_SIZE=64000000: Increasing the size of the array size to be compliant

with the STREAM Rules

-mcmodel=medium: compiler restricts code to the first 2GB; no memory restriction on data

a(i) = b(i) + q*c(i)

http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad














http://www.cs.virginia.edu/stream/stream_mail/2013/0002.html



static void add(double* l, double* r, double *res, int length)

{

# pragma offload target(mic) in(length) in(l,r,res : REUSE)

{

#ifdef __MIC__

# pragma omp parallel

{

int part = length/omp_get_num_threads();

int start = part*omp_get_thread_num();

double *myl=l+start, *myr=r+start, *myres=res+start;

# pragma noprefetch

for (int L2 = 0; L2+512*1024/8/4 <= part; L2 += 512*1024/8/4)

{

# pragma nofusion

# pragma noprefetch

for (int L1 = 0; L1+32*1024/8/4 <= 512*1024/8/4; L1 += 32*1024/8/4)

{

# pragma nofusion

# pragma noprefetch

for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)

{

_mm_prefetch((char*)myr+L2+L1+cacheline, _MM_HINT_T1);

_mm_prefetch((char*)myl+L2+L1+cacheline, _MM_HINT_T1);

}

# pragma nofusion

# pragma noprefetch

for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)

{

_mm_prefetch((char*)myr+L2+L1+cacheline, _MM_HINT_T0);

_mm_prefetch((char*)myl+L2+L1+cacheline, _MM_HINT_T0);

}

# pragma nofusion

# pragma noprefetch

for (int cacheline = 0; cacheline+8+8+8+8 <= 32*1024/8/4; cacheline += 8+8+8+8)

{

__m512d r0 = _mm512_load_pd(myr+L2+L1+cacheline+0*8); __m512d l0 = _mm512_load_pd(myl+L2+L1+cacheline+0*8);

__m512d r1 = _mm512_load_pd(myr+L2+L1+cacheline+1*8); __m512d l1 = _mm512_load_pd(myl+L2+L1+cacheline+1*8);

_mm512_storenrngo_pd(myres+L2+L1+cacheline+0*8, _mm512_add_pd(r0, l0));


__m512d r2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8); __m512d l2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8);

__m512d r3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8); __m512d l3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8);



}

}

}

}

#endif

}

}

#pragma omp parallel for for (j=0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j]+scalar*c[j];

Stream triad

Roofline Model for Intel Xeon Phi

• Memory bandwidth measured with the STREAM benchmark is about 157 GB/s

• To reach the peak performance an even mix of multiply and add operations is

need (“fused multiply add”)

• Without SIMD vectorization only 1/16 of the peak performance is achievable

Peak performance of a Intel Xeon Phi

Coprocessor (1.2 GHz) is

• 1171 GFLOPS

• (1.2 GHz * 16 OPs/cycle * 61 cores)

Roofline Model: SpMV Roofline Model Using read memory bandwidth 𝐵𝑊 and theoretical peak performance

𝑃

Model for SMXV 𝑦 =𝐴 ∗𝑥 Assumptions

𝑥 ,𝑦 can be kept in the cache (~ 15 MB)

A too big for caches (~ 3200 MB)

𝑛≪𝑛𝑛𝑧 Compressed Row Storage (CRS) Format: One value (double) and one

index (int) element have to be loaded (dimension nnz) → 12 Bytes

Operational intensity 𝐎 = 2 𝐹𝐿𝑂𝑃𝑆12 𝐵𝑦𝑡𝑒=16 𝐹𝐿𝑂𝑃𝑆𝐵𝑦𝑡𝑒 (→ memory-bound)

Performance Limit: 𝐿=min{𝑃,𝑂∗𝐵𝑊}

Roofline Model Intel Xeon Phi (STREAM

156 GB/s, O=1/6)

Roofline Model SNB (2.6 GHz, STREAM

74.2 GB/s, Peak 332.8 GFLOPS)

source: Tim Cramer, Rechen- und Kommunikationszentrum (RZ)

Roofline Model for Intel Xeon Phi Sparse Matrix Vector Multiplication

Roofline Model: SpMV Roofline Model Using read memory bandwidth 𝐵𝑊 and theoretical peak performance

𝑃

Model for SMXV 𝑦 =𝐴 ∗𝑥 Assumptions

𝑥 ,𝑦 can be kept in the cache (~ 15 MB)

A too big for caches (~ 3200 MB)

𝑛≪𝑛𝑛𝑧 Compressed Row Storage (CRS) Format: One value (double) and one

index (int) element have to be loaded (dimension nnz) → 12 Bytes

Operational intensity 𝐎 = 2 𝐹𝐿𝑂𝑃𝑆12 𝐵𝑦𝑡𝑒=16 𝐹𝐿𝑂𝑃𝑆𝐵𝑦𝑡𝑒 (→ memory-bound)

Performance Limit: 𝐿=min{𝑃,𝑂∗𝐵𝑊}

Intel Xeon Phi

STREAM 156 GB/s, O=1/6 source: Tim Cramer, Rechen- und Kommunikationszentrum (RZ)

SpMV

• 𝑦 =𝐴 ∗𝑥 • 𝑥 ,𝑦 can be kept in the cache (~ 15 MB), A too big for caches (~ 3200 MB)

• Compressed Row Storage (CRS) Format: • 1 value (double) and 1 index (int) element have to be loaded → 12 Bytes

• Operational intensity 𝐎 = 2 𝐹𝐿𝑂𝑃𝑆12 𝐵𝑦𝑡𝑒 =

1

6 𝐹𝐿𝑂𝑃𝑆𝐵𝑦𝑡𝑒 (→ memory-bound)

2xCPU Sandy Bridge 2.6 GHz,

STREAM 74.2 GB/s, Peak 332.8 GFLOPS

Accelerators in Anselm and Salomon

Anselm • 4 Nodes - cn[204-207]

• 2x Intel Sandy Bridge E5-2470, 2.3GHz

• 96 GB RAM

• 1x Intel Xeon Phi P5110, 60 cores, 8 GB RAM

Salomon • 432 Nodes (Perrin) – cns[577-1008]

• 2x Intel Haswell E5-2680v3, 2.5 GHz

• 128 GB RAM

• 2x Xeon Phi 7120P, 61 cores, 16 GB RAM

Differences from Anselm

• Two cards, more powerful model

• Infiniband

• /scratch accessible via Lustre

• Planned: full SW stack for Phi

• Planned: Phis as a “separated cluster”

MIC Programming on Salomon • Environment, PBS, MIC nodes, ….

• Device info

• MIC Programming modes • Native mode

• Compiler assisted offload mode

• Automatic offload mode using MKL library

• MPI and MIC

• Code has to be compiled on a node with MIC / MPSS installed

• Nodes cns577 – cns1008

• Get an interactive session to a node with MIC • $ qsub -I –X -A DD-14-1 -q qprod -lselect=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120, walltime=03:00:00

• To setup the MIC programming environment use: • $ module load intel/2015b

• To get information about MIC accelerator use: • $ micinfo

MIC Programming on Salomon

echo hostname | qsub -q R162356 echo hostname | qsub -q R162356 -g dd-14-1 echo hostname | qsub -q R162356 -W group_list dd-14-1 echo hostname | qsub -q R162356 -W group_list=dd-14-1

MicInfo Utility Log - Created Mon Jul 22 00:23:50 2013

System Info

HOST OS : Linux

OS Version : 2.6.32-504.16.2.el6.x86_64

Driver Version : 3.4.1-1

MPSS Version : 3.4.1

Host Physical Memory : 131930 MB

Device No: 0, Device Name: mic0

Version

...

uOS Version : 2.6.38.8+mpss3.4.1

...

Board

...

SMC HW Revision : Product 300W Passive CS

Cores

Total No of Active Cores : 61

Voltage : 1005000 uV

Frequency : 1238095 kHz

Thermal

...

Die Temp : 59 C

GDDR

...

GDDR Size : 15872 MB

GDDR Technology : GDDR5

GDDR Speed : 5.500000 GT/s

...

Native Mode

Native Mode Example • Edit source:

• $ vim vect-add-short.c

#include <stdio.h>

typedef int T;

#define SIZE 20

T in1[SIZE]; T in2[SIZE]; T res[SIZE];

// CPU function to generate a vector of random numbers

void random_T (T *a, int size) {

int i;

for (i = 0; i < size; i++)

a[i] = rand() % 10000; // random number between 0 and 9999

}

int main()

{ int i;

random_T(in1, SIZE); random_T(in2, SIZE);

#pragma omp parallel for

for (i=0; i<SIZE; i++)

res[i] = in1[i] + in2[i];

}

Execution on host : Compile: $ icc -xHost -fopenmp vect-

add-short.c -o vect-add-host

Run: $ export OMP_NUM_THREADS=16

$./vect-add-host

Native Mode Example • Compile on HOST

• $ icc -mmic -fopenmp vect-add-short.c

-o vect-add-mic

• Connect to MIC: • ssh mic0 , or

• ssh mic1

• Setup path OpenMP libraries • mic0 $ export LD_LIBRARY_PATH=/apps/all/icc/2015.3.187-GNU-5.1.0-

2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH

• Set number of OpenMP threads (1 – 240): • mic0 $ export OMP_NUM_THREADS=240

• Run: • mic0 $ ~/path_to_binary/vect-add-mic

• .

List of libraries required for execution of OpenMP parallel code on Intel Xeon Phi:

/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic

libiomp5.so ; libimf.so ; libsvml.so ; libirng.so ; libintlc.so.5

Offload Mode

Basic Offload Mode Example - Pi • Edit source: $ vim source-offload.cpp

• Turn on offload info: • $ export OFFLOAD_REPORT=2

• Compile: • $ icc source-offload.cpp -o bin-offload

• Run: • $ ./bin-offload

#include <iostream>

int main(int argc, char* argv[])

{

const int niter = 100000;

double result = 0;

#pragma offload target(mic)

for (int i = 0; i < niter; ++i) {

const double t = (i + 0.5) / niter;

result += 4.0 / (t * t + 1.0);

}

result /= niter;

std::cout << "Pi ~ " << result << '\n';

}

Implement simultaneous host/coprocessor computing 1.) sequential 2.) concurrent Hint: export OMP_NUM_THREADS=4

Parallel Offload OpenMP Example • Edit source:

• $ vim vect-add.c

#include <stdio.h>

typedef int T;

#define SIZE 1000

#pragma offload_attribute(push, target(mic))

T in1[SIZE]; T in2[SIZE]; T res[SIZE];

#pragma offload_attribute(pop)

// MIC function to add two vectors

__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {

int i = 0;


for (i = 0; i < size; i++)

c[i] = a[i] + b[i];

}

// CPU function to add two vectors

void add_cpu (T *a, T *b, T *c, int size) {

int i;

for (i = 0; i < size; i++)

c[i] = a[i] + b[i];

}

Parallel Offload OpenMP Example

int main()

{

int i;

random_T(in1, SIZE);

random_T(in2, SIZE);

#pragma offload target(mic) in(in1,in2) inout(res)

{ // 1. Parallel loop from main function


for (i=0; i<SIZE; i++)

res[i] = in1[i] + in2[i];

// 2. ... or parallel loop is called inside the function

add_mic(in1, in2, res, SIZE);

}

//Check the results with CPU implementation

T res_cpu[SIZE];

add_cpu(in1, in2, res_cpu, SIZE);

compare(res, res_cpu, SIZE);

}


__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {

int i = 0;


for (i = 0; i < size; i++)

c[i] = a[i] + b[i]; }

Implement : 1.) Transfer and keep C = A + B 2.) Reuse and keep C = A – B 3.) Reuse and return C = A * B

// CPU function to generate a vector of random numbers

void random_T (T *a, int size) {

int i;

for (i = 0; i < size; i++)

a[i] = rand() % 10000; // random number between 0 and 9999

}

// CPU function to compare two vectors

int compare(T *a, T *b, T size ){

int pass = 0;

int i;

for (i = 0; i < size; i++){

if (a[i] != b[i]) {

printf("Value mismatch at location %d, values %d and %d\n",i, a[i],

b[i]);

pass = 1;

}

}

if (pass == 0) printf ("Test passed\n"); else printf ("Test Failed\n");

return pass;

}


• Remaining functions, copy and paste before main function

Parallel Offload OpenMP Example • Turn on offload info:

• $ export OFFLOAD_REPORT=2

• Compile: • $ icc vect-add.c -openmp_report2 -vec-report2

-o vect-add

• Run: • $ ./vect-add

Some debugging options

openmp_report[0|1|2]

- controls the OpenMP parallelizer diagnostic level

vec-report[0|1|2]

- controls the compiler based vectorization diagnostic level

vect-add.c(60): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.

vect-add.c(53): (col. 3) remark: loop was not vectorized: statement cannot

be vectorized.


be vectorized.

vect-add.c(72): (col. 3) remark: LOOP WAS VECTORIZED.

vect-add.c(73): (col. 3) remark: loop was not vectorized: existence of

vector dependence.


vect-add.c(16): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.




be vectorized.

vect-add.c(39): (col. 3) remark: loop was not vectorized: existence of

vector dependence.

vect-add.c(60): (col. 5) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.

vect-add.c(61): (col. 5) remark: *MIC* LOOP WAS VECTORIZED.

vect-add.c(16): (col. 3) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.

vect-add.c(17): (col. 5) remark: *MIC* LOOP WAS VECTORIZED.


Automatic Offload using MKL Example • Edit source:

• $ vim sgemm-ao-short.c

As of Intel MKL 11.0.2 only the following functions are enabled for automatic offload: Level-3 BLAS functions - ?GEMM (for m,n > 2048, k > 256) - ?TRSM (for M,N > 3072) - ?TRMM (for M,N > 3072) - ?SYMM (for M,N > 2048) LAPACK functions - LU (M,N > 8192) - QR - Cholesky

#include <stdio.h>

#include <stdlib.h>

#include <malloc.h>

#include <stdint.h>

#include "mkl.h"

int main(int argc, char **argv)

{

float *A, *B, *C; /* Matrices */

MKL_INT N = 2560; /* Matrix dimensions */

MKL_INT LD = N; /* Leading dimension */

int matrix_bytes; /* Matrix size in bytes */

int matrix_elements; /* Matrix size in elements */

float alpha = 1.0, beta = 1.0; /* Scaling factors */

char transa = 'N', transb = 'N'; /* Transposition options */

int i, j; /* Counters */

matrix_elements = N * N;

matrix_bytes = sizeof(float) * matrix_elements;

Automatic Offload using MKL Example

/* Allocate the matrices */

A = malloc(matrix_bytes); B = malloc(matrix_bytes);

C = malloc(matrix_bytes);

/* Initialize the matrices */

for (i = 0; i < matrix_elements; i++)

A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;

printf("Computing SGEMM on the host\n");

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

printf("Enabling Automatic Offload\n");

mkl_mic_enable();

int ndevices = mkl_mic_get_device_count(); // Number of MIC devices

printf("Automatic Offload enabled: %d MIC devices present\n", ndevices);

printf("Computing SGEMM with automatic workdivision\n");

sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);

/* Free the matrix memory */

free(A); free(B); free(C);

printf("Done\n"); return 0;

}

Automatic Offload using MKL Example

?gemm C := alpha * A* B + beta*C,

Automatic Offload using MKL Example • Turn on offload info:

• $ export OFFLOAD_REPORT=2

• Compile: • $ icc -mkl sgemm-ao-short.c -o sgemm

• Run: • $ export OMP_NUM_THREADS=24

• $./sgemm

Computing SGEMM on the host

Enabling Automatic Offload

Automatic Offload enabled: 2 MIC devices present

Computing SGEMM with automatic workdivision

[MKL] [MIC --] [AO Function] SGEMM

[MKL] [MIC --] [AO SGEMM Workdivision] 0.44 0.28 0.28

[MKL] [MIC 00] [AO SGEMM CPU Time] 0.278222 seconds

[MKL] [MIC 00] [AO SGEMM MIC Time] 0.086087 seconds

[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 34078720 bytes

[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 7864320 bytes

[MKL] [MIC 01] [AO SGEMM CPU Time] 0.278222 seconds

[MKL] [MIC 01] [AO SGEMM MIC Time] 0.097036 seconds

[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 34078720 bytes

[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 7864320 bytes

This example is a simplified

version of an example from

MKL. The full version can be

found here:

$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c

MPI

MPI

• MPI programming models • Host-only model - all MPI ranks reside on the host. The

coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)

• Coprocessor-only model - all MPI ranks reside only on the coprocessors.

• Symmetric model - the MPI ranks reside on both the host and the coprocessor. Most general MPI case.

MPI Example • Edit source:

• $ vim mpi-test.c

#include <stdio.h>

#include <mpi.h>

int main (argc, argv)

int argc;

char *argv[];

{

int rank, size;

int len;

char node[MPI_MAX_PROCESSOR_NAME];

MPI_Init (&argc, &argv); /* starts MPI */

MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */

MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */

MPI_Get_processor_name(node,&len);

printf( "Hello world from process %d of %d on host %s \n", rank, size, node );

MPI_Finalize();

return 0;

}

MPI Example • Get access to 2 nodes with MIC accelerators

• $ qsub -I -A DD-XX-X -q qprod -lselect=2:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120

• Load Intel MPI module and setup host environment • $ module load intel/2015b

• $ export I_MPI_MIC_POSTFIX=-mic

• $ export I_MPI_MIC=1

• Compile on HOST • For both host and mic:

• $ mpiicc -xHost -o mpi-test mpi-test.c

• $ mpiicc -mmic -o mpi-test-mic mpi-test.c

• Setup environment once for all • $ vim ~/.profile

PS1='[\u@\h \W]\$ '

export PATH=/usr/bin:/usr/sbin:/bin:/sbin

#OpenMP

export

LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY

_PATH

#Intel MPI

export LD_LIBRARY_PATH=/apps/intel/impi/4.1.1.036/mic/lib/:$LD_LIBRARY_PATH

export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH

PS1='[\u@\h \W]\$ ’

export PATH=/usr/bin:/usr/sbin:/bin:/sbin

#IMPI

export PATH=/apps/all/impi/5.0.3.048-iccifort-2015.3.187-GNU-5.1.0-2.25/mic/bin/:$PATH

#OpenMP (ICC, IFORT), IMKL and IMPI

export LD_LIBRARY_PATH=/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-

2.25/mkl/lib/mic:/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-

2.25/lib/mic:/apps/all/icc/2015.3.187-GNU-5.1.0-

2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH

Execution on Host or single MIC • Host-only model

• $ mpirun -np 4 ./mpi-test

• Coprocessor-only model executed from MIC • $ ssh mic0

• $ mpirun -np 4 ~/path_to_binary/mpi-test-mic

• Coprocessor-only model executed from HOST • $ export I_MPI_MIC=1

• $ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -host mic0 –n 2 ~/path_to_binary/mpi-test-mic : -host mic1 –n 2 ~/path_to_binary/mpi-test-mic

Hello world from process 2 of 4 on host r25u30n719




Hello world from process 0 of 4 on host r25u30n719-mic0








PBS Generated Node-Files

• PBS generates a set of node-files • Host only node-file:

• /lscratch/${PBS_JOBID}/nodefile-cn

• MIC only node-file:

• /lscratch/${PBS_JOBID}/nodefile-mic

• Host and MIC node-file:

• /lscratch/${PBS_JOBID}/nodefile-mix

• Each host or accelerator is listed only per files • User has to specify how many jobs should be executed

per node using "-n" parameter of the mpirun command

MPI Tuning

• For best performance, use:

• This ensures, that MPI inside node will use SHMEM communication, between HOST and Phi the IB SCIF will be used and between different nodes or Phi's on diferent nodes a CCL-Direct proxy will be used.

• Please note: Other FABRICS like tcp,ofa may be used (even combined with shm) but there's severe loss of performance (by order of magnitude). Usage of single DAPL PROVIDER (e. g. I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u) will cause failure of Host<->Phi and/or Phi<->Phi communication. Usage of the I_MPI_DAPL_PROVIDER_LIST on non-accelerated node will cause failure of any MPI communication, since those nodes don't have SCIF device and there's no CCL-Direct proxy runnig.

$ export I_MPI_FABRICS=shm:dapl

$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-

mcm-1

MPI Bandwidth

Bandwidth MB/s

HOST0-HOST1 10672.53

HOST0-MIC0 10196.13

HOST0-MIC1 10398.99

HOST0MIC0-HOST0MIC1 1821.21




Thanks to Filip Stanek

Host-only model on multiple nodes • Use “nodefile-cn” nodefile

• mpirun -n 4 -machinefile

/lscratch/${PBS_JOBID}/nodefile-cn

~/path_to_binary/mpi-test





Coprocessor-only model on multiple nodes





cn204-mic0.bullx cn205-mic0.bullx

• Use “nodefile-mic” nodefile • mpirun -n 4

-machinefile /lscratch/${PBS_JOBID}/nodefile-

mic


mpirun -n 4 -genv I_MPI_FABRICS_LIST tcp -genv I_MPI_FABRICS shm:tcp -genv I_MPI_TCP_NETMASK=10.1.0.0/16 -genv I_MPI_DEBUG 5 -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -machinefile /lscratch/${PBS_JOBID}/nodefile-mic ~/anselm-intro/mic/mpi-test

Symmetric model on multiple nodes





• Use “nodefile-mix” nodefile • export I_MPI_FABRICS=shm:ofa

mpirun -n 4

-machinefile /lscratch/${PBS_JOBID}/nodefile-mix


mpirun -n 4 -genv I_MPI_FABRICS_LIST tcp -genv I_MPI_FABRICS shm:tcp -genv I_MPI_TCP_NETMASK=10.1.0.0/16 -genv I_MPI_DEBUG 5 -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -machinefile /lscratch/${PBS_JOBID}/nodefile-mix ~/anselm-intro/mic/mpi-test

Miscellanous

• Issues with Xeon Phi

• Debugging – Vtune, DDT

Debugging with Allinea DDT


• Native Xeon Phi non-MPI Programs

• Native Xeon Phi Intel MPI Programs

• Heterogeneous (host + Xeon Phi) Intel MPI Programs

• Heterogeneous Programs (#pragma offload)

Native Xeon Phi non-MPI Programs remote

Native Xeon Phi Intel MPI Programs remote

Heterogeneous Programs (#pragma offload) GUI / offline

DDT

Native Xeon Phi non-MPI Programs remote

Native Xeon Phi Intel MPI Programs remote

Heterogeneous Intel MPI Programs GUI / offline

Heterogeneous Programs (#pragma offload) GUI / offline

• Get an interactive session to a node with MIC • $ qsub –X -I -A DD-XX-X -q qmic -lselect=1:ncpus=1, walltime=03:00:00

• To setup the MIC programming environment use: • $ module load intel impi allinea-ddt-map

• Code has to be compiled on a node with MIC / MPSS installed

• Nodes cn204 – cn207


Native Xeon Phi non-MPI Programs • Compile with -d and -O0 flags

• icc -d -O0 -mmic -fopenmp vect-add-short.c -o vect-add-mic-debug

• Start DDT on the host (using the host installation of DDT).

• ddt

Native Xeon Phi non-MPI Programs

1. Click the Remote Launch drop-down on the Welcome Page and select Configure...

2. Enter the host name of the Xeon Phi card in the Host Name box

• mic0

3. Select the path to the Xeon Phi installation of DDT in the Installation Directory box.

• /apps/debug/allinea/4.2/

4. Click Test Remote Launch and ensure the settings are correct.

5. Click Ok.

6. Click Run and Debug a Program on the Welcome Page.

7. Select a native Xeon Phi program in the Application box in the Run window.

8. Click Run.

Native Xeon Phi non-MPI Programs

1. Click the Remote Launch drop-down on the Welcome Page and select Configure...


• mic0




5. Click Ok.


7. Select a native Xeon Phi program in the Application box in the Run window.

8. Click Run.

Native Xeon Phi MPI Programs

Native Xeon Phi MPI Programs • Compile with -d and -O0 flags

• mpiicc -g -O0 -xhost -o mpi-test-debug mpi-test.c

• mpiicc -g -O0 -mmic -o mpi-test-debug-mic mpi-test.c


• ddt

Native Xeon Phi MPI Programs Click the Remote Launch drop-down on the Welcome Page and select Configure...


• mic0



3. Select remote script to initialize the environment on MIC

• ~/.profile


5. Click Ok.



2. Select a native Xeon Phi MPI program in the Application box in the Run window.

DDT should have detected 'Intel MPI (MPMD)' as the MPI implementation in File → Options (DDT → Preferences on Mac OS X) → System.

1. Click Run.


Heterogeneous Programs with (#pragma offload)

Heterogeneous Programs with (#pragma offload) • Compile with -d and -O0 flags

• icc -d -O0 vect-add.c –o vect-add-debug


• ddt


1. Open the Options window: File → Options

2. Select Intel MPI (MPMD) as the MPI Implementation on the System page.

3. Check the Heterogeneous system support check box on the System page.

4. Click Ok.

5. Ensure Control → Default Breakpoints → Stop on Xeon Phi offload is checked.


Click Run and Debug a Program on the Welcome Page.

1. Select a heterogeneous program that uses #pragma offload in the Application box in the

2. Run window.

3. Click Run.


Heterogeneous (host + Xeon Phi) Intel MPI Programs • Zatim na Anselmu nefunguje

Heterogeneous (host + Xeon Phi) Intel MPI Programs

1. Start DDT on the host (using the host installation of DDT.

2. Open the Options window: File → Options

3. Select Intel MPI (MPMD) as the MPI Implementation on the System page.

4. Check the Heterogeneous system support check box on the System page.

5. Click Ok.

Heterogeneous (host + Xeon Phi) Intel MPI Programs

1. Click Run and Debug a Program in the Welcome Page.

2. Select the path to the host executable in the Application box in the Run window.

3. Enter an MPMD style mpiexec command line in the mpiexec Arguments box, e.g.

-np 8 -host micdev /home/user/examples/hello-host : -np 32 -host micdev-mic0 /home/user/examples/hello-mic

1. Set Number of processes to be the total number of processes launched on both the host and Xeon Phi (e.g. 40 for the above mpiexec Arguments line).

2. Add I_MPI_MIC=enable to the Environment Variables box.

3. Click Run. You may need to wait a minute for the Xeon Phi processes to connect.

Documents

Úvod do Intel Xeon Phiprace.it4i.cz/sites/prace.it4i.cz/files/files/salomon-09-2015-stachon.pdf · Intel® Xeon Phi™ Architecture and host CPUs •Messages to/from any core •Intel®