Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Úvod do Intel Xeon Phi
This project is funded by Structural Funds of the European Union (ESF) and state budget of the Czech Republic
Martin Stachoň Lubomír Říha
IT4Innovations
Outline
• Accelerators philosophy
• Programming models
• Practical info for Salomon
Historical Analysis
PetaFLOPS (MIC)
Vector
Machines MPPs with Multicores and
Heterogeneous Accelerators
Massively
Parallel
Processors
1993-
HPCC 2008-
End of Moore’s
Law in Clocking!
Performance
Time
PetaFLOPS (Cell)
TeraFLOPS (MPPs)
PetaFLOPS (GP GPUs)
2011
Trends for Petaflop/s Machines CPUs: Wider vector units, more cores
• General-purpose in nature
• High single-thread performance, moderate floating point throughput
• 2x E5-2680 - 0.40 Tflop/s, 260W
GPUs: Thousands of very simple stream processors
• Specialized for floating point.
• New programming models: CUDA, OpenCL, OpenACC
• Tesla K40 - 1.43 Tflop/s, 235W
MIC: Take CPU trends to an extreme, optimize for floating point.
• Retain general-purpose nature and programming models from CPU
• Low single-thread performance, high aggregate FP throughput
• 7210P - 1.24 Tflops/s, 300W
Accelerators in HPC Hardware Accelerators - Speeding up the Slow Part of the Code • Enable higher performance through fine-
grained parallelism
• Offer higher computational density than CPUs
• Accelerators present heterogeneity!
Main Features • Coprocessor to the CPU • PCIe based interconnection • Separate memory • Provide high bandwidth access to
local data
GPU or MIC
Future: Accelerators only?
Accelerated Execution Model
PC
mP
GPU, MIC,
FPGA, Cell CBE, …
• Transfer of Control
• Input Data
• Output Data
• Transfer of Control
• Fine grain computations with the accelerators, others with the CPU
• Interaction between accelerator and CPU can be blocking or asynchronous
• This scenario is replicated across the whole system and standard HPC parallel programming paradigms used for intranode interactions
Processors vs. Accelerators Accelerators
• tailored for compute-intensive, highly data parallel computation
• many parallel execution units
• have significantly faster and more advanced memory interfaces
• more transistors can be devoted to data processing
• less transistors for data caching and flow control
• Very Efficient For • Fast Parallel Floating Point Processing • High Computation per Memory Access
• Not As Efficient For • Branching-Intensive Operations • Random Access, Memory-Intensive Operations
DRAM
Cache
ALU
Control
ALU
ALU
ALU
DRAM
Processors have few execution units higher clock speeds
Intel MIC Architecture
• Up to 61 Cores, 244 Threads
• 512‐bit SIMD instructions
• >1TFLOPS DP-F.P. peak
• Up to 16GB GDDR5 Memory • 352 GB/s peak, but ~170 GB/s measured
• PCIe 2.0 x16 - 5.0 GT/s, 16-bit
• Data Cache • L1 32KB/core • L2 512KB/core, 30.5 MB/chip
• Up to 300W TDP (card)
• Linux* operating system • IP addressable - coprocessor becomes a network node • Common x86/IA (no binary compatibility) • Programming Models and SW-Tools
Intel MIC Architecture Overview
TD – cache Tag Directory
Based on what Intel learned from – Larrabee – SCC – TeraFlops Research Chip
Memory – up to 16 GB of GDDR5 • used for everything, including the OS image and the local filesystem • There are multiple memory controllers on the card. • Access over a shared ring bus • Cores compete for access. • Transfers to/from the card go over PCIe, so maximum speeds around 7GB/s.
Core Architecture Overview
• 61 cores • Full support for the x86 instruction set • In order execution • Coherent caches (per core)
• 32KB L1 instruction and data caches. • 512KB shared L2 data/instruction
• Two scalar pipelines • Scalar Unit based on Pentium® processors • Dual issue with scalar instructions • Pipelined one-per-clock scalar throughput • One pipeline is limited in functionality
• SIMD Vector Processing Engine • only significant difference to the Pentium • 512-bit vector processing unit (VPU) • 32 512-bit wide vector registers
• 4 hardware threads per core • 4 clock latency, hidden by round-robin scheduling of
threads • Cannot issue back to back instructions in same thread:
Means minimum two threads per core to achieve full compute potential
Vectorization
Source : Xeon Phi Tutorial Tim Cramer | Rechen- und Kommunikationszentrum
SIMD Vector Capabilities
• MMX: MMX Pentium and Pentium II (PentiumPro didn’t have MMX)
• SSE: Pentium III
• SSE2: Pentium 4
• SSE3: Pentium 4 with 90nm technology
• SSSE3: Core 2 Duo/Quad (65nm technology)
• SSE4.1: Core 2 Duo/Quad (45nm technology)
• SSE4.2: 1st generation Core i7 (45nm, 32nm)
• AVX: 2nd/3rd generation Core i7 (32nm, 22nm)
Vectorization
Source : Xeon Phi Tutorial Tim Cramer | Rechen- und Kommunikationszentrum
SIMD Vector Basic Arithmetic
Vectorization
Source : Xeon Phi Tutorial Tim Cramer | Rechen- und Kommunikationszentrum
SIMD Fused Multiply and Add - FMA
MIC vs. CPU
• CPUs designed for all workloads, high single-thread performance
• MIC also general purpose, though optimized for number crunching
• Focus on high aggregate throughput via lots of weaker threads
Regularly achieve >2x performance compared to dual E5 CPUs Single core scalar performance is 1/10th of E5 core
MIC (SE10P) CPU (E5) MIC is… 61 8 much higher 1.01 2.7 lower 512 256 higher 16+ 21+ lower 4 1* higher 320; GDDR5 100; DDR3 higher 170 75 higher
Number of cores Clock Speed (GHz) SIMD width (bit) DP GFLOPS/core HW threads/core Memory BW (GB/s) Sustained BW
High-performance Xeon Phi applications exploit both parallelism and vector processing.
MIC vs. CPU
Scalar & Single thread
Vector & Single thread
Scalar & Multiple threads
Vector & Multiple threads
More Parallel
More Performance
CPU MIC
CPU MIC
CPU MIC
1 10 100 1000 10000
Threads
Perf
orm
ance
[G
FLO
PS]
MIC Programming: Advantages • Intel’s MIC is based on x86 technology
• x86 cores w/ caches and cache coherency • SIMD instruction set • but is not x86 compatible (cross-compilation needed) • Coherent cache (but …)
• Programming for Phi is similar to programming for CPUs • Familiar languages: C/C++ and Fortran • Familiar parallel programming models: OpenMP & MPI • MPI on host and on the coprocessor • Any code can run on MIC, not just kernels
• Optimizing for Phi is similar to optimizing for CPUs • “Optimize once, run anywhere” • Early Phi porting efforts for codes “in the field” have obtained
double the performance of Sandy Bridge
Will My Code Run on Xeon Phi? • Yes
• … but that’s the wrong question • Will your code run “best” on Phi?, or • Will you get great Phi performance without additional work?
• Codes port easily • Minutes to days depending mostly on library dependencies
• Performance can require real work • Getting codes to run “at all” is almost too easy • Need to put in the effort to get what you expect
• Scalability • Multiple threads per core is really important • Getting your code to vectorize is really important
Performance Expectations “WOW, 240 hardware threads on a
single chip! My application will just
rock!” You really believe that?
Remember the limitations!
In-order cores
limited hardware prefetching
Running with 1GHz only
Small Caches (2 levels)
Poor single thread performance
Small main memory
PCIe as bottleneck + offload overhead
• In-order cores • limited hardware prefetching • Cores running with 1GHz only • Small Caches (2 levels)
• Poor single thread performance • Small main memory • PCIe as bottleneck + offload
overhead
... most likely NO …
MIC Programming Considerations • Getting full performance from the Intel® MIC
architecture requires both a high degree of parallelism and vectorization
• Not all code can be written this way • Not all programs make sense on this architecture
• Intel® MIC is different from Xeon • It specializes in running highly parallel and vectorized code. • Not optimized for processing serial code
• Parallelism and vectorization optimizations are beneficial across both architectures
Definition of a Node
• A “node” contains a host component and a MIC component
• host – refers to the Sandy Bridge component
• MIC – refers to one or two Intel Xeon Phi co-processor cards
Spectrum of Programming Models
MPI+Offload Programming Model
MPI ranks on CPUs only
• All messages into/out of host CPUs
• Offload models used to accelerate MPI ranks
• Intel® Cilk™ Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* within Intel® Xeon Phi™ coprocessor
Build Intel® 64 executable with included offload by using the Intel compiler Run instances of the MPI application on the host, offloading code onto coprocessor Advantages of more cores and wider SIMD for certain applications
Symmetric Programming Model
MPI ranks on both Intel® Xeon Phi™ Architecture and host CPUs
• Messages to/from
any core
• Intel® Cilk™ Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads* used directly within MPI processes
Build binaries by using the resp. compilers targeting Intel 64 and Intel Xeon Phi Architecture Upload the binary to the Intel Xeon Phi coprocessor Run instances of the MPI application on different mixed nodes
Coprocessor only Programming Model
MPI ranks on Intel® Xeon Phi™ coprocessor only
• All messages into/out of the coprocessors
• Intel® Cilk™ Plus, OpenMP*, Intel® Threading Building Blocks, Pthreads used directly within MPI processes
Build Intel Xeon Phi coprocessor binary using the Intel® compiler Upload the binary to the Intel Xeon Phi coprocessor Run instances of the MPI application on Intel Xeon Phi coprocessor nodes
Offload Programming Model Overview • Programmer designates code sections to offload
• No further programming/API usage is needed • The compiler and the runtime automatically manage setup/teardown,
data transfer, and synchronization
• Add pragmas and new keywords to working host code to make sections of code run on the Intel® Xeon Phi™ coprocessor
• Similar to adding parallelism to serial code using OpenMP* pragmas
• The Intel compiler generates code for both target architectures at once
• The resulting binary runs whether or not a coprocessor is present • Unless you use #pragma offload target(mic:cardnumber)
• The compiler adds code to transfer data automatically to the coprocessor and to start your code running (with no extra coding on your part)
• Hence the term “Heterogeneous Compiler” or “Offload Compiler”
Data Transfer Overview
• The host CPU and the coprocessor do not share physical or virtual memory in hardware
• Two offload data transfer models are available: • 1. Explicit Copy
• Programmer designates variables that need to be copied between host and card in the offload pragma/directive
• Syntax: Pragma/directive-based • C/C++ Example:
• #pragma offload target(mic) in(data:length(size))
• 2. Implicit Copy (only Cilk+) – not covered here • Programmer marks variables that need to be shared between host
and card • The same variable can then be used in both host and coprocessor
code • Runtime automatically maintains coherence at the beginning and
end of offload statements
Data Transfer: C/C++ Offload using Explicit Copies C/C++ Syntax Semantics
Offload pragma #pragma offload <clauses> <statement>
Allow next statement to execute on coprocessor or host CPU
Offload transfer #pragma offload_transfer <clauses>
Initiates asynchronous data transfer, or initiates and completes synchronous data transfer
Offload wait #pragma offload_wait <clauses> Specifies a wait for a previously initiated asynchronous activity
Keyword for variable & function definitions
__attribute__((target(mic))) Compile function for, or allocate variable on, both CPU and coprocessor
Entire blocks of code
#pragma offload_attribute(push, target(mic)) … #pragma offload_attribute(pop)
Mark entire files or large blocks of code for generation on both host CPU and Coprocessor
Conceptual Transformation
OpenMP* examples
Clauses / Modifiers Syntax Semantics
Target specification target( name[:card_number] ) Where to run construct
Inputs in(var-list modifiersopt) Copy from host to coprocessor
Outputs out(var-list modifiersopt) Copy from coprocessor to host
Inputs & outputs inout(var-list modifiersopt) Copy host to coprocessor and back when offload completes
Non-copied data nocopy(var-list modifiersopt) Data is local to target
Conditional offload if (condition) Boolean expression
Modifiers
Specify pointer length length(element-count-expr) Copy N elements of the pointer’s type
Control pointer memory allocation
alloc_if ( condition ) Allocate memory to hold data referenced by pointer if condition is TRUE
Control freeing of pointer memory
free_if ( condition ) Free memory used by pointer if condition is TRUE
Control target data alignment
align ( expression ) Specify minimum memory alignment on target
Variables and pointers restricted to scalars, structs of scalars, and arrays of scalars
Rules & Limitations
• The Host from/to Coprocessor data types allowed in a simple offload:
• Scalar variables of all types • Must be globals or statics if you wish to use them with nocopy, alloc_if, or free_if (i.e. if they are to persist on the coprocessor between offload calls)
• Structs that are bit-wise copyable (no pointer data members) • Arrays of the above types • Pointers to the above types
• What is allowed within coprocessor code?
• All data types can be used (incl. full C++ objects) • Any parallel programming technique (Pthreads*, Intel® TBB,
OpenMP*, etc.) • Intel® Xeon Phi™ versions of Intel® MKL
Offload using Explicit Copies: Example
float reduction(float *data, int numberOf)
{
float ret = 0.f;
#pragma offload target(mic) in(data:length(numberOf))
{
#pragma omp parallel for reduction(+:ret)
for (int i=0; i < numberOf; ++i)
ret += data[i];
}
return ret;
}
Note: copies numberOf*sizeof(float)elements to the coprocessor, not numberOf bytes – the compiler knows data’s type
Data Movement
• Default treatment of in/out variables in a #pragma offload statement
• At the start of an offload: • Space is allocated on the coprocessor • in variables are transferred to the coprocessor
• At the end of an offload: • out variables are transferred from the coprocessor • Space for both types (as well as inout) is deallocated on the
coprocessor
Heterogeneous Compiler: Reminder of What is Generated
• Note that the compiler generates two binaries:
• The host version • includes all functions/variables in the source code, whether
marked • #pragma offload,
• __attribute__((target(mic))) …….. or not
• The coprocessor version
• includes only functions/variables marked in the source code • #pragma offload,
• __attribute__((target(mic)))
• Linking creates one executable with both binaries included!
Heterogeneous Compiler: Command-line options • “–openmp” is automatically set when you build
• Don’t need –no-offload if compiling only for Xeon • Generates same Xeon only code as previous compilers
• But –no-offload creates smaller binaries
• Most command line arguments set for the host are set for the coprocessor build
• Unless overridden by
–offload-option,mic,xx=”…”
• Add –watch=mic-cmd to display the compiler options automatically passed to the offload compilation
Heterogeneous Compiler: Command-line options cont.
Offload-specific arguments to the Intel® Compiler:
• Generate host only code (by default both host + coprocessor code is generated): -no-offload
• Produce a report of offload data transfers at compile time -opt-report-phase=offload
• Add Intel® Xeon Phi™ compiler switches -offload-options,mic,compiler,“switches”
• Add Intel® Xeon Phi™ assembler switches -offload-options,mic,as:“switches”
• Add Intel® Xeon Phi™ linker switches -offload-options,mic,ld,“switches”
Example:
icc -I/my_dir/include -DMY_DEFINE=10 -offload-options, mic,compiler, “-I/my_dir/mic/include -DMY_DEFINE=20“ hello.c
Passes “-I/my_dir/mic/include -I/my_dir/include -DMY_DEFINE=10 -DMY_DEFINE=20” to the offload compiler
Compiler Support for SIMD Vectorization Intel auto-vectorizer Combination of loop unrolling and SIMD instructions to get vectorized loops
No guarantee given, compiler might need some hints
Compiler feedback Use –vec-report [n] to control the diagnostic information of the vectorizer
n can be between 0-5 (recommended 3 or 5)
concentrate on hotspots for optimization
C/C++ aliasing: Use restricted keyword
Intel specific pragma #pragma vector (Fortran: !DIR$ VECTOR)
indicates to the compiler that the loop should be vectorized #pragma simd (Fortran: !DIR$ SIMD)
enforces vectorization of the (innermost) loop
SIMD support will be added in OpenMP 4.0
Simultaneous Host/Coprocessor Computing • #pragma offload statement blocks thread until the
statement completes
• Simultaneous host and coprocessor computing requires multiple threads of execution on the host:
• One or more to block until their #pragma offload statements completes
• Others to do simultaneous processing on the host
• You can use most multithreading APIs to do this • OpenMP* tasks or parallel sections • Pthreads* • Intel® TBB’s parallel_invoke, Intel® Cilk™ Plus, …
Simultaneous Host/Coprocessor Computing - Sequential
master thread
offload single idle threads
workshare on cpu
Simultaneous Host/Coprocessor Computing - Concurrent
master thread
offload single nowait
workshare on cpu
assist when done in single
Thread Placement
• Thread placement may be controlled with the following environment variable
• KMP_AFFINITY=<type>
Asynchronous Offload and Data Transfer
• signal() and wait()
• Available async functionality • offload
• offload_transfer
• offload_wait
• Examples #pragma offload target(mic:0) signal(flg1)
#pragma offload_transfer target(mic:0) signal(flg2) wait(flg1)
#pragma offload_wait target(mic:0) wait(flg1)
Signal, Wait and tag
• Examples #pragma offload_transfer target(mic:0) \
signal(tagA) wait(tag0, tag1) ...
• Do not start transfer until the operations signaling tag0 and tag1 are complete
• Upon completion, indicate completion using tagA
#pragma offload target(mic:0) \
signal(tagB) wait(tag2, tag3, tag4) ...
• Do not start until the operations signaling tag2, tag3 and tag4 are complete
• Upon completion, indicate completion using tagB
Persistence of Pointer Data
• Default treatment of in/out variables in a #pragma offload statement
• At the start of an offload: • Space is allocated on the coprocessor • in variables are transferred to the coprocessor
• At the end of an offload: • out variables are transferred from the coprocessor • Space for both types (as well as inout) is deallocated on the
coprocessor
• This behavior can be modified • free_if(boolean) controls space deallocation on the
coprocessor at the end of the offload • alloc_if(boolean) controls space allocation on
coprocessor at the start of the offload • Use nocopy rather than in/out/inout to indicate that the
variable’s value is reused from a previous offload or is only relevant within this offload section
Persistence of Pointer Data: Example • Allocate space on coprocessor, transfer data to, and do not
release at end (persist)
• Use persisting data in subsequent offload code
• At end, transfer data from, and deallocate
__attribute__((target(mic))) static float *A, *B, *C, *C1;
// Transfer matrices A, B, and C to MIC device and
// do not deallocate matrices A and B
#pragma offload target(mic) \
in(transa, transb, M, N, K, alpha, beta, LDA, LDB, LDC) \
in(A:length(NCOLA * LDA) alloc_if(1) free_if(0)) \ // e.g. ALLOC
in(B:length(NCOLB * LDB) alloc_if(1) free_if(0)) \ //and RETAIN
inout(C:length(N * LDC))
{
sgemm(&transa, &transb, &M, &N, &K, &alpha, A, &LDA, B, &LDB,
&beta, C, &LDC);
}
Persistence of Pointer Data: Example
// Transfer matrix C1 to MIC device and reuse matrices A and B
#pragma offload target(mic) \
in(transa1, transb1, M, N, K, alpha1, beta1, LDA, LDB, LDC1) \
nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(0)) \ // e.g. REUSE
nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(0)) \ //and RETAIN
inout(C1:length(N * LDC1))
{
sgemm(&transa1, &transb1, &M, &N, &K, &alpha1, A, &LDA, B,
&LDB, &beta1, C1, &LDC1);
}
// Deallocate A and B on an Intel(R) Xeon Phi(TM) device
#pragma offload target(mic) \
nocopy(A:length(NCOLA * LDA) alloc_if(0) free_if(1)) \ // e.g. REUSE
nocopy(B:length(NCOLB * LDB) alloc_if(0) free_if(1)) \ //and FREE
inout(C1:length(N * LDC1))
{
x = stuff(C1);
}
Allocation for Parts of Arrays • Allocation of array slices is possible
• alloc(p[5:1000]) modifier allocate 1000 elements on coprocessor
• first useable element has index 5, last 1004 (dark blue + orange)
• p[10:100] specifies 100 elements to transfer (orange)
int *p;
// 1000 elements allocated. Data transferred into p[10:100]
#pragma offload … in ( p[10:100] : alloc(p[5:1000]) )
{ … }
Array length 1005 length first element
source: Tim Cramer, Rechen- und Kommunikationszentrum (RZ)
Double Buffering Example
• Overlap computation and communication
• Generalizes to data domain decomposition
Double Buffering Example
Double Buffering Example
Intel Math Kernel Library - MKL Usage Models on Intel Xeon Phi Coprocessor
• Automatic Offload • No code changes required • Automatically uses both host and target • Transparent data transfer and execution management
• Compiler Assisted Offload • Explicit controls of data transfer and remote execution using
compiler offload pragmas/directives • Can be used together with Automatic Offload
• Native Execution • Compile and run code directly on the Xeon Phi • Input data is copied to targets in advance - no need to trasfer data • homogeneous
Performance: DGEMM on Xeon Phi • Xeon Phis have 60 cores, 4 HW threads/core:
• $ export OMP_NUM_THREADS=240
• First try … and fail: • $./dgemm_offload
• Matrices of size 10000x10000
• 250 Gflops
• Theoretical peak performance is: • peak = (# cores)*(vector size)*(ops/cycle)*(frequency) • peak = 60*8*2*1.052 = 1011 Gflops!
Where are the flops?
Performance: Thread Affinity • Affinity is VERY important on manycore systems. By setting
KMP_AFFINITY performance is significantly affected:
• Scatter affinity: • export ENV_PREFIX=MIC;
• export MIC_KMP_AFFINITY=scatter; ./dgemm_offload
• 250 Gflops
• Compact affinity: • export MIC_KMP_AFFINITY=compact; ./dgemm_offload
• 500 Gflops
• Balanced affinity: • export MIC_KMP_AFFINITY=balanced; ./dgemm_offload
• 500 Gflops
• Balanced was introduced to the MIC
Performance: Alignment and Huge pages Huge Pages
For any array allocation bigger than 10MB, use huge pages: • export MIC_USE_2MB_BUFFERS=10M
• 750 Gflops
Alignment
• As a general rule, we need to align arrays to the vector size. • 16-byte alignement for SSE processors, • 32-byte alignement for AVX processors, • 64-byte for Xeon Phi.
• In offload mode the compiler will consider that arrays alignment is the same on both the host and the device. We need to change the alignement on the host to match that on the device:
double *A = (double*) \
_mm_malloc(sizeof(double)*size_A, Alignment);
Alignment = 16 Alignment = 64
750 Gflops 780 Gflops
Inspired by: Gilles Fourestey, CSCS
From 37 % to 80% of the peak
performance
#pragma offload in(a:length(count) align(64))
Performance Expectations “WOW, 240 hardware threads on a
single chip! My application will just
rock!” You really believe that?
Remember the limitations!
In-order cores
limited hardware prefetching
Running with 1GHz only
Small Caches (2 levels)
Poor single thread performance
Small main memory
PCIe as bottleneck + offload overhead
[Offload] [MIC 0] [Tag 0] [MIC Time] 8.25 7.89 (seconds) 240 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 7.95 7.33 (seconds) 250 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 7.82 7.65 (seconds) 255 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 4.00 4.00 (seconds) 500 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 4.00 4.00 (seconds) 500 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 2.67 2.57 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 0] [MIC Time] 2.67 2.58 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 1] [MIC Time] 2.67 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 1] [MIC Time] 2.67 (seconds) 750 GFLOPS [Offload] [MIC 0] [Tag 1] [MIC Time] 2.57 (seconds) 780 GFLOPS
MIC_OMP_NUM_THREADS=240, Data alignment = 16 MIC_KMP_AFFINITY=“disabled" MIC_KMP_AFFINITY=“none" MIC_KMP_AFFINITY="scatter" MIC_KMP_AFFINITY="compact" MIC_KMP_AFFINITY="balanced" MIC_USE_2MB_BUFFERS=100M MIC_OMP_NUM_THREADS=236 MIC_KMP_AFFINITY="compact“ Data alignment = 16 Data alignment = 32 Data alignment = 64
Performance: DGEMM on Anselm
Matrix dimension is set to 10 000
default
leave one core for OS
Stream benchmark on Anselm Size of input arrays
[MB] Bandwidth
GB/s
32 119.3047 64 142.6699
128 148.3955 256 156.2842
512 159.2646
1024 158.7783
2048 159.0215 4096 159.0313
0
50
100
150
200
1 8 64 512 4096
Triad bandwidth GB/s
http://software.intel.com/en-us/articles/optimizing-memory-bandwidth-on-stream-triad http://www.cs.virginia.edu/stream/stream_mail/2013/0002.html
Compiler Knobs
–mmic :build an application that runs natively on Intel® Xeon Phi coprocessor
–O3 :optimize for maximum speed and enable more aggressive optimizations that may not
improve performance on some programs
–openmp: enable the compiler to generate multi-threaded code based on the OpenMP*
directives (same as -fopenmp)
-opt-prefetch-distance=64,8:Software Prefetch 64 cachelines ahead for L2 cache;
Software Prefetch 8 cachelines ahead for L1 cache
-opt-streaming-cache-evict=0:Turn off all cache line evicts
-opt-streaming-stores always: enables generation of streaming stores under the assumption
that the application is memory bound
-DSTREAM_ARRAY_SIZE=64000000: Increasing the size of the array size to be compliant
with the STREAM Rules
-mcmodel=medium: compiler restricts code to the first 2GB; no memory restriction on data
a(i) = b(i) + q*c(i)
static void add(double* l, double* r, double *res, int length)
{
# pragma offload target(mic) in(length) in(l,r,res : REUSE)
{
#ifdef __MIC__
# pragma omp parallel
{
int part = length/omp_get_num_threads();
int start = part*omp_get_thread_num();
double *myl=l+start, *myr=r+start, *myres=res+start;
# pragma noprefetch
for (int L2 = 0; L2+512*1024/8/4 <= part; L2 += 512*1024/8/4)
{
# pragma nofusion
# pragma noprefetch
for (int L1 = 0; L1+32*1024/8/4 <= 512*1024/8/4; L1 += 32*1024/8/4)
{
# pragma nofusion
# pragma noprefetch
for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)
{
_mm_prefetch((char*)myr+L2+L1+cacheline, _MM_HINT_T1);
_mm_prefetch((char*)myl+L2+L1+cacheline, _MM_HINT_T1);
}
# pragma nofusion
# pragma noprefetch
for (int cacheline = 0; cacheline+8 <= 32*1024/8/4; cacheline += 8)
{
_mm_prefetch((char*)myr+L2+L1+cacheline, _MM_HINT_T0);
_mm_prefetch((char*)myl+L2+L1+cacheline, _MM_HINT_T0);
}
# pragma nofusion
# pragma noprefetch
for (int cacheline = 0; cacheline+8+8+8+8 <= 32*1024/8/4; cacheline += 8+8+8+8)
{
__m512d r0 = _mm512_load_pd(myr+L2+L1+cacheline+0*8); __m512d l0 = _mm512_load_pd(myl+L2+L1+cacheline+0*8);
__m512d r1 = _mm512_load_pd(myr+L2+L1+cacheline+1*8); __m512d l1 = _mm512_load_pd(myl+L2+L1+cacheline+1*8);
_mm512_storenrngo_pd(myres+L2+L1+cacheline+0*8, _mm512_add_pd(r0, l0));
_mm512_storenrngo_pd(myres+L2+L1+cacheline+1*8, _mm512_add_pd(r1, l1));
__m512d r2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8); __m512d l2 = _mm512_load_pd(myr+L2+L1+cacheline+2*8);
__m512d r3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8); __m512d l3 = _mm512_load_pd(myl+L2+L1+cacheline+3*8);
_mm512_storenrngo_pd(myres+L2+L1+cacheline+2*8, _mm512_add_pd(r2, l2));
_mm512_storenrngo_pd(myres+L2+L1+cacheline+3*8, _mm512_add_pd(r3, l3));
}
}
}
}
#endif
}
}
#pragma omp parallel for for (j=0; j<STREAM_ARRAY_SIZE; j++) a[j] = b[j]+scalar*c[j];
Stream triad
Roofline Model for Intel Xeon Phi
• Memory bandwidth measured with the STREAM benchmark is about 157 GB/s
• To reach the peak performance an even mix of multiply and add operations is
need (“fused multiply add”)
• Without SIMD vectorization only 1/16 of the peak performance is achievable
Peak performance of a Intel Xeon Phi
Coprocessor (1.2 GHz) is
• 1171 GFLOPS
• (1.2 GHz * 16 OPs/cycle * 61 cores)
Roofline Model: SpMV Roofline Model Using read memory bandwidth 𝐵𝑊 and theoretical peak performance
𝑃
Model for SMXV 𝑦 =𝐴 ∗𝑥 Assumptions
𝑥 ,𝑦 can be kept in the cache (~ 15 MB)
A too big for caches (~ 3200 MB)
𝑛≪𝑛𝑛𝑧 Compressed Row Storage (CRS) Format: One value (double) and one
index (int) element have to be loaded (dimension nnz) → 12 Bytes
Operational intensity 𝐎 = 2 𝐹𝐿𝑂𝑃𝑆12 𝐵𝑦𝑡𝑒=16 𝐹𝐿𝑂𝑃𝑆𝐵𝑦𝑡𝑒 (→ memory-bound)
Performance Limit: 𝐿=min{𝑃,𝑂∗𝐵𝑊}
Roofline Model Intel Xeon Phi (STREAM
156 GB/s, O=1/6)
Roofline Model SNB (2.6 GHz, STREAM
74.2 GB/s, Peak 332.8 GFLOPS)
source: Tim Cramer, Rechen- und Kommunikationszentrum (RZ)
Roofline Model for Intel Xeon Phi Sparse Matrix Vector Multiplication
Roofline Model: SpMV Roofline Model Using read memory bandwidth 𝐵𝑊 and theoretical peak performance
𝑃
Model for SMXV 𝑦 =𝐴 ∗𝑥 Assumptions
𝑥 ,𝑦 can be kept in the cache (~ 15 MB)
A too big for caches (~ 3200 MB)
𝑛≪𝑛𝑛𝑧 Compressed Row Storage (CRS) Format: One value (double) and one
index (int) element have to be loaded (dimension nnz) → 12 Bytes
Operational intensity 𝐎 = 2 𝐹𝐿𝑂𝑃𝑆12 𝐵𝑦𝑡𝑒=16 𝐹𝐿𝑂𝑃𝑆𝐵𝑦𝑡𝑒 (→ memory-bound)
Performance Limit: 𝐿=min{𝑃,𝑂∗𝐵𝑊}
Intel Xeon Phi
STREAM 156 GB/s, O=1/6 source: Tim Cramer, Rechen- und Kommunikationszentrum (RZ)
SpMV
• 𝑦 =𝐴 ∗𝑥 • 𝑥 ,𝑦 can be kept in the cache (~ 15 MB), A too big for caches (~ 3200 MB)
• Compressed Row Storage (CRS) Format: • 1 value (double) and 1 index (int) element have to be loaded → 12 Bytes
• Operational intensity 𝐎 = 2 𝐹𝐿𝑂𝑃𝑆12 𝐵𝑦𝑡𝑒 =
1
6 𝐹𝐿𝑂𝑃𝑆𝐵𝑦𝑡𝑒 (→ memory-bound)
2xCPU Sandy Bridge 2.6 GHz,
STREAM 74.2 GB/s, Peak 332.8 GFLOPS
Accelerators in Anselm and Salomon
Anselm • 4 Nodes - cn[204-207]
• 2x Intel Sandy Bridge E5-2470, 2.3GHz
• 96 GB RAM
• 1x Intel Xeon Phi P5110, 60 cores, 8 GB RAM
Salomon • 432 Nodes (Perrin) – cns[577-1008]
• 2x Intel Haswell E5-2680v3, 2.5 GHz
• 128 GB RAM
• 2x Xeon Phi 7120P, 61 cores, 16 GB RAM
Differences from Anselm
• Two cards, more powerful model
• Infiniband
• /scratch accessible via Lustre
• Planned: full SW stack for Phi
• Planned: Phis as a “separated cluster”
MIC Programming on Salomon • Environment, PBS, MIC nodes, ….
• Device info
• MIC Programming modes • Native mode
• Compiler assisted offload mode
• Automatic offload mode using MKL library
• MPI and MIC
• Code has to be compiled on a node with MIC / MPSS installed
• Nodes cns577 – cns1008
• Get an interactive session to a node with MIC • $ qsub -I –X -A DD-14-1 -q qprod -lselect=1:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120, walltime=03:00:00
• To setup the MIC programming environment use: • $ module load intel/2015b
• To get information about MIC accelerator use: • $ micinfo
MIC Programming on Salomon
echo hostname | qsub -q R162356 echo hostname | qsub -q R162356 -g dd-14-1 echo hostname | qsub -q R162356 -W group_list dd-14-1 echo hostname | qsub -q R162356 -W group_list=dd-14-1
MicInfo Utility Log - Created Mon Jul 22 00:23:50 2013
System Info
HOST OS : Linux
OS Version : 2.6.32-504.16.2.el6.x86_64
Driver Version : 3.4.1-1
MPSS Version : 3.4.1
Host Physical Memory : 131930 MB
Device No: 0, Device Name: mic0
Version
...
uOS Version : 2.6.38.8+mpss3.4.1
...
Board
...
SMC HW Revision : Product 300W Passive CS
Cores
Total No of Active Cores : 61
Voltage : 1005000 uV
Frequency : 1238095 kHz
Thermal
...
Die Temp : 59 C
GDDR
...
GDDR Size : 15872 MB
GDDR Technology : GDDR5
GDDR Speed : 5.500000 GT/s
...
Native Mode
Native Mode Example • Edit source:
• $ vim vect-add-short.c
#include <stdio.h>
typedef int T;
#define SIZE 20
T in1[SIZE]; T in2[SIZE]; T res[SIZE];
// CPU function to generate a vector of random numbers
void random_T (T *a, int size) {
int i;
for (i = 0; i < size; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
int main()
{ int i;
random_T(in1, SIZE); random_T(in2, SIZE);
#pragma omp parallel for
for (i=0; i<SIZE; i++)
res[i] = in1[i] + in2[i];
}
Execution on host : Compile: $ icc -xHost -fopenmp vect-
add-short.c -o vect-add-host
Run: $ export OMP_NUM_THREADS=16
$./vect-add-host
Native Mode Example • Compile on HOST
• $ icc -mmic -fopenmp vect-add-short.c
-o vect-add-mic
• Connect to MIC: • ssh mic0 , or
• ssh mic1
• Setup path OpenMP libraries • mic0 $ export LD_LIBRARY_PATH=/apps/all/icc/2015.3.187-GNU-5.1.0-
2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH
• Set number of OpenMP threads (1 – 240): • mic0 $ export OMP_NUM_THREADS=240
• Run: • mic0 $ ~/path_to_binary/vect-add-mic
• .
List of libraries required for execution of OpenMP parallel code on Intel Xeon Phi:
/apps/all/icc/2015.3.187-GNU-5.1.0-2.25/composer_xe_2015.3.187/compiler/lib/mic
libiomp5.so ; libimf.so ; libsvml.so ; libirng.so ; libintlc.so.5
Offload Mode
Basic Offload Mode Example - Pi • Edit source: $ vim source-offload.cpp
• Turn on offload info: • $ export OFFLOAD_REPORT=2
• Compile: • $ icc source-offload.cpp -o bin-offload
• Run: • $ ./bin-offload
#include <iostream>
int main(int argc, char* argv[])
{
const int niter = 100000;
double result = 0;
#pragma offload target(mic)
for (int i = 0; i < niter; ++i) {
const double t = (i + 0.5) / niter;
result += 4.0 / (t * t + 1.0);
}
result /= niter;
std::cout << "Pi ~ " << result << '\n';
}
Implement simultaneous host/coprocessor computing 1.) sequential 2.) concurrent Hint: export OMP_NUM_THREADS=4
Parallel Offload OpenMP Example • Edit source:
• $ vim vect-add.c
#include <stdio.h>
typedef int T;
#define SIZE 1000
#pragma offload_attribute(push, target(mic))
T in1[SIZE]; T in2[SIZE]; T res[SIZE];
#pragma offload_attribute(pop)
// MIC function to add two vectors
__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
// CPU function to add two vectors
void add_cpu (T *a, T *b, T *c, int size) {
int i;
for (i = 0; i < size; i++)
c[i] = a[i] + b[i];
}
Parallel Offload OpenMP Example
int main()
{
int i;
random_T(in1, SIZE);
random_T(in2, SIZE);
#pragma offload target(mic) in(in1,in2) inout(res)
{ // 1. Parallel loop from main function
#pragma omp parallel for
for (i=0; i<SIZE; i++)
res[i] = in1[i] + in2[i];
// 2. ... or parallel loop is called inside the function
add_mic(in1, in2, res, SIZE);
}
//Check the results with CPU implementation
T res_cpu[SIZE];
add_cpu(in1, in2, res_cpu, SIZE);
compare(res, res_cpu, SIZE);
}
Parallel Offload OpenMP Example
__attribute__((target(mic))) add_mic(T *a, T *b, T *c, int size) {
int i = 0;
#pragma omp parallel for
for (i = 0; i < size; i++)
c[i] = a[i] + b[i]; }
Implement : 1.) Transfer and keep C = A + B 2.) Reuse and keep C = A – B 3.) Reuse and return C = A * B
// CPU function to generate a vector of random numbers
void random_T (T *a, int size) {
int i;
for (i = 0; i < size; i++)
a[i] = rand() % 10000; // random number between 0 and 9999
}
// CPU function to compare two vectors
int compare(T *a, T *b, T size ){
int pass = 0;
int i;
for (i = 0; i < size; i++){
if (a[i] != b[i]) {
printf("Value mismatch at location %d, values %d and %d\n",i, a[i],
b[i]);
pass = 1;
}
}
if (pass == 0) printf ("Test passed\n"); else printf ("Test Failed\n");
return pass;
}
Parallel Offload OpenMP Example
• Remaining functions, copy and paste before main function
Parallel Offload OpenMP Example • Turn on offload info:
• $ export OFFLOAD_REPORT=2
• Compile: • $ icc vect-add.c -openmp_report2 -vec-report2
-o vect-add
• Run: • $ ./vect-add
Some debugging options
openmp_report[0|1|2]
- controls the OpenMP parallelizer diagnostic level
vec-report[0|1|2]
- controls the compiler based vectorization diagnostic level
vect-add.c(60): (col. 5) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
vect-add.c(53): (col. 3) remark: loop was not vectorized: statement cannot
be vectorized.
vect-add.c(54): (col. 3) remark: loop was not vectorized: statement cannot
be vectorized.
vect-add.c(72): (col. 3) remark: LOOP WAS VECTORIZED.
vect-add.c(73): (col. 3) remark: loop was not vectorized: existence of
vector dependence.
vect-add.c(61): (col. 5) remark: LOOP WAS VECTORIZED.
vect-add.c(16): (col. 3) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
vect-add.c(17): (col. 5) remark: LOOP WAS VECTORIZED.
vect-add.c(24): (col. 3) remark: LOOP WAS VECTORIZED.
vect-add.c(32): (col. 12) remark: loop was not vectorized: statement cannot
be vectorized.
vect-add.c(39): (col. 3) remark: loop was not vectorized: existence of
vector dependence.
vect-add.c(60): (col. 5) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.
vect-add.c(61): (col. 5) remark: *MIC* LOOP WAS VECTORIZED.
vect-add.c(16): (col. 3) remark: *MIC* OpenMP DEFINED LOOP WAS PARALLELIZED.
vect-add.c(17): (col. 5) remark: *MIC* LOOP WAS VECTORIZED.
Parallel Offload OpenMP Example
Automatic Offload using MKL Example • Edit source:
• $ vim sgemm-ao-short.c
As of Intel MKL 11.0.2 only the following functions are enabled for automatic offload: Level-3 BLAS functions - ?GEMM (for m,n > 2048, k > 256) - ?TRSM (for M,N > 3072) - ?TRMM (for M,N > 3072) - ?SYMM (for M,N > 2048) LAPACK functions - LU (M,N > 8192) - QR - Cholesky
#include <stdio.h>
#include <stdlib.h>
#include <malloc.h>
#include <stdint.h>
#include "mkl.h"
int main(int argc, char **argv)
{
float *A, *B, *C; /* Matrices */
MKL_INT N = 2560; /* Matrix dimensions */
MKL_INT LD = N; /* Leading dimension */
int matrix_bytes; /* Matrix size in bytes */
int matrix_elements; /* Matrix size in elements */
float alpha = 1.0, beta = 1.0; /* Scaling factors */
char transa = 'N', transb = 'N'; /* Transposition options */
int i, j; /* Counters */
matrix_elements = N * N;
matrix_bytes = sizeof(float) * matrix_elements;
Automatic Offload using MKL Example
/* Allocate the matrices */
A = malloc(matrix_bytes); B = malloc(matrix_bytes);
C = malloc(matrix_bytes);
/* Initialize the matrices */
for (i = 0; i < matrix_elements; i++)
A[i] = 1.0; B[i] = 2.0; C[i] = 0.0;
printf("Computing SGEMM on the host\n");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
printf("Enabling Automatic Offload\n");
mkl_mic_enable();
int ndevices = mkl_mic_get_device_count(); // Number of MIC devices
printf("Automatic Offload enabled: %d MIC devices present\n", ndevices);
printf("Computing SGEMM with automatic workdivision\n");
sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N);
/* Free the matrix memory */
free(A); free(B); free(C);
printf("Done\n"); return 0;
}
Automatic Offload using MKL Example
?gemm C := alpha * A* B + beta*C,
Automatic Offload using MKL Example • Turn on offload info:
• $ export OFFLOAD_REPORT=2
• Compile: • $ icc -mkl sgemm-ao-short.c -o sgemm
• Run: • $ export OMP_NUM_THREADS=24
• $./sgemm
Computing SGEMM on the host
Enabling Automatic Offload
Automatic Offload enabled: 2 MIC devices present
Computing SGEMM with automatic workdivision
[MKL] [MIC --] [AO Function] SGEMM
[MKL] [MIC --] [AO SGEMM Workdivision] 0.44 0.28 0.28
[MKL] [MIC 00] [AO SGEMM CPU Time] 0.278222 seconds
[MKL] [MIC 00] [AO SGEMM MIC Time] 0.086087 seconds
[MKL] [MIC 00] [AO SGEMM CPU->MIC Data] 34078720 bytes
[MKL] [MIC 00] [AO SGEMM MIC->CPU Data] 7864320 bytes
[MKL] [MIC 01] [AO SGEMM CPU Time] 0.278222 seconds
[MKL] [MIC 01] [AO SGEMM MIC Time] 0.097036 seconds
[MKL] [MIC 01] [AO SGEMM CPU->MIC Data] 34078720 bytes
[MKL] [MIC 01] [AO SGEMM MIC->CPU Data] 7864320 bytes
This example is a simplified
version of an example from
MKL. The full version can be
found here:
$MKL_EXAMPLES/mic_ao/blasc/source/sgemm.c
MPI
MPI
• MPI programming models • Host-only model - all MPI ranks reside on the host. The
coprocessors can be used by using offload pragmas. (Using MPI calls inside offloaded code is not supported.)
• Coprocessor-only model - all MPI ranks reside only on the coprocessors.
• Symmetric model - the MPI ranks reside on both the host and the coprocessor. Most general MPI case.
MPI Example • Edit source:
• $ vim mpi-test.c
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size;
int len;
char node[MPI_MAX_PROCESSOR_NAME];
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
MPI_Get_processor_name(node,&len);
printf( "Hello world from process %d of %d on host %s \n", rank, size, node );
MPI_Finalize();
return 0;
}
MPI Example • Get access to 2 nodes with MIC accelerators
• $ qsub -I -A DD-XX-X -q qprod -lselect=2:ncpus=24:accelerator=True:naccelerators=2:accelerator_model=phi7120
• Load Intel MPI module and setup host environment • $ module load intel/2015b
• $ export I_MPI_MIC_POSTFIX=-mic
• $ export I_MPI_MIC=1
• Compile on HOST • For both host and mic:
• $ mpiicc -xHost -o mpi-test mpi-test.c
• $ mpiicc -mmic -o mpi-test-mic mpi-test.c
• Setup environment once for all • $ vim ~/.profile
PS1='[\u@\h \W]\$ '
export PATH=/usr/bin:/usr/sbin:/bin:/sbin
#OpenMP
export
LD_LIBRARY_PATH=/apps/intel/composer_xe_2013.5.192/compiler/lib/mic:$LD_LIBRARY
_PATH
#Intel MPI
export LD_LIBRARY_PATH=/apps/intel/impi/4.1.1.036/mic/lib/:$LD_LIBRARY_PATH
export PATH=/apps/intel/impi/4.1.1.036/mic/bin/:$PATH
PS1='[\u@\h \W]\$ ’
export PATH=/usr/bin:/usr/sbin:/bin:/sbin
#IMPI
export PATH=/apps/all/impi/5.0.3.048-iccifort-2015.3.187-GNU-5.1.0-2.25/mic/bin/:$PATH
#OpenMP (ICC, IFORT), IMKL and IMPI
export LD_LIBRARY_PATH=/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-
2.25/mkl/lib/mic:/apps/all/imkl/11.2.3.187-iimpi-7.3.5-GNU-5.1.0-
2.25/lib/mic:/apps/all/icc/2015.3.187-GNU-5.1.0-
2.25/composer_xe_2015.3.187/compiler/lib/mic:$LD_LIBRARY_PATH
Execution on Host or single MIC • Host-only model
• $ mpirun -np 4 ./mpi-test
• Coprocessor-only model executed from MIC • $ ssh mic0
• $ mpirun -np 4 ~/path_to_binary/mpi-test-mic
• Coprocessor-only model executed from HOST • $ export I_MPI_MIC=1
• $ mpirun -genv LD_LIBRARY_PATH $MIC_LD_LIBRARY_PATH -host mic0 –n 2 ~/path_to_binary/mpi-test-mic : -host mic1 –n 2 ~/path_to_binary/mpi-test-mic
Hello world from process 2 of 4 on host r25u30n719
Hello world from process 0 of 4 on host r25u30n719
Hello world from process 1 of 4 on host r26u31n712
Hello world from process 3 of 4 on host r26u31n712
Hello world from process 0 of 4 on host r25u30n719-mic0
Hello world from process 1 of 4 on host r25u30n719-mic0
Hello world from process 2 of 4 on host r25u30n719-mic0
Hello world from process 3 of 4 on host r25u30n719-mic0
Hello world from process 1 of 4 on host r25u30n719-mic0
Hello world from process 0 of 4 on host r25u30n719-mic0
Hello world from process 2 of 4 on host r25u30n719-mic1
Hello world from process 3 of 4 on host r25u30n719-mic1
PBS Generated Node-Files
• PBS generates a set of node-files • Host only node-file:
• /lscratch/${PBS_JOBID}/nodefile-cn
• MIC only node-file:
• /lscratch/${PBS_JOBID}/nodefile-mic
• Host and MIC node-file:
• /lscratch/${PBS_JOBID}/nodefile-mix
• Each host or accelerator is listed only per files • User has to specify how many jobs should be executed
per node using "-n" parameter of the mpirun command
MPI Tuning
• For best performance, use:
• This ensures, that MPI inside node will use SHMEM communication, between HOST and Phi the IB SCIF will be used and between different nodes or Phi's on diferent nodes a CCL-Direct proxy will be used.
• Please note: Other FABRICS like tcp,ofa may be used (even combined with shm) but there's severe loss of performance (by order of magnitude). Usage of single DAPL PROVIDER (e. g. I_MPI_DAPL_PROVIDER=ofa-v2-mlx4_0-1u) will cause failure of Host<->Phi and/or Phi<->Phi communication. Usage of the I_MPI_DAPL_PROVIDER_LIST on non-accelerated node will cause failure of any MPI communication, since those nodes don't have SCIF device and there's no CCL-Direct proxy runnig.
$ export I_MPI_FABRICS=shm:dapl
$ export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-
mcm-1
MPI Bandwidth
Bandwidth MB/s
HOST0-HOST1 10672.53
HOST0-MIC0 10196.13
HOST0-MIC1 10398.99
HOST0MIC0-HOST0MIC1 1821.21
HOST0MIC0-HOST1MIC0 5043.12
HOST0MIC0-HOST1MIC1 5654.13
HOST0MIC1-HOST1MIC1 5674.11
Thanks to Filip Stanek
Host-only model on multiple nodes • Use “nodefile-cn” nodefile
• mpirun -n 4 -machinefile
/lscratch/${PBS_JOBID}/nodefile-cn
~/path_to_binary/mpi-test
Hello world from process 2 of 4 on host r38u21n993
Hello world from process 1 of 4 on host r38u22n994
Hello world from process 3 of 4 on host r38u22n994
Hello world from process 0 of 4 on host r38u21n993
Coprocessor-only model on multiple nodes
Hello world from process 2 of 4 on host r38u22n994-mic0
Hello world from process 0 of 4 on host r38u21n993-mic0
Hello world from process 3 of 4 on host r38u22n994-mic1
Hello world from process 1 of 4 on host r38u21n993-mic1
cn204-mic0.bullx cn205-mic0.bullx
• Use “nodefile-mic” nodefile • mpirun -n 4
-machinefile /lscratch/${PBS_JOBID}/nodefile-
mic
~/path_to_binary/mpi-test
mpirun -n 4 -genv I_MPI_FABRICS_LIST tcp -genv I_MPI_FABRICS shm:tcp -genv I_MPI_TCP_NETMASK=10.1.0.0/16 -genv I_MPI_DEBUG 5 -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -machinefile /lscratch/${PBS_JOBID}/nodefile-mic ~/anselm-intro/mic/mpi-test
Symmetric model on multiple nodes
Hello world from process 0 of 4 on host r38u21n993
Hello world from process 1 of 4 on host r38u22n994
Hello world from process 3 of 4 on host r38u21n993-mic1
Hello world from process 2 of 4 on host r38u21n993-mic0
• Use “nodefile-mix” nodefile • export I_MPI_FABRICS=shm:ofa
mpirun -n 4
-machinefile /lscratch/${PBS_JOBID}/nodefile-mix
~/path_to_binary/mpi-test
mpirun -n 4 -genv I_MPI_FABRICS_LIST tcp -genv I_MPI_FABRICS shm:tcp -genv I_MPI_TCP_NETMASK=10.1.0.0/16 -genv I_MPI_DEBUG 5 -genv LD_LIBRARY_PATH /apps/intel/impi/4.1.1.036/mic/lib/ -machinefile /lscratch/${PBS_JOBID}/nodefile-mix ~/anselm-intro/mic/mpi-test
Miscellanous
• Issues with Xeon Phi
• Debugging – Vtune, DDT
Debugging with Allinea DDT
Debugging with Allinea DDT
• Native Xeon Phi non-MPI Programs
• Native Xeon Phi Intel MPI Programs
• Heterogeneous (host + Xeon Phi) Intel MPI Programs
• Heterogeneous Programs (#pragma offload)
Native Xeon Phi non-MPI Programs remote
Native Xeon Phi Intel MPI Programs remote
Heterogeneous Programs (#pragma offload) GUI / offline
DDT
Native Xeon Phi non-MPI Programs remote
Native Xeon Phi Intel MPI Programs remote
Heterogeneous Intel MPI Programs GUI / offline
Heterogeneous Programs (#pragma offload) GUI / offline
• Get an interactive session to a node with MIC • $ qsub –X -I -A DD-XX-X -q qmic -lselect=1:ncpus=1, walltime=03:00:00
• To setup the MIC programming environment use: • $ module load intel impi allinea-ddt-map
• Code has to be compiled on a node with MIC / MPSS installed
• Nodes cn204 – cn207
Debugging with Allinea DDT
Native Xeon Phi non-MPI Programs • Compile with -d and -O0 flags
• icc -d -O0 -mmic -fopenmp vect-add-short.c -o vect-add-mic-debug
• Start DDT on the host (using the host installation of DDT).
• ddt
Native Xeon Phi non-MPI Programs
1. Click the Remote Launch drop-down on the Welcome Page and select Configure...
2. Enter the host name of the Xeon Phi card in the Host Name box
• mic0
3. Select the path to the Xeon Phi installation of DDT in the Installation Directory box.
• /apps/debug/allinea/4.2/
4. Click Test Remote Launch and ensure the settings are correct.
5. Click Ok.
6. Click Run and Debug a Program on the Welcome Page.
7. Select a native Xeon Phi program in the Application box in the Run window.
8. Click Run.
Native Xeon Phi non-MPI Programs
1. Click the Remote Launch drop-down on the Welcome Page and select Configure...
2. Enter the host name of the Xeon Phi card in the Host Name box
• mic0
3. Select the path to the Xeon Phi installation of DDT in the Installation Directory box.
• /apps/debug/allinea/4.2/
4. Click Test Remote Launch and ensure the settings are correct.
5. Click Ok.
6. Click Run and Debug a Program on the Welcome Page.
7. Select a native Xeon Phi program in the Application box in the Run window.
8. Click Run.
Native Xeon Phi MPI Programs
Native Xeon Phi MPI Programs • Compile with -d and -O0 flags
• mpiicc -g -O0 -xhost -o mpi-test-debug mpi-test.c
• mpiicc -g -O0 -mmic -o mpi-test-debug-mic mpi-test.c
• Start DDT on the host (using the host installation of DDT).
• ddt
Native Xeon Phi MPI Programs Click the Remote Launch drop-down on the Welcome Page and select Configure...
1. Enter the host name of the Xeon Phi card in the Host Name box
• mic0
2. Select the path to the Xeon Phi installation of DDT in the Installation Directory box.
• /apps/debug/allinea/4.2/
3. Select remote script to initialize the environment on MIC
• ~/.profile
4. Click Test Remote Launch and ensure the settings are correct.
5. Click Ok.
Native Xeon Phi MPI Programs
1. Click Run and Debug a Program on the Welcome Page.
2. Select a native Xeon Phi MPI program in the Application box in the Run window.
DDT should have detected 'Intel MPI (MPMD)' as the MPI implementation in File → Options (DDT → Preferences on Mac OS X) → System.
1. Click Run.
Native Xeon Phi MPI Programs
Heterogeneous Programs with (#pragma offload)
Heterogeneous Programs with (#pragma offload) • Compile with -d and -O0 flags
• icc -d -O0 vect-add.c –o vect-add-debug
• Start DDT on the host (using the host installation of DDT).
• ddt
Heterogeneous Programs with (#pragma offload)
1. Open the Options window: File → Options
2. Select Intel MPI (MPMD) as the MPI Implementation on the System page.
3. Check the Heterogeneous system support check box on the System page.
4. Click Ok.
5. Ensure Control → Default Breakpoints → Stop on Xeon Phi offload is checked.
Heterogeneous Programs with (#pragma offload)
Click Run and Debug a Program on the Welcome Page.
1. Select a heterogeneous program that uses #pragma offload in the Application box in the
2. Run window.
3. Click Run.
Heterogeneous Programs with (#pragma offload)
Heterogeneous (host + Xeon Phi) Intel MPI Programs • Zatim na Anselmu nefunguje
Heterogeneous (host + Xeon Phi) Intel MPI Programs
1. Start DDT on the host (using the host installation of DDT.
2. Open the Options window: File → Options
3. Select Intel MPI (MPMD) as the MPI Implementation on the System page.
4. Check the Heterogeneous system support check box on the System page.
5. Click Ok.
Heterogeneous (host + Xeon Phi) Intel MPI Programs
1. Click Run and Debug a Program in the Welcome Page.
2. Select the path to the host executable in the Application box in the Run window.
3. Enter an MPMD style mpiexec command line in the mpiexec Arguments box, e.g.
-np 8 -host micdev /home/user/examples/hello-host : -np 32 -host micdev-mic0 /home/user/examples/hello-mic
1. Set Number of processes to be the total number of processes launched on both the host and Xeon Phi (e.g. 40 for the above mpiexec Arguments line).
2. Add I_MPI_MIC=enable to the Environment Variables box.
3. Click Run. You may need to wait a minute for the Xeon Phi processes to connect.