Intel MIC Programming Workshop, Hardware Overview & Native ... · Intel MIC Programming Workshop, Hardware Overview & Native Execution 1 Ostrava, 7-8.2.2017

Intel MIC Programming Workshop,Hardware Overview & Native Execution

1

Ostrava, 7-8.2.2017

Intel MIC Programming Workshop @ Ostrava

● Intro @ accelerators on HPC ● Architecture overview of the Intel Xeon Phi Products ● Programming models ● Native mode programming & hands on

2

Agenda

7-8 February 2017


Why do we Need Coprocessors/or “Accelerators” on HPC?

• In the past, computers got faster by increasing the clock frequency of the core, but this has now reached its limit mainly due to power requirements and heat dissipation restrictions (unmanageable problem)

• Today, processor cores are not getting any faster, but instead the number of cores per chip increases and registers are getting wider

• On HPC, we need a chip that can provide higher computing performance at lower energy

37-8 February 2017



• One solution is a heterogeneous system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support

• Two types of hardware acceleration options, Intel Xeon Phi and Nvidia GPU • Designed to accelerate a computation through massive parallelisation, and to

deliver the highest performance and energy efficient computing in HPC

47-8 February 2017



57-8 February 2017

Nvidia GPU

Intel Xeon Phi


Accelerators in the Top 500 List

7-8 February 2017


Green500 List for November 2016

7

Allsystemsinthetop10areaccelerator-based(5usingXeonPhi)

7-8 February 2017

Intel Xeon Phi


Architectures Comparison (CPU vs GPU)

• Large cache and sophisticated flow control minimise latency for arbitrary memory access for serial process.

• Simple flow control and limited cache.

• More transistors for computing in parallel (up to 17 billion on Pascal GPU)

• (SIMD)

CPU

ControlALU

Cache

ALU

ALU ALU

DRAM

GPU

DRAM

87-8 February 2017

Intel Xeon CPU E5-2697v4 “Brodwell”

Nvidia GPU P100

Cores @ Clock 2 x 18 cores @ ≥ 2.3 GHz 56SMs @ 1.3 GHzSP Perf./core ≥ 73.6 GFlop/s up to 166 GFlop/s

SP peak ≥ 2.6 TFlop/s up to 9.3 TFlop/s

Transistors/TDP 2x7 Billion /2x145W 14 Billion/300WBW 2 x 62.5 GB/s 510 GB/s


Intel Xeon Phi Products

• The first product was released in 2012 named Knights Corner (KNC) which is the first architecture supporting 512 bit vectors.

• The 2nd generation released at ISC16 in Frankfurt named Knights Landing (KNL) also support 512bit vectors with a new instruction set called Intel Advanced Vector Instructions 512 (Intel AVX-512)

• KNL has a peak performance of 6 TFLOP/s in single precision ~ 3 times what KNC had, due to 2 vector processing units (VPUs) per core, doubled compared to the KNC.

97-8 February 2017


Intel MIC Architecture in common with Intel multi-core Xeon CPUs !

• X86 architecture • C, C++ and Fortran • Standard parallelisation libraries • Similar optimisation methods

• PCIe bus connection • IP-addressable

• Own Linux version OS

• Full Intel software tool suite • 8 - 16 GB GDDR5 DRAM (cached) • Up to 61 x86 (64 bit) cores

• Up to 1 GHz • 512 bit wide vector registers

• 4 way hyper-threading • 64-bit addressing • SSE, AVX or AVX2: are not supported

• Intel Initial Many Core Instructions (IMCI).

107-8 February 2017

• Up to 22 cores/socket • Up to 3 GHz • Up to 1.54 TB RAM • 256 bit AVX vectors • 2-way hyper-threading


GPUMICCPU

Architectures Comparison

General-purpose architecture

Control

Cache

Massively data parallel

ALU

DRAM

ALU

ALU ALU

DRAMDRAM

Power-efficient Multiprocessor X86 design architecture

117-8 February 2017

Momme Allalen: Intel MIC Programming Workshop, Feb. 7-8, 2017

Knights Corner vs Knights Landing

● Knights Corner ● Co-processor ● Binary incompatible with other

architecture ● 1.1 GHz processor ● 8 GB RAM ● 22 nm process ● One 512-bit VPU ● No support for branch prediction and

fast unaligned memory access

● Knights Landing ● No PCIe - Self hosted ● Binary compatible with prior Xeon

architectures (no phi) ● 1.4 GHz processor ● Up to 400 GB RAM (with MCDRAM) ● 14 nm process ● Tow 512-bit VPUs ● Support for branch prediction and

fast unaligned memory access

28/12/16 12


The Intel Xeon Phi (KNC) Microarchitecture

• Bidirectional ring interconnect which connects all the cores, L2 caches through a distributed global Tag Directory (TD), PCIe client logic,

GDDR5 memory controllers …etc.

137-8 February 2017


Network Access

• Network access possible using TCP/IP tools like ssh.

• NFS mounts on Xeon Phi Supported.

First experiences with the Intel MIC architecture at LRZ, Volker Weinberg and Momme Allalen, inSIDE Vol. 11 No.2 Autumn2013

147-8 February 2017


Cache Structure

15

Intel's Jim Jeffers and James Reinders: Intel Xeon® Phi™ Coprocessor High Performance Programming, the first book on how to program this HPC vector chip, is available on Amazon

Cache line size 64BL1 size 32KB data cache, 32KB instruction code

L1 latency 1 cycle

L2 size 512 KB

L2 ways 8

L2 latency 15-30 cycles

Memory à L2 prefetching hardware and software

L2 à L1 prefetching software onlyTranslation Lookside Buffer (TLB) 64 pages of size 4KB (256KB coverage)

Coverage options (L1, data) 8 pages of size 2MB (16MB coverage)

7-8 February 2017


Features of the Initial Many Core Instructions (IMCI) Set

512-bit wide registers • can pack up to eight 64-bit elements (long int, double) • up to sixteen 32-bit elements (int, float)

Arithmetic Instructions • Addition, subtraction and multiplication • Fused Multiply-Add instruction (FMA) • Division and reciprocal calculation • Exponential functions and the power function • Logarithms (natural, base 2 and base 10) • Square root, inverse square root, hypothenuse value and cubic root • Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos, atan)

167-8 February 2017

17

Lab1: Access Salomon


For the Labs: Salomon System Initialisation

● Access: ssh -l USERID login1.salomon.it4i.cz ● Load the Intel environment on the host via:

module load intel impi ● Submit an interactive job via

qsub -I -A DD-16-1 -q R553892.isrv5 -l select=1:accelerator=True,walltime=06:00:00 -l

● qstat -a

● qstat -u $USER

7-8 February 2017 18


Interacting with Intel Xeon Phi Coprocessors

user@host~$micinfo –listdevices

MicInfo Utility Log Created Thu Jan 7 09:33:20 2016 List of Available Devices deviceId | domain | bus# | pciDev# | hardwareId ------------|-------------|---------|------------|---------------- 0 | 0 | 20 | 0 | 22508086

1 | 0 | 8b | 0 | 22508086

------------------------------------------------------------------

user@host~$ micinfo | grep -i cores Cores

Total No of Active Cores : 61 Cores

Total No of Active Cores : 61

197-8 February 2017


Useful Tools and Files on Coprocessor

● top - display Linux tasks ● ps - report a snapshot of the current processes. ● kill - send signals to processes, or list signals

● ifconfig - configure a network interface ● traceroute - print the route packets take to network host

● mpiexec.hydra – run Intel MPI natively

● /proc/cpuinfo ● /proc/meminfo

207-8 February 2017


Interacting with Intel Xeon Phi Coprocessors

user@host~$ /sbin/lspci | grep -i "co-processor”

20:00.0 Co-processor: Intel Corporation Xeon Phi …. (rev 20)

8b:00.0 Co-processor: Intel Corporation Xeon Phi …. (rev 20)

user@host~$ cat /etc/hosts | grep mic1

user@host~$ cat /etc/hosts | grep mic1-ib0 | wc –l user@host~$ ssh mic0 or ssh mic1

user@host-mic0~$ ls /bin boot dev etc home init lib lib64 lrz media mnt proc root sbin sys tmp usr var • micsmc a utility for monitoring the physical parameters of Intel Xeon Phi

coprocessors: model, memory, core rail temperatures, core frequency, power usage, etc.

217-8 February 2017


Parallelism on the Heterogeneous Systems

Vectorisation - SIMD Parallelism MPI/OpenMP ccNUMA domains Multiple accelerators

L1DL2

L1DL2

L1DL2

L1DL2

L1DL2

L1DL2

L1DL2

L1DL2

L3L3

Other I/O

12 3

45

6

12

34

PCIe buses Other I/O resources

56

Memory Interface

Memory Memory

Memory Interface

227-8 February 2017


• Native Mode • Programs started on Xeon Phi • Cross-compilation using –mmic • User access to Xeon Phi is necessary

• Offload to MIC • Offload using OpenMP extensions

• Automatically offload some routines using MKL

• MKL Compiled assisted offload (CAO)

• MKL automatic Offload (AO)

• MPI tasks on Host and MIC • Treat the coprocessor like another host

• MPI only and MPI + X (X may be OpenMP, TBB, Cilk, OpenCL…etc.)

MIC Programming Models

myFunction();main() {

#pragma offload target(mic)

}

Host MIC

main() {

#pragma offload target(mic)

}

Host MIC

237-8 February 2017

MIC Programming Models

247-8 February 2017


Native Mode

• First ensure that the application is suitable for native execution. • The application runs entirely on the MIC coprocessor without offload

from a host system. • Compile the application for native execution using the flag: -mmic • Build also the required libraries for native execution. • Copy the executable and any dependencies, such as run time libraries

to the coprocessor. • Mount file shares to the coprocessor for accessing input data sets and

saving output data sets. • Login to the coprocessor via console, setup the environment and run

the executable. • You can debug the native application via a debug server running on

the coprocessor.

257-8 February 2017


Invocation of the Intel MPI compiler

Language MPI Compiler CompilerC mpiicc iccC++ mpiicpc icpcFortran mpiifort ifort

7-8 February 2017


Native Mode

• Compile on the Host:

~$ $INTEL_BASE/linux/bin/compilervars.sh intel64

~$ icpc -mmic hello.c -o hello

~$ifort -mmic hello.f90 -o hello

• Launch execution from the MIC: ~$ scp hello $HOST-mic0: ~$ ssh $HOST-mic0

~$ ./hello

hello, world

277-8 February 2017


Native Mode: micnativeloadex

The tool automatically transfer the code and dependent libraries and execute from the host:

~$ ./hello -bash: ./hello: cannot execute Binary file

~$ export SINK_LIBRARY_PATH=…/intel/compiler/lib/mic

~$ micnativeloadex ./hello

hello, world

~$ micnativeloadex ./hello -v hello, worldRemote process returned: 0Exit reason: SHUTDOWN OK

287-8 February 2017


Native Mode

• Use the Intel compiler flag –mmic

• Remove assembly and unnecessary dependencies

• Use –host=“—host=x86_64” to avoid program does not run errors

• Example, the GNU Multiple Precision Arithmetic Library (GMP):

~$host: wget http://mirror.bibleonline.ru/gnu/gmp/gmp-6.1.0.tar.bz2 ~$host: tar –xf gmp-6.1.0.tar.bz2 ~$host: cd gmp.6.1.0

~$host:./configure --prefix=${gmp_build} CC=icc CFLAGS=“-mmic” –host=x86_64 –disable-assembly --enable-cxx ~$host: make

297-8 February 2017

http://mirror.bibleonline.ru/gnu/gmp/gmp-6.1.0.tar.bz2




30

Lab 2: Native Mode


For the Labs: Salomon System Initialisation

● Load the Intel environment on the host via: module load intel

● Submit an interactive job via qsub -I -A DD-16-44 -q R553892.isrv5 -l select=1:ncpus=24:accelerator=True:naccelerators=2 -l walltime=06:00:00

● Exercise Code:cp –r /home/weinberg/MIC_Workshop/ .

● Exercise Sheets + Slides online: https://goo.gl/IPBNmK

7-8 February 2017 31

https://goo.gl/IPBNmK


Important MPI environment variables

● Important Paths are already set by intel module, otherwise use: − . $ICC_BASE/bin/compilervars.sh intel64 − . $MPI_BASE/bin64/mpivars.sh

● Recommended environment on Salomon:module load intel export I_MPI_HYDRA_BOOTSTRAP=ssh export I_MPI_MIC=enable export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1 export MIC_LD_LIBRARY_PATH = $MIC_LD_LIBRARY_PATH:/apps/all/impi/5.1.2.150-iccifort-2016.1.150-GCC-4.9.3-2.25/mic/lib/ depending on version

7-8 February 2017


micinfo and _SC_NPROCESSORS_ONLN

~$ micinfo –listdevices ~$ micinfo | grep -i cores

~$ cat hello.c #include <stdio.h>

#include <unistd.h>int main(){ printf("Hello world! I have %ld logical cores.\n", sysconf(_SC_NPROCESSORS_ONLN)); } ~$ icc hello.c –o hello-host && ./hello-host

~$ icc –mmic hello.c –o hello-mic

~$ micnativeloadex ./hello-mic

337-8 February 2017

I_MPI_FABRICS

● The following network fabrics are available for the Intel Xeon Phi coprocessor:

Fabric Network Hardware and Software usedshm Shared-memorytcp TCP/IP-capable network fabrics, such as Ethernet

and InfiniBand (through IPoIB)ofa OFA-capable network fabrics including InfiniBand

(through OFED verbs)dapl DAPL–capable network fabrics, such as

InfiniBand, iWarp, Dolphin, and XPMEM (through DAPL)

ofi OFA-capable network fabrics such as Intel True fabric, Intel Omni-Path, InfiniBand and Eternet

Intel MIC Programming Workshop @ Ostrava7-8 February 2017

I_MPI_FABRICS

● The default can be changed by setting the I_MPI_FABRICS environment variable to I_MPI_FABRICS=<fabric> or I_MPI_FABRICS= <intra-node fabric>:<inter-nodes fabric>

● Intranode: Shared Memory, Internode: DAPL(Default on SuperMIC/MUC) − export I_MPI_FABRICS=shm:dapl

● Intranode: Shared Memory, Internode: TCP(Can be used in case of Infiniband problems) − export I_MPI_FABRICS=shm:tcp

Intel MIC Programming Workshop @ Ostrava7-8 February 2017

Intel MIC Programming Workshop @ Ostrava 36

Thank you.

7-8 February 2017

Documents

Intel MIC Programming Workshop, Hardware Overview & Native ... · Intel MIC Programming Workshop, Hardware Overview & Native Execution 1 Ostrava, 7-8.2.2017