Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Intel MIC Programming Workshop,Hardware Overview & Native Execution
1
Ostrava, 7-8.2.2017
Intel MIC Programming Workshop @ Ostrava
● Intro @ accelerators on HPC ● Architecture overview of the Intel Xeon Phi Products ● Programming models ● Native mode programming & hands on
2
Agenda
7-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Why do we Need Coprocessors/or “Accelerators” on HPC?
• In the past, computers got faster by increasing the clock frequency of the core, but this has now reached its limit mainly due to power requirements and heat dissipation restrictions (unmanageable problem)
• Today, processor cores are not getting any faster, but instead the number of cores per chip increases and registers are getting wider
• On HPC, we need a chip that can provide higher computing performance at lower energy
37-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Why do we Need Coprocessors/or “Accelerators” on HPC?
• One solution is a heterogeneous system containing both CPUs and “accelerators”, plus other forms of parallelism such as vector instruction support
• Two types of hardware acceleration options, Intel Xeon Phi and Nvidia GPU • Designed to accelerate a computation through massive parallelisation, and to
deliver the highest performance and energy efficient computing in HPC
47-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Why do we Need Coprocessors/or “Accelerators” on HPC?
57-8 February 2017
Nvidia GPU
Intel Xeon Phi
Intel MIC Programming Workshop @ Ostrava
Accelerators in the Top 500 List
7-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Green500 List for November 2016
7
Allsystemsinthetop10areaccelerator-based(5usingXeonPhi)
7-8 February 2017
Intel Xeon Phi
Intel MIC Programming Workshop @ Ostrava
Architectures Comparison (CPU vs GPU)
• Large cache and sophisticated flow control minimise latency for arbitrary memory access for serial process.
• Simple flow control and limited cache.
• More transistors for computing in parallel (up to 17 billion on Pascal GPU)
• (SIMD)
CPU
ControlALU
Cache
ALU
ALU ALU
DRAM
GPU
DRAM
87-8 February 2017
Intel Xeon CPU E5-2697v4 “Brodwell”
Nvidia GPU P100
Cores @ Clock 2 x 18 cores @ ≥ 2.3 GHz 56SMs @ 1.3 GHzSP Perf./core ≥ 73.6 GFlop/s up to 166 GFlop/s
SP peak ≥ 2.6 TFlop/s up to 9.3 TFlop/s
Transistors/TDP 2x7 Billion /2x145W 14 Billion/300WBW 2 x 62.5 GB/s 510 GB/s
Intel MIC Programming Workshop @ Ostrava
Intel Xeon Phi Products
• The first product was released in 2012 named Knights Corner (KNC) which is the first architecture supporting 512 bit vectors.
• The 2nd generation released at ISC16 in Frankfurt named Knights Landing (KNL) also support 512bit vectors with a new instruction set called Intel Advanced Vector Instructions 512 (Intel AVX-512)
• KNL has a peak performance of 6 TFLOP/s in single precision ~ 3 times what KNC had, due to 2 vector processing units (VPUs) per core, doubled compared to the KNC.
97-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Intel MIC Architecture in common with Intel multi-core Xeon CPUs !
• X86 architecture • C, C++ and Fortran • Standard parallelisation libraries • Similar optimisation methods
• PCIe bus connection • IP-addressable
• Own Linux version OS
• Full Intel software tool suite • 8 - 16 GB GDDR5 DRAM (cached) • Up to 61 x86 (64 bit) cores
• Up to 1 GHz • 512 bit wide vector registers
• 4 way hyper-threading • 64-bit addressing • SSE, AVX or AVX2: are not supported
• Intel Initial Many Core Instructions (IMCI).
107-8 February 2017
• Up to 22 cores/socket • Up to 3 GHz • Up to 1.54 TB RAM • 256 bit AVX vectors • 2-way hyper-threading
Intel MIC Programming Workshop @ Ostrava
GPUMICCPU
Architectures Comparison
General-purpose architecture
Control
Cache
Massively data parallel
ALU
DRAM
ALU
ALU ALU
DRAMDRAM
Power-efficient Multiprocessor X86 design architecture
117-8 February 2017
Momme Allalen: Intel MIC Programming Workshop, Feb. 7-8, 2017
Knights Corner vs Knights Landing
● Knights Corner ● Co-processor ● Binary incompatible with other
architecture ● 1.1 GHz processor ● 8 GB RAM ● 22 nm process ● One 512-bit VPU ● No support for branch prediction and
fast unaligned memory access
● Knights Landing ● No PCIe - Self hosted ● Binary compatible with prior Xeon
architectures (no phi) ● 1.4 GHz processor ● Up to 400 GB RAM (with MCDRAM) ● 14 nm process ● Tow 512-bit VPUs ● Support for branch prediction and
fast unaligned memory access
28/12/16 12
Intel MIC Programming Workshop @ Ostrava
The Intel Xeon Phi (KNC) Microarchitecture
• Bidirectional ring interconnect which connects all the cores, L2 caches through a distributed global Tag Directory (TD), PCIe client logic,
GDDR5 memory controllers …etc.
137-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Network Access
• Network access possible using TCP/IP tools like ssh.
• NFS mounts on Xeon Phi Supported.
First experiences with the Intel MIC architecture at LRZ, Volker Weinberg and Momme Allalen, inSIDE Vol. 11 No.2 Autumn2013
147-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Cache Structure
15
Intel's Jim Jeffers and James Reinders: Intel Xeon® Phi™ Coprocessor High Performance Programming, the first book on how to program this HPC vector chip, is available on Amazon
Cache line size 64BL1 size 32KB data cache, 32KB instruction code
L1 latency 1 cycle
L2 size 512 KB
L2 ways 8
L2 latency 15-30 cycles
Memory à L2 prefetching hardware and software
L2 à L1 prefetching software onlyTranslation Lookside Buffer (TLB) 64 pages of size 4KB (256KB coverage)
Coverage options (L1, data) 8 pages of size 2MB (16MB coverage)
7-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Features of the Initial Many Core Instructions (IMCI) Set
512-bit wide registers • can pack up to eight 64-bit elements (long int, double) • up to sixteen 32-bit elements (int, float)
Arithmetic Instructions • Addition, subtraction and multiplication • Fused Multiply-Add instruction (FMA) • Division and reciprocal calculation • Exponential functions and the power function • Logarithms (natural, base 2 and base 10) • Square root, inverse square root, hypothenuse value and cubic root • Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos, atan)
167-8 February 2017
17
Lab1: Access Salomon
Intel MIC Programming Workshop @ Ostrava
For the Labs: Salomon System Initialisation
● Access: ssh -l USERID login1.salomon.it4i.cz ● Load the Intel environment on the host via:
module load intel impi ● Submit an interactive job via
qsub -I -A DD-16-1 -q R553892.isrv5 -l select=1:accelerator=True,walltime=06:00:00 -l
● qstat -a
● qstat -u $USER
7-8 February 2017 18
Intel MIC Programming Workshop @ Ostrava
Interacting with Intel Xeon Phi Coprocessors
user@host~$micinfo –listdevices
MicInfo Utility Log Created Thu Jan 7 09:33:20 2016 List of Available Devices deviceId | domain | bus# | pciDev# | hardwareId ------------|-------------|---------|------------|---------------- 0 | 0 | 20 | 0 | 22508086
1 | 0 | 8b | 0 | 22508086
------------------------------------------------------------------
user@host~$ micinfo | grep -i cores Cores
Total No of Active Cores : 61 Cores
Total No of Active Cores : 61
197-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Useful Tools and Files on Coprocessor
● top - display Linux tasks ● ps - report a snapshot of the current processes. ● kill - send signals to processes, or list signals
● ifconfig - configure a network interface ● traceroute - print the route packets take to network host
● mpiexec.hydra – run Intel MPI natively
● /proc/cpuinfo ● /proc/meminfo
207-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Interacting with Intel Xeon Phi Coprocessors
user@host~$ /sbin/lspci | grep -i "co-processor”
20:00.0 Co-processor: Intel Corporation Xeon Phi …. (rev 20)
8b:00.0 Co-processor: Intel Corporation Xeon Phi …. (rev 20)
user@host~$ cat /etc/hosts | grep mic1
user@host~$ cat /etc/hosts | grep mic1-ib0 | wc –l user@host~$ ssh mic0 or ssh mic1
user@host-mic0~$ ls /bin boot dev etc home init lib lib64 lrz media mnt proc root sbin sys tmp usr var • micsmc a utility for monitoring the physical parameters of Intel Xeon Phi
coprocessors: model, memory, core rail temperatures, core frequency, power usage, etc.
217-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Parallelism on the Heterogeneous Systems
Vectorisation - SIMD Parallelism MPI/OpenMP ccNUMA domains Multiple accelerators
L1DL2
L1DL2
L1DL2
L1DL2
L1DL2
L1DL2
L1DL2
L1DL2
L3L3
Other I/O
12 3
45
6
12
34
PCIe buses Other I/O resources
56
Memory Interface
Memory Memory
Memory Interface
227-8 February 2017
Intel MIC Programming Workshop @ Ostrava
• Native Mode • Programs started on Xeon Phi • Cross-compilation using –mmic • User access to Xeon Phi is necessary
• Offload to MIC • Offload using OpenMP extensions
• Automatically offload some routines using MKL
• MKL Compiled assisted offload (CAO)
• MKL automatic Offload (AO)
• MPI tasks on Host and MIC • Treat the coprocessor like another host
• MPI only and MPI + X (X may be OpenMP, TBB, Cilk, OpenCL…etc.)
MIC Programming Models
myFunction();main() {
#pragma offload target(mic)
}
Host MIC
main() {
#pragma offload target(mic)
}
Host MIC
237-8 February 2017
MIC Programming Models
247-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Native Mode
• First ensure that the application is suitable for native execution. • The application runs entirely on the MIC coprocessor without offload
from a host system. • Compile the application for native execution using the flag: -mmic • Build also the required libraries for native execution. • Copy the executable and any dependencies, such as run time libraries
to the coprocessor. • Mount file shares to the coprocessor for accessing input data sets and
saving output data sets. • Login to the coprocessor via console, setup the environment and run
the executable. • You can debug the native application via a debug server running on
the coprocessor.
257-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Invocation of the Intel MPI compiler
Language MPI Compiler CompilerC mpiicc iccC++ mpiicpc icpcFortran mpiifort ifort
7-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Native Mode
• Compile on the Host:
~$ $INTEL_BASE/linux/bin/compilervars.sh intel64
~$ icpc -mmic hello.c -o hello
~$ifort -mmic hello.f90 -o hello
• Launch execution from the MIC: ~$ scp hello $HOST-mic0: ~$ ssh $HOST-mic0
~$ ./hello
hello, world
277-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Native Mode: micnativeloadex
The tool automatically transfer the code and dependent libraries and execute from the host:
~$ ./hello -bash: ./hello: cannot execute Binary file
~$ export SINK_LIBRARY_PATH=…/intel/compiler/lib/mic
~$ micnativeloadex ./hello
hello, world
~$ micnativeloadex ./hello -v hello, worldRemote process returned: 0Exit reason: SHUTDOWN OK
287-8 February 2017
Intel MIC Programming Workshop @ Ostrava
Native Mode
• Use the Intel compiler flag –mmic
• Remove assembly and unnecessary dependencies
• Use –host=“—host=x86_64” to avoid program does not run errors
• Example, the GNU Multiple Precision Arithmetic Library (GMP):
~$host: wget http://mirror.bibleonline.ru/gnu/gmp/gmp-6.1.0.tar.bz2 ~$host: tar –xf gmp-6.1.0.tar.bz2 ~$host: cd gmp.6.1.0
~$host:./configure --prefix=${gmp_build} CC=icc CFLAGS=“-mmic” –host=x86_64 –disable-assembly --enable-cxx ~$host: make
297-8 February 2017
30
Lab 2: Native Mode
Intel MIC Programming Workshop @ Ostrava
For the Labs: Salomon System Initialisation
● Load the Intel environment on the host via: module load intel
● Submit an interactive job via qsub -I -A DD-16-44 -q R553892.isrv5 -l select=1:ncpus=24:accelerator=True:naccelerators=2 -l walltime=06:00:00
● Exercise Code:cp –r /home/weinberg/MIC_Workshop/ .
● Exercise Sheets + Slides online: https://goo.gl/IPBNmK
7-8 February 2017 31
Intel MIC Programming Workshop @ Ostrava
Important MPI environment variables
● Important Paths are already set by intel module, otherwise use: − . $ICC_BASE/bin/compilervars.sh intel64 − . $MPI_BASE/bin64/mpivars.sh
● Recommended environment on Salomon:module load intel export I_MPI_HYDRA_BOOTSTRAP=ssh export I_MPI_MIC=enable export I_MPI_FABRICS=shm:dapl export I_MPI_DAPL_PROVIDER_LIST=ofa-v2-mlx4_0-1u,ofa-v2-scif0,ofa-v2-mcm-1 export MIC_LD_LIBRARY_PATH = $MIC_LD_LIBRARY_PATH:/apps/all/impi/5.1.2.150-iccifort-2016.1.150-GCC-4.9.3-2.25/mic/lib/ depending on version
7-8 February 2017
Intel MIC Programming Workshop @ Ostrava
micinfo and _SC_NPROCESSORS_ONLN
~$ micinfo –listdevices ~$ micinfo | grep -i cores
~$ cat hello.c #include <stdio.h>
#include <unistd.h>int main(){ printf("Hello world! I have %ld logical cores.\n", sysconf(_SC_NPROCESSORS_ONLN)); } ~$ icc hello.c –o hello-host && ./hello-host
~$ icc –mmic hello.c –o hello-mic
~$ micnativeloadex ./hello-mic
337-8 February 2017
I_MPI_FABRICS
● The following network fabrics are available for the Intel Xeon Phi coprocessor:
Fabric Network Hardware and Software usedshm Shared-memorytcp TCP/IP-capable network fabrics, such as Ethernet
and InfiniBand (through IPoIB)ofa OFA-capable network fabrics including InfiniBand
(through OFED verbs)dapl DAPL–capable network fabrics, such as
InfiniBand, iWarp, Dolphin, and XPMEM (through DAPL)
ofi OFA-capable network fabrics such as Intel True fabric, Intel Omni-Path, InfiniBand and Eternet
Intel MIC Programming Workshop @ Ostrava7-8 February 2017
I_MPI_FABRICS
● The default can be changed by setting the I_MPI_FABRICS environment variable to I_MPI_FABRICS=<fabric> or I_MPI_FABRICS= <intra-node fabric>:<inter-nodes fabric>
● Intranode: Shared Memory, Internode: DAPL(Default on SuperMIC/MUC) − export I_MPI_FABRICS=shm:dapl
● Intranode: Shared Memory, Internode: TCP(Can be used in case of Infiniband problems) − export I_MPI_FABRICS=shm:tcp
Intel MIC Programming Workshop @ Ostrava7-8 February 2017
Intel MIC Programming Workshop @ Ostrava 36
Thank you.
7-8 February 2017