Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Intel Xeon Phi – Basic Tutorial

Evan Bollig and Brent Swartz 1pm, 12/19/2013


Overview •  Intro to MSI •  Intro to the MIC

Architecture •  Targeting the Xeon

Phi •  Examples

–  Automatic Offload

–  Offload Mode –  Native Mode

•  Distributed Jobs –  Symmetric MPI


A Quick Introduction to MSI


MSI at a Glance

HPC Resources •  Koronis •  Itasca • Calhoun • Cascade • GPUT

Laboratories •  Biomedical

Modeling, Simulation and Design.

•  Basic Sciences. •  Life Sciences. •  Scientific

Development. • Remote

Visualization.

Software • Chemical and

Physical Sciences

•  Engineering • Graphics and

Visualization •  Life Sciences • Development

Tools

User Services • Consulting •  Tutorials • Code Porting •  Parallelization •  Visualization


HPC Resources MSI’s Mission: provide researchers* access to—and support for—HPC resources to facilitate successful and cutting-edge research in all disciplines.

Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory

Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory

Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown Cores 2.8 TB of memory

Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1)

GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1) * UMN and other MN institutions


Tutorials/Workshops •  Introductory

–  Unix, Linux, remote computing, job submission, queue policy

•  Programming & Scientific Computation –  Code parallelization, programming

languages, math libraries •  Computational Physics

–  Fluid dynamics, space physics, structural mechanics, material science

•  Computational Chemistry –  Quantum chemistry, classical

molecular modeling, drug design, cheminformatics

•  Computational Biology –  Structural biology, computational

genomics, proteomics, bioinformatics www.msi.umn.edu/tutorial


Introduction to the MIC Architecture


Fee-fi-fo-fum •  What’s in a name?

–  Knights Corner –  Many Integrated Core (MIC) –  Xeon Phi –  Intel 5110P (B1)


PHI architecture •  PHI hardware is described here: http://software.intel.com/en-us/articles/intel-

xeon-phi-coprocessor-codename-knights-corner


PHI Performance •  Briefly, PHI performance is described here: http://www.intel.com/content/www/us/en/

benchmarks/xeon-phi-product-family-performance-brief.html


Phi vs GPU

•  Why the Phi? –  ia64 Instructions –  Bandwidth: 320 GB/s –  IP Addressable –  Code portability –  Symmetric Mode –  MKL Auto Offload

•  Why the GPU? –  Massive following

and Literature –  SIMT –  Dynamic Parallelism –  OpenCL Drivers –  cuBLAS, cuRAND,

cuSPARSE, etc.


MSI PHI description •  An MSI PHI quickstart guide is described

here: https://www.msi.umn.edu/content/intel-phi-

quickstart


Roofline Model

0.0625 0.25 1 4 161

4

16

64

256

1024

144 GByte/se

c

515 GFLOP/sec

NVidia K20 and M2070

Peak

Pos

sible

GFL

OP/

sec

(DP)

Operational Intensity (FLOPs:Byte)

208 GByte/se

c

1170 GFLOP/sec

0.0625 0.25 1 4 161

4

16

64

256

1024

320 GByte/se

c1011 GFLOP/sec

Intel Xeon Phi 5110P (B1)

Peak

Pos

sible

GFL

OP/

sec

(DP)

Operational Intensity (FLOPs:Byte)

Manage expectations of performance following with O.I.


Targeting the Xeon Phi


MSI PHI demonstration •  At MSI, the only compiler which currently

has OpenMP 4.0 support is the latest Intel/cluster module, loaded using:

% module load intel/cluster


MSI PHI demonstration •  Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb


MSI PHI demonstration •  Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo •  As shown from this micinfo output, each of

the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of 1.053 GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.


PHI Execution Mode •  Phi Execution mode figure: http://download.intel.com/newsroom/kits/xeon/

phi/pdfs/Intel-Xeon-Phi-Coprocessor_ProductBrief.pdf


MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct

link line options.): http://software.intel.com/sites/products/mkl/


MKL PHI usage •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the

User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm


MKL PHI code examples •  $MKLROOT/examples/mic_ao •  $MKLROOT/examples/mic_offload •  - dexp VML example (vdExp) •  - dgaussian double precision Gaussian RNG •  - fft complex-to-complex 1D FFT •  - sexp VML example (vsExp) •  - sgaussian single precision Gaussian RNG


MKL PHI code examples –sgemm SGEMM example –sgemm_f SGEMM example(Fortran 90) –sgemm_reuse SGEMM with data persistence –sgeqrf QR factorization –sgetrf LU factorization –spotrf Cholesky


MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.):

http://software.intel.com/sites/products/mkl/ •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm


PHI Optimization Tips Problem size considerations: –Large problems have more parallelism. –But not too large (8GB memory on a coprocessor). –FFT prefers power-of-2 sizes.


PHI Optimization Tips •  Data alignment consideration: – 64-byte alignment for better vectorization.


PHI Optimization Tips •  OpenMP thread count and thread affinity: – Avoid thread migration for better data

locality.


PHI Optimization Tips •  Large (2MB) pages for memory allocation: – Reduce TLB misses and memory allocation

overhead.


KMP_AFFINITY •  Pin threads to cores

–  Compact –  Scatter –  Balanced –  Explicit –  None

http://www.cac.cornell.edu/education/training/StampedeJune2013/mic-130618.pdf , Slide 29


Native Mode

(via MPIrun)


Git Checkout •  SSH to cascade •  module load cmake intel/cluster •  git clone /home/support/public/tutorials/

phi_cmake_example.git


Build •  cd phi_cmake_example •  mkdir build •  cd build •  cmake .. •  make


Run •  cd mic_mpi •  cp ../../mic_mpi/job_simple.pbs . •  qsub job_simple.pbs


Interactive Mode •  qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi •  export I_MPI_MIC=enable •  export I_MPI_MIC_POSTFIX=.mic •  mpirun –host ${HOSTNAME}-mic0 –np 4 `readlink –f

quad.x`


An OpenCL Example

(Research in progress)


What is an RBF?


RBF-FD? Classical FD: Vandermonde System

Subsitute for each


RBF-FD?


RBF-FD Stencils


Sparse Mat-Vec Multiply (SpMV)

=du(xc)

dx

Lu(x) |x=xc⇡

nX

j=1

c

j

u(xj

)

Dx

✓L =

@

@x

◆

cLu(xk)


Sparse Formats 61 2 3 4875

20 1 2 3310

10 1 2 3131

Row

Col

Value

61 2 3 4875

0 4 62

10 1 2 3131

Row Ptr

Col

Value 61 2 3 4875

10 1 2 3131Col

Value

1 5

2 0

3 0

40

6

0

0

0 8

7

0 0 1

2

6

8 4

3

7

5

COO

CSR ELL


ViennaCL Performance •  GPU to Phi Performance is NOT portable.

1)  OpenCL driver is still BETA! 2)  Loops vectorize differently


SpMM with MIC Intrinsics

(Content from submitted paper; slides kept separate)


Additional Items


Optimal Mapping Work to Cores/Accelerators

•  Still an outstanding issue wrt which programming model is optimal.

•  Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nVidia specific CUDA, or OpenCL. –  http://www.hpcwire.com/2013/12/03/compilers-

accelerated-programming/


OpenACC •  OpenACC 2.0 was released this summer:

–  http://www.openacc-standard.org/

•  Improvements include: procedure calls, nested parallelism, more dynamic data management support and more.

•  OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13: –  http://www.nvidia.com/object/sc13-technology-

theater.html


OpenACC •  PGI will support OpenACC 2.0 starting in Jan

2014, with PGI 14.1. –  Current MSI module pgi/13.9 supports OpenACC

1.0 directives. •  GCC will support OpenACC soon:

–  http://www.hpcwire.com/2013/11/14/openacc-broadens-appeal-gcc-compiler-support/

–  OpenACC 2.0 expected in 2014


OpenMP 4.0 •  MSI Intel module intel/cluster/2013

supports OpenMP 4.0, except for combined directives. –  http://software.intel.com/en-us/articles/

openmp-40-features-in-intel-fortran-composer-xe-2013

•  For more information on OpenMP, see: –  http://openmp.org/wp/


Knight’s Landing •  Information on the Intel PHI follow-on due out

in 2014/2015, Knight's Landing: –  http://www.theregister.co.uk/2013/06/17/

intel_knights_landing_xeon_phi_fabric_interconnects/

–  http://www.hpcwire.com/2013/11/23/intel-brings-knights-roundtable-sc13/

•  Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth


•  MSI home page –  www.msi.umn.edu

•  Software –  www.msi.umn.edu/sw

•  Password reset –  www.msi.umn.edu/password

•  Tutorials –  www.msi.umn.edu/tutorial

•  FAQ –  www.msi.umn.edu/support/faq.html

Questions?


Questions? •  MSI help desk is staffed Monday through

Friday from 8:30AM to 7:00PM. •  Walk-in help available in room 569 Walter. •  Phone 612.626.0802 •  Email [email protected]


Thank You

The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE,

Minneapolis, Minnesota, 55455, 612-624-0528.

Documents

Intel Xeon Phi – Basic Tutorial