Intel Xeon Phi – Basic Tutorial

Evan Bollig and Brent Swartz 1pm, 12/19/2013

Overview •  Intro to MSI •  Intro to the MIC

Architecture •  Targeting the Xeon

Phi •  Examples

–  Automatic Offload

–  Offload Mode –  Native Mode

•  Distributed Jobs –  Symmetric MPI

A Quick Introduction to MSI

MSI at a Glance

HPC Resources •  Koronis •  Itasca • Calhoun • Cascade • GPUT

Laboratories •  Biomedical

Modeling, Simulation and Design.

•  Basic Sciences. •  Life Sciences. •  Scientific

Development. • Remote

Visualization.

Software • Chemical and

Physical Sciences

•  Engineering • Graphics and

Visualization •  Life Sciences • Development

User Services • Consulting •  Tutorials • Code Porting •  Parallelization •  Visualization

HPC Resources MSI’s Mission: provide researchers* access to—and support for—HPC resources to facilitate successful and cutting-edge research in all disciplines.

Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory

Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory

Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown Cores 2.8 TB of memory

Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1)

GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1) * UMN and other MN institutions

Tutorials/Workshops •  Introductory

–  Unix, Linux, remote computing, job submission, queue policy

•  Programming & Scientific Computation –  Code parallelization, programming

languages, math libraries •  Computational Physics

–  Fluid dynamics, space physics, structural mechanics, material science

•  Computational Chemistry –  Quantum chemistry, classical

molecular modeling, drug design, cheminformatics

•  Computational Biology –  Structural biology, computational

genomics, proteomics, bioinformatics www.msi.umn.edu/tutorial

Introduction to the MIC Architecture

Fee-fi-fo-fum •  What’s in a name?

–  Knights Corner –  Many Integrated Core (MIC) –  Xeon Phi –  Intel 5110P (B1)

PHI architecture •  PHI hardware is described here: http://software.intel.com/en-us/articles/intel-

xeon-phi-coprocessor-codename-knights-corner

PHI Performance •  Briefly, PHI performance is described here: http://www.intel.com/content/www/us/en/

benchmarks/xeon-phi-product-family-performance-brief.html

Phi vs GPU

•  Why the Phi? –  ia64 Instructions –  Bandwidth: 320 GB/s –  IP Addressable –  Code portability –  Symmetric Mode –  MKL Auto Offload

•  Why the GPU? –  Massive following

and Literature –  SIMT –  Dynamic Parallelism –  OpenCL Drivers –  cuBLAS, cuRAND,

cuSPARSE, etc.

MSI PHI description •  An MSI PHI quickstart guide is described

here: https://www.msi.umn.edu/content/intel-phi-

quickstart

Roofline Model

0.0625 0.25 1 4 161

144 GByte/se

515 GFLOP/sec

NVidia K20 and M2070

Operational Intensity (FLOPs:Byte)

208 GByte/se

1170 GFLOP/sec

0.0625 0.25 1 4 161

320 GByte/se

c1011 GFLOP/sec

Intel Xeon Phi 5110P (B1)

Operational Intensity (FLOPs:Byte)

Manage expectations of performance following with O.I.

Targeting the Xeon Phi

MSI PHI demonstration •  At MSI, the only compiler which currently

has OpenMP 4.0 support is the latest Intel/cluster module, loaded using:

% module load intel/cluster

MSI PHI demonstration •  Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb

MSI PHI demonstration •  Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo •  As shown from this micinfo output, each of

the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of 1.053 GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.

PHI Execution Mode •  Phi Execution mode figure: http://download.intel.com/newsroom/kits/xeon/

phi/pdfs/Intel-Xeon-Phi-Coprocessor_ProductBrief.pdf

MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct

link line options.): http://software.intel.com/sites/products/mkl/

MKL PHI usage •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the

User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

MKL PHI code examples •  $MKLROOT/examples/mic_ao •  $MKLROOT/examples/mic_offload •  - dexp VML example (vdExp) •  - dgaussian double precision Gaussian RNG •  - fft complex-to-complex 1D FFT •  - sexp VML example (vsExp) •  - sgaussian single precision Gaussian RNG

MKL PHI code examples –sgemm SGEMM example –sgemm_f SGEMM example(Fortran 90) –sgemm_reuse SGEMM with data persistence –sgeqrf QR factorization –sgetrf LU factorization –spotrf Cholesky

MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.):

http://software.intel.com/sites/products/mkl/ •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

PHI Optimization Tips Problem size considerations: –Large problems have more parallelism. –But not too large (8GB memory on a coprocessor). –FFT prefers power-of-2 sizes.

PHI Optimization Tips •  Data alignment consideration: – 64-byte alignment for better vectorization.

PHI Optimization Tips •  OpenMP thread count and thread affinity: – Avoid thread migration for better data

locality.

PHI Optimization Tips •  Large (2MB) pages for memory allocation: – Reduce TLB misses and memory allocation

overhead.

KMP_AFFINITY •  Pin threads to cores

–  Compact –  Scatter –  Balanced –  Explicit –  None

http://www.cac.cornell.edu/education/training/StampedeJune2013/mic-130618.pdf , Slide 29

Native Mode

(via MPIrun)

Git Checkout •  SSH to cascade •  module load cmake intel/cluster •  git clone /home/support/public/tutorials/

phi_cmake_example.git

Build •  cd phi_cmake_example •  mkdir build •  cd build •  cmake .. •  make

Run •  cd mic_mpi •  cp ../../mic_mpi/job_simple.pbs . •  qsub job_simple.pbs

Interactive Mode •  qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi •  export I_MPI_MIC=enable •  export I_MPI_MIC_POSTFIX=.mic •  mpirun –host ${HOSTNAME}-mic0 –np 4 `readlink –f

quad.x`

An OpenCL Example

(Research in progress)

What is an RBF?

RBF-FD? Classical FD: Vandermonde System

Subsitute for each

RBF-FD?

RBF-FD Stencils

Sparse Mat-Vec Multiply (SpMV)

=du(xc)

Lu(x) |x=xc⇡

✓L =

cLu(xk)

Sparse Formats 61 2 3 4875

20 1 2 3310

10 1 2 3131

61 2 3 4875

0 4 62

10 1 2 3131

Row Ptr

Value 61 2 3 4875

10 1 2 3131Col

CSR ELL

ViennaCL Performance •  GPU to Phi Performance is NOT portable.

1)  OpenCL driver is still BETA! 2)  Loops vectorize differently

SpMM with MIC Intrinsics

(Content from submitted paper; slides kept separate)

Additional Items

Optimal Mapping Work to Cores/Accelerators

•  Still an outstanding issue wrt which programming model is optimal.

•  Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nVidia specific CUDA, or OpenCL. –  http://www.hpcwire.com/2013/12/03/compilers-

accelerated-programming/

OpenACC •  OpenACC 2.0 was released this summer:

–  http://www.openacc-standard.org/

•  Improvements include: procedure calls, nested parallelism, more dynamic data management support and more.

•  OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13: –  http://www.nvidia.com/object/sc13-technology-

theater.html

OpenACC •  PGI will support OpenACC 2.0 starting in Jan

2014, with PGI 14.1. –  Current MSI module pgi/13.9 supports OpenACC

1.0 directives. •  GCC will support OpenACC soon:

–  http://www.hpcwire.com/2013/11/14/openacc-broadens-appeal-gcc-compiler-support/

–  OpenACC 2.0 expected in 2014

OpenMP 4.0 •  MSI Intel module intel/cluster/2013

supports OpenMP 4.0, except for combined directives. –  http://software.intel.com/en-us/articles/

openmp-40-features-in-intel-fortran-composer-xe-2013

•  For more information on OpenMP, see: –  http://openmp.org/wp/

Knight’s Landing •  Information on the Intel PHI follow-on due out

in 2014/2015, Knight's Landing: –  http://www.theregister.co.uk/2013/06/17/

intel_knights_landing_xeon_phi_fabric_interconnects/

–  http://www.hpcwire.com/2013/11/23/intel-brings-knights-roundtable-sc13/

•  Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth

•  MSI home page –  www.msi.umn.edu

•  Software –  www.msi.umn.edu/sw

•  Password reset –  www.msi.umn.edu/password

•  Tutorials –  www.msi.umn.edu/tutorial

•  FAQ –  www.msi.umn.edu/support/faq.html

Questions?

Questions? •  MSI help desk is staffed Monday through

Friday from 8:30AM to 7:00PM. •  Walk-in help available in room 569 Walter. •  Phone 612.626.0802 •  Email help@msi.umn.edu

Thank You

The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE,

Minneapolis, Minnesota, 55455, 612-624-0528.

Intel Xeon Phi – Basic Tutorial

Documents

6 MPI on Intel Xeon Phi Coprocessor

А вы готовы к новым архитектурам Intel? · Intel® Xeon® + Intel® Xeon Phi™ Дополнительные решения для параллельных

Архитектура Intel от i386 до Xeon Phi: процессоры ... · Архитектура Intel от i386 до Xeon Phi: процессоры, производительность,

Performance Optimization on the Intel Xeon Phi · Performance Optimization on the Intel Xeon Phi Presenter: Dhananjay Brahme Parallelization and Optimization Center of Excellence

The Intel® Xeon Phi Coprocessor - nm.ifi.lmu.de

Unveiling the Early Universe with Intel Xeon Processors and Intel Xeon Phi at COSMOS (University of Cambridge)

Elmer on Intel Xeon Phi - umu.se

KNIGHTS LANDING:SECOND G INTEL XEON PHI …pages.cs.wisc.edu/~david/courses/cs758/Fall2016/handouts/...KNIGHTS LANDING:SECOND- GENERATION INTEL XEON PHI PRODUCT THE KNIGHTS LANDING

Intel® Xeon Phi™ Processor Softwareregistrationcenter-download.intel.com/akdlm/irc_nas/11673/xppsl_user_guide.pdf · Intel® Xeon Phi™ Processor Software also contains specific

Seamless Parallelization and Vectorization Integration ... · Intel® Xeon® and Intel® Xeon Phi™ Product Families are both going parallel Intel® Xeon Phi™ coprocessor Knights

Intel Xeon Phi Coprocessor

Лекция № 4 Векторные расширения Intel Xeon Phi

Intel Xeon Phi: Architecture and Programming€¦ · Intel Xeon Phi: Architecture and Programming F. Afﬁnito(f.afﬁnito@cineca.it) V. Ruggiero (v.ruggiero@cineca.it) Roma, 23 July

Introdução ao coprocessador Intel® Xeon Phi™ - Intel Software Conference 2013

Intel Xeon Phi Co-Processorsechow/ipcc/hpc-course/HPC-xeonphi.pdfIntel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A

Dell PowerEdge C4130 & Intel Xeon Phi coprocessor 7120P

A quick introduction to the Intel® Xeon Phi™ · A quick introduction to the Intel® Xeon Phi™ Stephen Blair -Chappell, Intel

Векторизация кода для Intel® Xeon Phi™ с …...Векторизация кода для Intel® Xeon Phi с помощью функций-интринсиков

Intel® Xeon Phi™ Coprocessor Datasheetgeco.mines.edu/prototype/Show_me_Intel_Phi_examples/xeon-phi... · Reference Number: 328209-002EN 7 2 Intel® Xeon Phi™ Coprocessor Architecture

INTEL XEON PHI COPROCESSOR Case Study...INTEL CONFIDENTIAL Intel® Xeon Phi™ Coprocessor: Performance Proof-points Click on links for more information (SC12) Energy Academic/ Government