Intel Xeon Phi – Basic Tutorial

Preview:

Citation preview

© 2013 Regents of the University of Minnesota. All rights reserved.

Intel Xeon Phi – Basic Tutorial

Evan Bollig and Brent Swartz 1pm, 12/19/2013

© 2013 Regents of the University of Minnesota. All rights reserved.

Overview •  Intro to MSI •  Intro to the MIC

Architecture •  Targeting the Xeon

Phi •  Examples

–  Automatic Offload

–  Offload Mode –  Native Mode

•  Distributed Jobs –  Symmetric MPI

© 2013 Regents of the University of Minnesota. All rights reserved.

A Quick Introduction to MSI

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI at a Glance

HPC Resources •  Koronis •  Itasca • Calhoun • Cascade • GPUT

Laboratories •  Biomedical

Modeling, Simulation and Design.

•  Basic Sciences. •  Life Sciences. •  Scientific

Development. • Remote

Visualization.

Software • Chemical and

Physical Sciences

•  Engineering • Graphics and

Visualization •  Life Sciences • Development

Tools

User Services • Consulting •  Tutorials • Code Porting •  Parallelization •  Visualization

© 2013 Regents of the University of Minnesota. All rights reserved.

HPC Resources MSI’s Mission: provide researchers* access to—and support for—HPC resources to facilitate successful and cutting-edge research in all disciplines.

Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory

Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory

Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown Cores 2.8 TB of memory

Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1)

GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1) * UMN and other MN institutions

© 2013 Regents of the University of Minnesota. All rights reserved.

Tutorials/Workshops •  Introductory

–  Unix, Linux, remote computing, job submission, queue policy

•  Programming & Scientific Computation –  Code parallelization, programming

languages, math libraries •  Computational Physics

–  Fluid dynamics, space physics, structural mechanics, material science

•  Computational Chemistry –  Quantum chemistry, classical

molecular modeling, drug design, cheminformatics

•  Computational Biology –  Structural biology, computational

genomics, proteomics, bioinformatics www.msi.umn.edu/tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Introduction to the MIC Architecture

© 2013 Regents of the University of Minnesota. All rights reserved.

Fee-fi-fo-fum •  What’s in a name?

–  Knights Corner –  Many Integrated Core (MIC) –  Xeon Phi –  Intel 5110P (B1)

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI architecture •  PHI hardware is described here: http://software.intel.com/en-us/articles/intel-

xeon-phi-coprocessor-codename-knights-corner

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Performance •  Briefly, PHI performance is described here: http://www.intel.com/content/www/us/en/

benchmarks/xeon-phi-product-family-performance-brief.html

© 2013 Regents of the University of Minnesota. All rights reserved.

Phi vs GPU

•  Why the Phi? –  ia64 Instructions –  Bandwidth: 320 GB/s –  IP Addressable –  Code portability –  Symmetric Mode –  MKL Auto Offload

•  Why the GPU? –  Massive following

and Literature –  SIMT –  Dynamic Parallelism –  OpenCL Drivers –  cuBLAS, cuRAND,

cuSPARSE, etc.

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI description •  An MSI PHI quickstart guide is described

here: https://www.msi.umn.edu/content/intel-phi-

quickstart

© 2013 Regents of the University of Minnesota. All rights reserved.

Roofline Model

0.0625 0.25 1 4 161

4

16

64

256

1024

144 GByte/se

c

515 GFLOP/sec

NVidia K20 and M2070

Peak

Pos

sible

GFL

OP/

sec

(DP)

Operational Intensity (FLOPs:Byte)

208 GByte/se

c

1170 GFLOP/sec

0.0625 0.25 1 4 161

4

16

64

256

1024

320 GByte/se

c1011 GFLOP/sec

Intel Xeon Phi 5110P (B1)

Peak

Pos

sible

GFL

OP/

sec

(DP)

Operational Intensity (FLOPs:Byte)

Manage expectations of performance following with O.I.

© 2013 Regents of the University of Minnesota. All rights reserved.

Targeting the Xeon Phi

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI demonstration •  At MSI, the only compiler which currently

has OpenMP 4.0 support is the latest Intel/cluster module, loaded using:

% module load intel/cluster

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI demonstration •  Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI demonstration •  Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo •  As shown from this micinfo output, each of

the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of 1.053 GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Execution Mode •  Phi Execution mode figure: http://download.intel.com/newsroom/kits/xeon/

phi/pdfs/Intel-Xeon-Phi-Coprocessor_ProductBrief.pdf

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct

link line options.): http://software.intel.com/sites/products/mkl/

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI usage •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the

User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI code examples •  $MKLROOT/examples/mic_ao •  $MKLROOT/examples/mic_offload •  - dexp VML example (vdExp) •  - dgaussian double precision Gaussian RNG •  - fft complex-to-complex 1D FFT •  - sexp VML example (vsExp) •  - sgaussian single precision Gaussian RNG

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI code examples –sgemm SGEMM example –sgemm_f SGEMM example(Fortran 90) –sgemm_reuse SGEMM with data persistence –sgeqrf QR factorization –sgetrf LU factorization –spotrf Cholesky

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.):

http://software.intel.com/sites/products/mkl/ •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips Problem size considerations: –Large problems have more parallelism. –But not too large (8GB memory on a coprocessor). –FFT prefers power-of-2 sizes.

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips •  Data alignment consideration: – 64-byte alignment for better vectorization.

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips •  OpenMP thread count and thread affinity: – Avoid thread migration for better data

locality.

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips •  Large (2MB) pages for memory allocation: – Reduce TLB misses and memory allocation

overhead.

© 2013 Regents of the University of Minnesota. All rights reserved.

KMP_AFFINITY •  Pin threads to cores

–  Compact –  Scatter –  Balanced –  Explicit –  None

http://www.cac.cornell.edu/education/training/StampedeJune2013/mic-130618.pdf , Slide 29

© 2013 Regents of the University of Minnesota. All rights reserved.

Native Mode

(via MPIrun)

© 2013 Regents of the University of Minnesota. All rights reserved.

Git Checkout •  SSH to cascade •  module load cmake intel/cluster •  git clone /home/support/public/tutorials/

phi_cmake_example.git

© 2013 Regents of the University of Minnesota. All rights reserved.

Build •  cd phi_cmake_example •  mkdir build •  cd build •  cmake .. •  make

© 2013 Regents of the University of Minnesota. All rights reserved.

Run •  cd mic_mpi •  cp ../../mic_mpi/job_simple.pbs . •  qsub job_simple.pbs

© 2013 Regents of the University of Minnesota. All rights reserved.

Interactive Mode •  qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi •  export I_MPI_MIC=enable •  export I_MPI_MIC_POSTFIX=.mic •  mpirun –host ${HOSTNAME}-mic0 –np 4 `readlink –f

quad.x`

© 2013 Regents of the University of Minnesota. All rights reserved.

An OpenCL Example

(Research in progress)

© 2013 Regents of the University of Minnesota. All rights reserved.

What is an RBF?

© 2013 Regents of the University of Minnesota. All rights reserved.

RBF-FD? Classical FD: Vandermonde System

Subsitute for each

© 2013 Regents of the University of Minnesota. All rights reserved.

RBF-FD?

© 2013 Regents of the University of Minnesota. All rights reserved.

RBF-FD Stencils

© 2013 Regents of the University of Minnesota. All rights reserved.

Sparse Mat-Vec Multiply (SpMV)

=du(xc)

dx

Lu(x) |x=xc⇡

nX

j=1

c

j

u(xj

)

Dx

✓L =

@

@x

cLu(xk)

© 2013 Regents of the University of Minnesota. All rights reserved.

Sparse Formats 61 2 3 4875

20 1 2 3310

10 1 2 3131

Row

Col

Value

61 2 3 4875

0 4 62

10 1 2 3131

Row Ptr

Col

Value 61 2 3 4875

10 1 2 3131Col

Value

1 5

2 0

3 0

40

6

0

0

0 8

7

0 0 1

2

6

8 4

3

7

5

COO

CSR ELL

© 2013 Regents of the University of Minnesota. All rights reserved.

ViennaCL Performance •  GPU to Phi Performance is NOT portable.

1)  OpenCL driver is still BETA! 2)  Loops vectorize differently

© 2013 Regents of the University of Minnesota. All rights reserved.

SpMM with MIC Intrinsics

(Content from submitted paper; slides kept separate)

© 2013 Regents of the University of Minnesota. All rights reserved.

Additional Items

© 2013 Regents of the University of Minnesota. All rights reserved.

Optimal Mapping Work to Cores/Accelerators

•  Still an outstanding issue wrt which programming model is optimal.

•  Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nVidia specific CUDA, or OpenCL. –  http://www.hpcwire.com/2013/12/03/compilers-

accelerated-programming/

© 2013 Regents of the University of Minnesota. All rights reserved.

OpenACC •  OpenACC 2.0 was released this summer:

–  http://www.openacc-standard.org/

•  Improvements include: procedure calls, nested parallelism, more dynamic data management support and more.

•  OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13: –  http://www.nvidia.com/object/sc13-technology-

theater.html

© 2013 Regents of the University of Minnesota. All rights reserved.

OpenACC •  PGI will support OpenACC 2.0 starting in Jan

2014, with PGI 14.1. –  Current MSI module pgi/13.9 supports OpenACC

1.0 directives. •  GCC will support OpenACC soon:

–  http://www.hpcwire.com/2013/11/14/openacc-broadens-appeal-gcc-compiler-support/

–  OpenACC 2.0 expected in 2014

© 2013 Regents of the University of Minnesota. All rights reserved.

OpenMP 4.0 •  MSI Intel module intel/cluster/2013

supports OpenMP 4.0, except for combined directives. –  http://software.intel.com/en-us/articles/

openmp-40-features-in-intel-fortran-composer-xe-2013

•  For more information on OpenMP, see: –  http://openmp.org/wp/

© 2013 Regents of the University of Minnesota. All rights reserved.

Knight’s Landing •  Information on the Intel PHI follow-on due out

in 2014/2015, Knight's Landing: –  http://www.theregister.co.uk/2013/06/17/

intel_knights_landing_xeon_phi_fabric_interconnects/

–  http://www.hpcwire.com/2013/11/23/intel-brings-knights-roundtable-sc13/

•  Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth

© 2013 Regents of the University of Minnesota. All rights reserved.

•  MSI home page –  www.msi.umn.edu

•  Software –  www.msi.umn.edu/sw

•  Password reset –  www.msi.umn.edu/password

•  Tutorials –  www.msi.umn.edu/tutorial

•  FAQ –  www.msi.umn.edu/support/faq.html

Questions?

© 2013 Regents of the University of Minnesota. All rights reserved.

Questions? •  MSI help desk is staffed Monday through

Friday from 8:30AM to 7:00PM. •  Walk-in help available in room 569 Walter. •  Phone 612.626.0802 •  Email help@msi.umn.edu

© 2013 Regents of the University of Minnesota. All rights reserved.

Thank You

The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE,

Minneapolis, Minnesota, 55455, 612-624-0528.

Recommended