51
© 2013 Regents of the University of Minnesota. All rights reserved. Intel Xeon Phi – Basic Tutorial Evan Bollig and Brent Swartz 1pm, 12/19/2013

Intel Xeon Phi – Basic Tutorial

  • Upload
    vuhanh

  • View
    251

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Intel Xeon Phi – Basic Tutorial

Evan Bollig and Brent Swartz 1pm, 12/19/2013

Page 2: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Overview •  Intro to MSI •  Intro to the MIC

Architecture •  Targeting the Xeon

Phi •  Examples

–  Automatic Offload

–  Offload Mode –  Native Mode

•  Distributed Jobs –  Symmetric MPI

Page 3: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

A Quick Introduction to MSI

Page 4: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI at a Glance

HPC Resources •  Koronis •  Itasca • Calhoun • Cascade • GPUT

Laboratories •  Biomedical

Modeling, Simulation and Design.

•  Basic Sciences. •  Life Sciences. •  Scientific

Development. • Remote

Visualization.

Software • Chemical and

Physical Sciences

•  Engineering • Graphics and

Visualization •  Life Sciences • Development

Tools

User Services • Consulting •  Tutorials • Code Porting •  Parallelization •  Visualization

Page 5: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

HPC Resources MSI’s Mission: provide researchers* access to—and support for—HPC resources to facilitate successful and cutting-edge research in all disciplines.

Koronis: SGI Altix 1140 Intel Nehalem Cores 2.96 TB of memory

Itasca: Hewlett-Packard 3000BL 8728 Intel Nehalem Cores 26 TB of memory

Calhoun: SGI Altix XE 1300 1440 Intel Xeon Clovertown Cores 2.8 TB of memory

Cascade: 15 Dell Compute Nodes 32 Nvidia M2070s (4:1) 8 Nvidia Kepler K20s (2:1) 4 Intel Xeon Phi (1:1, 2:1)

GPUT: 4 Exxact Corp GPU Blades 16 Nvidia GeForce GTX 480 (4:1) * UMN and other MN institutions

Page 6: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Tutorials/Workshops •  Introductory

–  Unix, Linux, remote computing, job submission, queue policy

•  Programming & Scientific Computation –  Code parallelization, programming

languages, math libraries •  Computational Physics

–  Fluid dynamics, space physics, structural mechanics, material science

•  Computational Chemistry –  Quantum chemistry, classical

molecular modeling, drug design, cheminformatics

•  Computational Biology –  Structural biology, computational

genomics, proteomics, bioinformatics www.msi.umn.edu/tutorial

Page 7: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Introduction to the MIC Architecture

Page 8: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Fee-fi-fo-fum •  What’s in a name?

–  Knights Corner –  Many Integrated Core (MIC) –  Xeon Phi –  Intel 5110P (B1)

Page 9: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI architecture •  PHI hardware is described here: http://software.intel.com/en-us/articles/intel-

xeon-phi-coprocessor-codename-knights-corner

Page 10: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Performance •  Briefly, PHI performance is described here: http://www.intel.com/content/www/us/en/

benchmarks/xeon-phi-product-family-performance-brief.html

Page 11: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Phi vs GPU

•  Why the Phi? –  ia64 Instructions –  Bandwidth: 320 GB/s –  IP Addressable –  Code portability –  Symmetric Mode –  MKL Auto Offload

•  Why the GPU? –  Massive following

and Literature –  SIMT –  Dynamic Parallelism –  OpenCL Drivers –  cuBLAS, cuRAND,

cuSPARSE, etc.

Page 12: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI description •  An MSI PHI quickstart guide is described

here: https://www.msi.umn.edu/content/intel-phi-

quickstart

Page 13: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Roofline Model

0.0625 0.25 1 4 161

4

16

64

256

1024

144 GByte/se

c

515 GFLOP/sec

NVidia K20 and M2070

Peak

Pos

sible

GFL

OP/

sec

(DP)

Operational Intensity (FLOPs:Byte)

208 GByte/se

c

1170 GFLOP/sec

0.0625 0.25 1 4 161

4

16

64

256

1024

320 GByte/se

c1011 GFLOP/sec

Intel Xeon Phi 5110P (B1)

Peak

Pos

sible

GFL

OP/

sec

(DP)

Operational Intensity (FLOPs:Byte)

Manage expectations of performance following with O.I.

Page 14: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Targeting the Xeon Phi

Page 15: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI demonstration •  At MSI, the only compiler which currently

has OpenMP 4.0 support is the latest Intel/cluster module, loaded using:

% module load intel/cluster

Page 16: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI demonstration •  Can obtain an interactive PHI node using: % qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi,pmem=200mb

Page 17: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MSI PHI demonstration •  Can obtain info about the Phi using: % /opt/intel/mic/bin/micinfo •  As shown from this micinfo output, each of

the current 2 Phi nodes have 1 attached Phi coprocessor containing 60 cores, with a frequency of 1.053 GHz, for a peak of 1011 GFLOPS, and 7936 MB of memory.

Page 18: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Execution Mode •  Phi Execution mode figure: http://download.intel.com/newsroom/kits/xeon/

phi/pdfs/Intel-Xeon-Phi-Coprocessor_ProductBrief.pdf

Page 19: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct

link line options.): http://software.intel.com/sites/products/mkl/

Page 20: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI usage •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the

User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

Page 21: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI code examples •  $MKLROOT/examples/mic_ao •  $MKLROOT/examples/mic_offload •  - dexp VML example (vdExp) •  - dgaussian double precision Gaussian RNG •  - fft complex-to-complex 1D FFT •  - sexp VML example (vsExp) •  - sgaussian single precision Gaussian RNG

Page 22: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI code examples –sgemm SGEMM example –sgemm_f SGEMM example(Fortran 90) –sgemm_reuse SGEMM with data persistence –sgeqrf QR factorization –sgetrf LU factorization –spotrf Cholesky

Page 23: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

MKL PHI usage •  Intel® Math Kernel Library Link Line Advisor (A web tool to help users to choose correct link line options.):

http://software.intel.com/sites/products/mkl/ •  “Using Intel® Math Kernel Library on Intel® Xeon Phi™ Coprocessors” section in the User’s Guide:

http://software.intel.com/sites/products/documentation/doclib/mkl_sa/11/mkl_userguide_lnx/index.htm

Page 24: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips Problem size considerations: –Large problems have more parallelism. –But not too large (8GB memory on a coprocessor). –FFT prefers power-of-2 sizes.

Page 25: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips •  Data alignment consideration: – 64-byte alignment for better vectorization.

Page 26: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips •  OpenMP thread count and thread affinity: – Avoid thread migration for better data

locality.

Page 27: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

PHI Optimization Tips •  Large (2MB) pages for memory allocation: – Reduce TLB misses and memory allocation

overhead.

Page 28: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

KMP_AFFINITY •  Pin threads to cores

–  Compact –  Scatter –  Balanced –  Explicit –  None

http://www.cac.cornell.edu/education/training/StampedeJune2013/mic-130618.pdf , Slide 29

Page 29: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Native Mode

(via MPIrun)

Page 30: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Git Checkout •  SSH to cascade •  module load cmake intel/cluster •  git clone /home/support/public/tutorials/

phi_cmake_example.git

Page 31: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Build •  cd phi_cmake_example •  mkdir build •  cd build •  cmake .. •  make

Page 32: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Run •  cd mic_mpi •  cp ../../mic_mpi/job_simple.pbs . •  qsub job_simple.pbs

Page 33: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Interactive Mode •  qsub -I -lwalltime=4:00:00,nodes=1:ppn=16:phi •  export I_MPI_MIC=enable •  export I_MPI_MIC_POSTFIX=.mic •  mpirun –host ${HOSTNAME}-mic0 –np 4 `readlink –f

quad.x`

Page 34: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

An OpenCL Example

(Research in progress)

Page 35: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

What is an RBF?

Page 36: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

RBF-FD? Classical FD: Vandermonde System

Subsitute for each

Page 37: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

RBF-FD?

Page 38: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

RBF-FD Stencils

Page 39: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Sparse Mat-Vec Multiply (SpMV)

=du(xc)

dx

Lu(x) |x=xc⇡

nX

j=1

c

j

u(xj

)

Dx

✓L =

@

@x

cLu(xk)

Page 40: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Sparse Formats 61 2 3 4875

20 1 2 3310

10 1 2 3131

Row

Col

Value

61 2 3 4875

0 4 62

10 1 2 3131

Row Ptr

Col

Value 61 2 3 4875

10 1 2 3131Col

Value

1 5

2 0

3 0

40

6

0

0

0 8

7

0 0 1

2

6

8 4

3

7

5

COO

CSR ELL

Page 41: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

ViennaCL Performance •  GPU to Phi Performance is NOT portable.

1)  OpenCL driver is still BETA! 2)  Loops vectorize differently

Page 42: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

SpMM with MIC Intrinsics

(Content from submitted paper; slides kept separate)

Page 43: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Additional Items

Page 44: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Optimal Mapping Work to Cores/Accelerators

•  Still an outstanding issue wrt which programming model is optimal.

•  Model for shared memory / accelerator programming options include OpenMP 3.1, OpenMP 4.0 (with accelerator, affinity, and SIMD directives), OpenACC, nVidia specific CUDA, or OpenCL. –  http://www.hpcwire.com/2013/12/03/compilers-

accelerated-programming/

Page 45: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

OpenACC •  OpenACC 2.0 was released this summer:

–  http://www.openacc-standard.org/

•  Improvements include: procedure calls, nested parallelism, more dynamic data management support and more.

•  OpenACC 2.0 additions described by PGI's Michael Wolfe at SC13: –  http://www.nvidia.com/object/sc13-technology-

theater.html

Page 46: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

OpenACC •  PGI will support OpenACC 2.0 starting in Jan

2014, with PGI 14.1. –  Current MSI module pgi/13.9 supports OpenACC

1.0 directives. •  GCC will support OpenACC soon:

–  http://www.hpcwire.com/2013/11/14/openacc-broadens-appeal-gcc-compiler-support/

–  OpenACC 2.0 expected in 2014

Page 47: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

OpenMP 4.0 •  MSI Intel module intel/cluster/2013

supports OpenMP 4.0, except for combined directives. –  http://software.intel.com/en-us/articles/

openmp-40-features-in-intel-fortran-composer-xe-2013

•  For more information on OpenMP, see: –  http://openmp.org/wp/

Page 48: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Knight’s Landing •  Information on the Intel PHI follow-on due out

in 2014/2015, Knight's Landing: –  http://www.theregister.co.uk/2013/06/17/

intel_knights_landing_xeon_phi_fabric_interconnects/

–  http://www.hpcwire.com/2013/11/23/intel-brings-knights-roundtable-sc13/

•  Expect much more memory per Knight's Landing socket, and significantly improved memory latency and bandwidth

Page 49: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

•  MSI home page –  www.msi.umn.edu

•  Software –  www.msi.umn.edu/sw

•  Password reset –  www.msi.umn.edu/password

•  Tutorials –  www.msi.umn.edu/tutorial

•  FAQ –  www.msi.umn.edu/support/faq.html

Questions?

Page 50: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Questions? •  MSI help desk is staffed Monday through

Friday from 8:30AM to 7:00PM. •  Walk-in help available in room 569 Walter. •  Phone 612.626.0802 •  Email [email protected]

Page 51: Intel Xeon Phi – Basic Tutorial

© 2013 Regents of the University of Minnesota. All rights reserved.

Thank You

The University of Minnesota is an equal opportunity educator and employer. This PowerPoint is available in alternative formats upon request. Direct requests to Minnesota Supercomputing Institute, 599 Walter library, 117 Pleasant St. SE,

Minneapolis, Minnesota, 55455, 612-624-0528.