ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors

ORIGINAL AUTHORJAMES REINDERS, INTEL

PRESENTED BY ADITYA AMBARDEKAR

Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors

References:

Intel® Xeon Phi™ Coprocessor (codename Knights Corner) http://

software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner Intel® Many Integrated Core Architecture : An Overview and Programming Models

http://www.olcf.ornl.gov/wp-content/training/electronic-structure-2012/ORNL_Elec_Struct_WS_02062012.pdf

Intel details Knights Corner architecture at long last http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/#.

USLaX6U4uuJ Xeon Phi Update:

http://www.advancedclustering.com/news/xeon-phi-update.html Optimization and Performance Tuning for Intel® Xeon Phi™ Coprocessors - Part 1:

Optimization Essentials http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-

phi-coprocessors-part-1-optimization Results at TeraGrid 2011 conference.

http://www.hpcwire.com/hpcwire/2011-08-15/adventures_with_hpc_accelerators:_gpus_and_intel_mic_coprocessors.html

NCCS introduction to MIC. http://www.nccs.nasa.gov/images/Intro-to-MIC012913.pdf

NCSA Scientist Backs MICs over GPUs http://goparallel.sourceforge.net/ncsa-scientist-backs-mics-gpus/

http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner




http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/

http://semiaccurate.com/2012/08/28/intel-details-knights-corner-architecture-at-long-last/

http://www.advancedclustering.com/news/xeon-phi-update.html

http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization

http://software.intel.com/en-us/articles/optimization-and-performance-tuning-for-intel-xeon-phi-coprocessors-part-1-optimization



http://www.nccs.nasa.gov/images/Intro-to-MIC012913.pdf

http://www.nccs.nasa.gov/images/Intro-to-MIC012913.pdf

http://goparallel.sourceforge.net/ncsa-scientist-backs-mics-gpus/



Paper flow

Introduction ( Performance capability, Parallelism-Dual-transforming-tuning advantage).

Features of Knight’s corner with architecture diagram (MIC). Tuning your applications for parallel performance (authors favorite

point). Performance and Cache Optimizations. Compiler and Programming models (MPI vs. Offload). Xeon Phi vs. GPU(author doesn’t go into details so we will cover it

under additional topic). Summary.

Introduction to Many Integrated Core (MIC )

The basis of the Intel MIC architecture is to leverage x86 legacy by creating a x86-compatible multiprocessor architecture that can utilize existing parallelization software tools.[14]Programming tools include OpenMP, OpenCL,[39] Intel Cilk Plus and specialized versions of Intel's Fortran, C++[40] and math libraries.

More than 50 cores, multiple threads.

http://en.wikipedia.org/wiki/Intel_MIC

http://en.wikipedia.org/wiki/OpenMP

http://en.wikipedia.org/wiki/OpenCL


http://en.wikipedia.org/wiki/Intel_Cilk_Plus




Multicore vs. Many core

Fig 01 : Many Core vs. Multicore.Ref http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner.

Introduction to Xeon-Phi

Intel Xeon Phi Coprocessors-brand name for Many Integrated Core architecture products.

Xeon Phi is essentially a parallel x86 supercomputer on a chip.

Applications that fully utilize scaling capabilities of Intel Xeon processor based systems now have additional power efficient scaling ,vector support and local memory bandwidth.

Maintaining the programmability and support associated with Intel Xeon processors.

First Intel Xeon Phi Coprocessor: Knights Corner

X86 based SMP on chip with over 50 cores, multiple threads per core(e.g. 4 threads on SE10X) and 512 bit SIMD instructions.

Trace history of Pentium design, but with addition of 64 bit support, 4 hardware threads per core.

Requires host processor.Runs Linux.Cores clocked at 1 Ghz or more.Each core has access to 512-KB cache locally

with high speed access to all other L2 caches.

Typical Platform

Fig 02: System Platform.Ref: http://software.intel.com/en-us/articles/Intel-Xeon-phi-coprocessor-codename-knights-corner.

Connection via PCIe bus. Since the Intel Xeon Phi coprocessor runs a Linux operating system, a virtualized TCP/IP stack could be

implemented over the PCIe bus, allowing the user to access the coprocessor as a network node. Thus, any user can connect to the coprocessor through a secure shell and directly run individual jobs or

submit batch jobs to it. Multiple Intel Xeon Phi coprocessors can be installed in a single host system. Within a single system, the

coprocessors can communicate with each other through the PCIe peer-to-peer interconnect without any intervention from the host.

Similarly, the coprocessors can also communicate through a network card without any intervention from the host.

Knight Corner architecture:-

Fig 03. Knight Corner Architecture .

Ref: Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors.

Ring Architecture

Fig 04. Ring Architecture Ref: http://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-codename-knights-corner

When to use Intel Xeon Phi?

Application maximizes capabilities of Intel Xeon Processor.

High Performance comes from parallel software combined with Parallel Hardware.

Fig 05 Software and Hardware Parallelism. Ref: Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors.

When to use Intel Xeon Phi?

Application scales well past 100 threads qualify as highly parallel

More Parallelism = Better Performance.Two fundamental considerations for

application A) Scaling. B) Vectorization.

Tuning your applications for Xeon Phi

Check Scaling Create a simple graph of performance, as you run with

various number of threads on Intel Xeon Processor based System.

Check Vectorization. Compile your application with and without Vectorization.

Compare the performances. Most effective use of Intel Xeon Phi coprocessors will be when most cycles executing are in vector instructions.

Check for memory bandwidth For this to be efficient, application needs to exhibit good

locality of reference and utilizes caches well in its core computations.

Tuning your applications for Xeon Phi

Communication vs Computation ratio, when using MPI. for deciding native vs offline model. Strategy of

overlapping communication and I/O with computation.

Tools to measure Performances

Intel Vtune amplifier XE 2013 To measure computations L1 compute density

Intel trace Analyzer and collector Profiling MPI communications.

Performance Optimizations

Memory access and loop transformations. (ex cache blocking, loop unrolling, prefetching, tiling…)

Blocking or Tiling Code runs faster when data are reused while they are still in the processor

registers or the processor cache. It is frequently possible to block or tile operations so that data are reused before they are evicted from cache.

Data Structure transformations:- Code will run best when data are accessed in sequential address-order from

memory. Frequently developers will change the data structure to allow this linear access pattern. A common transformation is from an array of structures to a structure of arrays ( AoS to SoA).

Algorithm Selection: Favor ones those are parallelization and vectorization friendly.

Large page considerations: Use Linuxlibhugetlbfs library – provides easy access to huge mempory pages.

preload library to back text, data, malloc() or shared memory with hugepages

Cache Optimizations

Maximum effective use of caches Maximize locality of references first

organized around threads being used per core and then around all the threads across the coprocessor.

Ensuring prefetching is utilized efficiently. Organizing data streams is best c

Offload VS Native model.

Coprocessor usage in application.1]Processor Centric “offload” model:

Program viewed as running on processors and offloading select work to coprocessors.

2] Native model: Program runs natively on both processors and

coprocessors which may communicate with each other.

Offload Model

Fortran/C++ pragma support.Future version of Open MP to include offload

directives.Offload complex program components that

Xeon Phi can process.No worries for placement of ranks on

coprocessor cores, to load balance work across all the cores available.

Offload model example

You can control the offloading in your own code by issuing a special compiler pragma before the loop you want to offload: #pragma offload target (mic)

Here is an example code snipped where following OpenMP loop would be offloaded to the Phi #pragma omp parallel for reduction(+:pi) for (i = 0; i < count; i ++) { float t = (float)((i+0.5)/count); pi += 4.0/(1.0+t*t); } pi /= count; /* executes on host */

When building code with the Intel compiler, a single additional flag is all that’s necessary: icc -offload-build -o hello hello.c

Native Model

MPI program may run on native with ranks on coprocessors and processors

One needs to fit problems in coprocessor environment.

Limited memory on coprocessor.Load Balancing with Xeon processor cores

and Xeon Phi processor cores

Native model example

Running MPI codes Since the Intel Xeon Phi has an OS and is fully network accessible,

the existing MPI code can be run on the Phi along with existing compute nodes. Let's assume you have 2 compute nodes (node01, node02), each with a Xeon Phi installed (node01-phi, node02-phi).

First we build the Intel MPI code

For the host Xeon processor: mpiicc -o mpi-hello.x86_64 mpi-hello.c

For the Xeon Phi processor (add the additional ‘-mmic’ argument): mpicc -mmic -o mpi-hello.mic mpi-hello.c

Run it as follows: mpirun -np 4 -host node01 mpi-hello.x86_64 \ -host node01-phi mpi-hello.mic \ -host node02 mpi-

hello.x86_64 \ -host node02-phi mpi-hello.mic

Recommended Compilers/Programming models.

Fortran Programmers- Use Open MP, DO CONCURRENT & MPI

C++ Programmers- Intel TBB, Intel Cilk Plus and Open MP.

C programmers:- Open MP and Intel Cilk Plus

Xeon Phi vs. GPU.

Table 01: Xeon Phi vs. GPU

Features XeonPhi\MIC GPU

Architecture X86 based Streaming processors

Caches Coherent caches

Shared memory and caches

MPI MPI on host & MIC

MPI on host only

Programming Native/offload kernels

Languages C, C++, Fortran and Open MP…

Cuda, opencl…

Summary

Fundamentals: maximize parallel computations and minimize data movement.

Parallel computations: Scaling and Vectorization.

“Transforming and Tuning” applications for scaling, vector usage and memory usage for use of Xeon Processors and Phi processors.

MIC offers X86 based architecture legacy support for features over GPU new streaming based architecture.

Questions

Documents

ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors