1 Three Questions - community.hartree.stfc.ac.uk

Three Questions every one keeps asking

Stephen Blair-Chappell

Intel Compiler Labs

Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Three Common Requests

“How can I make my program run

faster?”

“How can I make my program

parallel?”

“Will my code run on any CPU? -

compatibility”

2

8/2/2012


Three Components

• Intel® Composer XE• Use to generate fast, safe, parallel code

(C/C++, Fortran)

• Intel® VTune™ Amplifier XE• Find hotspots and bottlenecks in you code.

• Intel® Inspector XE• Use to find memory and threading errors

3

8/2/2012

Composer Composer XE

•Compiler•Libraries

Inspector XE•Memory Errors•Parallel Errors

Amplifier XE

•Profiler

Intel® Parallel Studio XE


Four Three Components

• Intel® Composer XE

• Use to generate fast, safe, parallel code (C/C++, Fortran)

• Intel® VTune™ Amplifier XE

• Find hotspots and bottlenecks in you code.

• Intel® Inspector XE

• Use to find memory and threading errors

4

8/2/2012

Composer Composer XE

•Compiler•Libraries

Inspector XE•Memory Errors•Parallel Errors

Amplifier XE

•Profiler


+ Advisor• Intel® Parallel Advisor

• Use to model parallelism in your existing applications




faster?”


parallel?”


compatibility”

5

8/2/2012


The compiler uses many optimisation techniques

6

8/2/2012

Faster Code

fast floating point

http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdfhttp://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html


7

Often we are happy with out-of-the-box experience

When was the last time you looked

at some documentation?

Faster Code


startStep 1

Step 2

Step 3

Step 4

Step 6

Step 7

Step 5

Use General Optimizations

Build with optimization disabled

Use Processor-Specific Options

Tune automatic vectorization

Example options Windows (Linux)/Od (-O0)

/01,/02,/03 (-O1, -O2, -O3)

/QxSSE4.2 (-xsse4.2)/QxHOST (-xhost)

/Qipo (-ipo)

/Qprof-gen (-prof-gen)/Qprof-use (-prof-use)

/Qguide (-guide)

Use Intel Family of Parallel Models /Qparallel (-parallel)

Add Inter-procedural

Implement Parallelism or use Automatic Parallelism

Use Profile Guided Optimization

The

Sev

en O

pti

mis

atio

n S

tep

s

Faster Code


Vectorisation is …

9

8/2/2012

Faster Code

a[3] a[2]

b[3] b[2]

c[3] c[2]

+ +a[1] a[0]

b[1] b[0]

c[1] c[0]

+ +for (i=0;i<MAX;i++)c[i]=a[i]+b[i];

70 instrSingle-Precision VectorsStreaming operations

144 instrDouble-precision Vectors8/16/3264/128-bit vector integer

13 instrComplex Data

32 instrDecode

47 instrVideo Graphics building blocksAdvanced vector instr

SSE

1999

SSE2

2000

SSE3

2004

SSSE3

2006

SSE4.1

2007

SSE4.2

2008

8 instrString/XML processingPOP-CountCRC

AES-NI

2009

7 instrEncryption and DecryptionKey Generation

AVX

2011

~100 new instr.~300

legacy sseinstrupdated256-bit vector3 and 4-operand instructions

AVX2

2012\2013

Int. AVX expands to 256 bitImproved bit manip.fmaVector shiftsGather

MIC

2012

512-bit vector


Dif

fere

nt

Way

s of

In

sert

ing

V

ecto

rise

dC

ode

10

8/2/2012

Assembler code (addps)

Vector intrinsic (mm_add_ps())

SIMD intrinsic class (F32vec4 add)

Compiler: Auto vectorization hints (#pragma ivdep, …)

Programmer control

Ease of use

Compiler: Fully automatic vectorization

Cilk Plus Array Notation

User Mandated Vectorization( SIMD Directive)

Manual CPU Dispatch (__declspec(cpu_dispatch …))

Use Performance Libraries (e.g. IPP and MKL)

Faster Code


An example

11

8/2/2012

Faster Code

Speedup by swapping compiler

Verified using VTune

Speedup by upgrading silicon




faster?”


parallel?”


compatibility”

12

8/2/2012


Speedup using parallelism

13

8/2/2012

Parallel Code

Insp

ecto

r X

E

Mem

ory

Thre

ads

Debug

con

curr

ency

Tune

Am

plifi

er X

E

Lock

s &

wai

ts

Am

plifi

er X

EH

otsp

ot

Analyze

EB

S (X

E o

nly

)

Four Step Development

Com

pose

r X

E Compiler

Libraries

Implement

MKL

Cilk Plus

TBB

OpenMP

IPP

12

3

4

Analyze

Implement

Debug

Tune


Four Different Ways to Find the Hotspots

1. Using Intel compiler’s loop profiler & profile viewer

2. Using the compiler’s Auto-parallelizer

3. Using Amplifier XE

4. Performing a Survey with Advisor

14

8/2/2012

Analyze

Implement

Debug

Tune

Parallel Code


Language to help parallelism

Intel® Cilk™ Plus

OpenMP

Intel® Threading Building Blocks

Intel® MPI

Fortran Coarrays

OpenCL

Native Threads

15

8/2/2012

Parallel Code

cilk_for (int i = 0; i < max_row; i++) {

for (int j = 0; j < max_col; j++ ){

p[i][j] = mandel( complex(scale(i), scale(j)));}

}

#pragma omp parallel forfor(i=1;i<=4;i++) {

printf(“Iter: %d”, i);}


Four Different Ways to Find your Parallel Errors

1. Using Inspector XE

2. Perform a Static Security Analysis

3. Debug with Parallel Debug Extensions

4. Use Advisor

16

8/2/2012

Analyze

Implement

Debug

Tune

Parallel Code


An example …

17

8/2/2012

Parallel Code

1

2

3

4

5

61. Hotspot Analysis2. Implement3. Find Threading Errors 4,5,6. Tune Parallelismhttps://makebettercode.com/parallel_landing_required/lib/pdf/5373_IN_ParallelMag_Sudoku_060911.pdf




faster?”


parallel?”


compatibility”

18

8/2/2012


Will my program run on any CPU?

19

8/2/2012

Compatible Code

Compatibility

Future Proofing• build?

• run?

• Performance?

OS-agnosticCPU-agnosticLanguage / StandardsToolsScalability


20

8/2/2012

On the graphs, bigger is better

Parallel

Vectorised


Running Example: Monte Carlo#pragma omp parallel forfor(int opt = 0; opt < OPT_N; opt++){

float VBySqrtT = VOLATILITY * sqrtf(T[opt]);float MuByT = (RISKFREE ‐ 0.5f * VOLATILITY * VOLATILITY) * T[opt];float Sval = S[opt];float Xval = X[opt];float val = 0.0f, val2 = 0.0f;

#pragma simd reduction(+:val) reduction(+:val2)for(int pos = 0; pos < RAND_N; pos++){

float callValue = expectedCall(Sval, Xval, MuByT, VBySqrtT, l_Random[pos]);

val += callValue;val2 += callValue * callValue;

}

float exprt = expf(‐RISKFREE *T[opt]);h_CallResult[opt] = exprt * val / (float)RAND_N;float stdDev = sqrtf(((float)RAND_N*val2 ‐ val*val) /

((float)RAND_N*(float)(RAND_N – 1.f)));h_CallConfidence[opt] =(float)(exprt * 1.96f *

stdDev/sqrtf((float)RAND_N));}

SFTL003 hands on lab

Intel® Parallel Studio XE 2013 andIntel® Cluster Studio XE 2013

Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications


• Industry-leading performance from advanced compilers

• Comprehensive libraries

• Parallel programming models

• Insightful analysis tools

More Cores. Wider Vectors. Performance Delivered.Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013

23

Serial Performance

Scaling Performance

EfficientlyMulticore Many-core

128 Bits

256 Bits

512 Bits

50+ cores

More Cores

Wider VectorsTask & Data

Parallel Performance

Distributed Performance


24

Phase Product Feature Benefit

Build

Intel®Advisor XE

Threading design assistant(Studio products only)

• Simplifies, demystifies, and speeds parallel application design

Intel®Composer XE

• C/C++ and Fortran compilers • Intel® Threading Building Blocks• Intel® Cilk™ Plus• Intel® Integrated Performance

Primitives• Intel® Math Kernel Library

• Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core

Intel®MPI Library†

High Performance MessagePassing (MPI) Library

• Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability

Verify& Tune

Intel®VTune™Amplifier XE

Performance Profiler for optimizing application performance and scalability

• Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks

Intel®Inspector XE

Memory & threading dynamicanalysis for code quality

Static Analysis for code quality

• Increased productivity, code quality, and lowers cost, finds memory, threading , and security defects before they happen

Intel® TraceAnalyzer & Collector†

MPI Performance Profiler for understanding application correctness & behavior

• Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots

Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 †

Efficiently Produce Fast, Scalable and Reliable Applications


25

Performance

Improved compiler and library performance

+ Ivy Bridgemicroarchitecture

+ Haswellmicroarchitecture

+ Intel® Xeon Phi™coprocessor

Performance Profiling

A dozen new analysis features

Low overhead Java* profiling

CPU Power Analysis

Reliability

Pointer checker

Heap growth analysis

Improved MPI fault tolerance†

Reproducibility

Conditional numerical reproducibility

Standards

ExpandedC++ 11

Expanded Fortran 2008

MPI 2.2†

ParallelismAssistance

Analysis extended to include Linux*, Fortran and C# (in addition to Windows* and C/C++)

Top New Features

Efficiently produce fast, scalable and reliable applications running on Windows* and Linux*

†Intel® Cluster Studio XE


Tool

MS Compiler

ICC

ICC

VTune

ICC

Inspector

ICC

VTune

ICC

Target

13‐0

13‐1

13‐2

13‐3

13‐4

13‐5

13‐6

13‐7

13‐8

26

8/2/2012

Add OpenMPCode

Find Hotspot

Add SSE Intrinsics

Check Correctness

Tune Parallelism

Find Hotspot

Fix Correctness

Macro

STEP_0

STEP_1

STEP_2

STEP_3

STEP_4

STEP_5

STEP_6

STEP_7

STEP_8

Solver

Generator

Key

Serial Release Mode

OpenMP Debug Mode

OpenMP Release Mode

1st version

Finish

The Build Environment

Build examplemake 13-0

or nmake 13-0


How to Run

13-0.exe test.txt

27

8/2/2012


Your Challenge – the hands-on

Examine each of the eight stage and use a combination of the compiler, inspector, and amplifier to understand ‘what’s going on’

Answer these questions

• Is the application using the CPU at it’s best? (Steps 0, 2 and 8)

• What’s the biggest hotspot in the serial code? (steps 1 and 3)

• What errors were introduced into the parallelism? (Steps 4, 5 & 6)

• How well is the parallelism tuned? (Steps 7 & 8)

Supplement:

Why is the Linux version slower than the Windows Version?

28

8/2/2012

Thank You

29

Backup

30

Intel® Parallel Studio XE Intel® Cluster Studio XE(30 minutes)

31

Intel® Parallel Studio XE 2013 andIntel® Cluster Studio XE 2013

Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications


• Industry-leading performance from advanced compilers

• Comprehensive libraries

• Parallel programming models

• Insightful analysis tools

More Cores. Wider Vectors. Performance Delivered.Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013

33

Serial Performance

Scaling Performance

EfficientlyMulticore Many-core

128 Bits

256 Bits

512 Bits

50+ cores

More Cores

Wider VectorsTask & Data

Parallel Performance

Distributed Performance


What’s New? Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013

Performance Leadership:• 3rd Generation Intel® Core™ Processors (code name

“Ivy Bridge”) and future Intel® processors (code name “Haswell”)

• Intel® Xeon Phi™ coprocessors• Improved C++ and Fortran performance

New Product Capabilities• Latest OS: Windows* 8 Desktop, Linux*• IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain• Standards: C99, selected C++11 features, almost

complete Fortran 2003 support and selected features from Fortran 2008, Fortran 2008, MPI 2.2

34

Compilers & Libraries


Intel® Cluster Studio XE

Boost Performance

35


Support for Latest Intel Processors and Coprocessors

36

Intel® Ivy Bridge microarchitecture

Intel® Haswell microarchitecture

Intel® Xeon Phi™ coprocessor

Intel® C++ and Fortran Compiler

✔AVX

✔AVX2, FMA3

✔IMCI

Intel® TBB library ✔ ✔ ✔

Intel® MKL library ✔AVX

✔AVX2, FMA3 ✔

Intel® MPI library ✔ ✔ ✔

Intel® VTune™ Amplifier XE†

✔Hardware Events

✔Hardware Events

✔Hardware Events

Intel® Inspector XE✔

Memory & Thread Checks

✔Memory & Thread

✔Memory & Thread††

† Hardware events for new processors added as new processors ship.†† Analysis runs on multicore processors, provides analysis for multicore and many-core processors.


Performance-Oriented Compiler SuitesIntel® Compilers, Performance Libraries, Debugging ToolsOn Windows, Linux and Mac OS X

37

Intel® C++ Composer XE 2013

• Intel® C++ Compiler XE 13.0with Intel® Cilk™ Plus

• Intel® TBB• Intel® MKL• Intel® IPP• Intel® Xeon Phi™ product

family support, Linux

Intel® Fortran Composer XE 2013

• Intel® Fortran Compiler XE 13.0• Intel® MKL• Compatibility with Compaq

Visual Fortran*• Fortran 2003, 2008 support• Intel® Xeon Phi™ product

family support, Linux

Intel Composer XE 2013

• Combines Intel C++ Composer XE and Intel® Fortran Composer XE

• For Fortran developers who also want Intel C++

• Windows (requires Visual Studio) and Linux only

Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio*

Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT

Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment

All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran

All: Leadership performance on Intel and compatible architectures

All: One Year Intel® Premier Support. Renewable Annually.

Performance . Compatibility. Support.


Superior C++ Compiler Performance

38

More Performance• Just recompile• Uses Intel® AVX and Intel® AVX2 instructions• Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux)• Intel® Cilk™ Plus: Tasking and vectorization


Superior Fortran Compiler Performance

39

More Performance • Just recompile• Intel® Xeon Phi™ product family: Linux compiler, debugger support • Access to Intel® AVX and Intel® AVX2 instructions (-xa or /Qxa)• Auto-parallelizer & directives to access SIMD instructions• Coarrays & synchronization constructs support parallel programming• Loop optimization directives: VECTOR, PARALLEL, SIMD• More control over array data alignment (align arrayNbytes)


C++ Performance Guide Performance Wizard for Windows

• Quick 5 step process for more performance

• Get help choosing optimization options

40

Gain Performance with Less Effort





Intel® Math Kernel Library (MKL)• Highly optimized threaded math routines

• Applications in science, engineering, finance

• Use Intel® MKL on Windows*, Linux*, Mac OS*

• Use Intel® MKL with Intel compiler, gcc, MSFT*, PGI

• Component of Intel® Parallel Studio XE and Intel® Cluster Studio XE

41

33% of math libraries users rely on Intel’s Math Kernel

Library

EDC North America Development Survey

2011, Volume II

Drop In The Next Intel® MKL Version to Unlock New Processor Performance


LAPACK Performance Improves withIntel® Math Kernel Library

42



Optimized for Performance and Power Efficiency

• Highly optimized using SSE, AVX instruction sets

• Performance beyond what an optimized compiler produces alone

• Highly optimized using SSE, AVX instruction sets

• Performance beyond what an optimized compiler produces alone

Intel Engineered & Future Proofed to Save You Time

• Ready-to-use & royalty free

• Fully optimized for current and past processors

• Save development, debug, and maintenance time

• Code once now, receive future optimizations later

• Ready-to-use & royalty free

• Fully optimized for current and past processors

• Save development, debug, and maintenance time

• Code once now, receive future optimizations later

Wide range of Cross Platform & OS Functionality

• Thousands of optimized functions

• Supports Windows*, Linux*, and Mac OS* X

• Supports Intel® Atom, Intel® Core, Intel® Xeon, platforms

• Thousands of optimized functions

• Supports Windows*, Linux*, and Mac OS* X

• Supports Intel® Atom, Intel® Core, Intel® Xeon, platforms

Intel® Integrated Performance Primitives (IPP)A Library Of Highly Optimized Algorithmic Building Blocks For Media And Data Applications

43

Availability: Part of several different product packages with single, multi-user licenses as well as volume, academic, and student discounts available.

Try it Before You Buy It: Download a trial version today at intel.com/software/products/eval

Performance Building Blocks to Make Your Applications Faster, Faster


Intel® IPPBoost from Intel® AVX

44


45

Where is my application…

Spending Time? Wasting Time? Waiting Too Long?

• Focus tuning on functions taking time

• See call stacks• See time on source

• See cache misses on your source

• See functions sorted by # of cache misses

• See locks by wait time• Red/Green for CPU utilization during wait

Intel® VTune™ Amplifier XE Performance Profiler

• Windows & Linux• Low overhead• No special recompiles

Claire CatesPrincipal Developer, SAS Institute Inc.

We improved the performance of the latest run 3 fold. We wouldn't have found the problem without something like Intel® VTune™ Amplifier XE.

Intel® VTune™ Amplifier XE

Advanced Profiling for Scalable Multicore Performance45


A Dozen New Analysis FeaturesIntel® VTune™ Amplifier XE 2013

More Profiling Data1) Statistical Call Counts

Data for Inlining & Parallelization2) Hardware Events + Stacks

Lower overhead, Higher resolution Finds hot spots in small functions

3) Uncore Event CountingMore accurate bandwidth analysis

4) Ivy Bridge Events5) Haswell Events

Updates as new processors ship6) Intel® Xeon Phi™ Products

Hardware events

46

Easier To Use7) Source View for Inlined Code

(For Intel® and GCC* compilers)8) Java Tuning

Results map to the Java source9) Task Annotation API

Label and visualize tasks. 10) User Defined Metrics

Create meaningful metrics from events11) Programmable Hot Keys

Start and stop collection easily12) More/Better Advanced Profiles

(e.g., Bandwidth)

Easy to Use, Wealth of Data, Powerful Analysis

Intel® VTune™ Amplifier XE


Low Overhead Java* ProfilingIntel® VTune™ Amplifier XE 2013

Low Overhead & Precise• Sampling is fast / unobtrusive• Hardware sampling even faster

(Now with optional stacks!)• Advanced profiles are unique

(cache misses, bandwidth…)

47

Versatile & Easy to Use

Multiple simultaneous JVMsMixed Java / C++ / FortranSee results on the Java source

Better Data, Lower Overhead, Easier to Use

Intel®VTune™ Amplifier XE


CPU Power AnalysisIntel® VTune™ Amplifier XE 2013

To decrease CPU power usage minimize wake-ups • Identify wake-up causes– Timers triggered by

application– Interrupts mapped to HW intr

level– Show wake-up rate

• Display source code for events that wake-up processor

• Show CPU frequencies by CPU core(CPU frequencies can change by CPU activity level)

• Linux only

48

Select & filter to see a single wake up object:

Uniquely Identifies the Cause of Wake-ups and Give Timer Call Stacks

Intel®VTune™ Amplifier XE

Scale Forward

49


Simplify and Speed Threading DesignIntel® Advisor XE – Threading Assistant

The Challenge of Parallel Design:• Need to implement to measure

performance• Implementation is time consuming• Disrupts regular product development• Testing difficult without tools

50

Intel Advisor XE Separates Design & Implementation• Fast exploration of multiple options• Find errors before implementation• Design without disrupting development

New! Linux* and Windows* New! C, C++, Fortran and C# code

Add Parallelism with Less Effort, Less Risk and More Impact

Intel® Advisor XE


1) Analyze it.

3) Tune it.

4) Check it.

5) Do it!

2) Design it. (Compiler ignores these annotations.)

Design Then ImplementIntel® Advisor XE 2013 – Threading Assistant

51

Less Effort, Less Risk, More Impact

Design Parallelism• No disruption to

regular development • All test cases

continue to work• Tune and debug the

design before you implement it

Implement Parallelism

Intel® Advisor XE


Abstract, Scalable and Composable

Scale Forward with Intel Parallel ModelsExtend to Intel® Xeon Phi™ Coprocessors

52

Support StandardsOpenMP Coarray Fortran MPI


C/C++ language extensions to simplify

parallelism


Widely used C++ template library for thread

management

Intel® Xeon Processors, and

Compatible Processors

Intel® Xeon Phi™ product family

Open programming models and also Intel products

Don’t Leave Your Code Behind



Simplify ParallelismIntel® Cilk™ Plus, Intel® Threading Building Blocks

53


• 3 simple keywords & array notations for parallelism

• Support for task and data parallelism

• Semantics similar to serial code

• Simple way to parallelize your code

• Sequentially consistent, low overhead, powerful solution

• Supports C, C++, Windows and Linux


• Parallel algorithms and data structures

• Scalable memory allocation and task scheduling

• Synchronization primitives

• Rich feature set for general purpose parallelism

• Available as open source or commercial license

• Supports C++, Windows, Linux, Mac OS X, other OSs

What

Features

Why

Language extensions tosimplify task/data parallelism

Widely used C++ templatelibrary for task parallelism

Task and Data Parallelism Made Easier



Parallelize Applications For PerformanceIntel® Threading Building Blocks (TBB)A popular, proven parallel C++ abstractionA C++ template library

• Scalable memory allocation

• Load-balancing

• Work-stealing task scheduling

• Thread-safe pipeline

• Flexible flow graph

• Concurrent containers

• High-level parallel algorithms

• Numerous synchronization primitives

• Open source, and portable acrossmany OSs

54

Simplify Parallelism with a Scalable Parallel Model

Michaël Rouillé, CTO, Golaem

"Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table


Scale Forward and Extend to Intel® Xeon Phi™ CoprocessorsIntel® Cilk™ Plus

55

Intel® Cilk™ Plus (Language Extension to C/C++)

Easier Task & Data Parallelism3 simple Keywords:cilk_for, cilk_spawn, cilk_sync

Intel® Cilk™ Plus Array Notation Save time with powerful vectorization

Minimize Software Re-Work for New Hardware

Increase Reliability

56


Pointer Checker

Finds buffer overflows and dangling pointers before memory corruption occurs

Powerful error reporting

Integrates into standard debuggers (Microsoft, gdb, Intel)

57

Buffer Overflow{

char *my_chp = "abc";char *an_chp = (char *) malloc (strlen((char *)my_chp));memset (an_chp, '@', sizeof(my_chp));

}

Dangling pointer{

char *p, *q;p = malloc(10);q = p;free(p);*q = 0;

}

CHKP: Bounds check errorTraceback:./a.out(main+0x1b2) [0x402d7a] in file mems.c at line 13

Pointer Checker Highlights Programming Errors For More Secure Applications





Conditional Numerical Reproducibility

Intel® Math Kernel Library: • New deterministic task scheduling

and code path selection options

OpenMP*:• New deterministic reduction

option

Intel® Threading Building Blocks• New parallel deterministic reduce

option

58

Help Achieve Reproducible Results, Despite Non-associative Floating Point Math

“I’m a C++ and Fortran developer and have high praise for the Intel® Math Kernel Library. One nice feature I’d like to stress is the numerical reproducibility of MKL which helps me get the assurance I need that I’m getting the same floating point results from run to run."

Franz BernasekOwner / CEO , Senior DeveloperMSTC Modern Software Technology



Expanded C++ 11 support

• Additional type traits• Initializer lists (partial)• Generalized constant expressions (partial)• Noexcept (partial)• Range based for loops • Conversions of lambdas to function pointers

59

Excellent Support for C++ 11 on Windows* and Linux*



Expanded Fortran 2008 Support

• Maximum array rank has been raised to 31 dimensions (Fortran 2008 specifies 15)

• Recursive type may have ALLOCATABLE components

• Coarrays– CODIMENSION attribute– SYNC ALL statement– SYNC IMAGES statement– SYNC MEMORY statement– CRITICAL and END CRITICAL statements– LOCK and UNLOCK statements– ERROR STOP statement– ALLOCATE and DEALLOCATE may specify coarrays– Intrinsic procedures IMAGE_INDEX, LCOBOUND,

NUM_IMAGES, THIS_IMAGE, UCOBOUND

• CONTIGUOUS attribute

• MOLD keyword in ALLOCATE

• DO CONCURRENT

• NEWUNIT keyword in OPEN

60

G0 and G0.d format edit descriptorUnlimited format item repeat count specifierCONTAINS section may be emptyIntrinsic procedures BESSEL_J0, BESSEL_J1, BESSEL_JN, BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL, DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA, HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS, LEADZ, LOG_GAMMA, MASKL, MASKR, MERGE_BITS, NORM2, PARITY, POPCNT, POPPAR, SHIFTA, SHIFTL, SHIFTR, STORAGE_SIZE, TRAILZAdditions to intrinsic module ISO_FORTRAN_ENV: ATOMIC_INT_KIND, ATOMIC_LOGICAL_KIND, CHARACTER_KINDS, INTEGER_KINDS, INT8, INT16, INT32, INT64, LOCK_TYPE, LOGICAL_KINDS, REAL_KINDS, REAL32, REAL64, REAL128, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, STAT_UNLOCKED

New: ATOMIC_DEFINE and ATOMIC_REF, initialization of polymorphic INTENT(OUT) dummy arguments, standard handling of G format and of printing the value zero, coarrays (more support), polymorphic source allocation

Leadership F2008 Support on Linux*, Windows* & OSX*



Dynamic Analysis Finds Memory & Threading ErrorsIntel® Inspector XE 2013

Find and eliminate errors• Memory leaks, invalid access…• Races & deadlocks• Analyze hybrid MPI cluster apps• Heap growth analysis

Faster & Easier to use• Debugger breakpoints

• Break on selected errors • Run faster to known error

• Pause/resume collection• Narrow analysis focus• Better performance

• Improved error suppression

61

Find Errors Early When They are Less Expensive

Intel® InspectorXE


Heap Growth AnalysisIntel® Inspector XE 2013

Does Application Memory Usage Mysteriously Grow? • Set an analysis interval with

start and analysis end points– Click a button –or–– Use an API

• See a list of memory allocations that are not freed in the interval

• Quickly zero in on suspicious activity that contributes to heap growth

62

Speeds Diagnosis of Difficult to Find Heap Errors

Intel® InspectorXE


Static Analysis Finds Coding and Security ErrorsIntel® Parallel Studio XE 2013

Find over 250 error types e.g.:• Incorrect directives• Security errors

Easier to use• Choose your priority:- Minimize false errors- Maximize error detection

• Hierarchical navigation of results• Share comments with the team

Increased Accuracy & Speed• Detect errors without all source files • Better scaling with large code bases

Code Complexity Metrics• Find code likely to be less reliable

63

Static Analysis is only available in Studio XE bundles. It is not sold separately.

Find Errors and Harden your Security




Cluster Tools

64


Scale Forward, Scale FasterIntel Cluster Tools

Scale Performance – Perform on More Nodes

• MPI Latency - Intel® MPI Library - Up to 6.5X as fast as alternative MPI libraries

• Compiler Performance – Industry leading Intel® C/C++ & Fortran compilers

Scale Forward – multicore now, many-core ready

• Intel® MPI Library scales beyond 120k processes

• Focused to preserve programming investments for multicore and many-core machines

Scale Efficiently – Tune & Debug on More Nodes

• Thread & Memory Correctness Checking – Intel® Inspector XE now MPI enabled across many nodes

• Rapid Node Level Performance Profiling – Intel VTune Amplifier XE can identify hotspots faster and on thousands of nodes

65



High Performance Standards Driven Fabric Flexible MPI Library65


On the Path to ExascaleIntel® MPI Library and (part of Cluster Studio XE 2013)

Latest hardware support• Ivy Bridge and Haswell• Intel® Xeon Phi™

Coprocessor

Increased Scaling• 120k Processes

Standards Support• MPI 2.2

66

Continued Scaling Capacity to Meet Ever Growing HPC Demands

Proc

esse

s

Year

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

2010 2011 2012 2013 2014 2015 2016 2017 2018

Intel® MPILibrary, Kprocesses

Doubling,Kprocesses

Exascale,Kprocesses(estimated)

90K

60K 120K

Intel® MPI Library


Improved MPI Fault Tolerance

Implementation of Berkeley Lab Checkpoint/Restart (BLCR) †

Primary Uses• Scheduling• Process Migration• Failure Recovery

†

67

Enabling Capabilities for Robust at Scale MPI Computing

Fault Recovery Scenario

Node Fault

Checkpointing

Checkpoint Recovery

Intel®MPI Library


MPI 2.2 Support

Backwards compatible with MPI 2.1 programs

Delivers Distributed Graph Topology Interface• Scalable & Informative for MPI Library Communications• Easy to Use Mechanism for Conveying Comms Patterns

to MPI Applications• Used by MPI Library to Improve Mapping Process to

Process Communications• Allows better fit for Applications Communications to

Hardware Capabilities

68

Intel®MPI Library

Outstanding Support Of The Latest MPI Standard


Optimize MPI CommunicationsIntel® Trace Analyzer and Collector (part of Cluster Studio XE 2013)

Visually understand parallel application behavior• Communications Patterns• Hotspots• Load Balance

MPI Checking• Detect Deadlocks• Data Corruption• Errors in Parameters, Data Types, etc

Scaling• Analysis Capability increasing to 6k

processes

69

Expanding MPI Profiling Capacity for Communications Optimization

Proc

esse

s

Year

0

1000

2000

3000

4000

5000

6000

7000

2010 2011 2012

Intel® Trace Analyzer and Collector (processes)

Intel® ITAC


INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


70

Backup

72


Value of SuitesSuite Only Features

• Advisor XEParallelism Advice

• C++ Performance Guide Performance Wizard

• Pointer CheckerReduces memory corruption

• Code Complexity Analysis Find code likely to be less reliable

• Static Analysis Improved! Find Errors and Harden your Security 73


What’s New in Libraries?Intel® MKL

• Digital random number generator (DRNG) for improved vector statistics calculations

• Automatically utilize Intel® Xeon Phi™ Coprocessors and balance compute loads between CPUs and coprocessors

Intel® IPP

• Enhanced image resize performance primitives

• Improved IPP footprint size

Intel® TBB

• Improved usability and reliability of the Flow Graph feature

• Additional C++11 Support

74

Ready to Use Libraries to Increase Performance

"Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table—crowd simulation software.”

Michaël Rouillé, CTO, Golaem



INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.

Optimization Notice

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

Legal Disclaimer & Optimization Notice


75

8/2/201275

Documents

1 Three Questions - community.hartree.stfc.ac.uk