Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
2
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Three Components
• Intel® Composer XE• Use to generate fast, safe, parallel code
(C/C++, Fortran)
• Intel® VTune™ Amplifier XE• Find hotspots and bottlenecks in you code.
• Intel® Inspector XE• Use to find memory and threading errors
3
8/2/2012
Composer Composer XE
•Compiler•Libraries
Inspector XE•Memory Errors•Parallel Errors
Amplifier XE
•Profiler
Intel® Parallel Studio XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Four Three Components
• Intel® Composer XE
• Use to generate fast, safe, parallel code (C/C++, Fortran)
• Intel® VTune™ Amplifier XE
• Find hotspots and bottlenecks in you code.
• Intel® Inspector XE
• Use to find memory and threading errors
4
8/2/2012
Composer Composer XE
•Compiler•Libraries
Inspector XE•Memory Errors•Parallel Errors
Amplifier XE
•Profiler
Intel® Parallel Studio XE
+ Advisor• Intel® Parallel Advisor
• Use to model parallelism in your existing applications
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
5
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The compiler uses many optimisation techniques
6
8/2/2012
Faster Code
fast floating point
http://software.intel.com/sites/products/collateral/hpc/compilers/compiler_qrg12.pdfhttp://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
7
Often we are happy with out-of-the-box experience
When was the last time you looked
at some documentation?
Faster Code
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
startStep 1
Step 2
Step 3
Step 4
Step 6
Step 7
Step 5
Use General Optimizations
Build with optimization disabled
Use Processor-Specific Options
Tune automatic vectorization
Example options Windows (Linux)/Od (-O0)
/01,/02,/03 (-O1, -O2, -O3)
/QxSSE4.2 (-xsse4.2)/QxHOST (-xhost)
/Qipo (-ipo)
/Qprof-gen (-prof-gen)/Qprof-use (-prof-use)
/Qguide (-guide)
Use Intel Family of Parallel Models /Qparallel (-parallel)
Add Inter-procedural
Implement Parallelism or use Automatic Parallelism
Use Profile Guided Optimization
The
Sev
en O
pti
mis
atio
n S
tep
s
Faster Code
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Vectorisation is …
9
8/2/2012
Faster Code
a[3] a[2]
b[3] b[2]
c[3] c[2]
+ +a[1] a[0]
b[1] b[0]
c[1] c[0]
+ +for (i=0;i<MAX;i++)c[i]=a[i]+b[i];
70 instrSingle-Precision VectorsStreaming operations
144 instrDouble-precision Vectors8/16/3264/128-bit vector integer
13 instrComplex Data
32 instrDecode
47 instrVideo Graphics building blocksAdvanced vector instr
SSE
1999
SSE2
2000
SSE3
2004
SSSE3
2006
SSE4.1
2007
SSE4.2
2008
8 instrString/XML processingPOP-CountCRC
AES-NI
2009
7 instrEncryption and DecryptionKey Generation
AVX
2011
~100 new instr.~300
legacy sseinstrupdated256-bit vector3 and 4-operand instructions
AVX2
2012\2013
Int. AVX expands to 256 bitImproved bit manip.fmaVector shiftsGather
MIC
2012
512-bit vector
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Dif
fere
nt
Way
s of
In
sert
ing
V
ecto
rise
dC
ode
10
8/2/2012
Assembler code (addps)
Vector intrinsic (mm_add_ps())
SIMD intrinsic class (F32vec4 add)
Compiler: Auto vectorization hints (#pragma ivdep, …)
Programmer control
Ease of use
Compiler: Fully automatic vectorization
Cilk Plus Array Notation
User Mandated Vectorization( SIMD Directive)
Manual CPU Dispatch (__declspec(cpu_dispatch …))
Use Performance Libraries (e.g. IPP and MKL)
Faster Code
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
An example
11
8/2/2012
Faster Code
Speedup by swapping compiler
Verified using VTune
Speedup by upgrading silicon
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
12
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Speedup using parallelism
13
8/2/2012
Parallel Code
Insp
ecto
r X
E
Mem
ory
Thre
ads
Debug
con
curr
ency
Tune
Am
plifi
er X
E
Lock
s &
wai
ts
Am
plifi
er X
EH
otsp
ot
Analyze
EB
S (X
E o
nly
)
Four Step Development
Com
pose
r X
E Compiler
Libraries
Implement
MKL
Cilk Plus
TBB
OpenMP
IPP
12
3
4
Analyze
Implement
Debug
Tune
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Four Different Ways to Find the Hotspots
1. Using Intel compiler’s loop profiler & profile viewer
2. Using the compiler’s Auto-parallelizer
3. Using Amplifier XE
4. Performing a Survey with Advisor
14
8/2/2012
Analyze
Implement
Debug
Tune
Parallel Code
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Language to help parallelism
Intel® Cilk™ Plus
OpenMP
Intel® Threading Building Blocks
Intel® MPI
Fortran Coarrays
OpenCL
Native Threads
15
8/2/2012
Parallel Code
cilk_for (int i = 0; i < max_row; i++) {
for (int j = 0; j < max_col; j++ ){
p[i][j] = mandel( complex(scale(i), scale(j)));}
}
#pragma omp parallel forfor(i=1;i<=4;i++) {
printf(“Iter: %d”, i);}
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Four Different Ways to Find your Parallel Errors
1. Using Inspector XE
2. Perform a Static Security Analysis
3. Debug with Parallel Debug Extensions
4. Use Advisor
16
8/2/2012
Analyze
Implement
Debug
Tune
Parallel Code
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
An example …
17
8/2/2012
Parallel Code
1
2
3
4
5
61. Hotspot Analysis2. Implement3. Find Threading Errors 4,5,6. Tune Parallelismhttps://makebettercode.com/parallel_landing_required/lib/pdf/5373_IN_ParallelMag_Sudoku_060911.pdf
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Three Common Requests
“How can I make my program run
faster?”
“How can I make my program
parallel?”
“Will my code run on any CPU? -
compatibility”
18
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Will my program run on any CPU?
19
8/2/2012
Compatible Code
Compatibility
Future Proofing• build?
• run?
• Performance?
OS-agnosticCPU-agnosticLanguage / StandardsToolsScalability
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
20
8/2/2012
On the graphs, bigger is better
Parallel
Vectorised
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Running Example: Monte Carlo#pragma omp parallel forfor(int opt = 0; opt < OPT_N; opt++){
float VBySqrtT = VOLATILITY * sqrtf(T[opt]);float MuByT = (RISKFREE ‐ 0.5f * VOLATILITY * VOLATILITY) * T[opt];float Sval = S[opt];float Xval = X[opt];float val = 0.0f, val2 = 0.0f;
#pragma simd reduction(+:val) reduction(+:val2)for(int pos = 0; pos < RAND_N; pos++){
float callValue = expectedCall(Sval, Xval, MuByT, VBySqrtT, l_Random[pos]);
val += callValue;val2 += callValue * callValue;
}
float exprt = expf(‐RISKFREE *T[opt]);h_CallResult[opt] = exprt * val / (float)RAND_N;float stdDev = sqrtf(((float)RAND_N*val2 ‐ val*val) /
((float)RAND_N*(float)(RAND_N – 1.f)));h_CallConfidence[opt] =(float)(exprt * 1.96f *
stdDev/sqrtf((float)RAND_N));}
SFTL003 hands on lab
Intel® Parallel Studio XE 2013 andIntel® Cluster Studio XE 2013
Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
• Industry-leading performance from advanced compilers
• Comprehensive libraries
• Parallel programming models
• Insightful analysis tools
More Cores. Wider Vectors. Performance Delivered.Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
23
Serial Performance
Scaling Performance
EfficientlyMulticore Many-core
128 Bits
256 Bits
512 Bits
50+ cores
More Cores
Wider VectorsTask & Data
Parallel Performance
Distributed Performance
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
24
Phase Product Feature Benefit
Build
Intel®Advisor XE
Threading design assistant(Studio products only)
• Simplifies, demystifies, and speeds parallel application design
Intel®Composer XE
• C/C++ and Fortran compilers • Intel® Threading Building Blocks• Intel® Cilk™ Plus• Intel® Integrated Performance
Primitives• Intel® Math Kernel Library
• Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core
Intel®MPI Library†
High Performance MessagePassing (MPI) Library
• Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability
Verify& Tune
Intel®VTune™Amplifier XE
Performance Profiler for optimizing application performance and scalability
• Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks
Intel®Inspector XE
Memory & threading dynamicanalysis for code quality
Static Analysis for code quality
• Increased productivity, code quality, and lowers cost, finds memory, threading , and security defects before they happen
Intel® TraceAnalyzer & Collector†
MPI Performance Profiler for understanding application correctness & behavior
• Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots
Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013 †
Efficiently Produce Fast, Scalable and Reliable Applications
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
25
Performance
Improved compiler and library performance
+ Ivy Bridgemicroarchitecture
+ Haswellmicroarchitecture
+ Intel® Xeon Phi™coprocessor
Performance Profiling
A dozen new analysis features
Low overhead Java* profiling
CPU Power Analysis
Reliability
Pointer checker
Heap growth analysis
Improved MPI fault tolerance†
Reproducibility
Conditional numerical reproducibility
Standards
ExpandedC++ 11
Expanded Fortran 2008
MPI 2.2†
ParallelismAssistance
Analysis extended to include Linux*, Fortran and C# (in addition to Windows* and C/C++)
Top New Features
Efficiently produce fast, scalable and reliable applications running on Windows* and Linux*
†Intel® Cluster Studio XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Tool
MS Compiler
ICC
ICC
VTune
ICC
Inspector
ICC
VTune
ICC
Target
13‐0
13‐1
13‐2
13‐3
13‐4
13‐5
13‐6
13‐7
13‐8
26
8/2/2012
Add OpenMPCode
Find Hotspot
Add SSE Intrinsics
Check Correctness
Tune Parallelism
Find Hotspot
Fix Correctness
Macro
STEP_0
STEP_1
STEP_2
STEP_3
STEP_4
STEP_5
STEP_6
STEP_7
STEP_8
Solver
Generator
Key
Serial Release Mode
OpenMP Debug Mode
OpenMP Release Mode
1st version
Finish
The Build Environment
Build examplemake 13-0
or nmake 13-0
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
How to Run
13-0.exe test.txt
27
8/2/2012
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Your Challenge – the hands-on
Examine each of the eight stage and use a combination of the compiler, inspector, and amplifier to understand ‘what’s going on’
Answer these questions
• Is the application using the CPU at it’s best? (Steps 0, 2 and 8)
• What’s the biggest hotspot in the serial code? (steps 1 and 3)
• What errors were introduced into the parallelism? (Steps 4, 5 & 6)
• How well is the parallelism tuned? (Steps 7 & 8)
Supplement:
Why is the Linux version slower than the Windows Version?
28
8/2/2012
Intel® Parallel Studio XE 2013 andIntel® Cluster Studio XE 2013
Helping Developers Efficiently Produce Fast, Scalable and Reliable Applications
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
• Industry-leading performance from advanced compilers
• Comprehensive libraries
• Parallel programming models
• Insightful analysis tools
More Cores. Wider Vectors. Performance Delivered.Intel® Parallel Studio XE 2013 and Intel® Cluster Studio XE 2013
33
Serial Performance
Scaling Performance
EfficientlyMulticore Many-core
128 Bits
256 Bits
512 Bits
50+ cores
More Cores
Wider VectorsTask & Data
Parallel Performance
Distributed Performance
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s New? Intel® Parallel Studio XE 2013/ Intel® Cluster Studio XE 2013
Performance Leadership:• 3rd Generation Intel® Core™ Processors (code name
“Ivy Bridge”) and future Intel® processors (code name “Haswell”)
• Intel® Xeon Phi™ coprocessors• Improved C++ and Fortran performance
New Product Capabilities• Latest OS: Windows* 8 Desktop, Linux*• IDE: Visual Studio 2008, 2010, 2012 and gnu tool chain• Standards: C99, selected C++11 features, almost
complete Fortran 2003 support and selected features from Fortran 2008, Fortran 2008, MPI 2.2
34
Compilers & Libraries
Intel® Parallel Studio XE
Intel® Cluster Studio XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Support for Latest Intel Processors and Coprocessors
36
Intel® Ivy Bridge microarchitecture
Intel® Haswell microarchitecture
Intel® Xeon Phi™ coprocessor
Intel® C++ and Fortran Compiler
✔AVX
✔AVX2, FMA3
✔IMCI
Intel® TBB library ✔ ✔ ✔
Intel® MKL library ✔AVX
✔AVX2, FMA3 ✔
Intel® MPI library ✔ ✔ ✔
Intel® VTune™ Amplifier XE†
✔Hardware Events
✔Hardware Events
✔Hardware Events
Intel® Inspector XE✔
Memory & Thread Checks
✔Memory & Thread
✔Memory & Thread††
† Hardware events for new processors added as new processors ship.†† Analysis runs on multicore processors, provides analysis for multicore and many-core processors.
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Performance-Oriented Compiler SuitesIntel® Compilers, Performance Libraries, Debugging ToolsOn Windows, Linux and Mac OS X
37
Intel® C++ Composer XE 2013
• Intel® C++ Compiler XE 13.0with Intel® Cilk™ Plus
• Intel® TBB• Intel® MKL• Intel® IPP• Intel® Xeon Phi™ product
family support, Linux
Intel® Fortran Composer XE 2013
• Intel® Fortran Compiler XE 13.0• Intel® MKL• Compatibility with Compaq
Visual Fortran*• Fortran 2003, 2008 support• Intel® Xeon Phi™ product
family support, Linux
Intel Composer XE 2013
• Combines Intel C++ Composer XE and Intel® Fortran Composer XE
• For Fortran developers who also want Intel C++
• Windows (requires Visual Studio) and Linux only
Windows: Intel C++/Visual* C++ compatibility & integration into Microsoft* Visual Studio*
Linux: Intel C++/gcc* compatibility & integration into Eclipse* CDT
Mac OS X: Intel C++/gcc compatibility & integration into XCode* Environment
All: Intel Fortran performance leadership, compatible with Compaq* Visual* Fortran
All: Leadership performance on Intel and compatible architectures
All: One Year Intel® Premier Support. Renewable Annually.
Performance . Compatibility. Support.
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Superior C++ Compiler Performance
38
More Performance• Just recompile• Uses Intel® AVX and Intel® AVX2 instructions• Intel® Xeon Phi™ product family support, Linux: Compiler, debugger (Linux)• Intel® Cilk™ Plus: Tasking and vectorization
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Superior Fortran Compiler Performance
39
More Performance • Just recompile• Intel® Xeon Phi™ product family: Linux compiler, debugger support • Access to Intel® AVX and Intel® AVX2 instructions (-xa or /Qxa)• Auto-parallelizer & directives to access SIMD instructions• Coarrays & synchronization constructs support parallel programming• Loop optimization directives: VECTOR, PARALLEL, SIMD• More control over array data alignment (align arrayNbytes)
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
C++ Performance Guide Performance Wizard for Windows
• Quick 5 step process for more performance
• Get help choosing optimization options
40
Gain Performance with Less Effort
Compilers & Libraries
Intel® Parallel Studio XE
Intel® Cluster Studio XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® Math Kernel Library (MKL)• Highly optimized threaded math routines
• Applications in science, engineering, finance
• Use Intel® MKL on Windows*, Linux*, Mac OS*
• Use Intel® MKL with Intel compiler, gcc, MSFT*, PGI
• Component of Intel® Parallel Studio XE and Intel® Cluster Studio XE
41
33% of math libraries users rely on Intel’s Math Kernel
Library
EDC North America Development Survey
2011, Volume II
Drop In The Next Intel® MKL Version to Unlock New Processor Performance
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
LAPACK Performance Improves withIntel® Math Kernel Library
42
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimized for Performance and Power Efficiency
• Highly optimized using SSE, AVX instruction sets
• Performance beyond what an optimized compiler produces alone
• Highly optimized using SSE, AVX instruction sets
• Performance beyond what an optimized compiler produces alone
Intel Engineered & Future Proofed to Save You Time
• Ready-to-use & royalty free
• Fully optimized for current and past processors
• Save development, debug, and maintenance time
• Code once now, receive future optimizations later
• Ready-to-use & royalty free
• Fully optimized for current and past processors
• Save development, debug, and maintenance time
• Code once now, receive future optimizations later
Wide range of Cross Platform & OS Functionality
• Thousands of optimized functions
• Supports Windows*, Linux*, and Mac OS* X
• Supports Intel® Atom, Intel® Core, Intel® Xeon, platforms
• Thousands of optimized functions
• Supports Windows*, Linux*, and Mac OS* X
• Supports Intel® Atom, Intel® Core, Intel® Xeon, platforms
Intel® Integrated Performance Primitives (IPP)A Library Of Highly Optimized Algorithmic Building Blocks For Media And Data Applications
43
Availability: Part of several different product packages with single, multi-user licenses as well as volume, academic, and student discounts available.
Try it Before You Buy It: Download a trial version today at intel.com/software/products/eval
Performance Building Blocks to Make Your Applications Faster, Faster
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Intel® IPPBoost from Intel® AVX
44
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
45
Where is my application…
Spending Time? Wasting Time? Waiting Too Long?
• Focus tuning on functions taking time
• See call stacks• See time on source
• See cache misses on your source
• See functions sorted by # of cache misses
• See locks by wait time• Red/Green for CPU utilization during wait
Intel® VTune™ Amplifier XE Performance Profiler
• Windows & Linux• Low overhead• No special recompiles
Claire CatesPrincipal Developer, SAS Institute Inc.
We improved the performance of the latest run 3 fold. We wouldn't have found the problem without something like Intel® VTune™ Amplifier XE.
Intel® VTune™ Amplifier XE
Advanced Profiling for Scalable Multicore Performance45
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
A Dozen New Analysis FeaturesIntel® VTune™ Amplifier XE 2013
More Profiling Data1) Statistical Call Counts
Data for Inlining & Parallelization2) Hardware Events + Stacks
Lower overhead, Higher resolution Finds hot spots in small functions
3) Uncore Event CountingMore accurate bandwidth analysis
4) Ivy Bridge Events5) Haswell Events
Updates as new processors ship6) Intel® Xeon Phi™ Products
Hardware events
46
Easier To Use7) Source View for Inlined Code
(For Intel® and GCC* compilers)8) Java Tuning
Results map to the Java source9) Task Annotation API
Label and visualize tasks. 10) User Defined Metrics
Create meaningful metrics from events11) Programmable Hot Keys
Start and stop collection easily12) More/Better Advanced Profiles
(e.g., Bandwidth)
Easy to Use, Wealth of Data, Powerful Analysis
Intel® VTune™ Amplifier XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Low Overhead Java* ProfilingIntel® VTune™ Amplifier XE 2013
Low Overhead & Precise• Sampling is fast / unobtrusive• Hardware sampling even faster
(Now with optional stacks!)• Advanced profiles are unique
(cache misses, bandwidth…)
47
Versatile & Easy to Use
Multiple simultaneous JVMsMixed Java / C++ / FortranSee results on the Java source
Better Data, Lower Overhead, Easier to Use
Intel®VTune™ Amplifier XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
CPU Power AnalysisIntel® VTune™ Amplifier XE 2013
To decrease CPU power usage minimize wake-ups • Identify wake-up causes– Timers triggered by
application– Interrupts mapped to HW intr
level– Show wake-up rate
• Display source code for events that wake-up processor
• Show CPU frequencies by CPU core(CPU frequencies can change by CPU activity level)
• Linux only
48
Select & filter to see a single wake up object:
Uniquely Identifies the Cause of Wake-ups and Give Timer Call Stacks
Intel®VTune™ Amplifier XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Simplify and Speed Threading DesignIntel® Advisor XE – Threading Assistant
The Challenge of Parallel Design:• Need to implement to measure
performance• Implementation is time consuming• Disrupts regular product development• Testing difficult without tools
50
Intel Advisor XE Separates Design & Implementation• Fast exploration of multiple options• Find errors before implementation• Design without disrupting development
New! Linux* and Windows* New! C, C++, Fortran and C# code
Add Parallelism with Less Effort, Less Risk and More Impact
Intel® Advisor XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
1) Analyze it.
3) Tune it.
4) Check it.
5) Do it!
2) Design it. (Compiler ignores these annotations.)
Design Then ImplementIntel® Advisor XE 2013 – Threading Assistant
51
Less Effort, Less Risk, More Impact
Design Parallelism• No disruption to
regular development • All test cases
continue to work• Tune and debug the
design before you implement it
Implement Parallelism
Intel® Advisor XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Abstract, Scalable and Composable
Scale Forward with Intel Parallel ModelsExtend to Intel® Xeon Phi™ Coprocessors
52
Support StandardsOpenMP Coarray Fortran MPI
Intel® Cilk™ Plus
C/C++ language extensions to simplify
parallelism
Intel® Threading Building Blocks
Widely used C++ template library for thread
management
Intel® Xeon Processors, and
Compatible Processors
Intel® Xeon Phi™ product family
Open programming models and also Intel products
Don’t Leave Your Code Behind
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Simplify ParallelismIntel® Cilk™ Plus, Intel® Threading Building Blocks
53
Intel® Cilk™ Plus
• 3 simple keywords & array notations for parallelism
• Support for task and data parallelism
• Semantics similar to serial code
• Simple way to parallelize your code
• Sequentially consistent, low overhead, powerful solution
• Supports C, C++, Windows and Linux
Intel® Threading Building Blocks
• Parallel algorithms and data structures
• Scalable memory allocation and task scheduling
• Synchronization primitives
• Rich feature set for general purpose parallelism
• Available as open source or commercial license
• Supports C++, Windows, Linux, Mac OS X, other OSs
What
Features
Why
Language extensions tosimplify task/data parallelism
Widely used C++ templatelibrary for task parallelism
Task and Data Parallelism Made Easier
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Parallelize Applications For PerformanceIntel® Threading Building Blocks (TBB)A popular, proven parallel C++ abstractionA C++ template library
• Scalable memory allocation
• Load-balancing
• Work-stealing task scheduling
• Thread-safe pipeline
• Flexible flow graph
• Concurrent containers
• High-level parallel algorithms
• Numerous synchronization primitives
• Open source, and portable acrossmany OSs
54
Simplify Parallelism with a Scalable Parallel Model
Michaël Rouillé, CTO, Golaem
"Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Scale Forward and Extend to Intel® Xeon Phi™ CoprocessorsIntel® Cilk™ Plus
55
Intel® Cilk™ Plus (Language Extension to C/C++)
Easier Task & Data Parallelism3 simple Keywords:cilk_for, cilk_spawn, cilk_sync
Intel® Cilk™ Plus Array Notation Save time with powerful vectorization
Minimize Software Re-Work for New Hardware
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Pointer Checker
Finds buffer overflows and dangling pointers before memory corruption occurs
Powerful error reporting
Integrates into standard debuggers (Microsoft, gdb, Intel)
57
Buffer Overflow{
char *my_chp = "abc";char *an_chp = (char *) malloc (strlen((char *)my_chp));memset (an_chp, '@', sizeof(my_chp));
}
Dangling pointer{
char *p, *q;p = malloc(10);q = p;free(p);*q = 0;
}
CHKP: Bounds check errorTraceback:./a.out(main+0x1b2) [0x402d7a] in file mems.c at line 13
Pointer Checker Highlights Programming Errors For More Secure Applications
Compilers & Libraries
Intel® Parallel Studio XE
Intel® Cluster Studio XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Conditional Numerical Reproducibility
Intel® Math Kernel Library: • New deterministic task scheduling
and code path selection options
OpenMP*:• New deterministic reduction
option
Intel® Threading Building Blocks• New parallel deterministic reduce
option
58
Help Achieve Reproducible Results, Despite Non-associative Floating Point Math
“I’m a C++ and Fortran developer and have high praise for the Intel® Math Kernel Library. One nice feature I’d like to stress is the numerical reproducibility of MKL which helps me get the assurance I need that I’m getting the same floating point results from run to run."
Franz BernasekOwner / CEO , Senior DeveloperMSTC Modern Software Technology
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Expanded C++ 11 support
• Additional type traits• Initializer lists (partial)• Generalized constant expressions (partial)• Noexcept (partial)• Range based for loops • Conversions of lambdas to function pointers
59
Excellent Support for C++ 11 on Windows* and Linux*
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Expanded Fortran 2008 Support
• Maximum array rank has been raised to 31 dimensions (Fortran 2008 specifies 15)
• Recursive type may have ALLOCATABLE components
• Coarrays– CODIMENSION attribute– SYNC ALL statement– SYNC IMAGES statement– SYNC MEMORY statement– CRITICAL and END CRITICAL statements– LOCK and UNLOCK statements– ERROR STOP statement– ALLOCATE and DEALLOCATE may specify coarrays– Intrinsic procedures IMAGE_INDEX, LCOBOUND,
NUM_IMAGES, THIS_IMAGE, UCOBOUND
• CONTIGUOUS attribute
• MOLD keyword in ALLOCATE
• DO CONCURRENT
• NEWUNIT keyword in OPEN
60
G0 and G0.d format edit descriptorUnlimited format item repeat count specifierCONTAINS section may be emptyIntrinsic procedures BESSEL_J0, BESSEL_J1, BESSEL_JN, BESSEL_YN, BGE, BGT, BLE, BLT, DSHIFTL, DSHIFTR, ERF, ERFC, ERFC_SCALED, GAMMA, HYPOT, IALL, IANY, IPARITY, IS_CONTIGUOUS, LEADZ, LOG_GAMMA, MASKL, MASKR, MERGE_BITS, NORM2, PARITY, POPCNT, POPPAR, SHIFTA, SHIFTL, SHIFTR, STORAGE_SIZE, TRAILZAdditions to intrinsic module ISO_FORTRAN_ENV: ATOMIC_INT_KIND, ATOMIC_LOGICAL_KIND, CHARACTER_KINDS, INTEGER_KINDS, INT8, INT16, INT32, INT64, LOCK_TYPE, LOGICAL_KINDS, REAL_KINDS, REAL32, REAL64, REAL128, STAT_LOCKED, STAT_LOCKED_OTHER_IMAGE, STAT_UNLOCKED
New: ATOMIC_DEFINE and ATOMIC_REF, initialization of polymorphic INTENT(OUT) dummy arguments, standard handling of G format and of printing the value zero, coarrays (more support), polymorphic source allocation
Leadership F2008 Support on Linux*, Windows* & OSX*
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Dynamic Analysis Finds Memory & Threading ErrorsIntel® Inspector XE 2013
Find and eliminate errors• Memory leaks, invalid access…• Races & deadlocks• Analyze hybrid MPI cluster apps• Heap growth analysis
Faster & Easier to use• Debugger breakpoints
• Break on selected errors • Run faster to known error
• Pause/resume collection• Narrow analysis focus• Better performance
• Improved error suppression
61
Find Errors Early When They are Less Expensive
Intel® InspectorXE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Heap Growth AnalysisIntel® Inspector XE 2013
Does Application Memory Usage Mysteriously Grow? • Set an analysis interval with
start and analysis end points– Click a button –or–– Use an API
• See a list of memory allocations that are not freed in the interval
• Quickly zero in on suspicious activity that contributes to heap growth
62
Speeds Diagnosis of Difficult to Find Heap Errors
Intel® InspectorXE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Static Analysis Finds Coding and Security ErrorsIntel® Parallel Studio XE 2013
Find over 250 error types e.g.:• Incorrect directives• Security errors
Easier to use• Choose your priority:- Minimize false errors- Maximize error detection
• Hierarchical navigation of results• Share comments with the team
Increased Accuracy & Speed• Detect errors without all source files • Better scaling with large code bases
Code Complexity Metrics• Find code likely to be less reliable
63
Static Analysis is only available in Studio XE bundles. It is not sold separately.
Find Errors and Harden your Security
Compilers & Libraries
Intel® Parallel Studio XE
Intel® Cluster Studio XE
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Scale Forward, Scale FasterIntel Cluster Tools
Scale Performance – Perform on More Nodes
• MPI Latency - Intel® MPI Library - Up to 6.5X as fast as alternative MPI libraries
• Compiler Performance – Industry leading Intel® C/C++ & Fortran compilers
Scale Forward – multicore now, many-core ready
• Intel® MPI Library scales beyond 120k processes
• Focused to preserve programming investments for multicore and many-core machines
Scale Efficiently – Tune & Debug on More Nodes
• Thread & Memory Correctness Checking – Intel® Inspector XE now MPI enabled across many nodes
• Rapid Node Level Performance Profiling – Intel VTune Amplifier XE can identify hotspots faster and on thousands of nodes
65
Compilers & Libraries
Intel® Cluster Studio XE
High Performance Standards Driven Fabric Flexible MPI Library65
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
On the Path to ExascaleIntel® MPI Library and (part of Cluster Studio XE 2013)
Latest hardware support• Ivy Bridge and Haswell• Intel® Xeon Phi™
Coprocessor
Increased Scaling• 120k Processes
Standards Support• MPI 2.2
66
Continued Scaling Capacity to Meet Ever Growing HPC Demands
Proc
esse
s
Year
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
2010 2011 2012 2013 2014 2015 2016 2017 2018
Intel® MPILibrary, Kprocesses
Doubling,Kprocesses
Exascale,Kprocesses(estimated)
90K
60K 120K
Intel® MPI Library
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Improved MPI Fault Tolerance
Implementation of Berkeley Lab Checkpoint/Restart (BLCR) †
Primary Uses• Scheduling• Process Migration• Failure Recovery
†
67
Enabling Capabilities for Robust at Scale MPI Computing
Fault Recovery Scenario
Node Fault
Checkpointing
Checkpoint Recovery
Intel®MPI Library
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
MPI 2.2 Support
Backwards compatible with MPI 2.1 programs
Delivers Distributed Graph Topology Interface• Scalable & Informative for MPI Library Communications• Easy to Use Mechanism for Conveying Comms Patterns
to MPI Applications• Used by MPI Library to Improve Mapping Process to
Process Communications• Allows better fit for Applications Communications to
Hardware Capabilities
68
Intel®MPI Library
Outstanding Support Of The Latest MPI Standard
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Optimize MPI CommunicationsIntel® Trace Analyzer and Collector (part of Cluster Studio XE 2013)
Visually understand parallel application behavior• Communications Patterns• Hotspots• Load Balance
MPI Checking• Detect Deadlocks• Data Corruption• Errors in Parameters, Data Types, etc
Scaling• Analysis Capability increasing to 6k
processes
69
Expanding MPI Profiling Capacity for Communications Optimization
Proc
esse
s
Year
0
1000
2000
3000
4000
5000
6000
7000
2010 2011 2012
Intel® Trace Analyzer and Collector (processes)
Intel® ITAC
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
70
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Value of SuitesSuite Only Features
• Advisor XEParallelism Advice
• C++ Performance Guide Performance Wizard
• Pointer CheckerReduces memory corruption
• Code Complexity Analysis Find code likely to be less reliable
• Static Analysis Improved! Find Errors and Harden your Security 73
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s New in Libraries?Intel® MKL
• Digital random number generator (DRNG) for improved vector statistics calculations
• Automatically utilize Intel® Xeon Phi™ Coprocessors and balance compute loads between CPUs and coprocessors
Intel® IPP
• Enhanced image resize performance primitives
• Improved IPP footprint size
Intel® TBB
• Improved usability and reliability of the Flow Graph feature
• Additional C++11 Support
74
Ready to Use Libraries to Increase Performance
"Intel® TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table—crowd simulation software.”
Michaël Rouillé, CTO, Golaem
Compilers & Libraries
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
75
8/2/201275