Upload
janis-stevenson
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Getting Reproducible Results with Intel® MKL 11.0
Todd Rosenquist
Technical Consulting Engineer
Intel® Math Kernel Library
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
The agenda
Reproducible results in Intel MKL• The symptom
• The problem
• The reality
• The requirements
• A conditional solution
• A beginner’s guide
• Performance
• Further resources
Try the feature in the recently released Intel® MKL 11.0
2
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Ever seen something like this?
3
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678902222
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
…or this?
Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275
4
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why do results vary?
Root cause for variations in results• floating-point numbers order of computation matters!
• double precision example where (a+b)+c a+(b+c) 2-63 + 1 + -1 = 2-63 (infinitely
precise result)
(2-63 + 1) + -1 0 (correct IEEE single precision result)
2-63 + ( 1 + -1) 2-63 (correct IEEE single precision result)
5
Order matters when doing floating point arithmetic.
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why does the order of operations change in Intel MKL?
6
Many optimizations require a change in order of operations.
Optimizations
instruction sets memory alignment affects grouping of data in registers
multiple cores / multiple processors
most functions are threaded to use as many cores as will give good scalability
Non-deterministic task scheduling
some algorithms use asynchronous task scheduling for optimal performance
code path optimized to use all the processor features available on the system where the program is run
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Why are reproducible results important for Intel MKL users?
7
Technical/legacySoftware correctness is determined by comparison to previous ‘gold’ results.
DebuggingWhen developing and debugging, a higher degree of run-to-run stability is required to find potential problems
LegalAccreditation or approval of software might require exact reproduction of previously defined results.
Customer perceptionDevelopers may understand the technical issues with reproducibility but still require reproducible results since end users or customers will be disconcerted by the inconsistencies. Source: Email correspondence with Kai Diethelm of GNS. see his whitepaper: http://www.computer.org/cms/Computer.org/ComputingNow/homepage/2012/0312/W_CS_TheLimitsofReproducibilityinNumericalSimulation.pdf
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New
!
• Align memory — try Intel MKL memory allocation functions• 64-byte alignment for processors in the next few years
Memory alignment
• Set the number of threads to a constant number• Use sequential libraries
Number of threads
• Ensures that FP operations occur in order to ensure reproducible results
Deterministic task scheduling
• Maintains consistent code paths across processors• Will often mean lower performance on the latest processors
Code path control
Balancing Reproducibility and Performance:Conditional Numerical Reproducibility (CNR)
8
Goal: Achieve best performance possible for cases that require reproducibility
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
• In Intel MKL 11.0 reproducibility is currently available under certain conditions:– Within single operating systems / architecture
– Reproducibility only applies within the blue boxes, not between them…
– Reproducibility on all supported servers and workstations– No support yet for Intel® Xeon Phi™ coprocessors
– Within a particular version of Intel MKL– Results in version 11.0 update 1 may differ from results in version 11.0
– Reproducibility controls in Intel MKL only affect Intel MKL functions
Why “Conditional”?
Linux*
IA32
Intel® 64
Windows*
IA32
Intel® 64
Mac OS X
IA32
Intel® 64
9
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Conditions for reproducibility
Aligned input and output arrays in function calls• 16-byte alignment for the family of SSE instruction sets
• 32-byte alignment for AVX
• 64-byte alignment for future processors <- choose this to be safe
Set the same number of computational threads for the library in each run
Use the same Intel MKL parameters from run-to-run • Example: You cannot call a function in 3 blocks in one run and 4
blocks in the next
Use the new functions & controls to ensure deterministic task scheduling and to control code paths• CNR controls must be set or called before any computational math
functions in Intel MKL
10
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example - COMPATIBLE
For reproducible results on Intel and Intel-compatible CPUs supporting SSE2 instructions or later • function call
mkl_cbwr_set(MKL_CBWR_COMPATIBLE)
• or environment variable
set MKL_CBWR="COMPATIBLE"
Note: MKL_CBWR_COMPATIBLE is provided because Intel and Intel compatible CPUs have approximation instructions (e.g., rcpps/rsqrtps) that may return different results. This option ensures that Intel MKL uses a SSE2-only codepath that does not contain any of these instructions.
11
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example – SSE2
For the same results on every Intel processor that supports SSE2 instructions or later• function call
mkl_cbwr_set(MKL_CBWR_SSE2)
• or environment variable
set MKL_CBWR="SSE2"
Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported
12
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example – SSE4.2
For the same results on every Intel processor that supports SSE4.2 instructions or later • function call
mkl_cbwr_set(MKL_CBWR_SSE4_2)
• or environment variable
set MKL_CBWR= "SSE4_2"
Note: on non-Intel processors the results may differ since only the MKL_CBWR_COMPATIBLE path is supported
13
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example – deterministic task schedulingFor consistent results on all supported processors without fixing the code branch• function call
mkl_cbwr_set(MKL_CBWR_AUTO)
• or environment variable
set MKL_CBWR= "AUTO"
• Note– This will ensure deterministic task scheduling– It will not give you reproducibility from processor to processor
14
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example – Find out the best performing option from a pool of processorsFor the best option given a pool of computing resources in a grid setting, you may launch a simple program as follows#include <mkl.h>
int main(void) {
int my_cbwr_branch;
/* Find the available MKL_CBWR_BRANCH */
my_cbwr_branch = mkl_cbwr_get_auto_branch();
if (!mkl_cbwr_set(my_cbwr_branch)) {
printf(“Error in setting branch. Aborting…\n”);
return;}
return my_cbwr_branch;
}
Examine all results and use mkl_cbwr_set(<minimum_result>)
15
The full list of options:COMPATIBLE 3SSE2 4SSE3 5SSSE3 6SSE4_1 7SSE4_2 8AVX 9AVX2 10
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
16
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Change this sort of inconsistency…
17
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
• Align memory • Constant # of threads• Turn on CNR with either
mkl_cbwr_set(MKL_CBWR_AUTO)
orset MKL_CBWR=AUTO
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Change this inconsistency in results…
Intel® Xeon® Processor E5540 Intel® Xeon® Processor E3-1275
18
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
C:\Users\me>test.exe
4.012345678902222
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
…to get reproducible results?
Intel® Xeon® Processor E5540
(Supporting SSE4.2 instructions)
Intel® Xeon® Processor E3-1275
(Supporting AVX instructions)
19
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
• Align memory • Constant # of threads• Turn on CNR with
either…
mkl_cbwr_set(MKL_CBWR_SSE4_2)
orset MKL_CBWR=SSE4_2
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
C:\Users\me>test.exe
4.012345678901111
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
What’s next?
20
https://softwareproductsurvey.intel.com/survey/150072/1afd/
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Further resources on conditional numerical reproducibilityIntel MKL Documentation – online and in the product• Intel MKL User’s Guide
• Reference Manual
Knowledgebase articles on CNR
Support• Intel MKL user forum
• Intel Premier support
Feedback• Survey: https://softwareproductsurvey.intel.com/survey/150072/1afd/
21
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New optimizations and features
Support for the Intel® Xeon Phi™ coprocessor based on the Intel® Many Integrated Core Architecture (Intel® MIC Architecture) on Linux* only
Optimizations using the new Intel® Advanced Vector Extensions 2 (AVX2) including the new FMA3 instructions
FFTs: Completed support for real-to-complex transforms with sizes given by 64-bit integers
Local threading control function• mkl_set_num_threads_local()
22
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Sept 18th, 2012 9:00AMInteresting ties between tools and new hardware features: How Intel Tools support the many new features in processors and coprocessors
Oct 2nd, 2012 9:00AMPointer Checker: Catch Out-of-Bounds Memory Accesses Easily!
Oct 16th, 2012 9:00AMHow Intel® Parallel Studio XE is used to improve the HMMER application
Oct 30th, 2012 9:00AMUsing the Intel® Math Kernel Library 11.0 and Compiler to Obtain Run-to-Run Reproducible Results
Oct 9th, 2012 9:00AMAchieving better parallel performance of Fortran programs with Intel® VTune™ Amplifier XE profiling.
Oct 23rd, 2012 9:00AMThree common Fortran mistakes you can avoid by using Intel® Inspector XE
Nov 6th, 2012 9:00AMAvoid common parallelization mistakes with the help of Intel® Advisor XE
Dec 4th, 2012 9:00AMFortran 2008 Standard Parallel Programming Features in Intel® Fortran Composer XE*
23
http://software.intel.com/en-us/fall-webinar-series-psxe-and-fsxe
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Summary
Conditional Numerical Reproducibility (CNR) provides:• reproducible results from run-to-run• reproducible results from processor-to-processor• the ability to balance reproducibility requirements with
great performance
Evaluate CNR in the following:Intel® Math Kernel Library 11.0
Intel® Composer XE 2013Intel® Parallel Studio XE 2013Intel® Cluster Studio XE 2013
Provide feedback:https://softwareproductsurvey.intel.com/survey/150072/1afd/
24
25
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.26
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
26