41
© 2011 IBM Corporation High Performance Programming with IBM XL Compilers and Libraries SPXXL/ScicomP-17 2011 Summer Workshop Yaoqing Gao [email protected] Raúl E. Silvera [email protected] IBM Toronto Lab

High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM Corporation

High Performance Programming with IBM XL Compilers and Libraries

SPXXLScicomP-17 2011 Summer Workshop

Yaoqing Gao ygaocaibmcom

Rauacutel E Silvera raulscaibmcom

IBM Toronto Lab

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

IBM Rational Disclaimer

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any way IBM the IBM logo Rational the Rational logo Telelogic the Telelogic logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Agenda

Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance

ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Overview of XL Compiler Family

Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities

ndashExploitation and tuning for latest hardware implementations

ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)

ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)

ndashProfile-driven optimization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance

ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features

Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback

Productivity enhancementndashXML compiler transformation reports

Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 2: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

IBM Rational Disclaimer

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any way IBM the IBM logo Rational the Rational logo Telelogic the Telelogic logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Agenda

Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance

ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Overview of XL Compiler Family

Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities

ndashExploitation and tuning for latest hardware implementations

ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)

ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)

ndashProfile-driven optimization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance

ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features

Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback

Productivity enhancementndashXML compiler transformation reports

Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 3: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Agenda

Overview of XL Compiler FamilyMajor Features of XL CC++ V111 and XL Fortran V131Migration from GNU to XL compilers XML Compiler Transformation ReportsCompiler Optimizations for Performance

ndashProfile Directed Feedback OptimizationndashSIMDization and VectorizationndashLoop TransformationsndashData PrefetchndashData ReorganizationndashInliningndashParallelization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Overview of XL Compiler Family

Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities

ndashExploitation and tuning for latest hardware implementations

ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)

ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)

ndashProfile-driven optimization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance

ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features

Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback

Productivity enhancementndashXML compiler transformation reports

Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 4: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Overview of XL Compiler Family

Similar compilation technology used to implement C C++ and Fortran CompilersSupports AIX Linux on Power zOS (CC++ only) BlueGene CellAdvanced optimization capabilities

ndashExploitation and tuning for latest hardware implementations

ndashAggressive loop analysis and transformations (unimodular and polyhedral framework)

ndashWhole program optimizationndashSIMD code generation and Vectorization exploitationndashParallelization (automatic and user-driven through OpenMP)

ndashProfile-driven optimization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance

ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features

Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback

Productivity enhancementndashXML compiler transformation reports

Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 5: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Major Features of XLC111 XLF131POWER7 support and exploitationLanguage standard conformance

ndashFull Fortran 2003ndashFull OpenMP 30 ndashAdditional C++0X features

Optimization enhancementsndashAutomatic parallelizationndashInliningndashLoop analysis and transformationsndashDelinquent load driven optimizationsndashProfile-directed feedback

Productivity enhancementndashXML compiler transformation reports

Other featuresndashFine-grained strict controlndashProPolicendashfunc_tracendashOther options and directives

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 6: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

HPC Performance Tuning with XL Compilers

-O3 ndashqarch=pwr7 or

ndashO3 ndashqhot ndashqarch=pwr7

with ndashqnostrict or ndashqstrict

Profiling for hot spot detection

Compiler instrumentation ndashqpdf1=level=12 pdf2

-pg for gprofxprofiler -qlist for tprof

User-provided profile functions -qfunctrace

SIMDization

Automatic SIMDization ndashO3 or above with ndashqsimd

User explicit SIMD program -qaltivec

Loop transformations

Loop transformations ndashO3 or above

Whole program optimizations

-O4 or ndashO5 for inter-procedural optimization inlining code partition data reorganization

Parallelization

User explicit parallelization only -qsmp=omp

Auto parallelization -qsmp (-qsmp=auto)

Polyhedral framework-qsmp with -qhot=level=2

XML Transformation Reports

-qlistfmt=xml=all

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 7: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Migration from GCC to IBM XL Compilers

CompatibilityndashSource level

bull Supports many gccg++ language extensions and annotations

ndashBinary levelbull Link objects from gccg++ and XL CC++

gxlc and gxlc++ utilities for compile option mappingndashControlled by the gxlccfg configuration file for option mappings from GCC to XL CC++

ndashModify the contents of the gxlccfg to meet your specific compilation requirements

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 8: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

gxlc gxlc++ gxlC

gxlc

gxlc++

gxlC

-v -WxltXL compiler optionsgt ltgcc g++ optionsgt filename

Creating customized configuration file(XLC_USR_CONFIG environment variable to specify the location of your defined configuration file)

gxlccfg format

abcd gcc_or_g++_option ldquoxlc_or_xlc++_optionldquo

eg

nnnc -ansi -qlanglvl=extc89 -qnokeyword=inline -qnokeyword=typeof -qnokeyword=asm - qnocpluscmt -D__STRICT_ANSI__

nnn -B -B

nnn -C -C

nnn -c -c

nnn -dM -qshowmacros

nnn -D -D

E E

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 9: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XML Compiler Transformation ReportsGenerate compilation reports consumable by other toolsndash Enable better visualization and analysis of compiler informationndash Help users do manual performance tuningndash Help automatic performance tuning through performance tool

integrationUnified report from all compiler subcomponents and analysisndash Compiler optionsndash Pseudo-sourcesndash Compiler transformations including missed opportunities

Consistent support among Fortran CC++ Controlled under option

-qlistfmt=xml=inlines generates inlining information-qlistfmt=xml=transform generates loop transformation information-qlistfmt=xml=data generates data reorganization information-qlistfmt=xml=pdf generates dynamic profiling information-qlistfmt=xml=all turns on all optimization content-qlistfmt=xml=none turns off all optimization content

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 10: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Transformation Report ContentsProgram characteristics ndash Compiler versionndash Date of compilation ndash Source file tablendash Function Tablendash Loop table (line number nest level iteration count loop attributes)ndash Transformation table (line number transformation description)ndash Pseudocode

Transformations at both high and low-level optimizationsndash Intra-procedural transformations

bull Loop transformationsbull Data prefetchbull Vectorization and SIMDizationbull Parallelizationbull Instruction scheduling

ndash Inter-procedural transformationsbull Inliningbull Data reorganization

Profiling informationndash Basic block countersndash Call countersndash Cache miss counters

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 11: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XL Compiler Assisted Performance Analysis and Tuning

Compiler Optimizations for Performancendash Compiler static analysis and optimization ndash User guided optimizations through compiler options

and directivesndash Automatic compiler optimizations through profile

directed feedback

XL Compiler and Tooling Integrationndash Compiler feedback view in PTPndash Compiler transformation reports and HPCS toolkit help

detect bottlenecks and identify solutions

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 12: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Compiler Feedback View

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 13: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SOURCE CODE

COMPILE AND LINK WITH ndashqpdf1Static analysisProfile based refinement

COMPILE AND LINK WITH ndashqpdf2Profile directed optimizations

INSTRUMENTEDAPPLICATION

OPTIMIZEDAPPLICATION

PROFILE DATA

SAMPLE INPUTS

SAMPLE INPUTS

Multiple-pass Dynamic Profiling Infrastructure

Hardware and software constraints

Multiple sample runs for different hardware performance events

Profile based instrumentation refinement

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 14: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Basic Block and Call Counter Information

Call Counter InformationRegion 1

Region Execution Count

1

Call Coverage 32 49

Call Counters

Call Name Call Execution Count

Line

smtctl_ 0 79

is_smt_on_ 1 55

jbind_ 1 109

jbind_ 0 108

aff_ 1 38

Step 1 compile the application with ndashqpdf1 to generate an instrumented executable

Step 2 run the executable with typical input data set to gather profiling information

Step 3 re-compile the application with ndashqpdf2 ndashqlistfmt=xml=all to generate the optimized executable and XML compiler transformation report

Region 1

Region Execution Count 5

Block Coverage 81 81

Block Counters

Block Index

Block Execution Count

Start Line

End Line

3 5 1 33

4 5 33 34

Basic Block Counter Information

helliphellip

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 15: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Cache Miss Information

Memory Reference Region Line Cache

Level Miss Count Miss Rate

((double )((char )d-zfaci5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzfaci5[]rns50[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 408 2 17446 24

((double )((char )d-ztp25addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtztp25[]rns26[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 412 2 9999 14

((double )((char )d-zdqsdtemp5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdqsdtemp5[]rns53[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIV7]

3 413 2 9974 14

((double )((char )d-zcld5addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))-gtzcld5[]rns72[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia- gtkidiarns4 + CIVC]

3 525 2 1033429 6

((double )((char )d-zdr15addr + max((long long) klon-gtklonrns00ll) -8ll - 8ll))- gtzdr15[]rns41[(long long) ktdia-gtktdiarns6 + CIV11][(long long) kidia-gtkidiarns4 + CIVC]

3 557 2 11553 16Delinquent load

Source code location

Cache miss

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 16: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Source location

Loop informationLoop TableLoopIndex

StartLine

EndLine

Parent Loop Index Nest Level

Minimum Cost

Maximum Cost

Iteration Count

Attributes

1 203 1 19630 19630 75 (array)bull

well behavedbull

bump normalizedbull

lower bound normalized

2 188 1 20413 20413 149 (array)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

3 141 1 13300 13300 100 (default)

bull

perfect nestbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

4 203 1 19630 19630 5 (PDF)

bull

residualbull

well behavedbull

bump normalizedbull

guardedbull

lower bound normalized

Loop iteration count based on static analysis or dynamic profiling

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 17: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Transformation Reports

Seq Type Phase Region Line Loop Index

Descriptio n

Attributes

2LoopVector (success)

High Level Optimizer 4 217 1

Loop vectorizati on was performed

not available

3 LoopFusion (success)

High Level Optimizer

4 108 3 Loops were fused

bullLoop Line Number 108bullLoop Line Number 206

4LoopVector Version (success)

High Level Optimizer 4 108 3

Vector versioning was performed

not available

20ModuloSch edule (success)

Low Level Optimizer 12 3499 26

Loop was modulo scheduled

bullInitiation Interval 12

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 18: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

filec

foo (float p float q float r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

Performance Tuning with Compiler Transformation Reports-qlistfmt=xml=all

filec

foo (float restrict p float restrict q float restrict r int n)

for (int i=0 ilt n i++) p[i] = p[i] + q[i]r[i]

filexmlLoop cannot be automatically parallelized A dependency is carried by variable aliasing

Original source file modified source file

filexmlLoop was automatically parallelizedLoop was modulo scheduled

Tuning

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 19: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Explicit SIMD programming for POWER7 Enabled under -qaltivec

Successor to altivec programming extensions on POWER6PPC970ndashAltivec data types 16-byte vectors

vector char 16 elements vector short 8 elements vector pixel 8 elements vector int 4 elements vector float 4 elements

ndashVSX Altivec extensions 16-byte vectors vector double 2 elements vector long long 2 elements

Altivec built-in functions extended to new data typesvec_add(vector double vector double) vec_sub(vector long long vector long long)

New vector operations vec_mul vec_div hellipUnaligned load and store operations

ndashAltivec truncating loadsstores still available vec_ld vec_stndashNew non-truncating loadsstores vec_xld2 vec_xstd2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 20: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

21

Automatic SIMDizationAutomatic SIMDization for VMX and VSX

ndashSupports data types of INTEGER UNSIGNED REAL and COMPLEX

FeaturesndashBasic block level SIMDizatonndashLoop level aggregationndashData conversion reductionndashLoop with limited control flowndashAutomatic SIMDization with ndashqstrict (VSX) and -qnostrictndashSupport of unaligned vector memory accesses (VSX)ndashAutomatic SIMDization enabled at ndashO3 -qsimd

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 21: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable alignment

Use __attribute__((aligned(n)) to set data alignment

Use __alignx(16 a) to indicate the data alignment to the compiler

Use -qassert=refalign if all references are naturally aligned

Use array references instead of pointers where possible

data dependence prevents SIMD vectorization

Use fewer pointers when possible

Use pragma independent if it has no loop carried dependency

Use pragma disjoint (a b) if a and b are disjoint

Use restrict keyword or compiler option ndashqrestrict

User actionsTransformation report

Loop was SIMD vectorized

Use pragma simd_level(10) to force the compiler to do SIMDizationIt is not profitable

to vectorize

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 22: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

SIMDization Tuning

memory accesses have

non-vectorizable strides

Loop interchange for stride-one accesses when possible

Data layout reshape for stride-one accesses

Higher optimization to propagate compile known stride information

Stride versioning

Do statement splitting and loop splitting

User actionsTransformation report

either operation or data type is not suitable for SIMD vectorization

Convert while-loops into do-loops when possible

Limited use of control flow in a loop

Use MIN MAX instead of if-then-else

Eliminate function calls in a loop through inlining

loop structure prevents SIMD vectorization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 23: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

MASS enhancements and Auto-vectorization

MASS enhancements for POWER7ndashPOWER7 vector MASS library (libmassvp7a)

bull Internally exploit VSX instructionsSP average speedup of 199 vs Power5 MASSVDP average speedup of 127 vs Power5 MASSV

ndashPOWER7 SIMD MASS library (libmass_simdp7a)bull Tuned math routines operating on vector data typesbull Over 35 frequently used mathematical functions bull Both simple and double precisionbull To be used in conjunction with explicit SIMD programming

Auto-vectorization at optimization level ndashO3 or above -qstrict=vectorprecision to maintain precision over all loop iterations

for (i=0iltni++)

b[i]=sqrt(a[i])

__vsqrt_P7(ban)

Loop vectorization was performed

Transformation report

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 24: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Software-controlled data prefetching for POWER7

Software control over POWER7 prefetch engine supporting up to 12 data streams

Fine grained software controlled data prefetch including stream type stream length stream stride prefetch depth at optimization level -O3 ndashqhot or above

ndashMore aggressive exploitation under option ndash qprefetch=aggressive

Global analysis for coarse grained prefetch engine control at optimization level -O5

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 25: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Built-in functions for POWER7 data prefetching and cache control

Transient cache line touchvoid __dcbtt(void address)void __dcbtstt (void address)

Partial cache line touchvoid __partial_dcbt(void address)

Stride-N stream prefetchvoid __protected_stream_stride(offset stride stream_ID)

Transient stream prefetchvoid __transient_protected_stream_count_depth(unit_count

depth stream_ID)void __transient_unlimited_protected_stream_depth(prefetch_depth

stream_ID)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 26: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Example of POWER7 data prefetching

for (i=0 ilt n i++) a[i] = b[i] +

__protected_store_stream_set(FORWARD ampa 11)

__protected_stream_count_depth(nsizeof(double)128 DEEPER 11)

__protected_stream_set(FORWARD ampb 0)

__transient_protected_stream_count_depth(nsizeof(double)128 DEEPER 0)

__eieio()

__protected_stream_go()

Store stream prefetch for array a

transient stream prefetch for array bStream id

Stream length

Stream direction

Prefetch depth

Start stream prefetch

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 27: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Loop Optimization

Traditional unimodular loop transformations for prefect regular loop nests

ndashCompiler loop transformations including loop fusion loop distribution unroll-and-jam loop tiling loop rerolling loop collapsing loop unrolling

ndashCompiler pragmas unroll stream_unroll block_loop unrollandfusendashUnder the optimization level O3 O3 ndashqhot or above

Polyhedral framework for any loop nestsndashProvide abstract representation for aggressive analysis and

complicated transformations of arbitrary loop nests and shapes under option control ndashqhot=level=2 with -qsmp

bull Loop skewing loop tiling for triangular loop shapesndashPerform exact dependence testing through unified dependence

formulation to enable more aggressive loop transformations in both traditional and polyhedral frameworks under at all hot levels (-O3 ndashqhot or above)

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 28: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Examples

Dependence analysis

do i = 1Ndo j = 1ido k = i+1Nc(k) = c(j) + b(jk)

Enable interchange of j and k loops to improve access locality for b

ndashIdentifies independence of memory accesses to c

Affects all optimization levels that include -qhot

Loop transformations

Tiling of triangular matrix multiplicationdo i = 1Ndo j = i+1Ndo k = i+1Nc(ji)+=a(ki)b(jk)

Also allows transformation of imperfect loop nests

ndashIntervening code between loops

Only available at -qhot=level=2

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 29: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Polyhedral Loop Transformation Example

Sequence of Imperfect Loop nests

for j hellip for k hellipA[j] = B[k]for i hellipC[i][k]=hellip

for j hellip for k hellipM[j] = N[k]for i hellip X = C[i][k]

InputParallelism amp Locality

Optimized Loopsfor jT hellipomp parallel forfor kT hellipfor j hellipfor k hellipA[j] = B[k]M[j] = N[k]for iC[i][k] = hellipX = C[i][k]

Output

Loop fusionLoop skewing to enable tiling

Loop tiling for cache

Loop skewing forPipeline parallelization

Loop tiling for registers

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 30: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Data ReorganizationData reorganization transformations to reduce memory latency

ndashContext sensitive and insensitive safety analysisndashData affinity analysis and shape analysisndashData splitting data transposing data interleaving to reshape data layout

Enabled at O5

Data Reorganization Report

Seq

Type Phase Data Name

Category Region Line Description

1 ArraySplitting High Level Optimizer

iplus 9An array of a large aggregated data-type was split into multiple arrays of smaller data-types

27 ArrayCoalescing High Level Optimizer

net Global variables were aggregated

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 31: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

S[0]F0 S[0]F1 S[0]F2 S[0]F3

S[1]F0 S[1]F1 S[1]F2 S[1]F3

S[2]F0 S[2]F1 S[2]F2 S[2]F3

S[3]F0 S[3]F1 S[3]F2 S[3]F3

hellip hellip hellip hellip

F0[0]

F1[0]

F2[0]

F3[0]

F0[1]

F1[1]

F2[1]

F3[1]

F0[2]

F1[2]

F2[2]

F3[2]

F0[3]

F1[3]

F2[3]

F3[3]

hellip

hellip

hellip

hellip

A[0]

A[1]

A[2]

A[3]

A[0][2] A[0][3]A[0][1]A[0][0]

A[1][2] A[1][3]A[1][1]A[1][0]

A[2][2] A[2][3]A[2][1]A[2][0]

A[3][2] A[3][3]A[3][1]A[3][0]

hellip

hellip

hellip

hellip

A[3] A[3][2] A[3][3]A[3][1]A[3][0] hellip

A[0][0] A[0][1] A[0][2] A[0][3]

A[1][0] A[1][1] A[1][2] A[1][3]

A[2][0] A[2][1] A[2][2] A[2][3]

A[3][0] A[3][1] A[3][2] A[3][3]

hellip hellip hellip hellip

hellip

hellip

hellip

hellip

hellip

Arsquo[0][0]

Arsquo[1][0]

Arsquo[2][0]

Arsquo[3][0]

Arsquo[0][1]

Arsquo[1][1]

Arsquo[2][1]

Arsquo[3][1]

Arsquo[0][2]

Arsquo[1][2]

Arsquo[2][2]

Arsquo[3][2]

Arsquo[0][3]

Arsquo[1][3]

Arsquo[2][3]

Arsquo[3][3]

hellip

hellip

hellip

hellip

hellip helliphelliphelliphellip

Array splitting

Array merging Array transposing

Data locality cache utilization

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 32: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

User Explicit Parallelization with OpenMPFull OpenMP 30 implementation on C C++ and Fortran

ndashFull OpenMP task parallelizationndashPrivatization of Fortran descriptor-based arrays

Efficient Threadprivate implementation using OS supported Thread-Local Storage TLS by default

ndashBypass expensive pthread key mechanismndashpthread-based implementation available under option control -qsmp=nOSTLS for backward compatibility

Improved interaction between OpenMP and automatic SIMD

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 33: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Automatic parallelizationImprovements to automatic parallelization

ndashMore effective array data flow analysis for array privatization

ndashAutomatic privatization of descriptor-based Fortran arraysndashRuntime-dependence testing

SMP runtime improvementsndashLeverage of TLS on SMP runtime implementation

Enable automatic parallelization with the compiler option -qsmp

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 34: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

XLSMPOPTS Environment Variable for Runtime Tuning

XLSMPOPTS environment variable allows you to tune runtime behavior of OpenMP and autoparallel programs

Some suboptions of interestndashspins and yields to define the behavior of idle threadsndashThread binding using startproc and stride suboptionsndashnew bind suboption on AIX bind=SDL=ltstart resourcegtltnumber of resourcesgtltstridegtbindlist=SDL=i0i1hellipix

ndashschedule to define the runtime scheduling algorithm used for parallel loops (static dynamic guided)

bull Note that the default schedule has changed from runtime to auto in V11V13

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 35: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Inlining

Single control knob to enable inliningndashSimplifies inlining control for programmer

-qinline=level=X X=010 (default 5)ndashConsistent across all languages and optimization levels ndashPrevious mechanisms still available

User inline control with ndashqinline+|-ltfunction_namegt

Automatic inlining before loop optimizationndashPreviously only available at ndashO5 or user inlining on C++ndashAvailable at all levels of ndashqhot (default at ndashO3 and up)ndashEnables early inlining of Fortran module procedures

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 36: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Control over optimizations that may affect program results -qstrict suboptions

Aggressive optimization may affect the results of the programndashPrecision of floating-point computationndashHandling of special cases of IEEE FP standard (INF NAN etc)ndashUse of alternate math libraries

-qstrict guarantees identical result to noopt at the expense of optimization

ndashSuboptions allow fine-grain control over this guaranteendashExamples

-qstrict=precision Strict FP precision-qstrict=exceptions Strict FP exceptions-qstrict=ieeefp Strict IEEE FP implementation-qstrict=nans Strict general and computation of NANs-qstrict=order Do not modify evaluation order-qstrict=vectorprecision Maintain precision over all loop iterations

Can be combined -qstrict=precisionnonans

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 37: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

The IBM Rational CC++ Cafeacute on IBM developerWorksibmcomrationalcafecommunityccppibmcomrationalcafecommunityccpp

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 38: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM Corporation40 SP-XXL Compiler update | IBM ConfidentialJan 2011

IBM | Software Group | Rational

Fortran Cafe on IBM developerWorks

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 39: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Feature RequestRequest for a feature to be supported by our compilers

CC++ feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005811

Fortran feature request pagendashhttpwwwibmcomsupportdocviewwssuid=swg27005812

Or send e-mail to xl_featurecaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 40: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

Documentation

An information center containing the documentation for the XL Fortran V131 and XL CC++ V111 versions of the AIX compilers is available at httppublibboulderibmcominfocentercomphelpv111v131indexjspndashNow downloadable as fully searchable package

Whitepaper ldquoCode optimization with the IBM XL Compilersrdquohttpwww-01ibmcomsupportdocviewwssuid=swg27005174

Whitepaper ldquoOverview of the IBM XL CC++ and XL Fortran Compiler Familyrdquo available at httpwwwibmcomsupportdocviewwssuid=swg27005175

Please send any comments or suggestions on this information center or about the existing C C++ or Fortran documentation shipped with the products to compinfocaibmcom

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43
Page 41: High Performance Programming with IBM XL Compilers and ...spscicomp.org/wordpress/wp-content/uploads/2011/05/gao-Perform… · –Exploitation and tuning for latest hardware implementations

copy 2011 IBM CorporationHigh Performance Programming with IBM XL Compilers and Libraries | SPXXLScicomP-17 2011 Summer WorkshopMay 2011

IBM | Software Group | Rational

43

copy Copyright IBM Corporation 2011 All rights reserved The information contained in these materials is provided for informational purposes only and is provided AS IS without warranty of any kind express or implied IBM shall not be responsible for any damages arising out of the use of or otherwise related to these materials Nothing contained in these materials is intended to nor shall have the effect of creating any warranties or representations from IBM or its suppliers or licensors or altering the terms and conditions of the applicable license agreement governing the use of IBM software References in these materials to IBM products programs or services do not imply that they will be available in all countries in which IBM operates Product release dates andor capabilities referenced in these materials may change at any time at IBMrsquos sole discretion based on market opportunities or other factors and are not intended to be a commitment to future product or feature availability in any wayIBM the IBM logo the on-demand business logo Rational the Rational logo and other IBM products and services are trademarks of the International Business Machines Corporation in the United States other countries or both Other company product or service names may be trademarks or service marks of others

  • High Performance Programming with IBM XL Compilers and Libraries SPXXLScicomP-17 2011 Summer Workshop
  • IBM Rational Disclaimer
  • Agenda
  • Overview of XL Compiler Family
  • Major Features of XLC111 XLF131
  • HPC Performance Tuning with XL Compilers
  • Migration from GCC to IBM XL Compilers
  • gxlc gxlc++ gxlC
  • XML Compiler Transformation Reports
  • Compiler Transformation Report Contents
  • XL Compiler Assisted Performance Analysis and Tuning
  • Compiler Feedback View
  • Slide Number 13
  • Basic Block and Call Counter Information
  • Cache Miss Information
  • Loop information
  • Loop Transformation Reports
  • Slide Number 18
  • Explicit SIMD programming for POWER7Enabled under -qaltivec
  • Automatic SIMDization
  • SIMDization Tuning
  • SIMDization Tuning
  • MASS enhancements and Auto-vectorization
  • Software-controlled data prefetching for POWER7
  • Built-in functions for POWER7 data prefetching and cache control
  • Example of POWER7 data prefetching
  • Loop Optimization
  • Polyhedral Loop Transformation Examples
  • Polyhedral Loop Transformation Example
  • Data Reorganization
  • Slide Number 33
  • User Explicit Parallelization with OpenMP
  • Automatic parallelization
  • XLSMPOPTS Environment Variable for Runtime Tuning
  • Inlining
  • Control over optimizations that may affect program results-qstrict suboptions
  • The IBM Rational CC++ Cafeacute on IBM developerWorks
  • Fortran Cafe on IBM developerWorks
  • Feature Request
  • Documentation
  • Slide Number 43