20
Intel® Composer XE for HPC customers July 2010 July 2010 Denis Makoshenko, Intel, SSG. Denis Makoshenko, Intel, SSG.

Intel® Composer XE for HPC customers July 2010 Denis Makoshenko, Intel, SSG

Embed Size (px)

Citation preview

Intel® Composer XE for HPC customers

July 2010July 2010

Denis Makoshenko, Intel, SSG.Denis Makoshenko, Intel, SSG.

2Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® C++ and Fortran Composer XE

04/19/23 2

Intel® C++ Composer XE components

Intel® C++ Compiler XE 12.0

Intel® Debugger with parallel debugging support (Linux*)

Intel® Parallel Debugger Extension (Windows*)

12.0

Intel® Math Kernel Library ( Intel® MKL ) 10.3

Intel® Integrated Performance Primitives (Intel® IPP) 7.0

Intel® Threading Building Blocks (Intel® TBB) 3.0

Intel® Composer XE for Fortran components

Intel® Fortran Compiler XE ( Linux*, MacOS* )

Intel® Visual Fortran Compiler XE (Windows*)

12.0

Intel® Debugger with parallel debugging support (Linux*)

Intel® Parallel Debugger Extension (Windows*)

12.0

Intel® Math Kernel Library ( Intel® MKL ) 10.3

3Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Important Notes

• Some of the names used here for features of the future compiler and library products are not finalized yet

• Feature set and functionality might change (slightly) for final product version

4Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Compiler Architecture

Profiler

C++Front End

Interprocedural analysis and optimizations: inlining, constant prop, whole program detect, mod/ref, points-to

Loop optimizations: data deps, prefetch, vectorizer, unroll/interchange/fusion/dist, auto-parallel/OpenMP

Global scalar optimizations: partial redundancy elim, dead store elim, strength reduction, dead code elim

Code generation: instruction selection, scheduling, register allocation, code generation

FORTRANFront End

Disambiguation:types, array,

pointer, structure, directives

5Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Interprocedural OptimizationExtends optimizations across file boundaries

Compile & OptimizeCompile & Optimize

Compile & OptimizeCompile & Optimize

Compile & OptimizeCompile & Optimize

Compile & OptimizeCompile & Optimize

file1.c

file2.c

file3.c

file4.c

Without IPOWithout IPO

Compile & OptimizeCompile & Optimize

file1.c

file4.c file2.c

file3.c

With IPOWith IPO

-ip Only between modules of one source file

-ipo Modules of multiple files/whole application

6Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Profile-Guided Optimizations (PGO)

• Use execution-time feedback to guide (final) optimization

• Helps I-cache, paging, branch-prediction• Enabled optimizations:

– Basic block ordering– Better register allocation– Better decision on which functions to inline– Function ordering– Switch-statement optimization

7Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Instrumented Compilationicc -prof_gen prog.c

Instrumented Executionprog.exe (on a typical dataset)

Feedback Compilationicc -prof_use prog.c

DYN file containingdynamic info: .dyn

Instrumented executable: prog.exe

Merged DYNsummary file: .dpiDelete old dyn files unless you want their info included

Step 1

Step 2

Step 3

PGO Usage: Three Step Process

8Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Some Generic Features

• Compatibility to standards ( ANSI C, ISO C++, ANSI C99, Fortran95, Fortran2003 )

• Compatibility to leading open-source tools ( ICC vs. GCC, IDB vs GBD, ICL to CL, …)

• OpenMP support and Automatic Parallelization• Sophisticated optimizations

– Profile-guided optimization– Multi-file inter-procedural optimization

• Detailed compilation report generation• Support of other Intel® tools

9Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

A few General SwitchesFunctionality Linux*

Disable optimization -O0

Optimize for speed (no code size increase) -O1

Optimize for speed (default) -O2

High-level optimizer ( e.g. loop unroll) -O3

Vectorization for x86, -xSSE2 is default <many options>

Aggressive optimizations (e.g. -ipo, -O3, -no-prec-div, -static -xHost for x86 Linux*)

-fast

Create symbols for debugging -g

Generate assembly files -S

Optimization report generation /opt-report

OpenMP support -openmp

Automatic parallelization for OpenMP* threading -parallel

10Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Optimization Report Options

• opt_report– generate an optimization report to stderr ( or file )

• opt_report_file <file>– specify the filename for the generated report

• opt_report_phase <phase_name>– specify the phase that reports are generated against

• opt_report_routine <name>– reports on routines containing the given name

• opt_report_help– display the optimization phases available for reporting

• vec-report<level>– Generate vectorization report

11Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

New Features of Intel® Composer XE

– GAP – Guided Automatic Parallelization– SIMD Directives

• Provides additional information to compiler to enable vectorization of loops

– Loop Profiler• Report file contains:

– Average, minimum, maximum iteration counter of loops

– Call count of routines

– Self-time and total-time of functions / loops

– Static Security Analyzer• Tool based on compiler technology to statically verify program

– C/C++ specific: Vector Notation, Cilk, C++0x Features – Fortran specific : F2003 Status, CAF, DO-CONCURRENT

12Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Memory Reference DisambiguationOptions/Directives related to Aliasing

• -alias_args[-]

• -ansi-alias[-]

• -fno-alias: No aliasing in whole program

• -fno-fnalias: No aliasing within single units

• -restrict (C99): -restrict and restrict attribute– enables selective pointer disambiguation

There are many more options – different for Windows* and There are many more options – different for Windows* and Linux*, different for C/C++ and FortranLinux*, different for C/C++ and Fortran

13Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

GAP – Guided Automatic Parallelization

Key design ideas: • Use compiler infrastructure to help developer to detect what is

blocking certain optimizations – in particular vectorization, parallelization and data transformations – and to change code correspondingly

• Very specific hints to fix problem • Not a separate tool but feature of C/C++ and Fortran compiler• Exploit multi-year experience brought into the compiler

development• Performance tuning knowledge based on dealing with

numerous applications, benchmarks and compute kernels

It is not: • Automatic vectorizer or parallelizer

• in fact, no code is generated to accelerate analysis

14Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

GAP – How it WorksSelection of most relevant Switches

Multiple compiler switches to activate and fine-tune guidance analysis

• Activate messages individually for vectorization, parallelization, data transformations or all three-guide[=level]

-guide-vec[=level]

-guide-par[=level]

-guide-data-trans[=level]

Optional argument level=1,2,3,4 controls extend of analysis; Intel Composer only supports up to level 3

• Control the source code part for which analysis is done-guide-opts=<arg>

Samples:

-guide-opts=“convert.c,'funca(int)'“

15Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Vectorization Example

void mul(NetEnv* ne, Vector* rslt

Vector* den, Vector* flux1,

Vector* flux2, Vector* num

{

float *r, *d, *n, *s1, *s2;

int i;

r=rslt->data; d=den->data;

n=num->data; s1=flux1->data;

s2=flux2->data;

for (i = 0; i < ne->len; ++i)

r[i] = s1[i]*s2[i] + n[i]*d[i];

}

GAP Messages (simplified):

1. “Use a local variable to hoist the upper-bound of loop at line 29 (variable:ne->len) if the upper-bound does not change during execution of the loop”

2. “Use “#pragma ivdep" to help vectorize the loop at line 29, if these arrays in the loop do not have cross-iteration dependencies: r, s1, s2, n, d”

-> Upon recompilation, the loop will be vectorized

The compiler guides the user on source-change and on what pragma to insertand on how to determine whether that pragma is correct for this case

16Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Data Transformation Examplestruct S3 {

int a;

int b; // hot

double c[100];

struct S2 *s2_ptr;

int d; int e;

struct S1 *s1_ptr;

char *c_p;

int f; // hot

};

peel.c(22): remark #30756: (DTRANS) Splitting the structure 'S3' into two parts will improve data locality and is highly recommended. Frequently accessed fields are 'b, f'; performance may improve by putting these fields into one structure and the remaining fields into another structure. Alternatively, performance may also improve by reordering the fields of the structure. Suggested field order:'b, f, s2_ptr, s1_ptr, a, c, d, e, c_p'. [VERIFY] The suggestion is based on the field references in current compilation …

for (ii = 0; ii < N; ii++){

sp->b = ii;

sp->f = ii + 1;

sp++;

}

17Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Privatization Directives for “pragma parallel”

• Mark variables in loops to be ‘private’ for automatic parallelization similar to what OpenMP private-clause is doing– Available for Fortran and C/C++– Used by GAP as an advice to add in when appropriate

Syntax for C/C++ : Syntax for C/C++ : #pragma parallel #pragma parallel [ clause [ [[ clause [ [,,] clause ]…]] clause ]…] where where clauseclause can be one of the following: can be one of the following: always always [[ assert assert ]] private( private( var [ var [ ::expr ] [ expr ] [ ,, var [ var [ ::expr ] ]...expr ] ]...)) lastprivate( lastprivate( var [ var [ ::expr ] [ expr ] [ ,, var [ var [ ::expr ] ]...expr ] ]...)) where varwhere var is a variable name. is a variable name. Note: In case <expr> is missing, semantic corresponds to Note: In case <expr> is missing, semantic corresponds to OpenMP 3.0OpenMP 3.0

18Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Concurrency-Safe Function Attribute Used by GAP too Windows syntax:__declspec(concurrency_safe[(profitable | cost(cycle-

count))])

Linux syntax: __attribute__((concurrency_safe[(profitable | cost(cycle-

count))]))

( In Fortran similar functionality via directive )

Semantics:__attribute__(concurrency_safe)

The function has no “unacceptable” side effects when invoked in parallel

profitable clause: The loops or blocks that contain calls to this function are profitable to parallelize.

19Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Questions?

20Copyright © 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners

Intel® Code Coverage Tool

Example of code coverage summary for a project. The workload applied in this

test exercised 34 of 143 blocks, representing 5 of 19 functions in 2 of 3 modules. In the file, SAMPLE.C, 4 of 5

functions were exercised

Clicking on SAMPLE.C produces a listing that highlights the code that

was exercised. In this example, the pink-highlighted code was

never exercised, the yellow was run but not exercised by any of the tests set up by the developer and

the beige was partially covered.