76
Oracle® Solaris Studio 12.3: OpenMP API User's Guide Part No: E21996 January, 2012

Oracle Solaris Studio 12.3 OpenMP API User's Guide

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Oracle® Solaris Studio 12.3: OpenMP APIUser's Guide

Part No: E21996January, 2012

Page 2: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Copyright © 2003, 2012, Oracle and/or its affiliates. All rights reserved.

This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectualproperty laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license,transmit, distribute, exhibit, perform, publish or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software,unless required by law for interoperability, is prohibited.

The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing.

If this is software or related documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, the following notice isapplicable:

U.S. GOVERNMENT RIGHTS

Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are "commercial computer software" or"commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, duplication,disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicable Government contract, and, to the extentapplicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, Commercial Computer Software License (December 2007).Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.

This software or hardware is developed for general use in a variety of information management applications. It is not developed or intended for use in any inherentlydangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shallbe responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim anyliability for any damages caused by use of this software or hardware in dangerous applications.

Oracle and Java are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners.

Intel and Intel Xeon are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registeredtrademarks of SPARC International, Inc. AMD, Opteron, the AMD logo, and the AMD Opteron logo are trademarks or registered trademarks of Advanced MicroDevices. UNIX is a registered trademark of The Open Group.

This software or hardware and documentation may provide access to or information on content, products, and services from third parties. Oracle Corporation andits affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services. Oracle Corporationand its affiliates will not be responsible for any loss, costs, or damages incurred due to your access to or use of third-party content, products, or services.

Ce logiciel et la documentation qui l’accompagne sont protégés par les lois sur la propriété intellectuelle. Ils sont concédés sous licence et soumis à des restrictionsd’utilisation et de divulgation. Sauf disposition de votre contrat de licence ou de la loi, vous ne pouvez pas copier, reproduire, traduire, diffuser, modifier, breveter,transmettre, distribuer, exposer, exécuter, publier ou afficher le logiciel, même partiellement, sous quelque forme et par quelque procédé que ce soit. Par ailleurs, il estinterdit de procéder à toute ingénierie inverse du logiciel, de le désassembler ou de le décompiler, excepté à des fins d’interopérabilité avec des logiciels tiers ou tel queprescrit par la loi.

Les informations fournies dans ce document sont susceptibles de modification sans préavis. Par ailleurs, Oracle Corporation ne garantit pas qu’elles soient exemptesd’erreurs et vous invite, le cas échéant, à lui en faire part par écrit.

Si ce logiciel, ou la documentation qui l’accompagne, est concédé sous licence au Gouvernement des Etats-Unis, ou à toute entité qui délivre la licence de ce logicielou l’utilise pour le compte du Gouvernement des Etats-Unis, la notice suivante s’applique :

U.S. GOVERNMENT RIGHTS. Programs, software, databases, and related documentation and technical data delivered to U.S. Government customers are"commercial computer software" or "commercial technical data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplementalregulations. As such, the use, duplication, disclosure, modification, and adaptation shall be subject to the restrictions and license terms set forth in the applicableGovernment contract, and, to the extent applicable by the terms of the Government contract, the additional rights set forth in FAR 52.227-19, CommercialComputer Software License (December 2007). Oracle America, Inc., 500 Oracle Parkway, Redwood City, CA 94065.

Ce logiciel ou matériel a été développé pour un usage général dans le cadre d’applications de gestion des informations. Ce logiciel ou matériel n’est pas conçu ni n’estdestiné à être utilisé dans des applications à risque, notamment dans des applications pouvant causer des dommages corporels. Si vous utilisez ce logiciel ou matérieldans le cadre d’applications dangereuses, il est de votre responsabilité de prendre toutes les mesures de secours, de sauvegarde, de redondance et autres mesuresnécessaires à son utilisation dans des conditions optimales de sécurité. Oracle Corporation et ses affiliés déclinent toute responsabilité quant aux dommages causéspar l’utilisation de ce logiciel ou matériel pour ce type d’applications.

Oracle et Java sont des marques déposées d’Oracle Corporation et/ou de ses affiliés.Tout autre nom mentionné peut correspondre à des marques appartenant àd’autres propriétaires qu’Oracle.

AMD, Opteron, le logo AMD et le logo AMD Opteron sont des marques ou des marques déposées d’Advanced Micro Devices. Intel et Intel Xeon sont des marques oudes marques déposées d’Intel Corporation. Toutes les marques SPARC sont utilisées sous licence et sont des marques ou des marques déposées de SPARCInternational, Inc. UNIX est une marque déposée concédé sous license par X/Open Company, Ltd.

120118@25097

Page 3: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Contents

Preface .....................................................................................................................................................7

1 Introducing the OpenMP API .............................................................................................................111.1 Where to Find the OpenMP Specifications ................................................................................ 111.2 Special Conventions ...................................................................................................................... 11

2 Compiling and Running OpenMP Programs ................................................................................... 132.1 Compiler Options ......................................................................................................................... 132.2 OpenMP Environment Variables ................................................................................................ 15

2.2.1 OpenMP Environment Variables Defaults ..................................................................... 152.2.2 Oracle Solaris Studio Environment Variables ................................................................. 16

2.3 Processor Binding ......................................................................................................................... 202.3.1 Virtual and Logical Processor IDs .................................................................................... 212.3.2 Interpreting the Value Specified for SUNW_MP_PROCBIND ............................................... 212.3.3 Interaction With OS Processor Sets .................................................................................... 22

2.4 Stacks and Stack Sizes ................................................................................................................... 222.5 Checking and Analyzing OpenMP Programs ............................................................................ 23

3 Implementation-Defined Behaviors ................................................................................................253.1 Task Scheduling Points ................................................................................................................. 253.2 Memory Model .............................................................................................................................. 253.3 Internal Control Variables ........................................................................................................... 263.4 Dynamic Adjustment of Threads ................................................................................................ 263.5 Loop Directive ............................................................................................................................... 273.6 Constructs ...................................................................................................................................... 27

3.6.1 SECTIONS .............................................................................................................................. 273.6.2 SINGLE .................................................................................................................................. 27

3

Page 4: Oracle Solaris Studio 12.3 OpenMP API User's Guide

3.6.3 ATOMIC .................................................................................................................................. 273.7 Routines .......................................................................................................................................... 27

3.7.1 omp_set_num_threads() ................................................................................................... 273.7.2 omp_set_schedule() ......................................................................................................... 273.7.3 omp_set_max_active_levels() ...................................................................................... 283.7.4 omp_get_max_active_levels() ...................................................................................... 28

3.8 Environment Variables ................................................................................................................. 283.9 Fortran Issues ................................................................................................................................. 29

3.9.1 THREADPRIVATE Directive ................................................................................................... 293.9.2 SHARED Clause ...................................................................................................................... 303.9.3 Runtime Library Definitions ............................................................................................. 30

4 Nested Parallelism ...............................................................................................................................314.1 Execution Model ............................................................................................................................ 314.2 Control of Nested Parallelism ...................................................................................................... 32

4.2.1 OMP_NESTED .......................................................................................................................... 324.2.2 OMP_THREAD_LIMIT ............................................................................................................. 334.2.3 OMP_MAX_ACTIVE_LEVELS ................................................................................................... 33

4.3 Using OpenMP Library Routines Within Nested Parallel Regions ......................................... 354.4 Some Tips on Using Nested Parallelism ..................................................................................... 37

5 Tasking ..................................................................................................................................................395.1 Tasking Model ............................................................................................................................... 395.2 Data Environment ......................................................................................................................... 415.3 Tasking Example ........................................................................................................................... 425.4 Programming Considerations ..................................................................................................... 43

5.4.1 THREADPRIVATE and Thread-Specific Information ................................................. 435.4.2 Locks ..................................................................................................................................... 445.4.3 References to Stack Data .................................................................................................... 44

6 Automatic Scoping of Variables ........................................................................................................496.1 Autoscoping Data Scope Clause .................................................................................................. 50

6.1.1 __auto Clause ...................................................................................................................... 506.1.2 default(__auto) Clause ................................................................................................... 50

Contents

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 20124

Page 5: Oracle Solaris Studio 12.3 OpenMP API User's Guide

6.2 Scoping Rules for a Parallel Construct ........................................................................................ 506.2.1 Scoping Rules for Scalar Variables .................................................................................... 516.2.2 Scoping Rules for Arrays .................................................................................................... 51

6.3 Scoping Rules for a task Construct ............................................................................................ 516.3.1 Scoping Rules for Scalar Variables .................................................................................... 516.3.2 Scoping Rules for Arrays .................................................................................................... 52

6.4 General Comments About Autoscoping .................................................................................... 526.5 Restrictions .................................................................................................................................... 536.6 Checking the Results of Autoscoping ......................................................................................... 536.7 Autoscoping Examples ................................................................................................................. 55

7 Scope Checking ....................................................................................................................................637.1 Using the Scope Checking Feature .............................................................................................. 637.2 Restrictions .................................................................................................................................... 66

8 Performance Considerations .............................................................................................................678.1 Some General Performance Recommendations ........................................................................ 678.2 False Sharing And How To Avoid It ............................................................................................ 70

8.2.1 What Is False Sharing? ........................................................................................................ 708.2.2 Reducing False Sharing ...................................................................................................... 71

8.3 Oracle Solaris OS Tuning Features .............................................................................................. 71

A Placement of Clauses on Directives .................................................................................................. 73

Index ......................................................................................................................................................75

Contents

5

Page 6: Oracle Solaris Studio 12.3 OpenMP API User's Guide

6

Page 7: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Preface

This guide describes the specifics of the OpenMP shared memory API supported by the OracleSolaris Studio 12.3 C, C++, and Fortran compilers.

Supported PlatformsThis Oracle Solaris Studio release supports platforms that use the SPARC family of processorarchitectures running the Oracle Solaris operating system, as well as platforms that use the x86family of processor architectures running Oracle Solaris or specific Linux systems.

This document uses the following terms to cite differences between x86 platforms:

■ “x86” refers to the larger family of 64-bit and 32-bit x86 compatible products.■ “x64” points out specific 64-bit x86 compatible CPUs.■ “32-bit x86” points out specific 32-bit information about x86 based systems.

Information specific to Linux systems refers only to supported Linux x86 platforms, whileinformation specific to Oracle Solaris systems refers only to supported Oracle Solaris platformson SPARC and x86 systems.

For a complete list of supported hardware platforms and operating system releases, see theOracle Solaris Studio Release Notes.

Oracle Solaris Studio DocumentationYou can find complete documentation for Oracle Solaris Studio software as follows:

■ Product documentation is located at the Oracle Solaris Studio documentation web site,including release notes, reference manuals, user guides, and tutorials.

■ Online help for the Code Analyzer, the Performance Analyzer, the Thread Analyzer,dbxtool, DLight, and the IDE is available through the Help menu, as well as through the F1key and Help buttons on many windows and dialog boxes, in these tools.

■ Man pages for command-line tools describe a tool's command options.

7

Page 8: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Resources for DevelopersVisit the Oracle Technical Network web site to find these resources for developers using OracleSolaris Studio:

■ Articles on programming techniques and best practices■ Links to complete documentation for recent releases of the software■ Information on support levels■ User discussion forums.

Access to Oracle SupportOracle customers have access to electronic support through My Oracle Support. Forinformation, visit http://www.oracle.com/pls/topic/lookup?ctx=acc&id=info or visithttp://www.oracle.com/pls/topic/lookup?ctx=acc&id=trs if you are hearing impaired.

Typographic ConventionsThe following table describes the typographic conventions that are used in this book.

TABLE P–1 Typographic Conventions

Typeface Meaning Example

AaBbCc123 The names of commands, files, and directories,and onscreen computer output

Edit your .login file.

Use ls -a to list all files.

machine_name% you have mail.

AaBbCc123 What you type, contrasted with onscreencomputer output

machine_name% su

Password:

aabbcc123 Placeholder: replace with a real name or value The command to remove a file is rmfilename.

AaBbCc123 Book titles, new terms, and terms to beemphasized

Read Chapter 6 in the User's Guide.

A cache is a copy that is storedlocally.

Do not save the file.

Note: Some emphasized itemsappear bold online.

Preface

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 20128

Page 9: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Shell Prompts in Command ExamplesThe following table shows the default UNIX system prompt and superuser prompt for shellsthat are included in the Oracle Solaris OS. Note that the default system prompt that is displayedin command examples varies, depending on the Oracle Solaris release.

TABLE P–2 Shell Prompts

Shell Prompt

Bash shell, Korn shell, and Bourne shell $

Bash shell, Korn shell, and Bourne shell for superuser #

C shell machine_name%

C shell for superuser machine_name#

Preface

9

Page 10: Oracle Solaris Studio 12.3 OpenMP API User's Guide

10

Page 11: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Introducing the OpenMP API

The OpenMP Application Program Interface (API) is a portable, parallel programming modelfor shared memory multimultithreaded architectures, developed in collaboration with anumber of computer vendors. The specifications were created and are published by theOpenMP Architecture Review Board.

The OpenMP API is the recommended parallel programming model for all Oracle SolarisStudio compilers on Oracle Solaris platforms.

1.1 Where to Find the OpenMP SpecificationsThe material presented in this manual describes issues specific to the Oracle Solaris Studioimplementation of the OpenMP 3.1 API, which can be found on the official OpenMP web site athttp://www.openmp.org.

Additional information about OpenMP including tutorials and other resources for developerscan be found on the cOMPunity web site at http://www.compunity.org

Latest information about the Oracle Solaris Studio compiler releases and their implementationof the OpenMP API can be found on the Oracle Solaris Studio portal at,http://www.oracle.com/technetwork/server-storage/solarisstudio.

1.2 Special ConventionsIn the tables and examples in this document, Fortran directives and source code are shown inuppercase, but are case-insensitive.

The term structured-block refers to a block of Fortran or C/C++ statements having no transfersinto or out of the block.

Constructs within square brackets, [...], are optional.

1C H A P T E R 1

11

Page 12: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Throughout this manual, “Fortran” refers to the Fortran 95 language and the Oracle SolarisStudio compiler, f95(1).

The terms “directive” and “pragma” are used interchangeably in this manual. OpenMP APIdirectives are significant comments inserted by the programmer to instruct the compiler to usespecialized features. As comments, they are not part of the host C, C++, or Fortran language,and may be ignored or enacted depending on compiler options.

1.2 Special Conventions

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201212

Page 13: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Compiling and Running OpenMP Programs

This chapter describes compiler options and runtime settings affecting programs that utilize theOpenMP API.

Note – Starting with Oracle Solaris Studio 12.3, the default number of threads used for anOpenMP program is 2 instead of 1. This number can be changed by setting theOMP_NUM_THREADS environment variable prior to running the program, or by calling theomp_set_num_threads() routine, or by using the num_threads clause on a PARALLEL directive.

2.1 Compiler OptionsTo enable explicit parallelization with OpenMP directives, compile your program with the cc,CC, or f95 option flag -xopenmp. (The f95 compiler accepts both -xopenmp and -openmp assynonyms.)

The -xopenmp flag accepts the following keyword sub-options.

-xopenmp=parallel Enables recognition of OpenMP pragmas.

The minimum optimization level for -xopenmp=parallel is -xO3.

The compiler changes the optimization from a lower level to -xO3 ifnecessary, and issues a warning.

2C H A P T E R 2

13

Page 14: Oracle Solaris Studio 12.3 OpenMP API User's Guide

-xopenmp=noopt Enables recognition of OpenMP pragmas.

The compiler does not raise the optimization level if it is lower than -xO3.

If you explicitly set the optimization level lower than -xO3, as in -xO2

-openmp=noopt the compiler will issue an error.

If you do not specify an optimization level with -openmp=noopt, theOpenMP pragmas are recognized and the program is parallelizedaccordingly but no optimization is done.

-xopenmp=stubs This option is no longer supported.

An OpenMP stubs library is provided for users’ convenience.

To compile an OpenMP program that calls OpenMP library routines butignores the OpenMP pragmas, compile the program without an -xopenmp

option and link the object files with the libompstubs.a library.

For example, % cc omp_ignore.c -lompstubs

Linking with both libompstubs.a and the OpenMP runtime librarylibmtsk.so is unsupported and may result in unexpected behavior.

-xopenmp=none Disables recognition of OpenMP pragmas and does not change theoptimization level.

Additional Notes:

■ For best performance and functionality on Oracle Solaris platforms, make sure that thelatest OpenMP runtime library, libmtsk.so, is installed on the running system.

■ If you do not specify —xopenmp on the command line, the compiler assumes —xopenmp=none(disabling recognition of OpenMP pragmas).

■ If you specify —xopenmp but without a keyword sub-option, the compiler assumes—xopenmp=parallel.

■ Specifying -xopenmp=parallel or noopt will define the _OPENMP preprocessor token to beYYYYMM (specifically 201107L for C/C++ and 201107 for Fortran 95).

■ When debugging OpenMP programs with dbx, compile with -xopenmp=noopt -g

■ The default optimization level for -xopenmp might change in future releases. Compilationwarning messages can be avoided by specifying an appropriate optimization level explicitly.

■ With Fortran 95, -xopenmp , -xopenmp=parallel, -xopenmp=noopt will add -stackvar

automatically. See “2.4 Stacks and Stack Sizes” on page 22.■ When compiling and linking an OpenMP program in separate steps, include -xopenmp on

each of the compile and the link steps.■ Use the —xvpara option with —xopenmp, —xopenmp=parallel, or —xopenmp=noopt to

display compiler warnings about potential OpenMP programming related problems andmessages related to autoscoping (see Chapter 6, “Automatic Scoping of Variables”) and

2.1 Compiler Options

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201214

Page 15: Oracle Solaris Studio 12.3 OpenMP API User's Guide

scope checking (see Chapter 7, “Scope Checking”). For example, warning messages will beemitted if the compiler detects the situations in the following list:■ Loops are parallelized using OpenMP directives when there are data dependencies

between different loop iterations■ OpenMP data-sharing attributes-clauses are problematic. For example, declaring a

variable shared whose accesses in an OpenMP parallel region may cause data race, ordeclaring a variable private whose value in a parallel region is used after the parallelregion.

2.2 OpenMP Environment VariablesThe OpenMP specification defines several environment variables that control the execution ofOpenMP programs. For details, refer to the OpenMP API Version 3.1 specification athttp://openmp.org. See also “3.8 Environment Variables” on page 28 for specific informationon the implementation of OpenMP environment variables by the Oracle Solaris Studiocompilers.

Additional environment variables defined by the Oracle Solaris Studio compilers that are notpart of the OpenMP specification are summarized in “2.2.2 Oracle Solaris Studio EnvironmentVariables” on page 16.

2.2.1 OpenMP Environment Variables DefaultsThis section describes the defaults for the OpenMP environment variables.

OMP_SCHEDULE If not set, a default value of STATIC is used.

Example: setenv OMP_SCHEDULE ’GUIDED,4’

OMP_NUM_THREADS If not set, a default of 2 is used.

Example: setenv OMP_NUM_THREADS 16

OMP_DYNAMIC If not set, a default value of TRUE is used. .

Example: setenv OMP_DYNAMIC FALSE

OMP_NESTED If not set, the default is FALSE.

Example: setenv OMP_NESTED FALSE

OMP_STACKSIZE The default is 4 Megabytes for 32-bit applications, and 8Megabytes for 64-bit applications.

Example: setenv OMP_STACKSIZE 10M

2.2 OpenMP Environment Variables

Chapter 2 • Compiling and Running OpenMP Programs 15

Page 16: Oracle Solaris Studio 12.3 OpenMP API User's Guide

OMP_WAIT_POLICY If not set, the default is PASSIVE.

OMP_MAX_ACTIVE_LEVELS If not set, the default is 4.

OMP_THREAD_LIMIT If not set, the default is 1024.

OMP_PROC_BIND If not set, the default is FALSE.

2.2.2 Oracle Solaris Studio Environment VariablesThe following additional environment variables affect the execution of OpenMP programs andare not part of the OpenMP specifications. Note that the values specified for the followingenvironment variables are case insensitive.and can be in uppercase or lowercase.

2.2.2.1 PARALLEL

For compatibility with legacy programs, setting the PARALLEL environment variable has thesame effect as setting OMP_NUM_THREADS. However, if both PARALLEL and OMP_NUM_THREADS areset, they must be set to the same value.

2.2.2.2 SUNW_MP_WARN

Controls warning messages issued by the OpenMP runtime library. If SUNW_MP_WARN is set toTRUE, the runtime library issues warning messages to stderr. In addition, the runtime libraryoutputs the settings of all environment variables for informational purposes. If the environmentvariable is set to FALSE, the runtime library does not issue any warning messages or output anysettings. The default is FALSE.

Example:

setenv SUNW_MP_WARN TRUE

The runtime library will also issue warning messages if the program registers a callbackfunction to accept warning messages. A program can register a user callback function by callingthe following function:

int sunw_mp_register_warn (void (*func)(void *));

The address of the callback function is passed as argument to sunw_mp_register_warn(). Thisfunction returns 0 upon successfully registering the call-back function, or 1 upon failure.

If the program has registered a callback function, the runtime library will call the registeredfunction and pass a pointer to the localized string containing the error message. The memorypointed to is no longer valid upon return from the callback function.

2.2 OpenMP Environment Variables

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201216

Page 17: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Note – The OpenMP runtime library has the ability to check for many common OpenMPviolations, such as incorrect nesting and deadlocks. Runtime checking, however, adds overheadto the execution of the program. Set SUNW_MP_WARN to TRUE while testing or debugging aprogram to enable warning messages from the OpenMP runtime library to be displayed.

2.2.2.3 SUNW_MP_THR_IDLE

Controls the status of idle threads in an OpenMP program that are waiting at a barrier orwaiting for new parallel regions to work on. You can set the value to be one of the following:SPIN, SLEEP, SLEEP( times), SLEEP(timems), SLEEP( timemc), where time is an integer thatspecifies an amount of time, and s, ms, and mc specify the time unit (seconds, milliseconds, andmicroseconds, respectively).

SPIN specifies that an idle thread should spin while waiting at barrier or waiting for new parallelregions to work on. SLEEP without a time argument specifies that an idle thread should sleepimmediately. SLEEP with a time argument specifies the amount of time a thread shouldspin-wait before going to sleep.

The default idle thread status is to sleep after possibly spin-waiting for some amount of time.SLEEP, SLEEP(0), SLEEP(0s), SLEEP(0ms), and SLEEP(0mc) are all equivalent.

If both SUNW_MP_THR_IDLE and OMP_WAIT_POLICY are set, then OMP_WAIT_POLICY will beignored.

Examples:

setenv SUNW_MP_THR_IDLE SPIN

setenv SUNW_MP_THR_IDLE SLEEP

setenv SUNW_MP_THR_IDLE SLEEP(2s)

setenv SUNW_MP_THR_IDLE SLEEP(20ms)

setenv SUNW_MP_THR_IDLE SLEEP(150mc)

2.2.2.4 SUNW_MP_PROCBIND

This environment variable works on both Oracle Solaris and Linux systems. TheSUNW_MP_PROCBIND environment variable can be used to bind threads of an OpenMP programto virtual processors on the running system. Performance can be enhanced with processorbinding, but performance degradation will occur if multiple threads are bound to the samevirtual processor. If both SUNW_MP_PROCBIND and OMP_PROC_BIND are set, they must be set to thesame value. See “2.3 Processor Binding” on page 20 for details.

2.2.2.5 SUNW_MP_MAX_POOL_THREADS

Specifies the maximum size of the thread pool. The thread pool contains only non-user threadsthat the OpenMP runtime library creates to work on parallel regions. The pool does not contain

2.2 OpenMP Environment Variables

Chapter 2 • Compiling and Running OpenMP Programs 17

Page 18: Oracle Solaris Studio 12.3 OpenMP API User's Guide

the initial (main) or any threads created explicitly by the user’s program. If this environmentvariable is set to zero, the thread pool will be empty and all parallel regions will be executed byone thread. If not specified, the default is 1023. See “4.2 Control of Nested Parallelism” onpage 32 for details.

Note that SUNW_MP_MAX_POOL_THREADS specifies the maximum number of non-user OpenMPthreads to use for the whole program, while OMP_THREAD_LIMIT specifies the maximum numberof user and non-user OpenMP threads for the whole program. If bothSUNW_MP_MAX_POOL_THREADS and OMP_THREAD_LIMIT are set, they must have consistent valuessuch that OMP_THREAD_LIMIT is set to one more than the value of SUNW_MP_MAX_POOL_THREADS.

2.2.2.6 SUNW_MP_MAX_NESTED_LEVELS

Sets the the maximum number of nested active parallel regions. A parallel region is active if it isexecuted by a team consisting of more than one thread. If SUNW_MP_MAX_NESTED_LEVELS is notspecified, the default is 4. See “4.2 Control of Nested Parallelism” on page 32 for details.

Note that if both SUNW_MP_MAX_NESTED_LEVELS and OMP_MAX_ACTIVE_LEVELS are set, they mustbe set to the same value.

2.2.2.7 STACKSIZE

Sets the stack size for each thread. The value is in Kilobytes. The default thread stack sizes are 4Megabytes on 32-bit SPARC V8 and x86 platforms and 8 Megabytes on 64-bit SPARC V9 andx86 platforms.

The STACKSIZE environment variable accepts numerical values with a suffix of either B, K, M,or G for Bytes, Kilobytes, Megabytes, or Gigabytes respectively. If no suffix is specified, thedefault is Kilobytes.

Examples:

setenv STACKSIZE 8192 // sets the thread stack size to 8 Megabytes

setenv STACKSIZE 16M // sets the thread stack size to 16 Megabytes

Note that if both STACKSIZE and OMP_STACKSIZE are set, they must be set to the same value. Ifthe values are not the same, a runtime error occurs.

2.2.2.8 SUNW_MP_GUIDED_WEIGHT

Sets the weighting factor used to determine the size of chunks assigned to threads in loops withGUIDED scheduling. The value should be a positive floating-point number, and will apply to allloops with GUIDED scheduling in the program. If not set, the default value assumed is 2.0.

2.2 OpenMP Environment Variables

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201218

Page 19: Oracle Solaris Studio 12.3 OpenMP API User's Guide

2.2.2.9 SUNW_MP_WAIT_POLICY

Allows fine-grain control of the behavior of threads in the program that are waiting for work(idle), waiting at a barrier, or waiting for tasks to complete. The behavior for each of the abovetypes of wait has three possibilities: spin for awhile, yield the CPU for awhile, and sleep untilawakened.

The syntax is (shown using csh):

setenv SUNW_MP_WAIT_POLICY IDLE=val:BARRIER=val:TASKWAIT=val

IDLE=val, BARRIER=val, and TASKWAIT=val are optional keywords that specify the type of waitbeing controlled.

Each of these keywords has a val setting that describes the wait behavior, SPIN, YIELD, or SLEEP.

SPIN(time) specifies how long a thread should spin before yielding the CPU. time can be inseconds, milliseconds, or microseconds (denoted by s, ms, and mc, respectively). If no time unitis specified, then seconds is assumed. SPIN with no time parameter means that the threadshould continuously spin while waiting.

YIELD(number) specifies the number of times a thread should yield the CPU before sleeping.After each yield of the CPU, a thread will run again when the operating system schedules it torun. YIELD with no number parameter means the thread should continuously yield whilewaiting.

SLEEP specifies that a thread should immediately go to sleep.

Note that SPIN, SLEEP, and YIELD settings for a particular type of wait can be specified in anyorder. The settings are separated by comma. "SPIN(0),YIELD(0)" is the same as SLEEP or sleepimmediately. When processing the settings for IDLE, BARRIER, and TASKWAIT, the “left-mostwins” rule is used.

If both SUNW_MP_WAIT_POLICY and OMP_WAIT_POLICY are set, OMP_WAIT_POLICY will be ignored.

Examples:

% setenv SUNW_MP_WAIT_POLICY “BARRIER=SPIN”

A thread waiting at a barrier spins until all threads in the team have reached the barrier.

% setenv SUNW_MP_WAIT_POLICY “IDLE=SPIN(10ms),YIELD(5)”

A thread waiting for work (idle) spins for 10 milliseconds, then yields the CPU 5 times beforegoing to sleep.

% setenv SUNW_MP_WAIT_POLICY \

“IDLE=SPIN(10ms),YIELD(2):BARRIER=SLEEP:TASKWAIT=YIELD(10)”

2.2 OpenMP Environment Variables

Chapter 2 • Compiling and Running OpenMP Programs 19

Page 20: Oracle Solaris Studio 12.3 OpenMP API User's Guide

A thread waiting for work (idle) spins for 10 milliseconds, then yields the CPU 2 times beforegoing to sleep; a thread waiting at a barrier goes to sleep immediately; a thread waiting at ataskwait yields the CPU 10 times before going to sleep.

2.3 Processor BindingWith processor binding, the programmer instructs the operating system that a thread in theprogram should run on the same processor throughout the execution of the program.

Processor binding when used with static scheduling benefits applications that exhibit a certaindata reuse pattern where data accessed by a thread in a parallel or worksharing region will be inthe local cache from a previous invocation of a parallel or worksharing region.

From the hardware point of view, a computer system is composed of one or more physicalprocessors. From the operating system point of view, each of these physical processors maps toone or more virtual processors onto which threads in a program can be run. If n virtualprocessors are available, then n threads can be scheduled to run at the same time. Depending onthe system, a virtual processor may be a processor, a core, a hardware thread, and so on.

For example, the SPARC T3 physical processor has eight cores, and each core can run eightsimultaneous processing threads. From the Oracle Solaris system point of view, there are 64virtual processors onto which threads can be scheduled to run. On Oracle Solaris platforms, thenumber of virtual processors can be determined by using the psrinfo(1M) command. OnLinux systems, the text file /proc/cpuinfo provides information about available processors.

When the operating system binds threads to processors, they are in effect bound to specificvirtual processors, not physical processors.

Set the SUNW_MP_PROCBIND environment variable to bind threads in an OpenMP program. Thevalue specified for SUNW_MP_PROCBIND can be one of the following:

■ The string "TRUE" or "FALSE" or "COMPACT" or "SCATTER" (or lowercase "true" or "false" or"compact" or "scatter"). For example, % setenv SUNW_MP_PROCBIND "TRUE"

■ A non-negative integer.For example, % setenv SUNW_MP_PROCBIND "2"

■ A list of two or more non-negative integers separated by one or more spaces.For example, % setenv SUNW_MP_PROCBIND "2 2 4 6" will bind two threads to processor 2,one will be bound to processor 4, and one will be bound to processor 6, if four threads areused.

■ Two non-negative integers, n1 and n2, separated by a minus sign ("-"). n1 must be less thanor equal to n2.For example, % setenv SUNW_MP_PROCBIND "0-6"

2.3 Processor Binding

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201220

Page 21: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Note that the non-negative integers referred to above denote logical identifiers (IDs). LogicalIDs may be different from virtual processor IDs. The difference will be explained in the nextsection.

2.3.1 Virtual and Logical Processor IDsEach virtual processor in a system has a unique processor ID. The Oracle Solaris psrinfo(1M)command displays information about the virtual processors in a system, including their virtualprocessor IDs. The prtdiag(1M) command displays system configuration and diagnosticinformation.

You can use psrinfo -pv to list all physical processors in the system and the virtual processorsthat are associated with each physical processor.

Virtual processor IDs could be sequential, but gaps in the IDs can also occur. For example, on aSun Fire 4810 with 8 UltraSPARC IV processors (16 cores), the virtual processor IDs might be:0, 1, 2, 3, 8, 9, 10, 11, 512, 513, 514, 515, 520, 521, 522, 523.

Logical processor IDs on the other hand are consecutive integers that start with 0. If the numberof virtual processors available in the system is n, then their logical IDs are 0, 1, ..., n-1, in theorder presented by psrinfo(1M).

For an interpretation of the values accepted by SUNW_MP_PROCBIND, see “2.3.2 Interpreting theValue Specified for SUNW_MP_PROCBIND” on page 21.

2.3.2 Interpreting the Value Specified forSUNW_MP_PROCBIND

If the value specified for SUNW_MP_PROCBIND is FALSE, the threads will not be bound to anyprocessors. This is the default setting.

If the value specified for SUNW_MP_PROCBIND is TRUE, then the threads will be bound to virtualprocessors in a round-robin fashion. The starting processor for the binding is determined bythe runtime library with the goal of achieving best performance.

If the value specified for SUNW_MP_PROCBIND is COMPACT, then the threads will be bound tovirtual processors that are as close together as possible on the system. COMPACT allows threads toshare data caches and thus improve data locality.

If the value specified for SUNW_MP_PROCBIND is SCATTER, then the threads will be bound tovirtual processors that are far apart. This will allow higher memory bandwidth for each of thethreads. SCATTER is the opposite of COMPACT.

2.3 Processor Binding

Chapter 2 • Compiling and Running OpenMP Programs 21

Page 22: Oracle Solaris Studio 12.3 OpenMP API User's Guide

If the value specified for SUNW_MP_PROCBIND is a non-negative integer, then that integer denotesthe starting logical ID of the virtual processor to which threads should be bound. Threads willbe bound to virtual processors in a round-robin fashion starting with the processor with thespecified logical ID and wrapping around to the processor with logical ID 0 after binding to theprocessor with logical ID n-1.

If the value specified for SUNW_MP_PROCBIND is a list of two or more non-negative integers, thenthreads will be bound in a round-robin fashion to virtual processors with the specified logicalIDs. Processors with logical IDs other than those specified will not be used.

If the value specified for SUNW_MP_PROCBIND is two non-negative integers separated by a minus("-"), then threads will be bound in a round-robin fashion to virtual processors in the range thatbegins with the first logical ID and ends with the second logical ID. Processors with logical IDsother than those included in the range will not be used.

If the value specified for SUNW_MP_PROCBIND does not conform to one of the forms describedabove, or if an invalid logical ID is given, then an error message will be emitted and execution ofthe program will terminate.

If the number of threads in an OpenMP program is is greater than the number of virtualprocessors available, some virtual processors will have more than one thread bound to them.This may negatively impact performance.

2.3.3 Interaction With OS Processor SetsA processor set can be specified using the psrset utility on Oracle Solaris platforms, or thetaskset command on Linux platforms. SUNW_MP_PROCBIND does not take processor sets intoaccount. If you use processor sets, then it is your responsibility to ensure that the setting ofSUNW_MP_PROCBIND is consistent with the processor set used. Otherwise, the setting ofSUNW_MP_PROCBIND will override the processor set setting on Linux systems, while on OracleSolaris systems an error message will be issued.

2.4 Stacks and Stack SizesThe executing program maintains a main stack for the initial (or main) thread executing theprogram, as well as distinct stacks for each slave thread. Stacks are temporary memory addressspaces used to hold arguments and automatic variables during invocation of a subprogram orfunction reference. Stack overflow might occur if the size of a thread's stack is too small, causingsilent data corruption or a segmentation fault.

Compiling Fortran programs with the f95 -stackvar option forces the allocation of localvariables and arrays on the stack as if they were automatic variables. Use of -stackvar with

2.4 Stacks and Stack Sizes

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201222

Page 23: Oracle Solaris Studio 12.3 OpenMP API User's Guide

OpenMP programs is implied with explicitly parallelized programs because it improves theoptimizer’s ability to parallelize loops with calls. (See the Fortran User’s Guide for a discussion ofthe -stackvar flag.) However, this usage could lead to stack overflow if not enough memory isallocated for the stack.

Use the limit C-shell command or the ulimit Bourne or Korn shell command to display or setthe stack size for the initial (or main) thread. In general, the default stack size for the initialthread is 8 Megabytes.

Each slave thread of an OpenMP program has its own thread stack. This stack mimics the initial(or main) thread stack but is unique to the thread. The thread’s PRIVATE arrays and variables(local to the thread) are allocated on the thread stack. The default size is 4 Megabytes on 32-bitSPARC V8 and x86 platforms, and 8 Megabytes on 64-bit SPARC V9 and x86 platforms. Thesize of the slave thread stack is set with the OMP_STACKSIZE environment variable.

demo% setenv OMP_STACKSIZE 16384 <-Set thread stack size to 16 Mb (C shell)

demo$ OMP_STACKSIZE=16384 <-Set thread stack size to 16 Mb (Bourne/Korn shell)demo$ export OMP_STACKSIZE

To detect stack overflows, compile your Fortran, C, or C++ programs with the -xcheck=stkovfcompiler option to force a segmentation fault on stack overflow, thereby stopping the programbefore any data corruption can occur.

2.5 Checking and Analyzing OpenMP ProgramsOracle Solaris Studio provides several tools to help debug and analyze OpenMP programs.

■ dbx is an interactive debugging tool that provides facilities to run a program in a controlledfashion and inspect the state of a stopped program. Refer to dbx(1). for more information

■ The Thread Analyzer is a tool for detecting data races and deadlocks in a multithreadedprogram. Refer to the Thread Analyzer manual and the tha(1) and libtha(3) man pages fordetails.

■ The Performance Analyzer analyzes the performance of an OpenMP program. Refer to thePerformance Analyzer manual or the collect(1) and analyzer(1) man pages for details.

2.5 Checking and Analyzing OpenMP Programs

Chapter 2 • Compiling and Running OpenMP Programs 23

Page 24: Oracle Solaris Studio 12.3 OpenMP API User's Guide

24

Page 25: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Implementation-Defined Behaviors

This chapter documents how certain OpenMP features behave when the program is compiledusing Oracle Solaris Studio compilers.(See Appendix E of the OpenMP 3.1 API Specification.)

3.1 Task Scheduling PointsTask scheduling points in untied task regions occur at the same points as in tied task regions.Therefore, within untied task regions, the OpenMP specification defines the following taskscheduling:

■ Encountered task constructs■ Encountered taskwait constructs■ Encountered taskyield constructs■ Encountered barrier directives■ Implicit barrier regions■ At the end of the untied task region

3.2 Memory ModelMemory accesses by multiple threads to the same variable without synchronization are notnecessarily atomic with respect to each other. Several implementation-dependent andapplication-dependent factors affect whether accesses are atomic. Some variables might belarger than the largest atomic memory operation on the target platform. Some variables mightbe misaligned or of unknown alignment and the compiler or the runtime system might need touse multiple loads/stores to access the variable. Sometimes there are faster code sequences thatuse more loads/stores.

3C H A P T E R 3

25

Page 26: Oracle Solaris Studio 12.3 OpenMP API User's Guide

3.3 Internal Control VariablesThe following internal control variables are defined by the implementation:

■ nthreads-var: Controls the number of threads requested for encountered parallel regions.The initial value of nthreads-var is 1.

■ dyn-var: Controls whether dynamic adjustment of the number of threads is enabled forencountered parallel regions. The initial value of dyn-var is TRUE (that is, dynamicadjustment is enabled).

■ run-sched-var: Controls the schedule that the runtime schedule clause uses for loop regions.The initial value of run-sched-var is static with no chunk size.

■ def-sched-var: Controls the implementation defined default scheduling of loop regions. Theinitial value of def-sched-var is static with no chunk size.

■ bind-var: Controls the binding of threads to processors. The initial value of bind-var isFALSE.

■ stacksize-var: Controls the stack size for threads that the OpenMP implementation creates.The initial value of stacksize-var is 4 Megabytes for 32-bit applications and 8 Megabytes for64-bit applications.

■ wait-policy-var: Controls the desired behavior of waiting threads. The initial value ofwait-policy-var is PASSIVE.

■ thread-limit-var: Controls the maximum number of threads participating in the OpenMPprogram. The initial value of thread-limit-var is 1024.

■ max-active-levels-var: Controls the maximum number of nested active parallel regions. Theinitial value of max-active-levels-var is 4.

3.4 Dynamic Adjustment of ThreadsThe implementation provides the ability to dynamically adjust the number of threads. Dynamicadjustment is enabled by default. Set the OMP_DYNAMIC environment variable to FALSE or call theomp_set_dynamic() routine with the appropriate argument to disable dynamic adjustment.

When a thread encounters a parallel construct, the number of threads delivered by thisimplementation is determined according to Algorithm 2.1 in the OpenMP 3.1 Specification. Inexceptional situations, such as a lack of system resources, the number of threads supplied will beless than described in Algorithm 2.1. In these situations, if SUNW_MP_WARN is set to TRUE or acallback function is registered through a call to sunw_mp_register_warn(), a warning messageis issued.

3.3 Internal Control Variables

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201226

Page 27: Oracle Solaris Studio 12.3 OpenMP API User's Guide

3.5 Loop DirectiveThe integer type used to compute the iteration count of a collapsed loop is long.

The effect of the schedule(runtime) clause when the run-sched-var internal control variable isset to auto is static with no chunk size.

3.6 Constructs

3.6.1 SECTIONS

The structured blocks in the sections construct are assigned to the threads in the team in a staticwith no chunk size fashion, so that each thread gets an approximately equal number ofconsecutive structured blocks.

3.6.2 SINGLE

The first thread to encounter the single construct will execute the construct.

3.6.3 ATOMIC

The implementation replaces all atomic directives by enclosing the target statement with aspecial, named critical construct. This operation enforces exclusive access between all atomicregions in the program, regardless of whether these regions update the same or different storagelocations.

3.7 Routines

3.7.1 omp_set_num_threads()

If the argument to omp_set_num_threads() is not a positive integer, then the call is ignored. Awarning message is issued if SUNW_MP_WARN is set to TRUE or a callback function is registered by acall to sunw_mp_register_warn().

3.7.2 omp_set_schedule()

The behavior for the Oracle Solaris Studio specific sunw_mp_sched_reserved schedule is thesame as static with no chunk size.

3.7 Routines

Chapter 3 • Implementation-Defined Behaviors 27

Page 28: Oracle Solaris Studio 12.3 OpenMP API User's Guide

3.7.3 omp_set_max_active_levels()

If omp_set_max_active_levels()is called from within an active parallel region, then the call isignored. A warning message is issued if SUNW_MP_WARN is set to TRUE or a callback function isregistered by a call to sunw_mp_register_warn().

If the argument to omp_set_max_active_levels() is not a non-negative integer, then the call isignored. A warning message is issued if SUNW_MP_WARN is set to TRUE or a callback function isregistered by a call to sunw_mp_register_warn().

3.7.4 omp_get_max_active_levels()

omp_get_max_active_levels() can be called from anywhere in the program. The call returnsthe value of the max-active-levels-var internal control variable.

3.8 Environment Variables

Variable Name Implementation

OMP_SCHEDULE If the schedule type specified for the OMP_SCHEDULE is not one of the valid types(static, dynamic, guided, or auto), then the environment variable is ignored, andthe default schedule (static with no chunk size) is used. A warning message isissued if SUNW_MP_WARN is set to TRUE or a callback function is registered by a call tosunw_mp_register_warn().

If the schedule type specified for the OMP_SCHEDULE environment variable is static,dynamic, or guided but the chunk-specified size is a negative integer, then thechunk size used is as follows: For static, there is no chunk size . For dynamic andguided, the chunk size is 1. A warning message is issued if SUNW_MP_WARN is set toTRUE or a callback function is registered by a call to sunw_mp_register_warn().

OMP_NUM_THREADS If the value of the variable is not a positive integer, then the environment variable isignored and a warning message issued if SUNW_MP_WARN is set to TRUE or a callbackfunction is registered by a call to sunw_mp_register_warn().

If the value of the variable is greater than the number of threads the implementationcan support, the following actions are taken:■ If dynamic adjustment of the number of threads is enabled, then the number of

threads will be reduced and a warning message will be issued if SUNW_MP_WARN isset to TRUE or a callback function is registered by a call tosunw_mp_register_warn().

■ If dynamic adjustment of the number of threads is disabled, then an errormessage will be issued and the program will stop.

3.8 Environment Variables

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201228

Page 29: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Variable Name Implementation

OMP_PROC_BIND If the value specified for OMP_PROC_BIND is neither TRUE nor FALSE, then an errormessage will be issued and the program will stop.

OMP_DYNAMIC If the value specified for OMP_DYNAMIC is neither TRUE nor FALSE, then the value willbe ignored, and the default value TRUE will be used. A warning message will beissued if SUNW_MP_WARN is set to TRUE or a callback function is registered by a call tosunw_mp_register_warn().

OMP_NESTED If the value specified for OMP_NESTED is neither TRUE nor FALSE, then the value willbe ignored and the default value FALSE will be used. A warning message will beissued if SUNW_MP_WARN is set to TRUE or a callback function is registered by a call tosunw_mp_register_warn().

OMP_STACKSIZE If the value given for OMP_STACKSIZE does not conform to the specified format, thenthe value will be ignored and the default value (4 Megabytes for 32-bit applications,and 8 Megabytes for 64-bit applications) is used. A warning message is issued ifSUNW_MP_WARN is set to TRUE or a callback function is registered by a call tosunw_mp_register_warn().

OMP_WAIT_POLICY The ACTIVE behavior for a thread is spin. The PASSIVE behavior for a thread is sleep,after possibly spinning for a while.

OMP_MAX_ACTIVE_LEVELS If the value specified for OMP_MAX_ACTIVE_LEVELS is not a nonnegative integer, thenthe value will be ignored and the default value (4) will be used. A warning messagewill be issued if SUNW_MP_WARN is set to TRUE or a callback function is registered by acall to sunw_mp_register_warn().

OMP_THREAD_LIMIT If the value specified for OMP_THREAD_LIMIT is not a positive integer, then the valuewill be ignored and the default value (1024) will be used. A warning message will beissued if SUNW_MP_WARN is set to TRUE or a callback function is registered by a call tosunw_mp_register_warn().

3.9 Fortran IssuesThe following issues apply to Fortran only.

3.9.1 THREADPRIVATEDirectiveIf the conditions for values of data in the thread-private objects of threads (other than the initialthread) to persist between two consecutive active parallel regions do not all hold, then theallocation status of an allocatable array in the second region might be not currently allocated.

3.9 Fortran Issues

Chapter 3 • Implementation-Defined Behaviors 29

Page 30: Oracle Solaris Studio 12.3 OpenMP API User's Guide

3.9.2 SHAREDClausePassing a shared variable to a non-intrinsic procedure could result in the value of the sharedvariable being copied into temporary storage before the procedure reference, and back out ofthe temporary storage into the actual argument storage after the procedure reference. Thiscopying into and out of temporary storage can occur only if the conditions found in theOpenMP 3.1 Specification shared clause section hold, namely:

■ The actual argument is one of the following:■ A shared variable■ A subobject of a shared variable■ An object associated with a shared variable■ An object associated with a subobject of a shared variable

■ The actual argument is also one of the following:■ An array section■ An array section with a vector subscript■ An assumed-shape array■ A pointer array

■ The associated dummy argument for this actual argument is an explicit-shape array or anassumed-size array.

3.9.3 Runtime Library DefinitionsBoth the include file omp_lib.h and the module file omp_lib are provided in theimplementation.

On Oracle Solaris platforms, the OpenMP runtime library routines that take an argument areextended with a generic interface so arguments of different Fortran KIND types can beaccommodated.

3.9 Fortran Issues

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201230

Page 31: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Nested Parallelism

This chapter discusses the features of OpenMP nested parallelism.

4.1 Execution ModelOpenMP uses the fork-join model of parallel execution. When a thread encounters a parallelconstruct, the thread creates a team composed of itself and some additional (possibly zero)number of threads. The encountering thread becomes the master of the new team. The otherthreads of the team are called slave threads of the team. All team members execute the codeinside the parallel construct. When a thread finishes its work within the parallel construct, itwaits at an implicit barrier at the end of the parallel construct. When all team members havearrived at the barrier, the threads can leave the barrier. The master thread continues executionof user code beyond the end of the parallel construct, while the slave threads wait to besummoned to join other teams.

OpenMP parallel regions can be nested inside each other. If nested parallelism is disabled, thenthe new team created by a thread encountering a parallel construct inside a parallel regionconsists only of the encountering thread. If nested parallelism is enabled, then the new teammay consist of more than one thread.

The OpenMP runtime library maintains a pool of threads that can be used as slave threads inparallel regions. When a thread encounters a parallel construct and needs to create a team ofmore than one thread, the thread will check the pool and grab idle threads from the pool,making them slave threads of the team. The master thread might get fewer slave threads than itneeds if the pool does not contain a sufficient number of idle threads. When the team finishesexecuting the parallel region, the slave threads return to the pool.

4C H A P T E R 4

31

Page 32: Oracle Solaris Studio 12.3 OpenMP API User's Guide

4.2 Control of Nested ParallelismNested parallelism can be controlled at runtime by setting various environment variables priorto execution of the program.

4.2.1 OMP_NESTED

Nested parallelism can be enabled or disabled by setting the OMP_NESTED environment variableor calling omp_set_nested().

The following example has three levels of nested parallel constructs.

EXAMPLE 4–1 Nested Parallelism Example

#include <omp.h>

#include <stdio.h>

void report_num_threads(int level)

{

#pragma omp single

{

printf("Level %d: number of threads in the team - %d\n",level, omp_get_num_threads());

}

}

int main()

{

omp_set_dynamic(0);

#pragma omp parallel num_threads(2)

{

report_num_threads(1);

#pragma omp parallel num_threads(2)

{

report_num_threads(2);

#pragma omp parallel num_threads(2)

{

report_num_threads(3);

}

}

}

return(0);

}

Compiling and running this program with nested parallelism enabled produces the following(sorted) output:

% setenv OMP_NESTED TRUE

% a.out

Level 1: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 3: number of threads in the team - 2

4.2 Control of Nested Parallelism

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201232

Page 33: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 2

The following example runs the same program but with nested parallelism disabled:

% setenv OMP_NESTED FALSE

% a.out

Level 1: number of threads in the team - 2

Level 2: number of threads in the team - 1

Level 3: number of threads in the team - 1

Level 2: number of threads in the team - 1

Level 3: number of threads in the team - 1

4.2.2 OMP_THREAD_LIMIT

The OpenMP runtime library maintains a pool of threads that can be used as slave threads inparallel regions. The setting of the OMP_THREAD_LIMIT environment variable controls thenumber of threads in the pool. By default, the number of threads in the pool is at most 1023.

The thread pool consists of only non-user threads that the runtime library creates. The pooldoes not include the initial thread or any thread created explicitly by the user's program.

If OMP_THREAD_LIMIT is set to 1 (or SUNW_MP_MAX_POOL_THREADS is set to 0), then the threadpool will be empty and all parallel regions will be executed by one thread.

The following example shows that a parallel region can get fewer slave threads if the pool doesnot contain sufficient threads. The code is the same as Example 4–1. The number of threadsneeded for all the parallel regions to be active at the same time is 8. Therefore, the pool needs tocontain at least 7 threads. If OMP_THREAD_LIMIT is set to 6 (or SUNW_MP_MAX_POOL_THREADS to 5),then the pool contains at most 5 slave threads. This implies that two of the four innermostparallel regions might not be able to get all the slave threads requested. The following exampleshows one possible result:

% setenv OMP_NESTED TRUE

% OMP_THREAD_LIMIT 6

% a.out

Level 1: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 1

Level 3: number of threads in the team - 1

4.2.3 OMP_MAX_ACTIVE_LEVELS

The environment variable OMP_MAX_ACTIVE_LEVELS controls the maximum number of nestedactive parallel regions. A parallel region is active if it is executed by a team consisting of morethan one thread. The default maximum number of nested active parallel regions is 4.

4.2 Control of Nested Parallelism

Chapter 4 • Nested Parallelism 33

Page 34: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Note that setting this environment variable is not sufficient to enable nested parallelism. Thisenvironment variable simply controls the the maximum number of nested active parallelregions; it does not enable nested parallelism. To enable nested parallelism, OMP_NESTED mustbe set to TRUE, or omp_set_nested() must be called with an argument that evaluates to true.

The following code will create 4 levels of nested parallel regions. If OMP_MAX_ACTIVE_LEVELS isset to 2, then nested parallel regions at nested depth of 3 and 4 are executed single-threaded.

#include <omp.h>

#include <stdio.h>

#define DEPTH 5

void report_num_threads(int level)

{

#pragma omp single

{

printf("Level %d: number of threads in the team - %d\n",level, omp_get_num_threads());

}

}

void nested(int depth)

{

if (depth == DEPTH)

return;

#pragma omp parallel num_threads(2)

{

report_num_threads(depth);

nested(depth+1);

}

}

int main()

{

omp_set_dynamic(0);

omp_set_nested(1);

nested(1);

return(0);

}

The following example shows the possible results from compiling and running this programwith a maximum nesting level of 4. (Actual results would depend on how the OS schedulesthreads.)

% setenv OMP_MAX_ACTIVE_LEVELS 4

% a.out |sort

Level 1: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 3: number of threads in the team - 2

Level 4: number of threads in the team - 2

Level 4: number of threads in the team - 2

Level 4: number of threads in the team - 2

Level 4: number of threads in the team - 2

4.2 Control of Nested Parallelism

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201234

Page 35: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Level 4: number of threads in the team - 2

Level 4: number of threads in the team - 2

Level 4: number of threads in the team - 2

Level 4: number of threads in the team - 2

The following example shows a possible result running with the nesting level set at 2:

% setenv OMP_MAX_ACTIVE_LEVELS 2

% a.out |sort

Level 1: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 2: number of threads in the team - 2

Level 3: number of threads in the team - 1

Level 3: number of threads in the team - 1

Level 3: number of threads in the team - 1

Level 3: number of threads in the team - 1

Level 4: number of threads in the team - 1

Level 4: number of threads in the team - 1

Level 4: number of threads in the team - 1

Level 4: number of threads in the team - 1

Again, these examples only show some possible results. Actual results would depend on how theOS schedules threads.

4.3 Using OpenMP Library Routines Within Nested ParallelRegions

Calls to the following OpenMP routines within nested parallel regions deserve some discussion.

- omp_set_num_threads()

- omp_get_max_threads()

- omp_set_dynamic()

- omp_get_dynamic()

- omp_set_nested()

- omp_get_nested()

The set calls affect future parallel regions at the same or inner nesting levels encountered by thecalling thread only. They do not affect parallel regions encountered by other threads.

The get calls return the values set by the calling thread. When a thread becomes the master of ateam executing a parallel region, all other members of the team inherit the values of the masterthread. When the master thread exits a nested parallel region and continues executing theenclosing parallel region, the values for that thread revert to their values in the enclosing parallelregion just before executing the nested parallel region.

EXAMPLE 4–2 Calls to OpenMP Routines Within Parallel Regions

#include <stdio.h>

#include <omp.h>

int main()

4.3 Using OpenMP Library Routines Within Nested Parallel Regions

Chapter 4 • Nested Parallelism 35

Page 36: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 4–2 Calls to OpenMP Routines Within Parallel Regions (Continued)

{

omp_set_nested(1);

omp_set_dynamic(0);

#pragma omp parallel num_threads(2)

{

if (omp_get_thread_num() == 0)

omp_set_num_threads(4); /* line A */

else

omp_set_num_threads(6); /* line B */

/* The following statement will print out

*

* 0: 2 4

* 1: 2 6

*

* omp_get_num_threads() returns the number

* of the threads in the team, so it is

* the same for the two threads in the team.

*/

printf("%d: %d %d\n", omp_get_thread_num(),

omp_get_num_threads(),

omp_get_max_threads());

/* Two inner parallel regions will be created

* one with a team of 4 threads, and the other

* with a team of 6 threads.

*/

#pragma omp parallel

{

#pragma omp master

{

/* The following statement will print out

*

* Inner: 4

* Inner: 6

*/

printf("Inner: %d\n", omp_get_num_threads());

}

omp_set_num_threads(7); /* line C */

}

/* Again two inner parallel regions will be created,

* one with a team of 4 threads, and the other

* with a team of 6 threads.

*

* The omp_set_num_threads(7) call at line C

* has no effect here, since it affects only

* parallel regions at the same or inner nesting

* level as line C.

*/

#pragma omp parallel

{

printf("count me.\n");}

}

4.3 Using OpenMP Library Routines Within Nested Parallel Regions

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201236

Page 37: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 4–2 Calls to OpenMP Routines Within Parallel Regions (Continued)

return(0);

}

The following example shows a possible result from compiling and running this program:

% a.out

0: 2 4

Inner: 4

1: 2 6

Inner: 6

count me.

count me.

count me.

count me.

count me.

count me.

count me.

count me.

count me.

count me.

4.4 Some Tips on Using Nested Parallelism■ Nesting parallel regions provides an immediate way to allow more threads to participate in

the computation.For example, suppose you have a program that contains two levels of parallelism and thedegree of parallelism at each level is 2. Also, suppose your system has four CPUs and youwant use all four CPUs to speed up the execution of this program. Just parallelizing any onelevel will use only two CPUs. You want to parallelize both levels.

■ Nesting parallel regions can easily create too many threads and oversubscribe the system.Set OMP_THREAD_LIMIT and OMP_MAX_ACTIVE_LEVELS appropriately to limit the number ofthreads in use and prevent runaway oversubscription.

■ Creating nested parallel regions adds overhead. If the outer level has enough paralellism andthe load is balanced, using all the threads at the outer level of the computation will be moreefficient than creating nested parallel regions at the inner levels.For example, suppose you have a program that contains two levels of parallelism. The degreeof parallelism at the outer level is four and the load is balanced. You have a system with fourCPUs and want to use all four CPUs to speed up the execution of this program. In general,using all four threads for the outer level could yield better performance than using twothreads for the outer parallel region and using the other two threads as slave threads for theinner parallel regions.

4.4 Some Tips on Using Nested Parallelism

Chapter 4 • Nested Parallelism 37

Page 38: Oracle Solaris Studio 12.3 OpenMP API User's Guide

38

Page 39: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Tasking

This chapter describes the OpenMP Tasking Model.

5.1 Tasking ModelTasking facilitates the parallelization of applications where units of work are generateddynamically, as in recursive structures or while loops.

In OpenMP, an explicit task is specified using the task directive. The task directive defines thecode associated with the task and its data environment. The task construct can be placedanywhere in the program: whenever a thread encounters a task construct, a new task isgenerated.

When a thread encounters a task construct, it may choose to execute the task immediately ordefer its execution until a later time. If task execution is deferred, then the task in placed in aconceptual pool of tasks that is associated with the current parallel region. The threads in thecurrent team will take tasks out of the pool and execute them until the pool is empty. A threadthat executes a task might be different from the thread that originally encountered it.

The code associated with a task construct will be executed only once. A task is tied if the code isexecuted by the same thread from beginning to end. A task is untied if the code can be executedby more than one thread, so that different threads execute different parts of the code. By default,tasks are tied, and a task can be specified to be untied by using the untied clause with the taskdirective.

Threads are allowed to suspend execution of a task region at a task scheduling point in order toexecute a different task. If the suspended task is tied, then the same thread later resumesexecution of the suspended task. If the suspended task is untied, then any thread in the currentteam may resume the task execution.

The OpenMP specification defines the following task scheduling points for tied tasks:

■ The point of encountering a task construct

5C H A P T E R 5

39

Page 40: Oracle Solaris Studio 12.3 OpenMP API User's Guide

■ The point of encountering a taskwait construct■ The point of encountering a taskyield construct■ The point of encountering an implicit or explicit barrier■ The completion point of the task

As implemented in the Oracle Solaris Studio compilers, these scheduling points are also the taskscheduling points for untied tasks.

In addition to explicit tasks specified using the task directive, the OpenMP specificationintroduces the notion of implicit tasks. An implicit task is a task generated by the implicitparallel region, or generated when a parallel construct is encountered during execution. Thecode for each implicit task is the code inside the parallel construct. Each implicit task isassigned to a different thread in the team and is tied; that is, the implicit task is always executedfrom beginning to end by the thread to which it is initially assigned.

All implicit tasks generated when a parallel construct is encountered are guaranteed to becomplete when the master thread exits the implicit barrier at the end of the parallel region. Onthe other hand, all explicit tasks generated within a parallel region are guaranteed to becomplete on exit from the next implicit or explicit barrier within the parallel region.

The OpenMP 3.1 specifiction defines various types of tasks that the programmer may choose toreduce the overhead of tasking.

An undeferred task is a task for which execution is not deferred with respect to its generatingtask region. That is, its generating task region is suspended until execution of the undeferredtask is completed. An example of an undeferred task is a task with an if clause expression thatevaluates to false. In this case, an undeferred task is generated, and the encountering threadmust suspend the current task region, for which execution cannot be resumed until the taskwith the if clause is completed.

An included task is a task for which execution is sequentially included in the generating taskregion. That is, it is undeferred and executed immediately by the encountering thread.Anexample of an included task is a task that is a descendent of a final task (described below).

The distinction between an included task and an undeferred task is subtle. In the case of anundeferred task, the generating task region is suspended until execution of the undeferred taskis completed, but the undeferred task may not be executed immediately as soon as it isencountered. The undeferred task may be placed in the conceptual pool and executed at a latertime by the encountering thread or by some other thread; in the meantime, the generating taskis suspended. Once the execution of the undeferred task is completed, the generating task canresume.

In the case of included task, the included task is executed immediately by the encounteringthread as soon as it is encountered. The task is not placed in the pool to be executed at a latertime. The generating task is suspended until the execution of the included task is completed.Once the execution of the included task is completed, the generating task can resume.

5.1 Tasking Model

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201240

Page 41: Oracle Solaris Studio 12.3 OpenMP API User's Guide

A merged task is a task whose data environment is the same as that of its generating task region.When a mergeable clause is present on a task construct, and the generated task is anundeferred task or an included task, then the implementation may choose to generate a mergedtask instead. If a merged task is generated, then the behavior is as though there was no taskdirective at all

A final task is a task that forces all of its child tasks to become final and included tasks. When afinal clause is present on a task construct and the final clause expression evaluates to TRUE,the generated task will be a final task. All descendents of that final task will be both final andincluded tasks.

5.2 Data EnvironmentThe task directive takes the following data attribute clauses that define the data environment ofthe task:

■ default (private | firstprivate | shared | none)■ private (list)■ firstprivate (list)■ shared (list)

All references within a task to a variable listed in the shared clause refer to the variable with thatsame name known immediately prior to the task directive.

For each private and firstprivate variable, new storage is created and all references to theoriginal variable in the lexical extent of the task construct are replaced by references to the newstorage. A firstprivate variable is initialized with the value of the original variable at themoment the task is encountered.

The OpenMP 3.0 specification version 3.0 (in section 2.9.1) describes how the data-sharingattributes of variables referenced in parallel, task, and worksharing regions are determined.

The data-sharing attributes of variables referenced in a construct may be one of the following:predetermined, explicitly determined, or implicitly determined. Variables with explicitlydetermined data-sharing attributes are those that are referenced in a given construct and arelisted in a data-sharing attribute clause on the construct. Variables with implicitly determineddata-sharing attributes are those that are referenced in a given construct, do not havepredetermined data-sharing attributes, and are not listed in a data-sharing attribute clause onthe construct.

The rules for how the data-sharing attributes of variables are implicitly determined might notalways be obvious. To avoid any surprises be sure to explicitly scope all variables that arereferenced in a task construct using the data sharing attribute clauses rather than relying on theOpenMP implicit scoping rules.

5.2 Data Environment

Chapter 5 • Tasking 41

Page 42: Oracle Solaris Studio 12.3 OpenMP API User's Guide

5.3 Tasking ExampleThe following C/C++ program illustrates how the OpenMP task and taskwait directives canbe used to compute Fibonacci numbers recursively.

In the example, the parallel directive denotes a parallel region which will be executed by fourthreads. In the parallel construct, the single directive is used to indicate that only one of thethreads will execute the print statement that calls fib(n).

The call to fib(n) generates two tasks, indicated by the task directive. One of the taskscomputes fib(n-1) and the other computes fib(n-2) The return values are added together toproduce the value returned by fib(n). Each of the calls to fib(n-1) and fib(n-2) will in turngenerate two tasks. Tasks will be recursively generated until the argument passed to fib() is lessthan 2.

Note the final clause on each of the task directives. If the final clause expression (n <=THRESHOLD) evaluates to true, then the generated task will be a final task. All task constructsencountered during execution of a final task will generate included (and final) tasks. Soincluded tasks will be generated when fib is called with the argument n = 9, 8, ..., 2. Theseincluded tasks will be executed immediately by the encountering threads, thus reducing theoverhead of placing tasks in the conceptual pool.

The taskwait directive ensures that the two tasks generated in an invocation of fib() arecompleted (that is, the tasks compute i and j) before that invocation of fib() returns.

Note that although only one thread executes the single directive and hence the call to fib(n),all four threads will participate in executing the tasks generated.

The following example is compiled using the Oracle Solaris Studio C++ compiler.

EXAMPLE 5–1 Tasking Example: Computing Fibonacci Numbers

#include <stdio.h>

#include <omp.h>

#define THRESHOLD 5

int fib(int n)

{

int i, j;

if (n<2)

return n;

#pragma omp task shared(i) firstprivate(n) final(n <= THRESHOLD)

i=fib(n-1);

#pragma omp task shared(j) firstprivate(n) final(n <= THRESHOLD)

j=fib(n-2);

#pragma omp taskwait

5.3 Tasking Example

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201242

Page 43: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 5–1 Tasking Example: Computing Fibonacci Numbers (Continued)

return i+j;

}

int main()

{

int n = 30;

omp_set_dynamic(0);

omp_set_num_threads(4);

#pragma omp parallel shared(n)

{

#pragma omp single

printf ("fib(%d) = %d\n", n, fib(n));

}

}

% CC -xopenmp -xO3 task_example.cc

% a.out

fib(30) = 832040

5.4 Programming ConsiderationsTasking introduces a layer of complexity to an OpenMP program. The programmer needs topay special attention to how a program with tasks works. This section discusses someprogramming issues to consider.

5.4.1 THREADPRIVATE and Thread-Specific InformationWhen a thread encounters a task scheduling point, the implementation may choose to suspendthe current task and schedule the thread to work on another task. This behavior implies that thevalue of a threadprivate variable, or other thread-specific information such as the threadnumber, may change across a task scheduling point.

If the suspended task is tied, then the thread that resumes executing the task will be the samethread that suspended it. Therefore, the thread number will remain the same after the task isresumed. However, the value of a threadprivate variable may change because the threadmight have been scheduled to work on another task that modified the threadprivate variablebefore resuming the suspended task.

If the suspended task is untied, then the thread that resumes executing the task may be differentfrom the thread that suspended it. Therefore, both the thread number and the value of athreadprivate variable before and after the task scheduling point might be different.

5.4 Programming Considerations

Chapter 5 • Tasking 43

Page 44: Oracle Solaris Studio 12.3 OpenMP API User's Guide

5.4.2 LocksOpenMP specifies that locks are no longer owned by threads, but by tasks. Once a lock isacquired, the current task owns it, and the same task must release it before task completion.

The critical construct, on the other hand, remains as a thread-based mutual exclusionmechanism.

The change in lock ownership requires extra care when using locks. The following example(which appears in Appendix A of the OpenMP 3.1 Specification) is conforming in OpenMP 2.5because the thread that releases the lock lck in the parallel region is the same thread thatacquired the lock in the sequential part of the program. The master thread of a parallel regionand the initial thread are the same. However, it is not conforming in OpenMP 3.1 because thetask region that releases the lock lck is different from the task region that acquires the lock.

EXAMPLE 5–2 Example Using Locks: Non-Conforming in OpenMP 3.0

#include <stdlib.h>

#include <stdio.h>

#include <omp.h>

int main()

{

int x;

omp_lock_t lck;

omp_init_lock (&lck);

omp_set_lock (&lck);

x = 0;

#pragma omp parallel shared (x)

{

#pragma omp master

{

x = x + 1;

omp_unset_lock (&lck);

}

}

omp_destroy_lock (&lck);

}

5.4.3 References to Stack DataA task is likely to have references to data on the stack of the routine where the task constructappears. Because the execution of a task may be deferred until the next implicit or explicitbarrier, a given task could execute after the stack of the routine where it appears has alreadybeen popped and the stack data overwritten, thereby destroying the stack data listed as sharedby the task.

5.4 Programming Considerations

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201244

Page 45: Oracle Solaris Studio 12.3 OpenMP API User's Guide

The programmer is responsible for inserting the needed synchronizations to ensure thatvariables are still on the stack when the task references them, as illustrated in the following twoexamples.

In the first example, i is specified to be shared in the task construct, and the task accesses thecopy of i that is allocated on the stack of work().

Task execution may be deferred, so tasks are executed at the implicit barrier at the end of theparallel region in main() after the work() routine has already returned. So when a taskreferences i, it accesses some undetermined value that happens to be on the stack at that time.

For correct results, the programmer needs to make sure that work() does not exit before thetasks have completed. This is done by inserting a taskwait directive after the task construct.Alternatively, i can be specified to be firstprivate in the task construct, instead of shared.

EXAMPLE 5–3 Stack Data: First Example– Incorrect Version

#include <stdio.h>

#include <omp.h>

void work()

{

int i;

i = 10;

#pragma omp task shared(i)

{

#pragma omp critical

printf("In Task, i = %d\n",i);}

}

int main(int argc, char** argv)

{

omp_set_num_threads(8);

omp_set_dynamic(0);

#pragma omp parallel

{

work();

}

}

EXAMPLE 5–4 Stack Data: First Example — Corrected Version

#include <stdio.h>

#include <omp.h>

void work()

{

int i;

i = 10;

#pragma omp task shared(i)

{

#pragma omp critical

5.4 Programming Considerations

Chapter 5 • Tasking 45

Page 46: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 5–4 Stack Data: First Example — Corrected Version (Continued)

printf("In Task, i = %d\n",i);}

/* Use TASKWAIT for synchronization. */

#pragma omp taskwait

}

int main(int argc, char** argv)

{

omp_set_num_threads(8);

omp_set_dynamic(0);

#pragma omp parallel

{

work();

}

}

In the second example, j in the task construct references the j in the sections construct.Therefore, the task accesses the firstprivate copy of j in the sections construct, which (insome implementations, including the Oracle Solaris Studio compilers) is a local variable on thestack of the outlined routine for the sections construct.

Task execution may deferred so the task is executed at the implicit barrier at the end of thesections region, after the outlined routine for the sections construct has exited. So when thetask references j, it accesses some undetermined value on the stack.

For correct results, the programmer needs to make sure that the task is executed before thesections region reaches its implicit barrier by inserting a taskwait directive after the taskconstruct. Alternatively, j can be specified to be firstprivate in the task construct, instead ofshared.

EXAMPLE 5–5 Second Example — Incorrect Version

#include <stdio.h>

#include <omp.h>

int main(int argc, char** argv)

{

omp_set_num_threads(2);

omp_set_dynamic(0);

int j=100;

#pragma omp parallel shared(j)

{

#pragma omp sections firstprivate(j)

{

#pragma omp section

{

#pragma omp task shared(j)

{

5.4 Programming Considerations

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201246

Page 47: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 5–5 Second Example — Incorrect Version (Continued)

#pragma omp critical

printf("In Task, j = %d\n",j);}

}

}

}

printf("After parallel, j = %d\n",j);}

EXAMPLE 5–6 Second Example — Corrected Version

#include <stdio.h>

#include <omp.h>

int main(int argc, char** argv)

{

omp_set_num_threads(2);

omp_set_dynamic(0);

int j=100;

#pragma omp parallel shared(j)

{

#pragma omp sections firstprivate(j)

{

#pragma omp section

{

#pragma omp task shared(j)

{

#pragma omp critical

printf("In Task, j = %d\n",j);}

/* Use TASKWAIT for synchronization. */

#pragma omp taskwait

}

}

}

printf("After parallel, j = %d\n",j);}

5.4 Programming Considerations

Chapter 5 • Tasking 47

Page 48: Oracle Solaris Studio 12.3 OpenMP API User's Guide

48

Page 49: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Automatic Scoping of Variables

Declaring the data sharing attributes of variables referenced in an OpenMP construct is calledscoping. A description of each of the data sharing attributes can be found in Data-SharingAttribute Clauses section (Chapter 2) of the OpenMP specification.

In an OpenMP program, every variable referenced in an OpenMP construct is scoped.Generally, a variable referenced in a construct may be scoped in one of two ways. Either theprogrammer explicitly declares the scope of a variable with a data sharing attribute clause, orthe implementation of the OpenMP API in the compiler automatically applies rules forpredetermined or implicitly determined scopes according to the Data Environment: DataScoping Rules section (Chapter 2) of the OpenMP specification.

Most users find scoping to be the hardest part of using the OpenMP paradigm. Explicitlyscoping variables can be tedious and error-prone, especially with large and complicatedprograms. Moreover, the rules for implicitly determined and predetermined scopes of variablesspecified in the OpenMP 3.0 specification can yield some unexpected results. The taskdirective, which was introduced in OpenMP Specification 3.0, added to the complexity anddifficulty of scoping.

The automatic scoping feature (called autoscoping)supported by the Oracle Solaris Studiocompilers can be a very helpful tool, as it relieves the programmer from having to explicitlydetermine the scopes of variables. With autoscoping, the compiler determines the scopes ofvariables by using some smart rules in a very simple user model.

Earlier compiler releases limited autoscoping to variables in a parallel construct. CurrentOracle Solaris Studio compilers extend the autoscoping feature to variables referenced in a taskconstruct as well.

6C H A P T E R 6

49

Page 50: Oracle Solaris Studio 12.3 OpenMP API User's Guide

6.1 Autoscoping Data Scope ClauseAutoscoping is invoked either by specifying the variables to be scoped automatically on a__auto data scope clause, or by using a default(__auto) clause. Both are extensions to theOpenMP specification provided by the Oracle Solaris Studio compilers.

6.1.1 __autoClauseSyntax: __auto(list-of-variables)

For Fortran, __AUTO(list-of-variables) is also accepted.

The __auto clause on a parallel or task construct directs the compiler to automaticallydetermine the scope of the named variables in the construct. (Note the two underscores beforeauto.)

The __auto clause can appear on a PARALLEL, PARALLEL DO/for, PARALLEL SECTIONS, Fortran95 PARALLEL WORKSHARE, or TASK directive.

If a variable is specified on the __auto clause, then it cannot be specified in any other datasharing attribute clause.

6.1.2 default(__auto)ClauseSyntax: default(__auto)

For Fortran, DEFAULT(__AUTO) is also accepted.

The default(__auto) clause on a parallel or task construct directs the compiler toautomatically determine the scope of all variables referenced in the construct that are notexplicitly scoped in any data scope clause.

The default(__auto) clause can appear on a PARALLEL, PARALLEL DO/for, PARALLEL SECTIONS,Fortran 95 PARALLEL WORKSHARE, or TASK directive.

6.2 Scoping Rules for a Parallel ConstructUnder automatic scoping, the compiler applies the rules described in this section to determinethe scope of a variable in a parallel construct.

These rules do not apply to variables scoped implicitly by the OpenMP specification, such asloop index variables of worksharing DO or FOR loops.

6.1 Autoscoping Data Scope Clause

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201250

Page 51: Oracle Solaris Studio 12.3 OpenMP API User's Guide

6.2.1 Scoping Rules for Scalar VariablesWhen autoscoping a scalar variable that is referenced in a parallel construct and that does nothave predetermined or implicitly determined scope, the compiler checks the use of the variableagainst the following rules PS1-PS3 in the given order.■ PS1: If the use of the variable in the parallel region is free of data race conditions for the

threads in the team executing the region, then the variable is scoped SHARED.■ PS2: If in each thread executing the parallel region the variable is always written before

being read by the same thread, then the variable is scoped PRIVATE. The variable is scoped asLASTPRIVATE if it can be scoped PRIVATE and is read before it is written after the parallelregion and the construct is either a PARALLEL DO or a PARALLEL SECTIONS.

■ PS3: If the variable is used in a reduction operation that can be recognized by the compiler,then the variable is scoped REDUCTION with that particular operation type.

6.2.2 Scoping Rules for Arrays■ PA1: If the use of the array in the parallel region is free of data race conditions for the threads

in the team executing the region, then the array is scoped as SHARED.

6.3 Scoping Rules for a taskConstructUnder automatic scoping, the compiler applies the rules described in this section to determinethe scope of a variable in a task construct.

These rules do not apply to variables scoped implicitly by the OpenMP specification, such asloop index variables of PARALLEL DO/for loops.

6.3.1 Scoping Rules for Scalar VariablesWhen autoscoping a scalar variable that is referenced in a task construct and that does not havepredetermined or implicitly determined scope, the compiler checks the use of the variableagainst the following rules TS1-TS5 in the given order.■ TS1: If the use of the variable is read-only in the task construct and read-only in the parallel

construct in which the task construct is enclosed, then the variable is autoscoped asFIRSTPRIVATE.

■ TS2: If the use of the variable is free of data race and the variable will be accessible while thetask is executing, then the variable is autoscoped as SHARED.

■ TS3: If the use of the variable is free of data race, is read-only in the task construct, and thevariable may not be accessible while the task is executing, then the variable is autoscoped asFIRSTPRIVATE.

6.3 Scoping Rules for a task Construct

Chapter 6 • Automatic Scoping of Variables 51

Page 52: Oracle Solaris Studio 12.3 OpenMP API User's Guide

■ TS4: If the use of the variable is not free of data race, and in each thread executing the taskregion the variable is always written before being read by the same thread, and the valueassigned to the variable in the task is not used outside the task region, then the variable isautoscoped as PRIVATE.

■ TS5: If the use of the variable is not free of data race, and the variable is not read-only in thetask region, and some read in the task region might get the value defined outside the task,and the value assigned to the variable inside the task is not used outside the task region, thenthe variable is autoscoped as FIRSTPRIVATE.

6.3.2 Scoping Rules for ArraysAutoscoping for tasks does not handle arrays.

6.4 General Comments About AutoscopingNote that task autoscoping rules and autoscoping results could change in future releases. Also,the order that implicitly determined scoping rules and autoscoping rules are applied couldchange in future releases as well.

The programmer explicitly requests autoscoping with the _auto(list-of-variables) clause or thedefault(_auto) clause. Specifying the default(_auto) or _auto(list-of-variables) clause for aparallel construct doesn't imply that same clause applies to task constructs that are lexicallyor dynamically enclosed in the parallel construct.

When autoscoping a variable that does not have predetermined implicit scope, the compilerchecks the use of the variable against the above rules in the given order. If a rule matches, thecompiler will scope the variable according to the matching rule. If no rule matches, or ifautoscoping cannot handle the variable (due to certain restrictions, described in the nextsection), the compiler will scope the variable as SHARED and treat the parallel or taskconstruct as if an IF (.FALSE.) or if(0) clause were specified.

Autoscoping failures generally occur for one of two reasons. One reason is that the use of thevariable does not match any of the rules. The other reason is that the source code is too complexfor the compiler to do a sufficient analysis. Function calls, complicated array subscripts,memory aliasing, and user-implemented synchronizations are some typical causes.

6.4 General Comments About Autoscoping

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201252

Page 53: Oracle Solaris Studio 12.3 OpenMP API User's Guide

6.5 Restrictions■ To enable autoscoping, the program must be compiled with -xopenmp at an optimization

level of -xO3 or higher. Autoscoping is not enabled if the program is compiled with just-xopenmp=noopt.

■ Parallel and task autosoping in C and C++ can handle only basic data types: integer, floatingpoint, and pointer.

■ Task autoscoping cannot handle arrays.■ Task autoscoping in C and C++ cannot handle global variables.■ Task autoscoping cannot handle untied tasks.■ Task autoscoping cannot handle tasks that are lexically enclosed in some other tasks. For

example:

#pragma omp task /* task1 */

{

...

#pragma omp task /* task 2 */

{

...

}

...

}

In the example, the compiler does not attempt autoscoping for task2 because it is lexicallynested in task1. The compiler will scope all variables referenced in task2 as SHARED andtreat task2 as if an IF(.FALSE.) or if(0) clause is specified on the task.

■ Only OpenMP directives are recognized and used in the analysis. Calls to OpenMP runtimeroutines are not recognized. For example, if a program uses omp_set_lock() andomp_unset_lock() to implement a critical section, the compiler is not able to detect theexistence of the critical section. Use CRITICAL and END CRITICAL directives if possible.

■ Only synchronizations specified by using OpenMP synchronization directives, such asBARRIER and MASTER, are recognized and used in data race analysis. User-implementedsynchronizations such as busy-waiting are not recognized.

6.6 Checking the Results of AutoscopingUse compiler commentary to check autoscoping results and to see whether any parallel regionswere serialized because autoscoping failed.

The compiler produces an inline commentary when compiled with the -g debug option thatcan be viewed with the er_src command, as shown in the following example. The er_srccommand is provided as part of the Oracle Solaris Studio software. For more information, seethe er_src(1) man page or the Oracle Solaris Studio Performance Analyzer manual.

6.6 Checking the Results of Autoscoping

Chapter 6 • Automatic Scoping of Variables 53

Page 54: Oracle Solaris Studio 12.3 OpenMP API User's Guide

A good place to start is to compile with the -xvpara option. Compiling with —xvpara will giveyou a general idea about whether autoscoping for a particular construct was successful.

EXAMPLE 6–1 Autoscoping With -vpara

%cat source1.f

INTEGER X(100), Y(100), I, T

C$OMP PARALLEL DO DEFAULT(__AUTO)

DO I=1, 100

T = Y(I)

X(I) = T*T

END DO

C$OMP END PARALLEL DO

END

%f95 -xopenmp -xO3 -vpara -c -g source1.f

"source1.f", line 2: Autoscoping for OpenMP construct succeeded.

Check er_src for details

If autoscoping fails for a particular construct, a warning message is issued (with -xvpara) asshown in the following example.

EXAMPLE 6–2 Autoscoping Failure With -vpara

%cat source2.f

INTEGER X(100), Y(100), I, T

C$OMP PARALLEL DO DEFAULT(__AUTO)

DO I=1, 100

T = Y(I)

CALL FOO(X)

X(I) = T*T

END DO

C$OMP END PARALLEL DO

END

%f95 -xopenmp -xO3 -vpara -c -g source2.f

"source2.f", line 2: Warning: Autoscoping for OpenMP construct failed.

Check er-src for details. Parallel region will be executed by

a single thread.

More detailed information appears in the compiler commentary displayed by er_src:

EXAMPLE 6–3 Usinger_src

% er_src source2.o

Source file: source2.f

Object file: source2.o

Load Object: source2.o

1. INTEGER X(100), Y(100), I, T

Source OpenMP region below has tag R1

Variables autoscoped as SHARED in R1: y

Variables autoscoped as PRIVATE in R1: t, i

Variables treated as shared because they cannot be autoscoped in R1: x

R1 will be executed by a single thread because

6.6 Checking the Results of Autoscoping

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201254

Page 55: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 6–3 Using er_src (Continued)

autoscoping for some variable s was not successful

Private variables in R1: i, t

Shared variables in R1: y, x

2. C$OMP PARALLEL DO DEFAULT(__AUTO)

Source loop below has tag L1

L1 parallelized by explicit user directive

L1 autoparallelized

L1 parallel loop-body code placed in function _$d1A2.MAIN_

along with 0 inne r loops

L1 could not be pipelined because it contains calls

3. DO I=1, 100

4. T = Y(I)

5. CALL FOO(X)

6. X(I) = T*T

7. END DO

8. C$OMP END PARALLEL DO

9. END

10.

6.7 Autoscoping ExamplesThis section provides some examples to illustrate how the autoscoping rules work.

EXAMPLE 6–4 A More Complicated Example

1. REAL FUNCTION FOO (N, X, Y)

2. INTEGER N, I

3. REAL X(*), Y(*)

4. REAL W, MM, M

5.

6. W = 0.0

7.

8. C$OMP PARALLEL DEFAULT(__AUTO)

9.

10. C$OMP SINGLE

11. M = 0.0

12. C$OMP END SINGLE

13.

14. MM = 0.0

15.

16. C$OMP DO

17. DO I = 1, N

18. T = X(I)

19. Y(I) = T

20. IF (MM .GT. T) THEN

21. W = W + T

22. MM = T

23. END IF

24. END DO

25. C$OMP END DO

26.

27. C$OMP CRITICAL

6.7 Autoscoping Examples

Chapter 6 • Automatic Scoping of Variables 55

Page 56: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 6–4 A More Complicated Example (Continued)

28. IF ( MM .GT. M ) THEN

29. M = MM

30. END IF

31. C$OMP END CRITICAL

32.

33. C$OMP END PARALLEL

34.

35. FOO = W - M

36.

37. RETURN

38. END

The function FOO() contains a parallel region, which contains a SINGLE construct, awork-sharing DO construct and a CRITICAL construct. Besides the OpenMP parallel constructs,the code in the parallel region does the following:

1. Copy the value in array X to array Y2. Find the maximum positive value in X, and store it in M

3. Accumulate the value of some elements of X into variable W.

How does the compiler use the these rules to correctly scope the variables in the parallel region?

The following variables are used in the parallel region, I, N, MM, T, W, M, X, and Y. The compiler willdetermine the following information:

■ Scalar I is the loop index of the work-sharing DO loop. The OpenMP specification mandatesthat I be scoped PRIVATE.

■ Scalar N is only read in the parallel region and therefore will not cause data race, so it isscoped as SHARED following rule S1.

■ Any thread executing the parallel region will execute statement 14, which sets the value ofscalar MM to 0.0. This write will cause data race, so rule S1 does not apply. The write happensbefore any read of MM in the same thread, so MM is scoped as PRIVATE according to rule S2.

■ Similarly, scalar T is scoped as PRIVATE.■ Scalar W is read and then written at statement 21, so rules S1 and S2 do not apply. The

addition operation is both associative and communicative, therefore, W is scoped asREDUCTION(+) according to rule S3.

■ Scalar M is written in statement 11 which is inside a SINGLE construct. The implicit barrier atthe end of the SINGLE construct ensures that the write in statement 11 will not happenconcurrently with either the read in statement 28 or the write in statement 29, and the lattertwo will not happen at the same time because both are inside the same CRITICAL construct.No two threads can access M at the same time. Therefore, the writes and reads of M in theparallel region do not cause a data race, and, following rule S1, M is scoped SHARED.

■ Array X is only read and not written in the region, so it is scoped as SHARED by rule A1.

6.7 Autoscoping Examples

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201256

Page 57: Oracle Solaris Studio 12.3 OpenMP API User's Guide

■ The writes to array Y is distributed among the threads, and no two threads will write to thesame elements of Y. Because no data race occurs, Y is scoped SHARED according to rule A1.

EXAMPLE 6–5 Example With QuickSort

static void par_quick_sort (int p, int r, float *data)

{

if (p < r)

{

int q = partition (p, r, data);

#pragma omp task default(__auto) if ((r-p)>=low_limit)

par_quick_sort (p, q-1, data);

#pragma omp task default(__auto) if ((r-p)>=low_limit)

par_quick_sort (q+1, r, data);

}

}

int main ()

{

...

#pragma omp parallel

{

#pragma omp single nowait

par_quick_sort (0, N-1, &Data[0]);

}

...

}

er_src result:

Source OpenMP region below has tag R1

Variables autoscoped as FIRSTPRIVATE in R1: p, q, data

Firstprivate variables in R1: data, p, q

47. #pragma omp task default(__auto) if ((r-p)>=low_limit)

48. par_quick_sort (p, q-1, data);

Source OpenMP region below has tag R2

Variables autoscoped as FIRSTPRIVATE in R2: q, r, data

Firstprivate variables in R2: data, q, r

49. #pragma omp task default(__auto) if ((r-p)>=low_limit)

50. par_quick_sort (q+1, r, data);

The scalar variables p and q, and the pointer variable data, are read-only in the task construct,and read-only in the parallel region. Therefore, they are autoscoped as FIRSTPRIVATE accordingto TS1.

EXAMPLE 6–6 Another Fibbonacci Example

int fib (int n)

{

int x, y;

if (n < 2) return n;

#pragma omp task default(__auto)

x = fib(n - 1);

6.7 Autoscoping Examples

Chapter 6 • Automatic Scoping of Variables 57

Page 58: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 6–6 Another Fibbonacci Example (Continued)

#pragma omp task default(__auto)

y = fib(n - 2);

#pragma omp taskwait

return x + y;

}

er_src result:

Source OpenMP region below has tag R1

Variables autoscoped as SHARED in R1: x

Variables autoscoped as FIRSTPRIVATE in R1: n

Shared variables in R1: x

Firstprivate variables in R1: n

24. #pragma omp task default(__auto) /* shared(x) firstprivate(n) */

25. x = fib(n - 1);

Source OpenMP region below has tag R2

Variables autoscoped as SHARED in R2: y

Variables autoscoped as FIRSTPRIVATE in R2: n

Shared variables in R2: y

Firstprivate variables in R2: n

26. #pragma omp task default(__auto) /* shared(y) firstprivate(n) */

27. y = fib(n - 2);

28.

29. #pragma omp taskwait

30. return x + y;

31. }

Scalar n is read-only in the task constructs and read-only in the parallel construct. Therefore, nis autoscoped as FIRSTPRIVATE, according to TS1.

Scalar variables x and y are local variables of function fib(). Accesses to x and y in both tasksare free of data race. Because there is a taskwait, the two tasks will complete execution beforethe thread executing fib(), which encountered the tasks, exits fib(). This implies that x and y

will be active while the two tasks are executing. Therefore, x and y are autoscoped as SHARED,according to TS2.

EXAMPLE 6–7 Another Autoscoping Example

int main(void)

{

int yy = 0;

#pragma omp parallel default(__auto) shared(yy)

{

int xx = 0;

#pragma omp single

{

#pragma omp task default(__auto) // task1

{

xx = 20;

6.7 Autoscoping Examples

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201258

Page 59: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 6–7 Another Autoscoping Example (Continued)

}

}

#pragma omp task default(__auto) // task2

{

yy = xx;

}

}

return 0;

}

er_src result:

Source OpenMP region below has tag R1

Variables autoscoped as PRIVATE in R1: xx

Private variables in R1: xx

Shared variables in R1: yy

7. #pragma omp parallel default(__auto) shared(yy)

8. {

9. int xx = 0;

10.

Source OpenMP region below has tag R2

11. #pragma omp single

12. {

Source OpenMP region below has tag R3

Variables autoscoped as SHARED in R3: xx

Shared variables in R3: xx

13. #pragma omp task default(__auto) // task1

14. {

15. xx = 20;

16. }

17. }

18.

Source OpenMP region below has tag R4

Variables autoscoped as PRIVATE in R4: yy

Variables autoscoped as FIRSTPRIVATE in R4: xx

Private variables in R4: yy

Firstprivate variables in R4: xx

19. #pragma omp task default(__auto) // task2

20. {

21. yy = xx;

22. }

23. }

In this example, xx is a private variable in the parallel region. One of the threads in the teammodifies its initial value of xx by executing task1. Then all of the threads encounter task2,which uses xx to do some computation.

In task1, the use of xx is free of data race. Because an implicit barrier is at the end of the singleconstruct and task1 should complete before exiting this barrier, xx will be active while task1 isexecuting. Therefore, according to TS2, xx is autoscoped as SHARED on task1.

6.7 Autoscoping Examples

Chapter 6 • Automatic Scoping of Variables 59

Page 60: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 6–7 Another Autoscoping Example (Continued)

In task2, the use of xx is read-only. However, the use of xx is not read-only in the enclosingparallel construct. Because xx is predetermined as PRIVATE for the parallel construct, whetherxx will be active while task2 is executing is not certain. Therefore, according to TS3, xx isautoscoped FIRSTPRIVATE on task2.

In task2, the use of yy is not free of data race, and in each thread executing task2, the variableyy is always written before being read by the same thread. So, according to TS4, yy isautoscoped PRIVATE on task2.

EXAMPLE 6–8 A Final Example

int foo(void)

{

int xx = 1, yy = 0;

#pragma omp parallel shared(xx,yy)

{

#pragma omp task default(__auto)

{

xx += 1;

#pragma omp atomic

yy += xx;

}

#pragma omp taskwait

}

return 0;

}

er_src result:

Source OpenMP region below has tag R1

Shared variables in R1: yy, xx

5. #pragma omp parallel shared(xx,yy)

6. {

Source OpenMP region below has tag R2

Variables autoscoped as SHARED in R2: yy

Variables autoscoped as FIRSTPRIVATE in R2: xx

Shared variables in R2: yy

Firstprivate variables in R2: xx

7. #pragma omp task default(__auto)

8. {

9. xx += 1;

10.

11. #pragma omp atomic

12. yy += xx;

13. }

14.

15. #pragma omp taskwait

16. }

6.7 Autoscoping Examples

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201260

Page 61: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 6–8 A Final Example (Continued)

The use of xx in the task construct is not read-only and is not free of data race. However theread of x in the task region gets the value of x defined outside the task. (In this example, becausexx is SHARED for the parallel region, the definition of x is actually outside the parallel region.)Therefore, according to TS5, xx is autoscoped as FIRSTPRIVATE.

The use of yy in the task construct is not read-only but is free of data race. yy will be accessiblewhile the task is executing because there is a taskwait. Therefore, according to TS2, yy isautoscoped as SHARED.

6.7 Autoscoping Examples

Chapter 6 • Automatic Scoping of Variables 61

Page 62: Oracle Solaris Studio 12.3 OpenMP API User's Guide

62

Page 63: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Scope Checking

Autoscoping can help the programmer decide how to scope variables. However, for somecomplicated programs, autoscoping might not be successful or the result of autoscoping mightnot be what the programmer expects. Incorrect scoping can cause many inconspicuous yetserious problems. For example, incorrectly scoping some variable as SHARED may cause a datarace; incorrectly privatizing a variable may result in an undefined value for the variable outsidethe construct.

Oracle Solaris Studio C, C++, and Fortran compilers provide a compile-time scope-checkingfeature where the compiler determines whether variables in an OpenMP program are correctlyscoped.

Based on the compiler's capabilities, scope checking can discover potential problems includingdata races, inappropriate privatization or reduction of variables, and other scoping issues.During scope checking, the data-sharing attributes specified by the programmer, the implicitdata-sharing attributes determined by the compiler, and autoscoping results are all checked bythe compiler.

7.1 Using the Scope Checking FeatureTo enable scope checking, compile the OpenMP program with the -xvpara and -xopenmp

options, and at optimization level -xO3 or higher. Scope checking does not work if the programis compiled with just -xopenmp=noopt. If the optimization level is less than -xO3, the compilerwill issue a warning message and will not do any scope checking.

During scope checking, the compiler will check all OpenMP constructs. If the scoping of somevariables causes problems, the compiler will issue warning messages, and, in some cases,suggestions for the correct data sharing attribute clause to use.

For example:

7C H A P T E R 7

63

Page 64: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 7–1 Scope Checking

% cat t.c

#include <omp.h>

#include <string.h>

int main()

{

int g[100], b, i;

memset(g, 0, sizeof(int)*100);

#pragma omp parallel for shared(b)

for (i = 0; i < 100; i++)

{

b += g[i];

}

return 0;

}

% cc -xopenmp -xO3 -xvpara source1.c

"source1.c", line 10: Warning: inappropriate scoping

variable ’b’ may be scoped inappropriately as ’shared’

. write at line 13 and write at line 13 may cause data race

"source1.c", line 10: Warning: inappropriate scoping

variable ’b’ may be scoped inappropriately as ’shared’

. write at line 13 and read at line 13 may cause data race

The compiler will not do scope checking if the optimization level is less than -xO3.

% cc -xopenmp=noopt -xvpara source1.c

"source1.c", line 10: Warning: Scope checking under vpara compiler

option is supported with optimization level -xO3 or higher.

Compile with a higher optimization level to enable this feature

The following example shows potential scoping errors:

EXAMPLE 7–2 Scoping Error Example

% cat source2.c

#include <omp.h>

int main()

{

int g[100];

int r=0, a=1, b, i;

#pragma omp parallel for private(a) lastprivate(i) reduction(+:r)

for (i = 0; i < 100; i++)

{

g[i] = a;

b = b + g[i];

r = r * g[i];

7.1 Using the Scope Checking Feature

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201264

Page 65: Oracle Solaris Studio 12.3 OpenMP API User's Guide

EXAMPLE 7–2 Scoping Error Example (Continued)

}

a = b;

return 0;

}

% cc -xopenmp -xO3 -xvpara source2.c

"source2.c", line 8: Warning: inappropriate scoping

variable ’r’ may be scoped inappropriately as ’reduction’

. reference at line 13 may not be a reduction of the specified type

"source2.c", line 8: Warning: inappropriate scoping

variable ’a’ may be scoped inappropriately as ’private’

. read at line 11 may be undefined

. consider ’firstprivate’

"source2.c", line 8: Warning: inappropriate scoping

variable ’i’ may be scoped inappropriately as ’lastprivate’

. value defined inside the parallel construct is not used outside

. consider ’private’

"source2.c", line 8: Warning: inappropriate scoping

variable ’b’ may be scoped inappropriately as ’shared’

. write at line 12 and write at line 12 may cause data race

"source2.c", line 8: Warning: inappropriate scoping

variable ’b’ may be scoped inappropriately as ’shared’

. write at line 12 and read at line 12 may cause data race

This artifical example shows some typical errors of scoping that scope checking can detect.

1. r is specified as a reduction variable whose operation is +, but actually the operation shouldbe *.

2. a is explicitly scoped as PRIVATE. Because PRIVATE variables do not have an initial value, thereference on line 11 to a could read garbage values. The compiler points out this problem,and suggests that the programmer consider scoping a as FIRSTPRIVATE.

3. Variable i is the loop index variable. In some cases, the programmer may wish to specify itto be LASTPRIVATE if the value of the loop index is used after the loop. However, in theexample, i is not referenced at all after the loop. The compiler issues a warning and suggeststhat the programmer scope i as PRIVATE. Using PRIVATE instead of LASTPRIVATE can lead tobetter performance.

4. The programmer does not explicitly specify a data-sharing attribute for variable b.According to page 79, lines 27-28 of the OpenMP Specification 3.0, b will be implicitlyscoped as SHARED. However, scoping b as SHARED will cause a data race. The correctdata-sharing attribute of b should be REDUCTION.

7.1 Using the Scope Checking Feature

Chapter 7 • Scope Checking 65

Page 66: Oracle Solaris Studio 12.3 OpenMP API User's Guide

7.2 Restrictions■ As mentioned, scope checking works only with optimization level -xO3 or higher. Scope

checking does not work if the program is compiled with just -xopenmp=noopt.■ Only synchronizations specified by using OpenMP synchronization directives, such as

BARRIER and MASTER, are recognized and used in the data race analysis. User-implementedsynchronizations, such as busy-waiting, are not recognized.

7.2 Restrictions

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201266

Page 67: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Performance Considerations

Once you have a correct, working OpenMP program, consider its overall performance. Thischapter describes some general techniques that you can utilize to improve the efficiency andscalability of an OpenMP application.

The Oracle Solaris Studio portal posts occasional articles and case studies regardingperformance analysis and optimization of OpenMP applications at http://www.oracle.com/technetwork/server-storage/solarisstudio.

8.1 Some General Performance RecommendationsSome general techniques for improving performance of OpenMP applications are:

■ Minimize synchronization.■ Avoid or minimize the use of BARRIER, CRITICAL sections, ORDERED regions, and locks.■ Use the NOWAIT clause where possible to eliminate redundant or unnecessary barriers.

For example, there is always an implied barrier at the end of a parallel region. AddingNOWAIT to a final DO in the region eliminates one redundant barrier.

■ Use named CRITICAL sections for fine-grained locking.■ Use explicit FLUSH with care. Flushes can cause data cache restores to memory, and

subsequent data accesses might require reloads from memory, all of which decreaseefficiency.

By default, idle threads will be put to sleep after a certain timeout period. If a thread does notfind work by the end of the timeout period, it will go to sleep, thus avoiding wastingprocessor cycles at the expense of other threads. The default timeout period might not besufficient for your application, causing the threads to go to sleep too soon or too late. Ingeneral, if an application has dedicated processors to run on, SPIN would give bestapplication performance. If an application shares processors with other applications at thesame time, then the default setting, or a setting that does not let the threads spin for too long,

8C H A P T E R 8

67

Page 68: Oracle Solaris Studio 12.3 OpenMP API User's Guide

would be best for system throughput .Use the SUNW_MP_THR_IDLE environment variable tooverride the default timeout period, even up to the point where the idle threads will never beput to sleep and remain active all the time.

■ Parallelize at the highest level possible, such as outer DO/FOR loops. Enclose multiple loops inone parallel region. In general, make parallel regions as large as possible to reduceparallelization overhead. For example, this construct is less efficient:

!$OMP PARALLEL

....

!$OMP DO

....

!$OMP END DO

....

!$OMP END PARALLEL

!$OMP PARALLEL

....

!$OMP DO

....

!$OMP END DO

....

!$OMP END PARALLEL

A more efficient construct:

!$OMP PARALLEL

....

!$OMP DO

....

!$OMP END DO

.....

!$OMP DO

....

!$OMP END DO

!$OMP END PARALLEL

■ Use PARALLEL DO/FOR instead of worksharing DO/FOR directives in parallel regions. ThePARALLEL DO/FOR is implemented more efficiently than a general parallel region containingpossibly several loops. For example, this construct is less efficient:

!$OMP PARALLEL

!$OMP DO

.....

!$OMP END DO

!$OMP END PARALLEL

while this construct is more efficient:

!$OMP PARALLEL DO

....

!$OMP END PARALLEL

8.1 Some General Performance Recommendations

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201268

Page 69: Oracle Solaris Studio 12.3 OpenMP API User's Guide

■ On Oracle Solaris systems, use SUNW_MP_PROCBIND to bind threads to processors. Processorbinding, when used along with static scheduling, benefits applications that exhibit a certaindata reuse pattern where data accessed by a thread in a parallel region will be in the localcache from a previous invocation of a parallel region. See “2.3 Processor Binding” onpage 20.

■ Use MASTER instead of SINGLE wherever possible.■ The MASTER directive is implemented as an IF-statement with no implicit BARRIER :

IF(omp_get_thread_num() == 0) {...}■ The SINGLE directive is implemented similar to other worksharing constructs. Keeping

track of which thread reached SINGLE first adds additional runtime overhead. There is animplicit BARRIER if NOWAIT is not specified, which is less efficient.

Choose the appropriate loop scheduling.■ STATIC causes no synchronization overhead and can maintain data locality when data

fits in cache. However, STATIC could lead to load imbalance.■ DYNAMIC,GUIDED incurs a synchronization overhead to keep track of which chunks have

been assigned. While these schedules could lead to poor data locality, they can improveload balancing. Experiment with different chunk sizes.

Use LASTPRIVATE with care, as it has the potential of high overhead.■ Data needs to be copied from private to shared storage upon return from the parallel

construct.■ The compiled code checks which thread executes the logically last iteration. This

imposes extra work at the end of each chunk in a parallel DO/FOR. The overhead adds upif there are many chunks.

Use efficient thread-safe memory management.■ Applications could be using malloc() and free() explicitly, or implicitly in the

compiler-generated code for dynamic/allocatable arrays, vectorized intrinsics, and soon.

■ The thread-safe malloc() and free() in libc have a high synchronization overheadcaused by internal locking. Faster versions can be found in the libmtmalloc library. Linkwith -lmtmalloc to use libmtmalloc.

Small data cases could cause OpenMP parallel loops to underperform. Use the if clause onPARALLEL constructs to indicate that a loop should run parallel only in those cases wheresome performance gain can be expected.

■ When possible, merge loops. For example, merge two loops:

!$omp parallel do

do i = ...

..........statements_1...

end do

8.1 Some General Performance Recommendations

Chapter 8 • Performance Considerations 69

Page 70: Oracle Solaris Studio 12.3 OpenMP API User's Guide

!$omp parallel do

do i = ...

..........statements_2...

end do

Merged into a single loop:

!$omp parallel do

do i = ...

..........statements_1...

..........statements_2...

end do

■ Try nested parallelism if your application lacks scalability beyond a certain level. See “1.2Special Conventions” on page 11 for more information about nested parallelism inOpenMP.

8.2 False Sharing And How To Avoid ItCareless use of shared memory structures with OpenMP applications can result in poorperformance and limited scalability. Multiple processors updating adjacent shared data inmemory can result in excessive traffic on the multiprocessor interconnect and, in effect, causeserialization of computations.

8.2.1 What Is False Sharing?Most high performance processors, such as UltraSPARC processors, insert a cache bufferbetween slow memory and the high speed registers of the CPU. Accessing a memory locationcauses a slice of actual memory (a cache line) containing the memory location requested to becopied into the cache. Subsequent references to the same memory location or those around itcan probably be satisfied out of the cache until the system determines it is necessary to maintainthe coherency between cache and memory.

However, simultaneous updates of individual elements in the same cache line coming fromdifferent processors invalidate entire cache lines, even though these updates are logicallyindependent of each other. Each update of an individual element of a cache line marks the lineas invalid. Other processors accessing a different element in the same line see the line marked asinvalid. They are forced to fetch a more recent copy of the line from memory or elsewhere, eventhough the element accessed has not been modified. This occurs because cache coherency ismaintained on a cache-line basis and not for individual elements. As a result, interconnecttraffic and overhead increases. Also, while the cache-line update is in progress, access to theelements in the line is inhibited.

This situation is called false sharing. If it occurs frequently, performance and scalability of anOpenMP application suffers significantly.

False sharing degrades performance when all of the following conditions occur:

8.2 False Sharing And How To Avoid It

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201270

Page 71: Oracle Solaris Studio 12.3 OpenMP API User's Guide

■ Shared data is modified by multiple processors.■ Multiple processors update data within the same cache line.■ Data updating occurs very frequently (as in a tight loop).

Note that shared data that is read-only in a loop does not lead to false sharing.

8.2.2 Reducing False SharingCareful analysis of those parallel loops that play a major part in the execution of an applicationcan reveal performance scalability problems caused by false sharing. In general, false sharingcan be reduced by the following techniques:■ Making use of private data as much as possible■ Using the compiler’s optimization features to eliminate memory loads and stores

In specific cases, the impact of false sharing might be less visible when dealing with largerproblem sizes, as there might be less sharing.

Techniques for tackling false sharing are very much dependent on the particular application. Insome cases, a change in the way the data is allocated can reduce false sharing. In other cases,changing the mapping of iterations to threads by giving each thread more work per chunk (bychanging the chunksize value) can also lead to a reduction in false sharing.

8.3 Oracle Solaris OS Tuning FeaturesOracle Solaris supports features that improve the performance of OpenMP programs withouthardware upgrades, such as Memory Placement Optimizations (MPO) and Multiple Page SizeSupport (MPSS), among others.

MPO emables the OS to allocate pages close to the processors that access those pages. SunFireE20K and SunFire E25K systems have different memory latencies within the same UniBoard asopposed to between different UniBoards. The default MPO policy, called first-touch, allocatesmemory on the UniBoard containing the processor that first touches the memory. Thefirst-touch policy can greatly improve the performance of applications where data accesses aremade mostly to the memory local to each processor with first-touch placement. Compared to arandom memory placement policy where the memory is evenly distributed throughout thesystem, the memory latencies for applications can be lowered and the bandwidth increased,leading to higher performance.

The MPSS feature enables a program to use different page sizes for different regions of virtualmemory. The default Oracle Solaris page size is relatively small (8 KB on UltraSPARCprocessors and 4 KB on AMD64 Opteron processors). Applications that suffer from too manyTLB misses could experience a performance boost by using a larger page size.

TLB misses can be measured using the Oracle Solaris Studio Performance Analyzer.

8.3 Oracle Solaris OS Tuning Features

Chapter 8 • Performance Considerations 71

Page 72: Oracle Solaris Studio 12.3 OpenMP API User's Guide

The default page size on a specific platform can be obtained with the Oracle Solaris OScommand: /usr/bin/pagesize . The -a option on this command lists all the supported pagesizes. (See the pagesize(1) man page for details.)

The three ways to change the default page size for an application are:

■ Use the Oracle Solaris OS command ppgsz(1).■ Compile the application with the -xpagesize, -xpagesize_heap, and -xpagesize_stack

options. See the compiler man pages for details.■ Use MPSS specific environment variables. See the mpss.so.1(1) man page for details.

8.3 Oracle Solaris OS Tuning Features

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201272

Page 73: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Placement of Clauses on Directives

TABLE A–1 Pragmas Where Clauses Can Appear

Clause/Pragma parallel do/for sections single

parallel

do/for

parallel

sections

parallel

workshare

task

if + + + + +

private + + + + + + + +

shared + + + + +

firstprivate + + + + + + + +

lastprivate + + + +

default + + + + +

reduction + + + + + +

copyin + + + +

copyprivate + (1)

ordered + +

schedule + +

nowait + (2) + (2) + (2)

num_threads + + + +

untied +

final +

mergeable +

__auto + + + + +

AA P P E N D I X A

73

Page 74: Oracle Solaris Studio 12.3 OpenMP API User's Guide

1. Fortran only: COPYPRIVATE can appear on the END SINGLE directive.2. For Fortran, a NOWAIT modifier can only appear on the END DO, END SECTIONS, END SINGLE, or

END WORKSHARE directives.3. Only Fortran supports WORKSHARE and PARALLEL WORKSHARE.

Placement of Clauses on Directives

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201274

Page 75: Oracle Solaris Studio 12.3 OpenMP API User's Guide

Index

A__auto, 50automatic scoping, 49–61

Ccache line, 70compiling for OpenMP, 13–23

Ddefault(_auto), 50directive, 12documentation, accessing, 7documentation index, 7dynamic thread adjustment, 15

Eenvironment variables, 15–20

Ffalse sharing, 70

Gguided scheduling, 18

Iidle threads, 17implementation, 25

Mmemory placement optimization (MPO), 71

Nnested parallelism, 15, 31, 32number of threads, OMP_NUM_THREADS, 15

OOMP_DYNAMIC, 15OMP_MAX_ACTIVE_LEVELS, 16, 33OMP_NESTED, 15, 32OMP_NUM_THREADS, 15OMP_PROC_BIND, 16OMP_SCHEDULE, 15OMP_STACKSIZE, 15OMP_THREAD_LIMIT, 16OMP_WAIT_POLICY, 16OpenMP API specification, 11Oracle Solaris OS tuning, 71

75

Page 76: Oracle Solaris Studio 12.3 OpenMP API User's Guide

PPARALLELenvironment variable, 16parallelism, nested, 31performance, 67pragma, 12

Sscalability, 70scheduling, OMP_SCHEDULE, 15scoping of variables

automatic, 49–61compiler commentary, 53–55rules, 50–51

SLEEP, 17SPIN, 17stack size, 18, 22stacks, 22STACKSIZE, 18-stackvar, 22SUNW_MP_MAX_POOL_THREADS, 33SUNW_MP_THR_IDLE, 17SUNW_MP_WAIT_POLICY, 19SUNW_MP_WARN, 16

Ttask construct, automatic scoping rules, 51–52thread stack size, 18

Wwarning messages, 16weighting factor, 18

X-xopenmp, 13

Index

Oracle Solaris Studio 12.3: OpenMP API User's Guide • January, 201276