SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential © Copyright 2018 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice.

SGI UV3000Parallel Programming and Optimization

Nov 1st, 2019

Japan Advanced Institute of Science and Technology

Hewlett-Packard Japan, Ltd.

2Confidential ©2017 SGI

1. System summary

2. Submit job

3. Creating a job script

4. Compiler Options

5. Numerical Library

6. First-Touch Policy & Data Placement

7. Debugger

8. Exercise：Compile and Execute

9. Exercise：Auto-parallelization and OpenMP

Contents

Confidential 3

System summary


UV3000

Model SGI UV3000

total

system 1（4Racks）

IRU 16

Blade 128

CPU71.27TFLOPS

256CPUs・1536cores

memory32TB

16GB * 2048DIMM

DiskExternal disk device 160TB（Physical）

SGI Infinite Storage 5100

I/O I/F Dual Port 16Gbps FC HBA

NW I/F Dual Port 10G

Blade

CPUIntel Xeon E5-4655 v3 * 2CPU

6c/2.9GHz/30M/9.6GT/s/135W

memory16GB DDR4 * 2DIMM * 4ch / CPU

2133 MT/s


InfinitStorage 5100

Model SGI InfinitStorage 5100

System 1

chassis 4U enclosures

controller Active/active controller

Host Interface 4x 16Gb Fc port

Cache size816GB

16GB system cache +800GB SSD

Disk units4TB NL-SAS HDD x42

800GB SAS SSD x2


Login Server(uv)

Model SGI Rackable C2112-GP2

System 1

CPU 2x Intel Xeon E5-2667 v3

MEM128GB

16GB * 8DIMM

Disk 1TB SATA HDD x2

Network4x Gigabit Ethernet

2x 10Gigabit Ethernet


Network

campus network

10GbE

16Gbps FCconnect

UV3000 Login + PMT

NFS server

/home


Server Room


UV3000 Blade

Intel Xeon E5-4655v3Intel Xeon E5-4655v3

QP I

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM Q

P IQ

P IQ

P I

QP I

QP I

QP I

QP I

UV Hub

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

NL612Channels

UV3000 blade Brock diagram

L3 shered cache 30MB

core 1

core 2

core 3

core 4

core 5

core 0

L3 shered cache 30MB

core 1

core 2

core 3

core 4

core 5

core 0


Topology

This drawing shows 1/8 of the topology.Each vertex of the cube is connect to two router blades.


– OS SUSE Linux Enterprise Server 12

– SGI® Performance Suite

– SGI Accelerate

– SGI MPT

– Intel® Parallel Studio XE 2016 update2, 2017 update1

– PGI compiler 2017

– Gaussian 16

Software


– Use the UV3000 system via the login-server (uv).

Login to the login-server

hostname uv.jaist.ac.jp

• The login-server can login with ssh. File transfer is available to scp.

$ ssh -l <userID> uv


– Each user is assigned a home directory. Home directory resides on the file server , it has been shared in each system by NFS . Home directory is the working area when you logged in to each system.

– The UV3000(altix-uv) have a local work directory(/work). I would recommended you use, because it has good I/O performance.

Note, however, that work area does not get backup the data. Please keep data at your own risk.

– The work directory has bee NFS mount to login-server(uv)

Storage of UV3000


– module is a user interface to the Modules package. The Modules package provides for the dynamic modification of the user's environment.

– mpi/PrgEnv-intel_sgi is loaded by default.

– mpi/PrgEnv-intel_sgi is the environment in which the Intel compiler and SGI MPT ca be used.

– List all available modulefiles in the current .

$ module avail

– You can Load modulefile(s) into the shell environment.

$ module load mpi/PrgEnv-intel_sgi

– If you use a non-standard environment in a batch script , you need to perform the initial configuration of the module command in the batch script .

source /etc/profile.d/modules.sh

module load mpi/PrgEnv-intel

– Please replace the " /etc/profile.d/modules.csh " in the case of csh.

Switching programming environment


Main Modulefile

Module name compiler MPI

Intel/16.0.2 Intel parallel_studio_xe_2016 X

Intel/17.0.1 Intel parallel_studio_xe_2017 X

mpi/PrgEnv-intel-2016.2 Intel parallel_studio_xe_2016 Intel MPI

mpi/PrgEnv-intel

mpi/PrgEnv-intel-2017.1

Intel parallel_studio_xe_2017 Intel MPI

mpi/PrgEnv-intel-2016.2_sgi Intel parallel_studio_xe_2016 SGI MPT

mpi/PrgEnv-intel_sgi

mpi/PrgEnv-intel-2017.1_sgi

Intel parallel_studio_xe_2017 SGI MPT

pgi/17.1 PGI compiler 17.1 X

pgi/PrgEnv-pgi PGI compiler 17.1 openMPI


– PBS Professional as a job management system (PBS) has been introduced . There are two ways to run a job. One is interactive mode , the other is to use the script . How to create a job script file , please refer to the "Creating a job script " .

– Please specify the execution condition when you submitted jobs.

PBS Professional

queue vnode core memory Wall time number of

execution

(par user)

priority

TINY 1-4 1-24 ～512GB 6 hours - (1) 160

SINGLE 1-2 1-12 128-256GB 1 weeks 32 (16) 150

SMALL 4-8 12-48 256GB-1TB 1 weeks 16 (6) 130

MEDIUM 8-32 48-192 1TB-4TB 3 days 4 (1) 90

LARGE 32-64 192-384 4TB-8TB 2 days 2 (1) 70

XLARGE 64-128 384-768 4TB-16TB 2 days 1 (1) 30

APPLI 1-2 1-12 128GB-256GB 3 weeks 16 (6) 110

LONG-S 1-8 12-48 256GB-1TB 2 weeks 3 (1) 110

LING-M 8-32 48-192 1TB-4TB 1 weeks 1 (1) 90

LONG-L 32-96 192-576 4TB-12TB 5 days 1 (1) 30


– UV 3000 system can be used as SMP of 32 TB at maximum with 256 sockets (1536 cores).

*SMP is a Symmetric Multiple processor, which is a multi-processor method that distributes processing by multiple CPUs, and is a method of treating each CPU equally and doing parallel processing. It is also translated as a symmetric multi-processor.

Access from 1 process to 32 TB memory is possible. (In JAIST's PC cluster, unless you use a parallel program, only one process can use up to 64 GB of memory.) Simply put, think that it is a personal computer that can use the 1536 core 32 TB memory with one OS.

If 32 TB of memory is used, it is necessary to apply to the administrator.

– It is possible to perform calculations using large scale memory

– It is possible to automatically parallelize an existing source by using the paralell option of Intel compiler

– Favorite calculation

– Calculation using large memory

– A program that performs automatic parallelization and executes openMP

– Weak point

– Calculations with heavily disk I/O

(The direct connected working disk(/work) is abeilable. But I/O performance is lowered because it is used for the entire OS.)

– Notes : Executed the program

– Avoid large output to standard output. You may experience a system malfunction.

About the program to be executed

Confidential 18

Submit job


– The qsub command is used to submit a batch job to PBS.<job script> is a script for job submission.

$ qsub <job script>

In the following example , There is date in the home directory. Submit from here with the qsubcommand. If you submit different locations, plerase modify userID and directory. There is a sample program in / work/Samples.

Submit job (batch)

$ cat test_prog.sh#!/bin/bash#PBS -q TINY#PBS -l select=1:ncpus=1:mem=1gb#PBS -N sample_JOB#PBS -j oecd ${PBS_O_WORKDIR}dplace ./sample.out

---Submit job---$ qsub test_prog.sh---Finish---$ cat /user1/xxx/userID/directory/sample_JOB.ojobID


– Job is to be run interactively.

$ qsub -I

When the job submission is successful , the job ID is assigned. And it ends with exit command .

Submit job (Interactive)

$ qsub -I -l select=1:ncpus=4qsub: waiting for job 4107.altix-uv to startqsub: job 4107.altix-uv ready

altix-uv /home/sgise2>altix-uv /home/sgise2> cd /work/sgise2altix-uv /work/sgise2> ./mathprogram1altix-uv /work/sgise2> INTEGRAL[ 0.1 0.9: 100000000 STEPS] altix-uv /work/sgise2> exitlogout

qsub: job 4107.altix-uv completed


qsub options

Option Description

-q queue_name Job is submitted to the named queue at the default server.

If not specified, it will be submitted to the default (TINY).

-N name Sets job’s Job_Name attribute and name to name. If no script is used, the job’s name is “STDIN”.

(string, up to 236 characters in length. It must consist of an alphabetic or numeric character followed by

printable, nonwhite-space characters.)

-a date_time Point in time after which the job is eligible for execution. Given in pairs of digits. Sets job’s

Execution_Time attribute to date_time.

Format: datetime:[[[[CC]YY]MM]DD]hhmm[.SS]

-j oe | eo Whether and how to join the job’s standard error and standard output streams. Sets job’s Join_Path

attribute to join.

oe: Standard error and standard output are merged into standard output.

eo: Standard error and standard output are merged into standard error.

-o path_name Path to be used for the job’s standard output stream. Sets job’s Output_Path attribute to path. If the -o

option is not specified, PBS copies the standard output to the current working directory where the qsub

command was executed. The default filename for the standard output stream is used. It has this form:

job name.o<sequence>

-e path_name Path to be used for the job’s standard error stream. Sets job’s Error_Path attribute to path. If the -o

option is not specified, PBS copies the standard output to the current working directory where the qsub

command was executed. The default filename for the standard output stream is used. It has this form:

job name.o<sequence number>

-m mail_option The set of conditions under which mail about the job is sent. Sets job’s Mail_Points attribute to

mail_events. The mail_events argument can be either “n” or any combination of a, b, and e.

n No mail will be sent.

a Mail is sent when the job is aborted by the batch system.

b Mail is sent when the job begins execution.

e Mail is sent when the job terminates.

-M mail_address List of users to whom mail about the job is sent.

-l keyword=value Allows the user to request resources and specify job placement. Sets job’s Resource_list attribute to

resource_list.


UV 3000 creates a virtual node with PBS. Jobs are structured to execute using virtual nodes.

In the case of Jaist UV3000, the 256 CPU (1536 core) 32TB system is divided into 256 virtual nodes.

The virtual node is specified in the select statement.

There are two ways to set the virtual node. Specifying the number of virtual nodes or the number of cores to be used.

The following is an example of executing MPI using 24 cores.

Specifying the number of virtual nodes

qsub –l select=4:ncpus=6:mpiprocs=6

Specifying the number of cores

qsub –l select=1:ncpus=24:mpiprocs=24

About virtual nodes


Use a select statement to specify the number of nodes, number of cores.

Write the job input options.

-l select=N1:ncpus=N2

In the case of MPI job , it looks like the following .

-l select=N1:ncpus=N2:mpiprocs=N3

N1：Specifies the number of nodes.

N2：Specifies the number of cores in the one compute node.

N3： Specifies the number of MPI process in the one compute node.

In the case of MPI job , it will be the N3 = N2.

In the case of hybrid job of MPI and OpenMP, it will be the N3 = N2 / ( number of threads in the OpenMP).

If the value of canges every node, the format is as follows.(connect with +)

-l select=N1:ncpus=N2:mpiprocs=N3＋M1:ncpus=M2:mpiprocs=M3

About the select statement


– The qstat command is used to display the status of jobs, queues, and batch servers.

Job confirmation

uv /home/sgise2> qstatJob id Name User Time Use S Queue------------------- -------------------- ------------------ ----------- -- -----3627.altix-uv STDIN s1520207 316:42:1 R SMALL3629.altix-uv STDIN s1520207 84:22:43 R SMALL3630.altix-uv sym-62 s1420207 00:00:37 R SINGLE3793.altix-uv sym-62 s1420207 100:35:3 R SINGLE3854.altix-uv translate_full s1520207 366:04:0 R APPLI

Description of Default Job Status Columns

• Job id The job_identifier

• Name Job name

• User Username of job owner.

• Time Use CPU Time

• S Job status

• Queue The queue in which the job resides.

The job’s state:

• Q Job is queued

• R Job is running

• E Job is exiting after having run.

• S Job is suspended.


option Discription

-u user_name If a destination is given, status for jobs at that destination

owned by users in user_name is displayed. If a job_identifier is

given, status information for that job is displayed regardless of

the job’s ownership.

-f (JOBID) Full display. Job, queue or server attributes displayed one to a

line.

-Q Display queue status in default format.

-s Any comment added by the administrator or scheduler is

shown on the line below the basic information.

Job confirmation

• qstat main options


Job confirmation

• If a destination is given, information for running or suspended jobs at that destination is displayed. Use “qstat –f job ID”

• Display queue status in default format. Use “qstat -Q”

uv /home/sgise2> qstat -QQueue Max Tot Ena Str Que Run Hld Wat Trn Ext Type---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----LONG-L 0 0 yes yes 0 0 0 0 0 0 ExecSINGLE 32 19 yes yes 0 19 0 0 0 0 ExecSMALL 16 7 yes yes 0 7 0 0 0 0 ExecMEDIUM 4 0 yes yes 0 0 0 0 0 0 ExecLARGE 2 0 yes yes 0 0 0 0 0 0 ExecXLARGE 1 0 yes yes 0 0 0 0 0 0 ExecAPPLI 16 3 yes yes 0 3 0 0 0 0 ExecLONG-S 4 0 yes yes 0 0 0 0 0 0 ExecLONG-M 2 0 yes yes 0 0 0 0 0 0 ExecTINY 0 0 yes yes 0 0 0 0 0 0 Exec

• Max ：Maximum number of jobs allowed to run concurrently in the queue.

• Tot ：Total number of jobs in the queue..

• Ena ：Whether the queue is enabled or disabled.

• Str ：Whether the queue is started or stopped.

• Que ：Number of queued jobs.

• Run ：Number of running jobs.

• Hld ：Number of held jobs.

• Wat ：Number of waiting jobs.

• Trn ：Number of jobs being moved (transiting.)

• Ext ：Number of exiting jobs.

• Type ：Type of queue: execution or routing.


– The qdel command deletes jobs in the order given. A PBS job may be deleted by its owner only.

$ qdel <JOBID>

Cancel job


– dplace command

dplace is used to bind a related set of processes to specific cpus or nodes to prevent process migrations. It is recommended to use dplace command with execution of parallel programs.

Note that different options in automatic parallelization/OpenMPprogram and MPI program. The numbers are not related to the number of CPU.

※dplace is used to bind a related set of processes to specific cpus or nodes to prevent process migrations.

By default, memory is allocated to a process on the node that the process is executing on. If a process moves from node to node during its lifetime, a higher percentage of memory references will be to remote nodes. Remote accesses typically have higher access times. Process performance may suffer.

Process layout


– Serial Program.

Process layout

$ dplace ./a.out

• Auto-parallelized code/OpenMP code(For programs compiled with intel conpiler 17.0.0 or earlier)

Specify the -x2 option

$ export OMP_NUM_THREADS=8

$ export KMP_AFFINITY=disabled

$ dplace -x2 ./a.out

• Specification has been changed so that management threats will not stand up from Intel compiler 17.0.1. Therefore, the option of -x2 becomes unnecessary.

• export command is a command of bash system . Please change to "setenv OMP_NUM_THREADS 8" in the case of csh.

• PBS batch job, OMP_NUM_THREADS sets the number of CPUS the qsub command is specified automatically.


– SGI MPT code

Specify the –s1 option

Process layout

$ mpiexec_mpt –np 8 dplace -s1 ./a.out

※In the PBS batch job , please use the mpiexec_mpt.

The number of MPI process specified by "-np option", qsub command is automatically set to mpiprocs.


option description

-x Provides the ability to skip placement of processes. <skip_mask> is a bitmask. If bit N

of <skip_mask> is set, then the N+1th process that is forked is not placed. For example,

setting the mask to 6 will cause the 2nd and 3rd processes from being placed. The

first process (the process named by the <command>) will be assigned to the first

cpu. The second and third processes are not placed. The fourth process is assigned to

the second cpu, etc.. This option is useful for certain classes of threaded apps that

spawn a few helper processes that typically do not use much cpu time. (Hint: Intel

OpenMP applications currently should be placed using -x 2. This could change in future

versions of OpenMP).

-s Skip the first <skip_count> processes before starting to place processes onto cpus.

This option is useful if the first <skip_count> processes are "shepherd" processes

that are used only for launching the application. If <skip_count> is not specified, a

default value of 0 is used.

Process layout

• “-x、–s” option is specifies the process do not assign on the CPU.


– omplace command (Only SGI MPT is available)

The omplace command causes the successive threads in a threaded or in a hybrid MPI/threaded job to be pinned to unique CPUs. The CPUs are assigned in order from the effective CPU list within the contained cpuset. This command is layered on dplace, and can be easier to use with MPI application launch commands because it hides the details associated with process skip counts, nested MPI and OpenMP processes and threads, and complex CPU lists.

Process layout

$ export OMP_NUM_THREADS=4$ export KMP_AFFINITY=disabled$ mpiexec_mpt –np 8 omplace -nt $OMP_NUM_THREADS ./a.out

※-np: Specifies the number of threads per MPI process.

Confidential 33

Creating a job script


– None parallel code

Options to the qsub command is the line beginning with "#PBS".(Line two to sixth)

(#PBS line is interpreted as a comment line in the shell script, but in the qsub command is interpreted as an option line)


1. #!/bin/bash2. #PBS -q TINY3. #PBS -l select=1:ncpus=14. #PBS -N serial_JOB5. #PBS -o serial_out_file6. #PBS -j oe7. source /etc/profile.d/modules.sh8. module load PrgEnv-intel9. cd ${PBS_O_WORKDIR}10. dplace ./a.out

line 1： the script is to be interpreted and run by the bash shell.

line 2：Name of Queue

line 3：Use a select statement to specify the number of nodes, number of cores. Write the job input options.

line 4：Name of job

line 5：Name of stout file.

line 6：Path to be used for the job’s standard output stream.

line 7： Preferences for using the module command module. If you use csh instead “modules.csh”

line 8： load the Intel compiler environment in the command module. * Not required for SGI-MPT for line 7 and 8.

line 9：Move to the directory with executable files.

line 10：The execution of the program.

None parallel program execution will be dplace command with no options.


– Auto-parallelized code/OpenMP code

Number of parallel (OMP_NUM_THREADS environment variable ) is set to the number of CPU that is specified in the “-l select ncpus”.


1. #!/bin/bash 2. #PBS -q TINY3. #PBS -l select=1:ncpus=64. #PBS -N omp_JOB5. #PBS -o omp_out_file6. #PBS -j oe7. source /etc/profile.d/modules.sh8. module load PrgEnv_intel9. export OMP_NUM_THREADS=810. export KMP_AFFINITY=disabled11. cd ${PBS_O_WORKDIR}12. dplace -x2 ./a.out or dplace ./a.out

line 3：Specifies the number of CPUS to use (parallelism).

* Not required for SGI-MPT for line 7 and 8.

line 9： specify the degree of parallelism of OpenMP. If you use csh “setenv OMP_NUM_THREADS 8”

This environment variable is mandatory.

line 10：To disable the AFFINITY of the Intel compiler.

If you use csh “setenv KMP_AFFINITYdisabled”

line 11： Move to the directory with executable files


Note For automatic parallelization of Intel compiler 17.0.0 or earlier and execution of OpenMP parallel program, specify the dplace - x 2 command (optional).


– Examples of a job using the /work area

– The UV3000 have work area(/work) with much I/O calculations job. You make the directory of one’s account name in /work, use the calculation during execution of data input and output . please do not leave the data in the work area with job finished. It may be removed without notice.

The following is an example script for running a program. If in the program can specify the output location , the copy process is not required.


1. #!/bin/bash 2. #PBS -q TINY3. #PBS -l select=1:ncpus=64. #PBS -N smp_JOB5. #PBS -o smp_out_file6. #PBS -j oe7. source /etc/profile.d/modules.sh8. module load PrgEnv-intel9. export OMP_NUM_THREADS=810. export KMP_AFFINITY=disabled11. cp –rp ${PBS_O_WORKDIR}/program /work/userID/12. cd /work/userID/program13. dplace -x2 ./a.out or dplace ./a.out

line 2：Name of Queue.

* Not required for SGI-MPT for line 7 and 8.

line 11： Copy the directory containing the executable file to the /work area.

line 12： Move to the directory with executable files


For automatic parallelization of Intel compiler 17.0.0 or earlier and execution of OpenMP parallel program, specify the dplace - x 2 command (optional).


– MPI code(use SGI MPT)

Please specify the number of CPU with ”-l select=ncpus, mpiprocs ” options.


1. #!/bin/bash 2. #PBS -q GEN3. #PBS -l select=2:ncpus=4:mpiprocs=44. #PBS -N sgimpt_JOB5. #PBS -o sgimpt_out_file6. #PBS -j oe7. cd ${PBS_O_WORKDIR}8. mpiexec_mpt dplace -s1 ./a.out

line 3：Specifies the number of CPUS to use (parallelism) and MPI prosess.

line 10：Use the mpiexec_mpt command to run MPI parallel jobs.

nt option is automatically loaded with PBS.

Note SGI-MPT program execution will be dplace command with –s1 option


– SGI MPT hybrid(MPI+OpenMP) code

Please specify the number of CPU with ”-l select=ncpus, mpiprocs ” options.


1. #!/bin/bash 2. #PBS -q SMALL3. #PBS -l select=4:ncpus=6:mpiprocs=24. #PBS -N hybrid_JOB5. #PBS -o hybrid_out_file6. #PBS -j oe7. export OMP_NUM_THREADS=28. export KMP_AFFINITY=disabled9. cd ${PBS_O_WORKDIR}10. mpiexec_mpt omplace -nt ${OMP_NUM_THREADS} ./a.out

line 3： Specifies the number of CPUS to use (parallelism) and MPI prosesses

line 7： specify the degree of parallelism of OpenMP. If you use csh “setenv OMP_NUM_THREADS 8”

This environment variable is mandatory.

line 10：Use omplace and mpiexec_mpt commands to run a hybrid parallel jobs.

Confidential 39

Compiler Options


– Intel compiler command

– icc (C/C++)

– icpc (C++)

– ifort (Fortran)

– Display the compiler options

– icc –help

– icc –help [category]

–If category is specified, a category of compiler options are displayed.

– Display the compiler version information..

– icc –V

– Example

– icc [options] file1.c [file2.c …]

Compiler Command


– Serial Code

$ icc –O3 prog.c (compile)

$ dplace ./a.out (execute)

– OpenMP Code

$ icc –O3 –qopenmp prog_omp.c (compile)

$ setenv KMP_AFFINITY disabled

$ setenv OMP_NUM_THREADS 4 (set the number of threads to use during execution)

$ dplace ./a.out (execute)

– MPI Code

$ icc –O3 prog_mpi.c –lmpi (compile)

$ mpiexec_mpt –np 4 dplace –s1 ./a.out (execute)

– Hybrid (MPI+OpenMP) Code

$ icc –O3 –qopenmp prog_hyb.c –lmpi (compile)


$ setenv OMP_NUM_THREADS 4 (set the number of threads to use during execution)

$ npiexec_mpt –np 4 omplace –nt ${OMP_NUM_THREADS} ./a.out (execute)

*MPI Options -Lmpi should be added at the end.

Compile and Execute – C/C++


– Key compiler options by default

– Recommended options

Recommended compiler options

Option Description

Optimization level -O2 Optimization for more high performance.

Generate optimized code

specialized for the intel

processor

-msse2 Generate Intel SSE2 and SSE instructions for

Intel Xeon Processors.

Option Description

Optimization level -O3 Performs O2 optimizations and enables more

aggressive loop transformations such as

Fusion, Block-Unroll-and-Jam.

Generate optimized code

specialized for the intel

processor

-xCORE-AVX2 Generate Intel AVX2, AVX, SSE4.2, SSE4.1,

SSSE3, SSE3, SSE2, and SSE for Intel Xeon

E5 v3 Processor.


Option Description

-O0 Disables all optimizations. Using for debugging.

-O1

•Enables global optimization

•Disables inlining of some instructions.

This optimization level may improve performance for applications with very large code size, many branches,

and execution time not dominated by code within loops.

-O2

If the optimization level is not specified, this optimization level is enabled by default. The option enables:

• Inlining of intrinsics

• Intra-file interprocedural optimizations

Inlining, constant propagation, forward substitution, routine attribute propagation, variable

address-taken analysis. etc.

• The following capabilities for performance gain

• Loop unrolling, dead-code elimination, global instruction scheduling and control speculation,

exception handing optimization, etc.

-O3

Perform O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-

Jam, and collapsing IF statements.

The O3 option is recommended for applications that have loops that heavily use floating point calculations and

process large data sets.

When O3 is used with -axCORE-AVX2 or –xCORE-AVX2, the compiler performs more aggressive data

dependency analysis than for O2, which may result in longer compilation times.

-fast-fast option is macro option which enables -ipo, -O3, -no-prec-div, -static, -fp-model fast=2, and –xHOST

options.

※Because –fast option includes –static, please use –Bdynamic option to link a dynamic library when there is only a dynamic library.

Optimization options


Optimization options

Option Description

-xprocessorTells the compiler to generate optimized code specialized for the intel

processor that executes your program.

-axprocessorTells the compiler to generate multiple, processor-specific auto-dispatch

code paths for the intel processors id there is a performance benefit.

-vec Enables or disables vectorization

-no-prec-divEnables optimizations that gave slightly less precise results than full IEEE

division.

-no-prec-sqrt Enables a faster but less precise implementation of square root.


– -axprocessor、-xprocessor

Generates optimized code specialized for the Intel Processor

Optimization Options for the specific Intel processor

Processor Generates optimized code specialized for the Intel processor

HOSTGenerates instructions for the highest instruction set available on the

compilation host processor

CORE-AVX2Generates optimized code for Intel Xeon E5 v3 Processor family and enables

AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions.

SSE4.2Generates optimized code for Westmere-EX(Intel Xeon E6-8800 family) and

enables SSE4.2, SSE4, SSSE3, SSE3, SSE2, and SSE instructions.


– Inter procedural Options

Optimization Options

Option Description

-ip Enables single file inter procedural optimization (inline function expansion,

constant propagation…). It may improve much compiler optimization.

-ipo Enable multi file IP optimizations (between files). It is important to compiler

the entire application of multiple, related source files together when you

specify –ipo.


– Floating Point Operation

Optimization Options

Option Description

-ftz Enables floating underflow results set to zero.

Every optimization options level, except O0, sets ftz.

If this option produces undesirable results of the numerical behavior of your program,

you can turn the Flash To Zero mode off by using –no-ftz options.

-fltconsistency Enables improved floating-point consistency.

-fp-model

keyword

Controls the semantics of floating-point calculations.

Keyword

precise: Disables optimizations that are not value-safe on floating-point data.

fast[=1|2]: Enables more aggressive optimizations on floating-point data.

strict: Enables precise and except, disables contractions and enables pragma

stdc fenv_access.

source: Rounds intermediate results to source-defined precision.

double: Rounds intermediate results to 53-bit(double)

extended: Rounds intermediate results to 64-bit(extended)

[no-]except: Determines whether strict floating-point exception semantics are honored.


Alias Options

Option Description

-falias(Default)

-fno-alias

-ffnalias（Default）

-fno-fnalias

Specifies that aliasing should no be assumed in the program, or within

functions. If there is no aliasing, compiler might makes more optimization

improvement. Especially, it may affect C/C++ codes.

If you can rewrite the source codes, you can use _restrict keyword (C99:

restict) to the point.

p access region

q access region

No aliasing

p

q

p access region

q access region

aliasing

p

q


Optimization Report

Option Descrption

-qopt-report [=n] Generate an optimization report. Indicates the level of detail in the report. You can

specify values 0 through 5. If you specify zero, no report is generated.

-qopt-report-file=name Specifies the file filename to hold the optimizations report. If you specify stderr, the

output should go to stderr. If you specify stdout, the output should go to stdout.

-qopt-report-routine=name Generate reports on the routines containing the specified name.

-qopt-report-phase=name Generates reports for the optimizer you specify in phase.

-qopt-report-help Displays the optimizer phases available for report generation.

Phase Description

cg The phase for code generation

ipo The phase for Interprocedural

Optimization

loop The phase for loop optimization

openmp The phase of OpenMP

* Optimizer Phases

Phase Description

pgo The phase of Profile Guided Optimization

tcollect The phase for trace collection

vec The phase for vectrization

all All optimizer phases. This is the default if you do not specify list.


Option Description

-static Prevents linking with shared libraries.

-Bstatic Enables static linking of user’s library.

-Bdynamic Enables dynamic linking libraries at run time.

-shared-intel Causes Intel-provided libraries to be linked in dynamically.

-static-intel Causes Intel-provided libraries to be linked in statically.

Linking Options


–Intel® Compiler deals with 32-bit and 64-bit differently.

–Intel®64 memory model– small(default): Tells the compiler to restrict code and data to the

first 2GB of address space.

– medium(-mcmodel=medium): Tells the compiler to restrict code to the first 2gb; it places no memory restriction on data.

– large(-mcmodel=large): Place no memory restriction on code or data.

– If you specify -mcmodel = medium or -mcmodel = large, also set the -shared-intel option.

–When you specify medium of large, you must also specify compiler option –shared-intel

Intel®64 Memory Model

Confidential 52

Numerical Library


– Feature

– Scientific Technical Computing Library

– Optimized for the Intel Processor

– Multi-threading

–Thread parallel

–Thread safe

– Runtime auto processor detection

– C and Fortran Interface

Intel Math Kernel Library (MKL)


– Intel Math Kernel Library contains followings functions.

– BLAS

– BLACS

– LAPACK

– ScaLAPACK

– PBLAS

– Sparse Solver

– Vector Math Library (VML)

– Vector Statistical Library (VSL)

– Conventional DFTs and Cluster DFTs

– Etc.



–Linking


Serial：$ icc $CFLAGS –o test test.c –lmkl_intel_lp64 –lmkl_sequential –lmkl_core

Thread Parallel：$ icc $CFLAGS –o test test.c –lmkl_intel_lp64 –lmkl_intel_thread –lmkl_core –liomp5

Serial：$ icc $CFLAGS –o test test.c –mkl=sequential

Thread parallel：$ icc $CFLAGS –o test test.c –mkl=parallel

Intel compiler can link the MKL with “-mkl” option.


–BLACS and/or ScaLAPACK


Serial：$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 ¥

-lmkl_intel_lp64 -lmkl_sequential -lmkl_core example1.c -lmpi

Thread parallel：$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpti_lp64 ¥

-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 example1.c -lmpi

Serial：$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 ¥

-mkl=sequential example1.c -lmpi

Thread parallel：$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 ¥

-mkl=parallel example1.c -lmpi

Intel compiler can link the MKL with “-mkl” option.

SGI MPT

In the case of Intel MPI, the complier command and BLACS library option are different.*MPI Options -Lmpi should be added at the end.


– Caution for the thread parallel version MKL.


Serial executionSet the environment variable OMP_NUM_THREADS=1 or link the serial

version MKL.

Thread Parallel

execution

Set the environment variable OMP_NUM_THREADS. When MKL functions

are used in an OpenMP code, the code runs with the number of threads

which is defined by OMP_NUM_THREADS. If you want to run the MKL

function with the different number of thread which is defined the

OMP_NUM_THREADS, you must set MKL_NUM_THREADS in addition to

OMP_NUM_THREADS.

*When the MKL function is called in the OpenMP parallel region, the MKL thread are disabled

by default. Because the OpenMP nesting parallelization is disabled. If you want to enable the

nesting parallelization, set the environment variable able OMP_NESTED=“yes”.

MPI executionIf you want to run with only MPI parallel, set the environment variable

OMP_NUM_THREADS=1 or link the serial version MKL not to run a

parallel thread of MKL.

Hybrid executionIf you want to run with MPI and thread parallel, set the environment variable

OMP_NUM_THREADS and/or MKL_NUM_THREADS.

Confidential 58

First-Touch Policy & Data Placement


– SGI UV3000 is NUMA architecture, and data is placed on memory in first touch policy.

– The physical page is placed on the memory node located on the nearby processor which is first to access the data. This is “First Touch” policy.

– There is “local” and “remote” memory in the NUMA architecture.

– To achieve high performance, the data must be placed in the “local” processor which uses it.

– It is important which core the process is placed on . *dplace / omplace command)

First Touch Policy

NUMAlinkRouter

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

Local memory

Remote memory

First Touch Policy Overview on SGI UV3000


– All data is allocated with a “First-touch” policy (as a system page unit)

– The initialization loop, if executed serially, will grab pages from single node.

– In the parallel loop, multiple processors will access that one memory. So, there are bottlenecks in accessing a single node.

First-touch Policy

for( i=0; i<N; ++i){a[i]=0.0;b[i]=(double)i/2.0;c[i]=(double)i/3.0;d[i]=(double)i/7.0;

}#pragma omp parallel for

for( i=0; i<N; ++i){a[i] = b[i] + c[i] + d[i];

}

NUMAlinkRouter

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

Bottleneck in access to that

node.

Parallelized

Executedserially

All data is allocated on here.


– Perform initialization in parallel.

– allocated to the local memory with the first touch policy.

– Data is distributed naturally, so minimal data exchange between nodes. The performance will be better.

First-touch Policy

#pragma omp parallel for shared(a, b, ,c, d)

for( i=0; i<N; ++i){a[i]=0.0;b[i]=(double)i/2.0;c[i]=(double)i/3.0;d[i]=(double)i/7.0;

}#pragma omp parallel for shared(a, b, c, d)

for( i=0; i<N; ++i){a[i] = b[i] + c[i] + d[i];

}

NUMAlinkRouter

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

Parallelized

Executed in parallel

Allocated to the local memory, data is

distributed naturally.

Confidential 62

Debugger


Debugger (Serial Code, OpenMP code)

- gdb – GNU Debugger– Linux standard debugger– Multi-thread (OpenMP, pthread)

（Ex.）– Core file analysis

% gdb ./a.out core

(gdb)where

(gdb)w

– Run a program on gdb% gdb ./a.out

(gdb) run

– Attach a running process% gdb a.out [process id]


– The environment variable MPI_SLAVE_DEBUG_ATTACH specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH=N, the MPI process with rank N prints a message during program setup and sleep 20 seconds.

– Specifies the MPI process to be debugged.

$ setenv MPI_SLAVE_DEBUG_ATTACH 0 (specify the rank0)

$ mpirun -np 4 ./a.out

MPI rank 0 sleeping for 20 seconds while you attach the debugger.

You can use this debugger command:

gdb /proc/26071/exe 26071

or

idb -pid 26071 /proc/26071/exe

– In another window, attach to the target process from debugger.

– $ gdb /proc/26071/exe 26071

– (gdb) cont

Debugger (MPI code)

Confidential 65

Exercise：Compile and Execute


– Login to UV3000.

– After login to UV3000, ”mpi/PrgEnv-intel_sgi” module is already loaded by default.

– “mpi/PrgEnv-intel_sgi” includes Intel Compiler 17.1 and SGI MPT.

– Copy and extract the training file, prepare the working directory.

Login to UV3000

$ ssh –I login-name uv

$ module list Currently Loaded Modulefiles:

1) mpi/PrgEnv-intel_sgi

$ cp /work/Samples/Seminar/training_2019_11.tar.gz .$ tar zxvf training_2019_11.tar.gz$ cd training_2019_11$ ls


– In this training, we use UV3000 with Interactive job. Submit interactive job with 4 sockets(24cores).

– If the interactive job is started, the following message are shown and you can use 4sockets (24cores) of UV3000.

– Change directory to the current directory in which you performed “qsub” command.

Interactive Job

$ qsub –I –q TINY –l select=1:ncpus=24

qsub: waiting for job 4248.altix-uv to startqsub: job 4248.altix-uv ready

$ cd ${PBS_O_WORKDIR}


We use “Himeno Benchmark code in C” in this training.

Dr. Ryutaro Himeno, director of the Advanced Center for Computing and Communication, has developed this benchmark to evaluate performance of incompressible fluid analysis code. This benchmark program takes measurements to proceed major loops in solving the Poisson’s equation solution using the Jacobi iteration method. The result of this benchmark is measured in MFLOPS.

FLOPS is an acronym of “the number of FLoating Operations Per Second”. The higher FLOPS means the higher floating point operation performance.

Test Code

RIKEN Advanced Center for Computing and Communication Himeno Benchmark(http://accc.riken.jp/supercom/himenobmt/)


Compile a serial code(Himeno benchmark static allocation in C) with the following command. (Specifies the Grid size to L with “-DLARGE”)

Execute “himeno.serial”.

Check the result of “himeno.serial”.

Compile (Serial Code)

$ icc –o himeno.serial –DLARGE himenoBMTxps.c

$ dplace ./himeno.serial

Loop executed for 348 timesGosa : 7.323683e-04MFLOPS measured : 6903.578631 cpu : 56.392519Score based on Pentium III 600MHz : 84.189983

Check the value of MFLOPS.


Try the Optimization Level (-O3) and the special optimization option for Intel Xeon E5 v3(Haswell) (-xCORE-AVX2).

Compile with “-O3” option.

Additionally, compile with “-xCORE-AVX2” option.

Run “himeno.serial” and check the results(MFLOPS).

Compile (Serial Code)

$ icc –o himeno.serial –DLARGE –O3 himenoBMTxps.c

$ icc –o himeno.serial –DLARGE –O3 –xCORE-AVX2 himenoBMTxps.c

Confidential 71

Exercise：Auto-parallelization and OpenMP


– Auto-parallelization

– Overview

– Compiler options

– Compile and execution

– Performance test

– OpenMP

– Overview

– OpenMP pragma

– Compiler options

– Add pragma

– Compile and execution

– Performance test

Procedure


–Auto-Parallelization by Intel Compiler

–Generates Multi thread code

–Combined with compiler optimization

–Only compiler option

–Diagnostic information

（not created multi threaded source code.）

Auto-parallelization (Overview)


-parallelTells the auto-parallelizer to generate multithreaded code for loops that can be safely

executed in parallel. To use this option, you much also specify option O2 or O3.

-par-thresholdn

Sets a threshold for the auto-parallelization of loops.

n=0: loops get auto-parallelized always, regardless of computation work volume.

n=100: loops get auto-parallelized when performance gains are predicted based on the

compiler analysis data. Loops get auto-parallelized only if profitable parallel execution is

almost certain.

The intermediate 1 to 99 values represent the percentage probability for profitable speed-

up. For example, n=50 directs the compiler to parallelize only if there is a 50% probability of

the code speeding up if executed in parallel.

-qopt-report=n

–qopt-report-phase=par

–qopt-report-file=stdout

Controls the diagnostic information reported by the auto-parallelizer. the diagnostic

information is not output by default.

n=1: reports which loops were parallelized.

n=2: Generates level 1 details, and reports which loops were not parallelized along with a short reason.

n=3:Generates level 2 details, and prints the memory locations that are categorized as private, shared,

reduction, etc..

n=4: For this phase, this is the same as specifying level 3.

n=5: Generates level 4 details, and dependency edges that inhibit parallelization.

Auto-parallelization (Compiler Options)


Compile a Serial code(Himeno benchmark code static allocation in C) with auto-parallelization option.

Auto-parallelization reports are shown by -qopt-report=1 –qopt-report-phase=par –qopt-report-file=stdout.

Auto-parallelization (Compile)

$ icc -o himeno.par -parallel -DLARGE himenoBMTxps.c -qopt-report=1 -qopt-report-phase=par -qopt-report-file=stdout

Begin optimization report for: jacobi(int)

Report from: Auto-parallelization optimizations [par]

LOOP BEGIN at himenoBMTxps.c(195,3)remark #25460: No loop optimizations reportedLOOP BEGIN at himenoBMTxps.c(198,5)

remark #17109: LOOP WAS AUTO-PARALLELIZEDLOOP BEGIN at himenoBMTxps.c(199,7)

remark #25460: No loop optimizations reported


The Loop of line 198 was auto-parallelized.

AUTO-PARALLEIZED LOOP

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]

The Loop of line 198.

LOOP BEGIN at himenoBMTxps.c(198,5)

remark #17109: LOOP WAS AUTO-PARALLELIZED


Specify the number of threads with OMP_NUM_THREADS, execute “himeno.par” (Don’t forget KMP_AFFINITY=disabled)

Check the result (with 6 threads)

Auto-parallelization (Execution)

Loop executed for 871 timesGosa : 6.077909e-04MFLOPS measured : 16907.951967 cpu : 57.629344Score based on Pentium III 600MHz : 206.194536

$ setenv OMP_NUM_THREADS 6$ setenv KMP_AFFINITY disabled$ dplace ./himeno.par


– Check the performance with 1, 2, 4, 6 ,12, 24 threads.

Auto-parallelization (Performance Test)

# threads Performance [MFLOPS]

1

2

4

6

12

24


Parallelize by OpenMP pragma

OpenMP Overview

#pragma omp parallel for shared(A, B, C)for ( i= 1 ; i < 10000 ; i++) {

A[i] = B[i] + C[i-1] + C[i+1]}

Key OpenMP pragmas・PARALLEL { ……}

・PARALLEL FOR, PARALLEL FOR REDUCTION(+: …)

・MASTER

・CRITICAL

・BARRIER


#include <stdio.h>int main(void)

{#pragma omp parallel{

#pragma omp criticalprintf(“hello, world¥n”) ;

}}

Parallel Region

“hello, world”

–PARALLEL pragma#pragma omp parallel [clause...]

– A parallel region is a block of code that will be executed by multi threads.


$ icc -qopenmp -qopt-report=1 -qopt-report-phase=openmp -qopt-report-file=stdout hello.c

Intel(R) Advisor can now assist with vectorization and show optimization

report messages with your source code.

See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Begin optimization report for: main(void)

Report from: OpenMP optimizations [openmp]

hello.c(5:3-5:3):OMP:main: OpenMP DEFINED REGION WAS PARALLELIZED

===========================================================================$

$ setenv OMP_NUM_THREADS 4


$ dplace ./a.out

hello, world

hello, world

hello, world

hello, world

$

hello, world (example)

Master thread executes serial portion of the code.

Creates slave threads

printf by eavh threads

barrier

Master thread resumes execution after the parallel region.

Start

End


・for pragma#pragma omp for [clause...]

– In parallel region, the iteration of the loop immediately following it must be executed in parallel by the term.

– The default schedule is “STATIC”, which means that the iterations are every (if possible) divided contiguously among the threads.

Work sharing of the for loop

Loop length = N

i=1,2,… N N/4

Thre

ad 0

Thre

ad 1

Thre

ad 2

Thre

ad 3

4 thread case

N/4 N/4 N/4


void daxpy(int n, double c, double *x, double *y){

int i;#pragma omp parallel for private(i) shared(c,x,y)for(i = 0 ; i < n ; i++) {

y[i] = y[i] + c * x[i];}

}

Work sharing of for loop

–parallel for pragma– parallel pragma + for pragma

– Create the parallel region and divide the for loop.


– The variables in the parallel region or the divided loop should be...

– Independent in each thread.

– Shared with all threads.

These variables must be specified by the data scope clauses.

– The data scope clauses are used as an option of the parallel region or for pragma.

#pragma omp parallel for private(i) shared(n, c, x, y)

Data Scoping

shared clauseprivate clause


– shared clause

– The variable specified in shared clause exists in only one memory location and all threads can read or write to the address.

– The shared object is the same as master thread.

shared and private

n c x y iMaster thread

The shared variable is only one object from all threads.

shared(n, c, x, y)


–private clause

– The variable specified in private clause is created as a new abject in each threads.

– The private object is unrelated to master thread.

shared and private

n c x y iMaster thread

The private variable is independent in each thread.

private(i)

iiii


Reduction operation

Reduction Operation

for( i=1 ; i<10000 ; i++ ){S = S + A[i]

}

for( i=1 ; i<5000 ; i++ ){S = S + A[i]

}

0

∵When the variable S is global attribute, the answer is not collected. Because thread 0 and 1 write into S at the same time.

for( i=5000 ; i<10000 ; i++ ){S = S + A[i]

}

1

When S is local, partial sum of each thread is calculated. How to calculate sum total?


Reduction operation

Reduction Operation

#paragma omp parallel for reduction(+: S)for( i=1 ; i<10000 ; i++ ){

S = S + A[i]}

for( i=1 ; i< 5000 ; i++ ){S0 = S0 + A[i]

}

0

The result of reduction operation may be not consistent with serial calculation. Because the order of operation is not different, a rounding error may be caused. When the parallel number is changed, the answer may change.

for( i=5000 ; i<10000 ; i++ ){S1 = S1+ A[i]

}

1

S = S + S0+ S1


・reduction clause

– The reduction operation is the contracting of an array into a scalar variable by some operation.

– Reduction clause format

#pragma omp for reduction(op : var)

var is the comma-deliminated list of reduction variable. op is one of the following.

operation : +, *, -, .AND., .OR., .EQV., .NEQV.

intrinsic : MAX, MIN, IAND, IOR, IEOR

– Variables that appear in a REDUCTION clause must be SHARED in the enclosing context. A private copy of each variable in list is created for each thread as if the PRIVATE clause had been used. The private copy is initialized according to the operator.

– op = +、- ： initialization value 0

– op = * ： initialization value 1

– op = MAX ： smallest representable number

– op = MIN ： largest representable number

Reduction Operation


-qopenmpEnables the parallelizer to generate multi-threaded code

based on OpenMP pragmas.

-qopt-report=n

-qopt-report-phase=openmp

-qopt-report-file=stdout

Controls the OpenMP parallelizer diagnostic messages.

The diagnostic messages are not output by default.n=1: Reports loops, regions, sections, and tasks successfully parallelized.

n=2: Generates level 1 details, and messages indicating successful handling of master

constructs, single constructs, critical constructs, ordered constructs, atomic, pragmas,

and so forth.

OpenMP (Compiler Options)


The hotspot of Himeno benchmark code(Static allocation in C) is the loop of line 198.

Add OpenMP pragma

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]205 - p[i-1][j+1][k ] + p[i-1][j-1][k ] )206 + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]207 - p[i ][j+1][k-1] + p[i ][j-1][k-1] )208 + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]209 - p[i+1][j ][k-1] + p[i-1][j ][k-1] )

210 + c[0][i][j][k] * p[i-1][j ][k ]211 + c[1][i][j][k] * p[i ][j-1][k ]212 + c[2][i][j][k] * p[i ][j ][k-1]213 + wrk1[i][j][k];214215 ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];216217 gosa+= ss*ss;218 /* gosa= (gosa > ss*ss) ? a : b; */219220 wrk2[i][j][k] = p[i][j][k] + omega * ss;221 }222223 for(i=1 ; i<imax-1 ; ++i)224 for(j=1 ; j<jmax-1 ; ++j)225 for(k=1 ; k<kmax-1 ; ++k)226 p[i][j][k] = wrk2[i][j][k];227228 } /* end n loop */229230 return(gosa);231 }


Parallelize the loop of line 198 with “#pragma omp parallel for”.

– Add “#pragma omp parallel for”

– Specify private variables. (no needed to specify shared variables, because a variable is shared by default.)

– Use reduction clause for the variable “gosa”, because the variable gosa is a summation of “ss”.

Add OpenMP pragma

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197

#pragma omp parallel for reduction(+:gosa) private(i, j, k, s0, ss)198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]205 - p[i-1][j+1][k ] + p[i-1][j-1][k ] )206 + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]207 - p[i ][j+1][k-1] + p[i ][j-1][k-1] )208 + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]209 - p[i+1][j ][k-1] + p[i-1][j ][k-1] )

210 + c[0][i][j][k] * p[i-1][j ][k ]211 + c[1][i][j][k] * p[i ][j-1][k ]212 + c[2][i][j][k] * p[i ][j ][k-1]213 + wrk1[i][j][k];214215 ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];216217 gosa+= ss*ss;218 /* gosa= (gosa > ss*ss) ? a : b; */219220 wrk2[i][j][k] = p[i][j][k] + omega * ss;221 }222223 for(i=1 ; i<imax-1 ; ++i)224 for(j=1 ; j<jmax-1 ; ++j)225 for(k=1 ; k<kmax-1 ; ++k)226 p[i][j][k] = wrk2[i][j][k];227228 } /* end n loop */229230 return(gosa);231 }


Create a parallel region included the loop of line 198 and 223. Each loop is parallelized by “#pragma omp for”. Save this modification as “himenoBMTxps_omp01.c”.

Add OpenMP pragma

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197

#pragma omp parallel private(I, j, k, s0, ss){

#pragma omp for reduction(+:gosa)198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]205 - p[i-1][j+1][k ] + p[i-1][j-1][k ] )206 + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]207 - p[i ][j+1][k-1] + p[i ][j-1][k-1] )208 + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]209 - p[i+1][j ][k-1] + p[i-1][j ][k-1] )

210 + c[0][i][j][k] * p[i-1][j ][k ]211 + c[1][i][j][k] * p[i ][j-1][k ]212 + c[2][i][j][k] * p[i ][j ][k-1]213 + wrk1[i][j][k];214215 ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];216217 gosa+= ss*ss;218 /* gosa= (gosa > ss*ss) ? a : b; */219220 wrk2[i][j][k] = p[i][j][k] + omega * ss;221 }222

#pragma omp for223 for(i=1 ; i<imax-1 ; ++i)224 for(j=1 ; j<jmax-1 ; ++j)225 for(k=1 ; k<kmax-1 ; ++k)226 p[i][j][k] = wrk2[i][j][k];

}227228 } /* end n loop */229230 return(gosa);231 }


Compile a OpenMP code(himenoBMTxps_omp01.c).

OpenMP report is generated by -qopt-report=1 –qopt-report-phase=openmp –qopt-report-file=stdout options.

Specify the number of threads with OMP_NUM_THREADS, execute “himeno.omp01” (Don’t forget KMP_AFFINITY=disabled)

OpenMP (Compiler and Execute)

$ icc -o himeno.omp01 -qopenmp -DLARGE himenoBMTxps_omp01.c -qopt-report=1 -qopt-report-phase=openmp -qopt-report-file=stdout

Begin optimization report for: jacobi(int)

Report from: OpenMP optimizations [openmp]

himenoBMTxps_omp01.c(198:1-198:1):OMP:jacobi: OpenMP DEFINED REGION WAS PARALLELIZED

$ setenv OMP_NUM_THREADS 6$ setenv KMP_AFFINITY disabled$ dplace ./himeno.omp01


–Check the performance of himenoBMTxps_omp01.c with 1, 2, 4, 6 ,12, 24 threads.

OpenMP (Performance Test)


1

2

4

6

12

24


In himenoBMTxps_omp01.c, we parallelize only the jacobifunction.

UV3000 is a NUMA architecture, the data were placed on the memory in the first touch policy. Because the initialization of arrays is performed in serial in this code, the arrays are placed on the memory of the CPU socket on which the initialization of arrays is performed.

If we use more than 2 sockets, the performance is bottlenecked by the remote memory access. So we also need parallelize initmtfunction which performs the initialization of arrays.

Add more OpenMP pragma


Parallelize the loop of line 152 and 170. Save this modification as “himenoBMTxps_omp02.c”.

Add more OpenMP pragma

147 void148 initmt()149 {150 int i,j,k;151

#pragma omp parallel for private(i,j,k)152 for(i=0 ; i<MIMAX ; i++)153 for(j=0 ; j<MJMAX ; j++)154 for(k=0 ; k<MKMAX ; k++){155 a[0][i][j][k]=0.0;156 a[1][i][j][k]=0.0;157 a[2][i][j][k]=0.0;158 a[3][i][j][k]=0.0;159 b[0][i][j][k]=0.0;160 b[1][i][j][k]=0.0;161 b[2][i][j][k]=0.0;162 c[0][i][j][k]=0.0;163 c[1][i][j][k]=0.0;164 c[2][i][j][k]=0.0;165 p[i][j][k]=0.0;166 wrk1[i][j][k]=0.0;167 bnd[i][j][k]=0.0;168 }169

#pragma omp parallel for private(i,j,k)170 for(i=0 ; i<imax ; i++)171 for(j=0 ; j<jmax ; j++)172 for(k=0 ; k<kmax ; k++){173 a[0][i][j][k]=1.0;174 a[1][i][j][k]=1.0;175 a[2][i][j][k]=1.0;176 a[3][i][j][k]=1.0/6.0;177 b[0][i][j][k]=0.0;178 b[1][i][j][k]=0.0;179 b[2][i][j][k]=0.0;180 c[0][i][j][k]=1.0;181 c[1][i][j][k]=1.0;182 c[2][i][j][k]=1.0;183 p[i][j][k]=(float)(i*i)/(float)((imax-1)*(imax-1));184 wrk1[i][j][k]=0.0;185 bnd[i][j][k]=1.0;186 }187 }


–Check the performance of himenoBMTxps_omp02.c with 1, 2, 4, 6 ,12, 24 threads.

Compare the performance between these result and himenoBMTxps_omp01.c.

OpenMP (Performance Test)


1

2

4

6

12

24


– start

– mode

– Vi editor has command mode and insert mode.

– Command mode ESC-key

– Insert mode i-key

– exit

– Exit the vi editor in command mode.

– Exit after saving the file :wq

– Exit without saving the file :q!

Ref.: How to use the vi editor

$ vi file-name


Ref.: How to use the vi editor

Operation Command Description

Go to insert mode i Insert a text at the cursor position

Go to command mode Esc

Move the cursor

→ ( l ) Right

← ( h ) Left

↑ ( k ) Up

↓ ( j ) Down

Delete a character/charactersx Delete a character

dd Delete a line (=cut a line)

Cut/Copy/Paste

yy Copy a line

dd Cut a line

p Paste

Search

/string Forward search

?string Backward search

n To repeat the search in a forward direction

N To repeat the search in a backward direction

Save / Exit

：q ! Exit without saving a file

：wq Exit after saving a file

：zz Exit after saving a file

：w Overwrite

Thank you

Documents

SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides