101
Confidential © Copyright 2018 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice. SGI UV3000 Parallel Programming and Optimization Nov 1st, 2019 Japan Advanced Institute of Science and Technology Hewlett-Packard Japan, Ltd.

SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

  • Upload
    others

  • View
    2

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential © Copyright 2018 Hewlett Packard Enterprise Development LP. The information contained herein is subject to change without notice.

SGI UV3000Parallel Programming and Optimization

Nov 1st, 2019

Japan Advanced Institute of Science and Technology

Hewlett-Packard Japan, Ltd.

Page 2: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

2Confidential ©2017 SGI

1. System summary

2. Submit job

3. Creating a job script

4. Compiler Options

5. Numerical Library

6. First-Touch Policy & Data Placement

7. Debugger

8. Exercise:Compile and Execute

9. Exercise:Auto-parallelization and OpenMP

Contents

Page 3: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 3

System summary

Page 4: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

4Confidential ©2017 SGI

UV3000

Model SGI UV3000

total

system 1(4Racks)

IRU 16

Blade 128

CPU71.27TFLOPS

256CPUs・1536cores

memory32TB

16GB * 2048DIMM

DiskExternal disk device 160TB(Physical)

SGI Infinite Storage 5100

I/O I/F Dual Port 16Gbps FC HBA

NW I/F Dual Port 10G

Blade

CPUIntel Xeon E5-4655 v3 * 2CPU

6c/2.9GHz/30M/9.6GT/s/135W

memory16GB DDR4 * 2DIMM * 4ch / CPU

2133 MT/s

Page 5: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

5Confidential ©2017 SGI

InfinitStorage 5100

Model SGI InfinitStorage 5100

System 1

chassis 4U enclosures

controller Active/active controller

Host Interface 4x 16Gb Fc port

Cache size816GB

16GB system cache +800GB SSD

Disk units4TB NL-SAS HDD x42

800GB SAS SSD x2

Page 6: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

6Confidential ©2017 SGI

Login Server(uv)

Model SGI Rackable C2112-GP2

System 1

CPU 2x Intel Xeon E5-2667 v3

MEM128GB

16GB * 8DIMM

Disk 1TB SATA HDD x2

Network4x Gigabit Ethernet

2x 10Gigabit Ethernet

Page 7: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

7Confidential ©2017 SGI

Network

campus network

10GbE

16Gbps FCconnect

UV3000 Login + PMT

NFS server

/home

Page 8: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

8Confidential ©2017 SGI

Server Room

Page 9: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

9Confidential ©2017 SGI

UV3000 Blade

Intel Xeon E5-4655v3Intel Xeon E5-4655v3

QP I

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM

32GB DDR4 DIMM Q

P IQ

P IQ

P I

QP I

QP I

QP I

QP I

UV Hub

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

16GB DDR4 DIMM

NL612Channels

UV3000 blade Brock diagram

L3 shered cache 30MB

core 1

core 2

core 3

core 4

core 5

core 0

L3 shered cache 30MB

core 1

core 2

core 3

core 4

core 5

core 0

Page 10: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

10Confidential ©2017 SGI

Topology

This drawing shows 1/8 of the topology.Each vertex of the cube is connect to two router blades.

Page 11: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

11Confidential ©2017 SGI

– OS SUSE Linux Enterprise Server 12

– SGI® Performance Suite

– SGI Accelerate

– SGI MPT

– Intel® Parallel Studio XE 2016 update2, 2017 update1

– PGI compiler 2017

– Gaussian 16

Software

Page 12: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

12Confidential ©2017 SGI

– Use the UV3000 system via the login-server (uv).

Login to the login-server

hostname uv.jaist.ac.jp

• The login-server can login with ssh. File transfer is available to scp.

$ ssh -l <userID> uv

Page 13: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

13Confidential ©2017 SGI

– Each user is assigned a home directory. Home directory resides on the file server , it has been shared in each system by NFS . Home directory is the working area when you logged in to each system.

– The UV3000(altix-uv) have a local work directory(/work). I would recommended you use, because it has good I/O performance.

Note, however, that work area does not get backup the data. Please keep data at your own risk.

– The work directory has bee NFS mount to login-server(uv)

Storage of UV3000

Page 14: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

14Confidential ©2017 SGI

– module is a user interface to the Modules package. The Modules package provides for the dynamic modification of the user's environment.

– mpi/PrgEnv-intel_sgi is loaded by default.

– mpi/PrgEnv-intel_sgi is the environment in which the Intel compiler and SGI MPT ca be used.

– List all available modulefiles in the current .

$ module avail

– You can Load modulefile(s) into the shell environment.

$ module load mpi/PrgEnv-intel_sgi

– If you use a non-standard environment in a batch script , you need to perform the initial configuration of the module command in the batch script .

source /etc/profile.d/modules.sh

module load mpi/PrgEnv-intel

– Please replace the " /etc/profile.d/modules.csh " in the case of csh.

Switching programming environment

Page 15: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

15Confidential ©2017 SGI

Main Modulefile

Module name compiler MPI

Intel/16.0.2 Intel parallel_studio_xe_2016 X

Intel/17.0.1 Intel parallel_studio_xe_2017 X

mpi/PrgEnv-intel-2016.2 Intel parallel_studio_xe_2016 Intel MPI

mpi/PrgEnv-intel

mpi/PrgEnv-intel-2017.1

Intel parallel_studio_xe_2017 Intel MPI

mpi/PrgEnv-intel-2016.2_sgi Intel parallel_studio_xe_2016 SGI MPT

mpi/PrgEnv-intel_sgi

mpi/PrgEnv-intel-2017.1_sgi

Intel parallel_studio_xe_2017 SGI MPT

pgi/17.1 PGI compiler 17.1 X

pgi/PrgEnv-pgi PGI compiler 17.1 openMPI

Page 16: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

16Confidential ©2017 SGI

– PBS Professional as a job management system (PBS) has been introduced . There are two ways to run a job. One is interactive mode , the other is to use the script . How to create a job script file , please refer to the "Creating a job script " .

– Please specify the execution condition when you submitted jobs.

PBS Professional

queue vnode core memory Wall time number of

execution

(par user)

priority

TINY 1-4 1-24 ~512GB 6 hours - (1) 160

SINGLE 1-2 1-12 128-256GB 1 weeks 32 (16) 150

SMALL 4-8 12-48 256GB-1TB 1 weeks 16 (6) 130

MEDIUM 8-32 48-192 1TB-4TB 3 days 4 (1) 90

LARGE 32-64 192-384 4TB-8TB 2 days 2 (1) 70

XLARGE 64-128 384-768 4TB-16TB 2 days 1 (1) 30

APPLI 1-2 1-12 128GB-256GB 3 weeks 16 (6) 110

LONG-S 1-8 12-48 256GB-1TB 2 weeks 3 (1) 110

LING-M 8-32 48-192 1TB-4TB 1 weeks 1 (1) 90

LONG-L 32-96 192-576 4TB-12TB 5 days 1 (1) 30

Page 17: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

17Confidential ©2017 SGI

– UV 3000 system can be used as SMP of 32 TB at maximum with 256 sockets (1536 cores).

*SMP is a Symmetric Multiple processor, which is a multi-processor method that distributes processing by multiple CPUs, and is a method of treating each CPU equally and doing parallel processing. It is also translated as a symmetric multi-processor.

Access from 1 process to 32 TB memory is possible. (In JAIST's PC cluster, unless you use a parallel program, only one process can use up to 64 GB of memory.) Simply put, think that it is a personal computer that can use the 1536 core 32 TB memory with one OS.

If 32 TB of memory is used, it is necessary to apply to the administrator.

– It is possible to perform calculations using large scale memory

– It is possible to automatically parallelize an existing source by using the paralell option of Intel compiler

– Favorite calculation

– Calculation using large memory

– A program that performs automatic parallelization and executes openMP

– Weak point

– Calculations with heavily disk I/O

(The direct connected working disk(/work) is abeilable. But I/O performance is lowered because it is used for the entire OS.)

– Notes : Executed the program

– Avoid large output to standard output. You may experience a system malfunction.

About the program to be executed

Page 18: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 18

Submit job

Page 19: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

19Confidential ©2017 SGI

– The qsub command is used to submit a batch job to PBS.<job script> is a script for job submission.

$ qsub <job script>

In the following example , There is date in the home directory. Submit from here with the qsubcommand. If you submit different locations, plerase modify userID and directory. There is a sample program in / work/Samples.

Submit job (batch)

$ cat test_prog.sh#!/bin/bash#PBS -q TINY#PBS -l select=1:ncpus=1:mem=1gb#PBS -N sample_JOB#PBS -j oecd ${PBS_O_WORKDIR}dplace ./sample.out

---Submit job---$ qsub test_prog.sh---Finish---$ cat /user1/xxx/userID/directory/sample_JOB.ojobID

Page 20: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

20Confidential ©2017 SGI

– Job is to be run interactively.

$ qsub -I

When the job submission is successful , the job ID is assigned. And it ends with exit command .

Submit job (Interactive)

$ qsub -I -l select=1:ncpus=4qsub: waiting for job 4107.altix-uv to startqsub: job 4107.altix-uv ready

altix-uv /home/sgise2>altix-uv /home/sgise2> cd /work/sgise2altix-uv /work/sgise2> ./mathprogram1altix-uv /work/sgise2> INTEGRAL[ 0.1 0.9: 100000000 STEPS] altix-uv /work/sgise2> exitlogout

qsub: job 4107.altix-uv completed

Page 21: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

21Confidential ©2017 SGI

qsub options

Option Description

-q queue_name Job is submitted to the named queue at the default server.

If not specified, it will be submitted to the default (TINY).

-N name Sets job’s Job_Name attribute and name to name. If no script is used, the job’s name is “STDIN”.

(string, up to 236 characters in length. It must consist of an alphabetic or numeric character followed by

printable, nonwhite-space characters.)

-a date_time Point in time after which the job is eligible for execution. Given in pairs of digits. Sets job’s

Execution_Time attribute to date_time.

Format: datetime:[[[[CC]YY]MM]DD]hhmm[.SS]

-j oe | eo Whether and how to join the job’s standard error and standard output streams. Sets job’s Join_Path

attribute to join.

oe: Standard error and standard output are merged into standard output.

eo: Standard error and standard output are merged into standard error.

-o path_name Path to be used for the job’s standard output stream. Sets job’s Output_Path attribute to path. If the -o

option is not specified, PBS copies the standard output to the current working directory where the qsub

command was executed. The default filename for the standard output stream is used. It has this form:

job name.o<sequence>

-e path_name Path to be used for the job’s standard error stream. Sets job’s Error_Path attribute to path. If the -o

option is not specified, PBS copies the standard output to the current working directory where the qsub

command was executed. The default filename for the standard output stream is used. It has this form:

job name.o<sequence number>

-m mail_option The set of conditions under which mail about the job is sent. Sets job’s Mail_Points attribute to

mail_events. The mail_events argument can be either “n” or any combination of a, b, and e.

n No mail will be sent.

a Mail is sent when the job is aborted by the batch system.

b Mail is sent when the job begins execution.

e Mail is sent when the job terminates.

-M mail_address List of users to whom mail about the job is sent.

-l keyword=value Allows the user to request resources and specify job placement. Sets job’s Resource_list attribute to

resource_list.

Page 22: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

22Confidential ©2017 SGI

UV 3000 creates a virtual node with PBS. Jobs are structured to execute using virtual nodes.

In the case of Jaist UV3000, the 256 CPU (1536 core) 32TB system is divided into 256 virtual nodes.

The virtual node is specified in the select statement.

There are two ways to set the virtual node. Specifying the number of virtual nodes or the number of cores to be used.

The following is an example of executing MPI using 24 cores.

Specifying the number of virtual nodes

qsub –l select=4:ncpus=6:mpiprocs=6

Specifying the number of cores

qsub –l select=1:ncpus=24:mpiprocs=24

About virtual nodes

Page 23: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

23Confidential ©2017 SGI

Use a select statement to specify the number of nodes, number of cores.

Write the job input options.

-l select=N1:ncpus=N2

In the case of MPI job , it looks like the following .

-l select=N1:ncpus=N2:mpiprocs=N3

N1:Specifies the number of nodes.

N2:Specifies the number of cores in the one compute node.

N3: Specifies the number of MPI process in the one compute node.

In the case of MPI job , it will be the N3 = N2.

In the case of hybrid job of MPI and OpenMP, it will be the N3 = N2 / ( number of threads in the OpenMP).

If the value of canges every node, the format is as follows.(connect with +)

-l select=N1:ncpus=N2:mpiprocs=N3+M1:ncpus=M2:mpiprocs=M3

About the select statement

Page 24: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

24Confidential ©2017 SGI

– The qstat command is used to display the status of jobs, queues, and batch servers.

Job confirmation

uv /home/sgise2> qstatJob id Name User Time Use S Queue------------------- -------------------- ------------------ ----------- -- -----3627.altix-uv STDIN s1520207 316:42:1 R SMALL3629.altix-uv STDIN s1520207 84:22:43 R SMALL3630.altix-uv sym-62 s1420207 00:00:37 R SINGLE3793.altix-uv sym-62 s1420207 100:35:3 R SINGLE3854.altix-uv translate_full s1520207 366:04:0 R APPLI

Description of Default Job Status Columns

• Job id The job_identifier

• Name Job name

• User Username of job owner.

• Time Use CPU Time

• S Job status

• Queue The queue in which the job resides.

The job’s state:

• Q Job is queued

• R Job is running

• E Job is exiting after having run.

• S Job is suspended.

Page 25: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

25Confidential ©2017 SGI

option Discription

-u user_name If a destination is given, status for jobs at that destination

owned by users in user_name is displayed. If a job_identifier is

given, status information for that job is displayed regardless of

the job’s ownership.

-f (JOBID) Full display. Job, queue or server attributes displayed one to a

line.

-Q Display queue status in default format.

-s Any comment added by the administrator or scheduler is

shown on the line below the basic information.

Job confirmation

• qstat main options

Page 26: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

26Confidential ©2017 SGI

Job confirmation

• If a destination is given, information for running or suspended jobs at that destination is displayed. Use “qstat –f job ID”

• Display queue status in default format. Use “qstat -Q”

uv /home/sgise2> qstat -QQueue Max Tot Ena Str Que Run Hld Wat Trn Ext Type---------------- ----- ----- --- --- ----- ----- ----- ----- ----- ----- ----LONG-L 0 0 yes yes 0 0 0 0 0 0 ExecSINGLE 32 19 yes yes 0 19 0 0 0 0 ExecSMALL 16 7 yes yes 0 7 0 0 0 0 ExecMEDIUM 4 0 yes yes 0 0 0 0 0 0 ExecLARGE 2 0 yes yes 0 0 0 0 0 0 ExecXLARGE 1 0 yes yes 0 0 0 0 0 0 ExecAPPLI 16 3 yes yes 0 3 0 0 0 0 ExecLONG-S 4 0 yes yes 0 0 0 0 0 0 ExecLONG-M 2 0 yes yes 0 0 0 0 0 0 ExecTINY 0 0 yes yes 0 0 0 0 0 0 Exec

• Max :Maximum number of jobs allowed to run concurrently in the queue.

• Tot :Total number of jobs in the queue..

• Ena :Whether the queue is enabled or disabled.

• Str :Whether the queue is started or stopped.

• Que :Number of queued jobs.

• Run :Number of running jobs.

• Hld :Number of held jobs.

• Wat :Number of waiting jobs.

• Trn :Number of jobs being moved (transiting.)

• Ext :Number of exiting jobs.

• Type :Type of queue: execution or routing.

Page 27: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

27Confidential ©2017 SGI

– The qdel command deletes jobs in the order given. A PBS job may be deleted by its owner only.

$ qdel <JOBID>

Cancel job

Page 28: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

28Confidential ©2017 SGI

– dplace command

dplace is used to bind a related set of processes to specific cpus or nodes to prevent process migrations. It is recommended to use dplace command with execution of parallel programs.

Note that different options in automatic parallelization/OpenMPprogram and MPI program. The numbers are not related to the number of CPU.

※dplace is used to bind a related set of processes to specific cpus or nodes to prevent process migrations.

By default, memory is allocated to a process on the node that the process is executing on. If a process moves from node to node during its lifetime, a higher percentage of memory references will be to remote nodes. Remote accesses typically have higher access times. Process performance may suffer.

Process layout

Page 29: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

29Confidential ©2017 SGI

– Serial Program.

Process layout

$ dplace ./a.out

• Auto-parallelized code/OpenMP code(For programs compiled with intel conpiler 17.0.0 or earlier)

Specify the -x2 option

$ export OMP_NUM_THREADS=8

$ export KMP_AFFINITY=disabled

$ dplace -x2 ./a.out

• Specification has been changed so that management threats will not stand up from Intel compiler 17.0.1. Therefore, the option of -x2 becomes unnecessary.

• export command is a command of bash system . Please change to "setenv OMP_NUM_THREADS 8" in the case of csh.

• PBS batch job, OMP_NUM_THREADS sets the number of CPUS the qsub command is specified automatically.

Page 30: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

30Confidential ©2017 SGI

– SGI MPT code

Specify the –s1 option

Process layout

$ mpiexec_mpt –np 8 dplace -s1 ./a.out

※In the PBS batch job , please use the mpiexec_mpt.

The number of MPI process specified by "-np option", qsub command is automatically set to mpiprocs.

Page 31: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

31Confidential ©2017 SGI

option description

-x Provides the ability to skip placement of processes. <skip_mask> is a bitmask. If bit N

of <skip_mask> is set, then the N+1th process that is forked is not placed. For example,

setting the mask to 6 will cause the 2nd and 3rd processes from being placed. The

first process (the process named by the <command>) will be assigned to the first

cpu. The second and third processes are not placed. The fourth process is assigned to

the second cpu, etc.. This option is useful for certain classes of threaded apps that

spawn a few helper processes that typically do not use much cpu time. (Hint: Intel

OpenMP applications currently should be placed using -x 2. This could change in future

versions of OpenMP).

-s Skip the first <skip_count> processes before starting to place processes onto cpus.

This option is useful if the first <skip_count> processes are "shepherd" processes

that are used only for launching the application. If <skip_count> is not specified, a

default value of 0 is used.

Process layout

• “-x、–s” option is specifies the process do not assign on the CPU.

Page 32: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

32Confidential ©2017 SGI

– omplace command (Only SGI MPT is available)

The omplace command causes the successive threads in a threaded or in a hybrid MPI/threaded job to be pinned to unique CPUs. The CPUs are assigned in order from the effective CPU list within the contained cpuset. This command is layered on dplace, and can be easier to use with MPI application launch commands because it hides the details associated with process skip counts, nested MPI and OpenMP processes and threads, and complex CPU lists.

Process layout

$ export OMP_NUM_THREADS=4$ export KMP_AFFINITY=disabled$ mpiexec_mpt –np 8 omplace -nt $OMP_NUM_THREADS ./a.out

※-np: Specifies the number of threads per MPI process.

Page 33: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 33

Creating a job script

Page 34: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

34Confidential ©2017 SGI

– None parallel code

Options to the qsub command is the line beginning with "#PBS".(Line two to sixth)

(#PBS line is interpreted as a comment line in the shell script, but in the qsub command is interpreted as an option line)

Creating a job script

1. #!/bin/bash2. #PBS -q TINY3. #PBS -l select=1:ncpus=14. #PBS -N serial_JOB5. #PBS -o serial_out_file6. #PBS -j oe7. source /etc/profile.d/modules.sh8. module load PrgEnv-intel9. cd ${PBS_O_WORKDIR}10. dplace ./a.out

line 1: the script is to be interpreted and run by the bash shell.

line 2:Name of Queue

line 3:Use a select statement to specify the number of nodes, number of cores. Write the job input options.

line 4:Name of job

line 5:Name of stout file.

line 6:Path to be used for the job’s standard output stream.

line 7: Preferences for using the module command module. If you use csh instead “modules.csh”

line 8: load the Intel compiler environment in the command module. * Not required for SGI-MPT for line 7 and 8.

line 9:Move to the directory with executable files.

line 10:The execution of the program.

None parallel program execution will be dplace command with no options.

Page 35: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

35Confidential ©2017 SGI

– Auto-parallelized code/OpenMP code

Number of parallel (OMP_NUM_THREADS environment variable ) is set to the number of CPU that is specified in the “-l select ncpus”.

Creating a job script

1. #!/bin/bash 2. #PBS -q TINY3. #PBS -l select=1:ncpus=64. #PBS -N omp_JOB5. #PBS -o omp_out_file6. #PBS -j oe7. source /etc/profile.d/modules.sh8. module load PrgEnv_intel9. export OMP_NUM_THREADS=810. export KMP_AFFINITY=disabled11. cd ${PBS_O_WORKDIR}12. dplace -x2 ./a.out or dplace ./a.out

line 3:Specifies the number of CPUS to use (parallelism).

* Not required for SGI-MPT for line 7 and 8.

line 9: specify the degree of parallelism of OpenMP. If you use csh “setenv OMP_NUM_THREADS 8”

This environment variable is mandatory.

line 10:To disable the AFFINITY of the Intel compiler.

If you use csh “setenv KMP_AFFINITYdisabled”

line 11: Move to the directory with executable files

line 12:The execution of the program.

Note For automatic parallelization of Intel compiler 17.0.0 or earlier and execution of OpenMP parallel program, specify the dplace - x 2 command (optional).

Page 36: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

36Confidential ©2017 SGI

– Examples of a job using the /work area

– The UV3000 have work area(/work) with much I/O calculations job. You make the directory of one’s account name in /work, use the calculation during execution of data input and output . please do not leave the data in the work area with job finished. It may be removed without notice.

The following is an example script for running a program. If in the program can specify the output location , the copy process is not required.

Creating a job script

1. #!/bin/bash 2. #PBS -q TINY3. #PBS -l select=1:ncpus=64. #PBS -N smp_JOB5. #PBS -o smp_out_file6. #PBS -j oe7. source /etc/profile.d/modules.sh8. module load PrgEnv-intel9. export OMP_NUM_THREADS=810. export KMP_AFFINITY=disabled11. cp –rp ${PBS_O_WORKDIR}/program /work/userID/12. cd /work/userID/program13. dplace -x2 ./a.out or dplace ./a.out

line 2:Name of Queue.

* Not required for SGI-MPT for line 7 and 8.

line 11: Copy the directory containing the executable file to the /work area.

line 12: Move to the directory with executable files

line 13:The execution of the program.

For automatic parallelization of Intel compiler 17.0.0 or earlier and execution of OpenMP parallel program, specify the dplace - x 2 command (optional).

Page 37: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

37Confidential ©2017 SGI

– MPI code(use SGI MPT)

Please specify the number of CPU with ”-l select=ncpus, mpiprocs ” options.

Creating a job script

1. #!/bin/bash 2. #PBS -q GEN3. #PBS -l select=2:ncpus=4:mpiprocs=44. #PBS -N sgimpt_JOB5. #PBS -o sgimpt_out_file6. #PBS -j oe7. cd ${PBS_O_WORKDIR}8. mpiexec_mpt dplace -s1 ./a.out

line 3:Specifies the number of CPUS to use (parallelism) and MPI prosess.

line 10:Use the mpiexec_mpt command to run MPI parallel jobs.

nt option is automatically loaded with PBS.

Note SGI-MPT program execution will be dplace command with –s1 option

Page 38: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

38Confidential ©2017 SGI

– SGI MPT hybrid(MPI+OpenMP) code

Please specify the number of CPU with ”-l select=ncpus, mpiprocs ” options.

Creating a job script

1. #!/bin/bash 2. #PBS -q SMALL3. #PBS -l select=4:ncpus=6:mpiprocs=24. #PBS -N hybrid_JOB5. #PBS -o hybrid_out_file6. #PBS -j oe7. export OMP_NUM_THREADS=28. export KMP_AFFINITY=disabled9. cd ${PBS_O_WORKDIR}10. mpiexec_mpt omplace -nt ${OMP_NUM_THREADS} ./a.out

line 3: Specifies the number of CPUS to use (parallelism) and MPI prosesses

line 7: specify the degree of parallelism of OpenMP. If you use csh “setenv OMP_NUM_THREADS 8”

This environment variable is mandatory.

line 10:Use omplace and mpiexec_mpt commands to run a hybrid parallel jobs.

Page 39: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 39

Compiler Options

Page 40: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

40Confidential ©2017 SGI

– Intel compiler command

– icc (C/C++)

– icpc (C++)

– ifort (Fortran)

– Display the compiler options

– icc –help

– icc –help [category]

–If category is specified, a category of compiler options are displayed.

– Display the compiler version information..

– icc –V

– Example

– icc [options] file1.c [file2.c …]

Compiler Command

Page 41: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

41Confidential ©2017 SGI

– Serial Code

$ icc –O3 prog.c (compile)

$ dplace ./a.out (execute)

– OpenMP Code

$ icc –O3 –qopenmp prog_omp.c (compile)

$ setenv KMP_AFFINITY disabled

$ setenv OMP_NUM_THREADS 4 (set the number of threads to use during execution)

$ dplace ./a.out (execute)

– MPI Code

$ icc –O3 prog_mpi.c –lmpi (compile)

$ mpiexec_mpt –np 4 dplace –s1 ./a.out (execute)

– Hybrid (MPI+OpenMP) Code

$ icc –O3 –qopenmp prog_hyb.c –lmpi (compile)

$ setenv KMP_AFFINITY disabled

$ setenv OMP_NUM_THREADS 4 (set the number of threads to use during execution)

$ npiexec_mpt –np 4 omplace –nt ${OMP_NUM_THREADS} ./a.out (execute)

*MPI Options -Lmpi should be added at the end.

Compile and Execute – C/C++

Page 42: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

42Confidential ©2017 SGI

– Key compiler options by default

– Recommended options

Recommended compiler options

Option Description

Optimization level -O2 Optimization for more high performance.

Generate optimized code

specialized for the intel

processor

-msse2 Generate Intel SSE2 and SSE instructions for

Intel Xeon Processors.

Option Description

Optimization level -O3 Performs O2 optimizations and enables more

aggressive loop transformations such as

Fusion, Block-Unroll-and-Jam.

Generate optimized code

specialized for the intel

processor

-xCORE-AVX2 Generate Intel AVX2, AVX, SSE4.2, SSE4.1,

SSSE3, SSE3, SSE2, and SSE for Intel Xeon

E5 v3 Processor.

Page 43: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

43Confidential ©2017 SGI

Option Description

-O0 Disables all optimizations. Using for debugging.

-O1

•Enables global optimization

•Disables inlining of some instructions.

This optimization level may improve performance for applications with very large code size, many branches,

and execution time not dominated by code within loops.

-O2

If the optimization level is not specified, this optimization level is enabled by default. The option enables:

• Inlining of intrinsics

• Intra-file interprocedural optimizations

Inlining, constant propagation, forward substitution, routine attribute propagation, variable

address-taken analysis. etc.

• The following capabilities for performance gain

• Loop unrolling, dead-code elimination, global instruction scheduling and control speculation,

exception handing optimization, etc.

-O3

Perform O2 optimizations and enables more aggressive loop transformations such as Fusion, Block-Unroll-

Jam, and collapsing IF statements.

The O3 option is recommended for applications that have loops that heavily use floating point calculations and

process large data sets.

When O3 is used with -axCORE-AVX2 or –xCORE-AVX2, the compiler performs more aggressive data

dependency analysis than for O2, which may result in longer compilation times.

-fast-fast option is macro option which enables -ipo, -O3, -no-prec-div, -static, -fp-model fast=2, and –xHOST

options.

※Because –fast option includes –static, please use –Bdynamic option to link a dynamic library when there is only a dynamic library.

Optimization options

Page 44: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

44Confidential ©2017 SGI

Optimization options

Option Description

-xprocessorTells the compiler to generate optimized code specialized for the intel

processor that executes your program.

-axprocessorTells the compiler to generate multiple, processor-specific auto-dispatch

code paths for the intel processors id there is a performance benefit.

-vec Enables or disables vectorization

-no-prec-divEnables optimizations that gave slightly less precise results than full IEEE

division.

-no-prec-sqrt Enables a faster but less precise implementation of square root.

Page 45: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

45Confidential ©2017 SGI

– -axprocessor、-xprocessor

Generates optimized code specialized for the Intel Processor

Optimization Options for the specific Intel processor

Processor Generates optimized code specialized for the Intel processor

HOSTGenerates instructions for the highest instruction set available on the

compilation host processor

CORE-AVX2Generates optimized code for Intel Xeon E5 v3 Processor family and enables

AVX2, AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions.

SSE4.2Generates optimized code for Westmere-EX(Intel Xeon E6-8800 family) and

enables SSE4.2, SSE4, SSSE3, SSE3, SSE2, and SSE instructions.

Page 46: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

46Confidential ©2017 SGI

– Inter procedural Options

Optimization Options

Option Description

-ip Enables single file inter procedural optimization (inline function expansion,

constant propagation…). It may improve much compiler optimization.

-ipo Enable multi file IP optimizations (between files). It is important to compiler

the entire application of multiple, related source files together when you

specify –ipo.

Page 47: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

47Confidential ©2017 SGI

– Floating Point Operation

Optimization Options

Option Description

-ftz Enables floating underflow results set to zero.

Every optimization options level, except O0, sets ftz.

If this option produces undesirable results of the numerical behavior of your program,

you can turn the Flash To Zero mode off by using –no-ftz options.

-fltconsistency Enables improved floating-point consistency.

-fp-model

keyword

Controls the semantics of floating-point calculations.

Keyword

precise: Disables optimizations that are not value-safe on floating-point data.

fast[=1|2]: Enables more aggressive optimizations on floating-point data.

strict: Enables precise and except, disables contractions and enables pragma

stdc fenv_access.

source: Rounds intermediate results to source-defined precision.

double: Rounds intermediate results to 53-bit(double)

extended: Rounds intermediate results to 64-bit(extended)

[no-]except: Determines whether strict floating-point exception semantics are honored.

Page 48: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

48Confidential ©2017 SGI

Alias Options

Option Description

-falias(Default)

-fno-alias

-ffnalias(Default)

-fno-fnalias

Specifies that aliasing should no be assumed in the program, or within

functions. If there is no aliasing, compiler might makes more optimization

improvement. Especially, it may affect C/C++ codes.

If you can rewrite the source codes, you can use _restrict keyword (C99:

restict) to the point.

p access region

q access region

No aliasing

p

q

p access region

q access region

aliasing

p

q

Page 49: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

49Confidential ©2017 SGI

Optimization Report

Option Descrption

-qopt-report [=n] Generate an optimization report. Indicates the level of detail in the report. You can

specify values 0 through 5. If you specify zero, no report is generated.

-qopt-report-file=name Specifies the file filename to hold the optimizations report. If you specify stderr, the

output should go to stderr. If you specify stdout, the output should go to stdout.

-qopt-report-routine=name Generate reports on the routines containing the specified name.

-qopt-report-phase=name Generates reports for the optimizer you specify in phase.

-qopt-report-help Displays the optimizer phases available for report generation.

Phase Description

cg The phase for code generation

ipo The phase for Interprocedural

Optimization

loop The phase for loop optimization

openmp The phase of OpenMP

* Optimizer Phases

Phase Description

pgo The phase of Profile Guided Optimization

tcollect The phase for trace collection

vec The phase for vectrization

all All optimizer phases. This is the default if you do not specify list.

Page 50: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

50Confidential ©2017 SGI

Option Description

-static Prevents linking with shared libraries.

-Bstatic Enables static linking of user’s library.

-Bdynamic Enables dynamic linking libraries at run time.

-shared-intel Causes Intel-provided libraries to be linked in dynamically.

-static-intel Causes Intel-provided libraries to be linked in statically.

Linking Options

Page 51: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

51Confidential ©2017 SGI

–Intel® Compiler deals with 32-bit and 64-bit differently.

–Intel®64 memory model– small(default): Tells the compiler to restrict code and data to the

first 2GB of address space.

– medium(-mcmodel=medium): Tells the compiler to restrict code to the first 2gb; it places no memory restriction on data.

– large(-mcmodel=large): Place no memory restriction on code or data.

– If you specify -mcmodel = medium or -mcmodel = large, also set the -shared-intel option.

–When you specify medium of large, you must also specify compiler option –shared-intel

Intel®64 Memory Model

Page 52: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 52

Numerical Library

Page 53: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

53Confidential ©2017 SGI

– Feature

– Scientific Technical Computing Library

– Optimized for the Intel Processor

– Multi-threading

–Thread parallel

–Thread safe

– Runtime auto processor detection

– C and Fortran Interface

Intel Math Kernel Library (MKL)

Page 54: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

54Confidential ©2017 SGI

– Intel Math Kernel Library contains followings functions.

– BLAS

– BLACS

– LAPACK

– ScaLAPACK

– PBLAS

– Sparse Solver

– Vector Math Library (VML)

– Vector Statistical Library (VSL)

– Conventional DFTs and Cluster DFTs

– Etc.

Intel Math Kernel Library (MKL)

Page 55: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

55Confidential ©2017 SGI

–Linking

Intel Math Kernel Library (MKL)

Serial:$ icc $CFLAGS –o test test.c –lmkl_intel_lp64 –lmkl_sequential –lmkl_core

Thread Parallel:$ icc $CFLAGS –o test test.c –lmkl_intel_lp64 –lmkl_intel_thread –lmkl_core –liomp5

Serial:$ icc $CFLAGS –o test test.c –mkl=sequential

Thread parallel:$ icc $CFLAGS –o test test.c –mkl=parallel

Intel compiler can link the MKL with “-mkl” option.

Page 56: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

56Confidential ©2017 SGI

–BLACS and/or ScaLAPACK

Intel Math Kernel Library (MKL)

Serial:$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 ¥

-lmkl_intel_lp64 -lmkl_sequential -lmkl_core example1.c -lmpi

Thread parallel:$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpti_lp64 ¥

-lmkl_intel_lp64 -lmkl_intel_thread -lmkl_core -liomp5 example1.c -lmpi

Serial:$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 ¥

-mkl=sequential example1.c -lmpi

Thread parallel:$ icc -lmkl_scalapack_lp64 -lmkl_blacs_sgimpt_lp64 ¥

-mkl=parallel example1.c -lmpi

Intel compiler can link the MKL with “-mkl” option.

SGI MPT

In the case of Intel MPI, the complier command and BLACS library option are different.*MPI Options -Lmpi should be added at the end.

Page 57: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

57Confidential ©2017 SGI

– Caution for the thread parallel version MKL.

Intel Math Kernel Library (MKL)

Serial executionSet the environment variable OMP_NUM_THREADS=1 or link the serial

version MKL.

Thread Parallel

execution

Set the environment variable OMP_NUM_THREADS. When MKL functions

are used in an OpenMP code, the code runs with the number of threads

which is defined by OMP_NUM_THREADS. If you want to run the MKL

function with the different number of thread which is defined the

OMP_NUM_THREADS, you must set MKL_NUM_THREADS in addition to

OMP_NUM_THREADS.

*When the MKL function is called in the OpenMP parallel region, the MKL thread are disabled

by default. Because the OpenMP nesting parallelization is disabled. If you want to enable the

nesting parallelization, set the environment variable able OMP_NESTED=“yes”.

MPI executionIf you want to run with only MPI parallel, set the environment variable

OMP_NUM_THREADS=1 or link the serial version MKL not to run a

parallel thread of MKL.

Hybrid executionIf you want to run with MPI and thread parallel, set the environment variable

OMP_NUM_THREADS and/or MKL_NUM_THREADS.

Page 58: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 58

First-Touch Policy & Data Placement

Page 59: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

59Confidential ©2017 SGI

– SGI UV3000 is NUMA architecture, and data is placed on memory in first touch policy.

– The physical page is placed on the memory node located on the nearby processor which is first to access the data. This is “First Touch” policy.

– There is “local” and “remote” memory in the NUMA architecture.

– To achieve high performance, the data must be placed in the “local” processor which uses it.

– It is important which core the process is placed on . *dplace / omplace command)

First Touch Policy

NUMAlinkRouter

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

Local memory

Remote memory

First Touch Policy Overview on SGI UV3000

Page 60: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

60Confidential ©2017 SGI

– All data is allocated with a “First-touch” policy (as a system page unit)

– The initialization loop, if executed serially, will grab pages from single node.

– In the parallel loop, multiple processors will access that one memory. So, there are bottlenecks in accessing a single node.

First-touch Policy

for( i=0; i<N; ++i){a[i]=0.0;b[i]=(double)i/2.0;c[i]=(double)i/3.0;d[i]=(double)i/7.0;

}#pragma omp parallel for

for( i=0; i<N; ++i){a[i] = b[i] + c[i] + d[i];

}

NUMAlinkRouter

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

Bottleneck in access to that

node.

Parallelized

Executedserially

All data is allocated on here.

Page 61: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

61Confidential ©2017 SGI

– Perform initialization in parallel.

– allocated to the local memory with the first touch policy.

– Data is distributed naturally, so minimal data exchange between nodes. The performance will be better.

First-touch Policy

#pragma omp parallel for shared(a, b, ,c, d)

for( i=0; i<N; ++i){a[i]=0.0;b[i]=(double)i/2.0;c[i]=(double)i/3.0;d[i]=(double)i/7.0;

}#pragma omp parallel for shared(a, b, c, d)

for( i=0; i<N; ++i){a[i] = b[i] + c[i] + d[i];

}

NUMAlinkRouter

128GB128GB

CPU CPU

HUB

UV Blade

128GB128GB

CPU CPU

HUB

UV Blade

Parallelized

Executed in parallel

Allocated to the local memory, data is

distributed naturally.

Page 62: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 62

Debugger

Page 63: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

63Confidential ©2017 SGI

Debugger (Serial Code, OpenMP code)

- gdb – GNU Debugger– Linux standard debugger– Multi-thread (OpenMP, pthread)

(Ex.)– Core file analysis

% gdb ./a.out core

(gdb)where

(gdb)w

– Run a program on gdb% gdb ./a.out

(gdb) run

– Attach a running process% gdb a.out [process id]

Page 64: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

64Confidential ©2017 SGI

– The environment variable MPI_SLAVE_DEBUG_ATTACH specifies the MPI process to be debugged. If you set MPI_SLAVE_DEBUG_ATTACH=N, the MPI process with rank N prints a message during program setup and sleep 20 seconds.

– Specifies the MPI process to be debugged.

$ setenv MPI_SLAVE_DEBUG_ATTACH 0 (specify the rank0)

$ mpirun -np 4 ./a.out

MPI rank 0 sleeping for 20 seconds while you attach the debugger.

You can use this debugger command:

gdb /proc/26071/exe 26071

or

idb -pid 26071 /proc/26071/exe

– In another window, attach to the target process from debugger.

– $ gdb /proc/26071/exe 26071

– (gdb) cont

Debugger (MPI code)

Page 65: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 65

Exercise:Compile and Execute

Page 66: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

66Confidential ©2017 SGI

– Login to UV3000.

– After login to UV3000, ”mpi/PrgEnv-intel_sgi” module is already loaded by default.

– “mpi/PrgEnv-intel_sgi” includes Intel Compiler 17.1 and SGI MPT.

– Copy and extract the training file, prepare the working directory.

Login to UV3000

$ ssh –I login-name uv

$ module list Currently Loaded Modulefiles:

1) mpi/PrgEnv-intel_sgi

$ cp /work/Samples/Seminar/training_2019_11.tar.gz .$ tar zxvf training_2019_11.tar.gz$ cd training_2019_11$ ls

Page 67: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

67Confidential ©2017 SGI

– In this training, we use UV3000 with Interactive job. Submit interactive job with 4 sockets(24cores).

– If the interactive job is started, the following message are shown and you can use 4sockets (24cores) of UV3000.

– Change directory to the current directory in which you performed “qsub” command.

Interactive Job

$ qsub –I –q TINY –l select=1:ncpus=24

qsub: waiting for job 4248.altix-uv to startqsub: job 4248.altix-uv ready

$ cd ${PBS_O_WORKDIR}

Page 68: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

68Confidential ©2017 SGI

We use “Himeno Benchmark code in C” in this training.

Dr. Ryutaro Himeno, director of the Advanced Center for Computing and Communication, has developed this benchmark to evaluate performance of incompressible fluid analysis code. This benchmark program takes measurements to proceed major loops in solving the Poisson’s equation solution using the Jacobi iteration method. The result of this benchmark is measured in MFLOPS.

FLOPS is an acronym of “the number of FLoating Operations Per Second”. The higher FLOPS means the higher floating point operation performance.

Test Code

RIKEN Advanced Center for Computing and Communication Himeno Benchmark(http://accc.riken.jp/supercom/himenobmt/)

Page 69: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

69Confidential ©2017 SGI

Compile a serial code(Himeno benchmark static allocation in C) with the following command. (Specifies the Grid size to L with “-DLARGE”)

Execute “himeno.serial”.

Check the result of “himeno.serial”.

Compile (Serial Code)

$ icc –o himeno.serial –DLARGE himenoBMTxps.c

$ dplace ./himeno.serial

Loop executed for 348 timesGosa : 7.323683e-04MFLOPS measured : 6903.578631 cpu : 56.392519Score based on Pentium III 600MHz : 84.189983

Check the value of MFLOPS.

Page 70: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

70Confidential ©2017 SGI

Try the Optimization Level (-O3) and the special optimization option for Intel Xeon E5 v3(Haswell) (-xCORE-AVX2).

Compile with “-O3” option.

Additionally, compile with “-xCORE-AVX2” option.

Run “himeno.serial” and check the results(MFLOPS).

Compile (Serial Code)

$ icc –o himeno.serial –DLARGE –O3 himenoBMTxps.c

$ icc –o himeno.serial –DLARGE –O3 –xCORE-AVX2 himenoBMTxps.c

Page 71: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Confidential 71

Exercise:Auto-parallelization and OpenMP

Page 72: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

72Confidential ©2017 SGI

– Auto-parallelization

– Overview

– Compiler options

– Compile and execution

– Performance test

– OpenMP

– Overview

– OpenMP pragma

– Compiler options

– Add pragma

– Compile and execution

– Performance test

Procedure

Page 73: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

73Confidential ©2017 SGI

–Auto-Parallelization by Intel Compiler

–Generates Multi thread code

–Combined with compiler optimization

–Only compiler option

–Diagnostic information

(not created multi threaded source code.)

Auto-parallelization (Overview)

Page 74: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

74Confidential ©2017 SGI

-parallelTells the auto-parallelizer to generate multithreaded code for loops that can be safely

executed in parallel. To use this option, you much also specify option O2 or O3.

-par-thresholdn

Sets a threshold for the auto-parallelization of loops.

n=0: loops get auto-parallelized always, regardless of computation work volume.

n=100: loops get auto-parallelized when performance gains are predicted based on the

compiler analysis data. Loops get auto-parallelized only if profitable parallel execution is

almost certain.

The intermediate 1 to 99 values represent the percentage probability for profitable speed-

up. For example, n=50 directs the compiler to parallelize only if there is a 50% probability of

the code speeding up if executed in parallel.

-qopt-report=n

–qopt-report-phase=par

–qopt-report-file=stdout

Controls the diagnostic information reported by the auto-parallelizer. the diagnostic

information is not output by default.

n=1: reports which loops were parallelized.

n=2: Generates level 1 details, and reports which loops were not parallelized along with a short reason.

n=3:Generates level 2 details, and prints the memory locations that are categorized as private, shared,

reduction, etc..

n=4: For this phase, this is the same as specifying level 3.

n=5: Generates level 4 details, and dependency edges that inhibit parallelization.

Auto-parallelization (Compiler Options)

Page 75: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

75Confidential ©2017 SGI

Compile a Serial code(Himeno benchmark code static allocation in C) with auto-parallelization option.

Auto-parallelization reports are shown by -qopt-report=1 –qopt-report-phase=par –qopt-report-file=stdout.

Auto-parallelization (Compile)

$ icc -o himeno.par -parallel -DLARGE himenoBMTxps.c -qopt-report=1 -qopt-report-phase=par -qopt-report-file=stdout

Begin optimization report for: jacobi(int)

Report from: Auto-parallelization optimizations [par]

LOOP BEGIN at himenoBMTxps.c(195,3)remark #25460: No loop optimizations reportedLOOP BEGIN at himenoBMTxps.c(198,5)

remark #17109: LOOP WAS AUTO-PARALLELIZEDLOOP BEGIN at himenoBMTxps.c(199,7)

remark #25460: No loop optimizations reported

Page 76: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

76Confidential ©2017 SGI

The Loop of line 198 was auto-parallelized.

AUTO-PARALLEIZED LOOP

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]

The Loop of line 198.

LOOP BEGIN at himenoBMTxps.c(198,5)

remark #17109: LOOP WAS AUTO-PARALLELIZED

Page 77: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

77Confidential ©2017 SGI

Specify the number of threads with OMP_NUM_THREADS, execute “himeno.par” (Don’t forget KMP_AFFINITY=disabled)

Check the result (with 6 threads)

Auto-parallelization (Execution)

Loop executed for 871 timesGosa : 6.077909e-04MFLOPS measured : 16907.951967 cpu : 57.629344Score based on Pentium III 600MHz : 206.194536

$ setenv OMP_NUM_THREADS 6$ setenv KMP_AFFINITY disabled$ dplace ./himeno.par

Page 78: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

78Confidential ©2017 SGI

– Check the performance with 1, 2, 4, 6 ,12, 24 threads.

Auto-parallelization (Performance Test)

# threads Performance [MFLOPS]

1

2

4

6

12

24

Page 79: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

79Confidential ©2017 SGI

Parallelize by OpenMP pragma

OpenMP Overview

#pragma omp parallel for shared(A, B, C)for ( i= 1 ; i < 10000 ; i++) {

A[i] = B[i] + C[i-1] + C[i+1]}

Key OpenMP pragmas・PARALLEL { ……}

・PARALLEL FOR, PARALLEL FOR REDUCTION(+: …)

・MASTER

・CRITICAL

・BARRIER

Page 80: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

80Confidential ©2017 SGI

#include <stdio.h>int main(void)

{#pragma omp parallel{

#pragma omp criticalprintf(“hello, world¥n”) ;

}}

Parallel Region

“hello, world”

–PARALLEL pragma#pragma omp parallel [clause...]

– A parallel region is a block of code that will be executed by multi threads.

Page 81: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

81Confidential ©2017 SGI

$ icc -qopenmp -qopt-report=1 -qopt-report-phase=openmp -qopt-report-file=stdout hello.c

Intel(R) Advisor can now assist with vectorization and show optimization

report messages with your source code.

See "https://software.intel.com/en-us/intel-advisor-xe" for details.

Begin optimization report for: main(void)

Report from: OpenMP optimizations [openmp]

hello.c(5:3-5:3):OMP:main: OpenMP DEFINED REGION WAS PARALLELIZED

===========================================================================$

$ setenv OMP_NUM_THREADS 4

$ setenv KMP_AFFINITY disabled

$ dplace ./a.out

hello, world

hello, world

hello, world

hello, world

$

hello, world (example)

Master thread executes serial portion of the code.

Creates slave threads

printf by eavh threads

barrier

Master thread resumes execution after the parallel region.

Start

End

Page 82: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

82Confidential ©2017 SGI

・for pragma#pragma omp for [clause...]

– In parallel region, the iteration of the loop immediately following it must be executed in parallel by the term.

– The default schedule is “STATIC”, which means that the iterations are every (if possible) divided contiguously among the threads.

Work sharing of the for loop

Loop length = N

i=1,2,… N N/4

Thre

ad 0

Thre

ad 1

Thre

ad 2

Thre

ad 3

4 thread case

N/4 N/4 N/4

Page 83: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

83Confidential ©2017 SGI

void daxpy(int n, double c, double *x, double *y){

int i;#pragma omp parallel for private(i) shared(c,x,y)for(i = 0 ; i < n ; i++) {

y[i] = y[i] + c * x[i];}

}

Work sharing of for loop

–parallel for pragma– parallel pragma + for pragma

– Create the parallel region and divide the for loop.

Page 84: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

84Confidential ©2017 SGI

– The variables in the parallel region or the divided loop should be...

– Independent in each thread.

– Shared with all threads.

These variables must be specified by the data scope clauses.

– The data scope clauses are used as an option of the parallel region or for pragma.

#pragma omp parallel for private(i) shared(n, c, x, y)

Data Scoping

shared clauseprivate clause

Page 85: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

85Confidential ©2017 SGI

– shared clause

– The variable specified in shared clause exists in only one memory location and all threads can read or write to the address.

– The shared object is the same as master thread.

shared and private

n c x y iMaster thread

The shared variable is only one object from all threads.

shared(n, c, x, y)

Page 86: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

86Confidential ©2017 SGI

–private clause

– The variable specified in private clause is created as a new abject in each threads.

– The private object is unrelated to master thread.

shared and private

n c x y iMaster thread

The private variable is independent in each thread.

private(i)

iiii

Page 87: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

87Confidential ©2017 SGI

Reduction operation

Reduction Operation

for( i=1 ; i<10000 ; i++ ){S = S + A[i]

}

for( i=1 ; i<5000 ; i++ ){S = S + A[i]

}

0

∵When the variable S is global attribute, the answer is not collected. Because thread 0 and 1 write into S at the same time.

for( i=5000 ; i<10000 ; i++ ){S = S + A[i]

}

1

When S is local, partial sum of each thread is calculated. How to calculate sum total?

Page 88: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

88Confidential ©2017 SGI

Reduction operation

Reduction Operation

#paragma omp parallel for reduction(+: S)for( i=1 ; i<10000 ; i++ ){

S = S + A[i]}

for( i=1 ; i< 5000 ; i++ ){S0 = S0 + A[i]

}

0

The result of reduction operation may be not consistent with serial calculation. Because the order of operation is not different, a rounding error may be caused. When the parallel number is changed, the answer may change.

for( i=5000 ; i<10000 ; i++ ){S1 = S1+ A[i]

}

1

S = S + S0+ S1

Page 89: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

89Confidential ©2017 SGI

・reduction clause

– The reduction operation is the contracting of an array into a scalar variable by some operation.

– Reduction clause format

#pragma omp for reduction(op : var)

var is the comma-deliminated list of reduction variable. op is one of the following.

operation : +, *, -, .AND., .OR., .EQV., .NEQV.

intrinsic : MAX, MIN, IAND, IOR, IEOR

– Variables that appear in a REDUCTION clause must be SHARED in the enclosing context. A private copy of each variable in list is created for each thread as if the PRIVATE clause had been used. The private copy is initialized according to the operator.

– op = +、- : initialization value 0

– op = * : initialization value 1

– op = MAX : smallest representable number

– op = MIN : largest representable number

Reduction Operation

Page 90: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

90Confidential ©2017 SGI

-qopenmpEnables the parallelizer to generate multi-threaded code

based on OpenMP pragmas.

-qopt-report=n

-qopt-report-phase=openmp

-qopt-report-file=stdout

Controls the OpenMP parallelizer diagnostic messages.

The diagnostic messages are not output by default.n=1: Reports loops, regions, sections, and tasks successfully parallelized.

n=2: Generates level 1 details, and messages indicating successful handling of master

constructs, single constructs, critical constructs, ordered constructs, atomic, pragmas,

and so forth.

OpenMP (Compiler Options)

Page 91: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

91Confidential ©2017 SGI

The hotspot of Himeno benchmark code(Static allocation in C) is the loop of line 198.

Add OpenMP pragma

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]205 - p[i-1][j+1][k ] + p[i-1][j-1][k ] )206 + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]207 - p[i ][j+1][k-1] + p[i ][j-1][k-1] )208 + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]209 - p[i+1][j ][k-1] + p[i-1][j ][k-1] )

210 + c[0][i][j][k] * p[i-1][j ][k ]211 + c[1][i][j][k] * p[i ][j-1][k ]212 + c[2][i][j][k] * p[i ][j ][k-1]213 + wrk1[i][j][k];214215 ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];216217 gosa+= ss*ss;218 /* gosa= (gosa > ss*ss) ? a : b; */219220 wrk2[i][j][k] = p[i][j][k] + omega * ss;221 }222223 for(i=1 ; i<imax-1 ; ++i)224 for(j=1 ; j<jmax-1 ; ++j)225 for(k=1 ; k<kmax-1 ; ++k)226 p[i][j][k] = wrk2[i][j][k];227228 } /* end n loop */229230 return(gosa);231 }

Page 92: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

92Confidential ©2017 SGI

Parallelize the loop of line 198 with “#pragma omp parallel for”.

– Add “#pragma omp parallel for”

– Specify private variables. (no needed to specify shared variables, because a variable is shared by default.)

– Use reduction clause for the variable “gosa”, because the variable gosa is a summation of “ss”.

Add OpenMP pragma

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197

#pragma omp parallel for reduction(+:gosa) private(i, j, k, s0, ss)198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]205 - p[i-1][j+1][k ] + p[i-1][j-1][k ] )206 + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]207 - p[i ][j+1][k-1] + p[i ][j-1][k-1] )208 + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]209 - p[i+1][j ][k-1] + p[i-1][j ][k-1] )

210 + c[0][i][j][k] * p[i-1][j ][k ]211 + c[1][i][j][k] * p[i ][j-1][k ]212 + c[2][i][j][k] * p[i ][j ][k-1]213 + wrk1[i][j][k];214215 ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];216217 gosa+= ss*ss;218 /* gosa= (gosa > ss*ss) ? a : b; */219220 wrk2[i][j][k] = p[i][j][k] + omega * ss;221 }222223 for(i=1 ; i<imax-1 ; ++i)224 for(j=1 ; j<jmax-1 ; ++j)225 for(k=1 ; k<kmax-1 ; ++k)226 p[i][j][k] = wrk2[i][j][k];227228 } /* end n loop */229230 return(gosa);231 }

Page 93: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

93Confidential ©2017 SGI

Create a parallel region included the loop of line 198 and 223. Each loop is parallelized by “#pragma omp for”. Save this modification as “himenoBMTxps_omp01.c”.

Add OpenMP pragma

189 float190 jacobi(int nn)191 {192 int i,j,k,n;193 float gosa, s0, ss;194195 for(n=0 ; n<nn ; ++n){196 gosa = 0.0;197

#pragma omp parallel private(I, j, k, s0, ss){

#pragma omp for reduction(+:gosa)198 for(i=1 ; i<imax-1 ; i++)199 for(j=1 ; j<jmax-1 ; j++)200 for(k=1 ; k<kmax-1 ; k++){201 s0 = a[0][i][j][k] * p[i+1][j ][k ]202 + a[1][i][j][k] * p[i ][j+1][k ]203 + a[2][i][j][k] * p[i ][j ][k+1]204 + b[0][i][j][k] * ( p[i+1][j+1][k ] - p[i+1][j-1][k ]205 - p[i-1][j+1][k ] + p[i-1][j-1][k ] )206 + b[1][i][j][k] * ( p[i ][j+1][k+1] - p[i ][j-1][k+1]207 - p[i ][j+1][k-1] + p[i ][j-1][k-1] )208 + b[2][i][j][k] * ( p[i+1][j ][k+1] - p[i-1][j ][k+1]209 - p[i+1][j ][k-1] + p[i-1][j ][k-1] )

210 + c[0][i][j][k] * p[i-1][j ][k ]211 + c[1][i][j][k] * p[i ][j-1][k ]212 + c[2][i][j][k] * p[i ][j ][k-1]213 + wrk1[i][j][k];214215 ss = ( s0 * a[3][i][j][k] - p[i][j][k] ) * bnd[i][j][k];216217 gosa+= ss*ss;218 /* gosa= (gosa > ss*ss) ? a : b; */219220 wrk2[i][j][k] = p[i][j][k] + omega * ss;221 }222

#pragma omp for223 for(i=1 ; i<imax-1 ; ++i)224 for(j=1 ; j<jmax-1 ; ++j)225 for(k=1 ; k<kmax-1 ; ++k)226 p[i][j][k] = wrk2[i][j][k];

}227228 } /* end n loop */229230 return(gosa);231 }

Page 94: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

94Confidential ©2017 SGI

Compile a OpenMP code(himenoBMTxps_omp01.c).

OpenMP report is generated by -qopt-report=1 –qopt-report-phase=openmp –qopt-report-file=stdout options.

Specify the number of threads with OMP_NUM_THREADS, execute “himeno.omp01” (Don’t forget KMP_AFFINITY=disabled)

OpenMP (Compiler and Execute)

$ icc -o himeno.omp01 -qopenmp -DLARGE himenoBMTxps_omp01.c -qopt-report=1 -qopt-report-phase=openmp -qopt-report-file=stdout

Begin optimization report for: jacobi(int)

Report from: OpenMP optimizations [openmp]

himenoBMTxps_omp01.c(198:1-198:1):OMP:jacobi: OpenMP DEFINED REGION WAS PARALLELIZED

$ setenv OMP_NUM_THREADS 6$ setenv KMP_AFFINITY disabled$ dplace ./himeno.omp01

Page 95: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

95Confidential ©2017 SGI

–Check the performance of himenoBMTxps_omp01.c with 1, 2, 4, 6 ,12, 24 threads.

OpenMP (Performance Test)

# threads Performance [MFLOPS]

1

2

4

6

12

24

Page 96: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

96Confidential ©2017 SGI

In himenoBMTxps_omp01.c, we parallelize only the jacobifunction.

UV3000 is a NUMA architecture, the data were placed on the memory in the first touch policy. Because the initialization of arrays is performed in serial in this code, the arrays are placed on the memory of the CPU socket on which the initialization of arrays is performed.

If we use more than 2 sockets, the performance is bottlenecked by the remote memory access. So we also need parallelize initmtfunction which performs the initialization of arrays.

Add more OpenMP pragma

Page 97: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

97Confidential ©2017 SGI

Parallelize the loop of line 152 and 170. Save this modification as “himenoBMTxps_omp02.c”.

Add more OpenMP pragma

147 void148 initmt()149 {150 int i,j,k;151

#pragma omp parallel for private(i,j,k)152 for(i=0 ; i<MIMAX ; i++)153 for(j=0 ; j<MJMAX ; j++)154 for(k=0 ; k<MKMAX ; k++){155 a[0][i][j][k]=0.0;156 a[1][i][j][k]=0.0;157 a[2][i][j][k]=0.0;158 a[3][i][j][k]=0.0;159 b[0][i][j][k]=0.0;160 b[1][i][j][k]=0.0;161 b[2][i][j][k]=0.0;162 c[0][i][j][k]=0.0;163 c[1][i][j][k]=0.0;164 c[2][i][j][k]=0.0;165 p[i][j][k]=0.0;166 wrk1[i][j][k]=0.0;167 bnd[i][j][k]=0.0;168 }169

#pragma omp parallel for private(i,j,k)170 for(i=0 ; i<imax ; i++)171 for(j=0 ; j<jmax ; j++)172 for(k=0 ; k<kmax ; k++){173 a[0][i][j][k]=1.0;174 a[1][i][j][k]=1.0;175 a[2][i][j][k]=1.0;176 a[3][i][j][k]=1.0/6.0;177 b[0][i][j][k]=0.0;178 b[1][i][j][k]=0.0;179 b[2][i][j][k]=0.0;180 c[0][i][j][k]=1.0;181 c[1][i][j][k]=1.0;182 c[2][i][j][k]=1.0;183 p[i][j][k]=(float)(i*i)/(float)((imax-1)*(imax-1));184 wrk1[i][j][k]=0.0;185 bnd[i][j][k]=1.0;186 }187 }

Page 98: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

98Confidential ©2017 SGI

–Check the performance of himenoBMTxps_omp02.c with 1, 2, 4, 6 ,12, 24 threads.

Compare the performance between these result and himenoBMTxps_omp01.c.

OpenMP (Performance Test)

# threads Performance [MFLOPS]

1

2

4

6

12

24

Page 99: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

99Confidential ©2017 SGI

– start

– mode

– Vi editor has command mode and insert mode.

– Command mode ESC-key

– Insert mode i-key

– exit

– Exit the vi editor in command mode.

– Exit after saving the file :wq

– Exit without saving the file :q!

Ref.: How to use the vi editor

$ vi file-name

Page 100: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

100Confidential ©2017 SGI

Ref.: How to use the vi editor

Operation Command Description

Go to insert mode i Insert a text at the cursor position

Go to command mode Esc

Move the cursor

→ ( l ) Right

← ( h ) Left

↑ ( k ) Up

↓ ( j ) Down

Delete a character/charactersx Delete a character

dd Delete a line (=cut a line)

Cut/Copy/Paste

yy Copy a line

dd Cut a line

p Paste

Search

/string Forward search

?string Backward search

n To repeat the search in a forward direction

N To repeat the search in a backward direction

Save / Exit

:q ! Exit without saving a file

:wq Exit after saving a file

:zz Exit after saving a file

:w Overwrite

Page 101: SGI UV3000 Parallel Programming and Optimization · 2019. 10. 30. · ©2017 SGI Confidential 14 – module is a user interface to the Modules package. The Modules package provides

Thank you