Foundation to Parallel Programming

Foundation to Parallel Programming

并行程序设计基础• Introduction to Parallel Programming • Approaches for Parallel Programs • Parallel Programming Model• Parallel Programming Paradigm

2

Parallel Programming is a Complex Task

• Parallel software developers handle issues such as:– Non-determinism– Communication– Synchronization– data partitioning and distribution– load-balancing– fault-tolerance– Heterogeneity– shared or distributed memory– Deadlocks– race conditions.

3

Users’ Expectations from Parallel Programming Environments

• Parallel computing can only be widely successful if parallel software is able to meet expectations of the users, such as:– provide architecture/processor type transparency– provide network/communication transparency– be easy-to-use and reliable – provide support for fault-tolerance – accommodate heterogeneity – assure portability – provide support for traditional high-level languages– be capable of delivering increased performance– to provide parallelism transparency

4

2 并行程序构造方法

5

串行代码段for ( i= 0; i<N; i++ ) A[i]=b[i]*b[i+1];for (i= 0; i<N; i++) c[i]=A[i]+A[i+1];

(a) 使用库例程构造并行程序id=my_process_id();p=number_of_processes();for ( i= id; i<N; i=i+p) A[i]=b[i]*b[i+1];barrier();for (i= id; i<N; i=i+p) c[i]=A[i]+A[i+1];例子 : MPI,PVM, Pthreads

(b) 扩展串行语言my_process_id,number_of_processes(), and barrier()

A(0:N-1)=b(0:N-1)*b(1:N)c=A(0:N-1)+A(1:N)例子 : Fortran 90

(c) 加编译注释构造并行程序的方法#pragma parallel#pragma shared(A,b,c)#pragma local(i) {# pragma pfor iterate(i=0;N;1)for (i=0;i<N;i++) A[i]=b[i]*b[i+1];# pragma synchronize# pragma pfor iterate (i=0; N; 1)for (i=0;i<N;i++)c[i]=A[i]+A[i+1];}例子： SGI power C

2 并行程序构造方法

6

方法实例优点缺点

库例程 MPI, PVM 易于实现, 不需要新编

译器

无编译器检查 ,

分析和优化

扩展 Fortran90 允许编译器检查、分析

和优化

实现困难,需要新

编译器

编译器注释 SGI powerC, HPF 介于库例程和扩展方法之间, 在串行平台

上不起作用.

三种并行程序构造方法比较

3 并行性问题

7

3.1 进程的同构性SIMD: 所有进程在同一时间执行相同的指令MIMD: 各个进程在同一时间可以执行不同的指令

SPMD: 各个进程是同构的，多个进程对不同的数据执行相同的代码 ( 一般是数据并行的同义语 )常对应并行循环，数据并行结构，单代码MPMD: 各个进程是异构的，多个进程执行不同的代码要为有 1000 个处理器的计算机编写一个完全异构的并行程序是很困难的

3 并行性问题

8

并行块parbegin S1 S2 S3 …….Sn parendS1 S2 S3 …….Sn 可以是不同的代码并行循环 : 当并行块中所有进程共享相同代码时parbegin S1 S2 S3 …….Sn parend S1 S2 S3 …….Sn 是相同代码简化为parfor (i=1; i<=n, i++) S(i)

3 并行性问题

9

用单代码方法要说明以下 SPMD 程序 :parfor (i=0; i<=N, i++) foo(i)

用户需写一个以下程序 :pid=my_process_id();numproc=number_of _processes();parfor (i=pid; i<=N, i=i+numproc) foo(i)

此程序经编译后生成可执行程序 A, 用shell 脚本将它加载到 N 个处理结点上 :run A –numnodes N

SPMD 程序的构造方法用数据并行程序的构造方法要说明以下 SPMD 程序 :parfor (i=0; i<=N, i++) {

C[i]=A[i]+B[i];}

用户可用一条数据赋值语句 :C=A+B或forall (i=1,N) C[i]=A[i]+B[i]

3 并行性问题

10

用 SPMD 伪造 MPMD要说明以下 MPMD 程序 :parbegin S1 S2 S3 parend 可以用以下 SPMD 程序 :parfor (i=0; i<3, i++) {

if (i=0) S1if (i=1) S2if (i=2) S3}

MPMD 程序的构造方法用多代码方法对不提供并行块或并行循环的语言要说明以下 MPMD 程序 :parbegin S1 S2 S3 parend

用户需写 3 个程序 , 分别编译生成3 个可执行程序 S1 S2 S3, 用 shell脚本将它们加载到 3 个处理结点上 :run S1 on node1run S2 on node2run S3 on node3S1, S2 和 S3 是顺序语言程序加上进行交互的库调用 .

3 并行性问题

11

3.2 静态和动态并行性程序的结构 : 由它的组成部分构成程序的方法静态并行性的例子 :parbegin P, Q, R parend

其中 P,Q,R 是静态的

动态并行性的例子 :while (C>0) begin

fork (foo(C));C:=boo(C);end

静态并行性 : 程序的结构以及进程的个数在运行之前 ( 如编译时 , 连接时或加载时 )就可确定 , 就认为该程序具有静态并行性 . 动态并行性 : 否则就认为该程序具有动态并行性 . 即意味着进程要在运行时创建和终止

3 并行性问题

12

Process A:begin

Z:=1fork(B);T:=foo(3);end

Process B:begin

fork(C);X:=foo(Z);join(C);output(X+Y);end

Process C:begin

Y:=foo(Z);end

开发动态并行性的一般方法 : Fork/Join

Fork: 派生一个子进程Join: 强制父进程等待子进程

3 并行性问题

13

3.3 进程编组目的 : 支持进程间的交互 , 常把需要交互的进程调度在同一组中一个进程组成员由：组标识符 + 成员序号唯一确定 .

3.4 划分与分配原则 : 使系统大部分时间忙于计算 , 而不是闲置或忙于交互 ; 同时不牺牲并行性 ( 度 ).划分 : 切割数据和工作负载分配：将划分好的数据和工作负载映射到计算结点 ( 处理器 )上分配方式显式分配 : 由用户指定数据和负载如何加载隐式分配：由编译器和运行时支持系统决定就近分配原则：进程所需的数据靠近使用它的进程代码

3 并行性问题

14

并行度 (Degree of Parallelism, DOP): 同时执行的分进程数 .

并行粒度 (Granularity): 两次并行或交互操作之间所执行的计算负载 .指令级并行块级并行进程级并行任务级并行并行度与并行粒度大小常互为倒数 : 增大粒度会减小并行度 .

增加并行度会增加系统 ( 同步 ) 开销

Levels of Parallelism

15

Code-GranularityCode ItemLarge grain(task level)Program

Medium grain(control level)Function (thread)

Fine grain(data level)Loop (Compiler)

Very fine grain(multiple issue)With hardware

Task i-l Task i Task i+1

func1 ( ){........}

func2 ( ){........}

func3 ( ){........}

a ( 0 ) =..b ( 0 ) =..

a ( 1 )=..b ( 1 )=..

a ( 2 )=..b ( 2 )=..

+ x Load

PVM/MPI

Threads

Compilers

CPU

Responsible for Parallelization

Grain Size Code Item Parallelised by

Very Fine Instruction Processor

Fine Loop/Instruction block Compiler

Medium (Standard one page) Function Programmer

Large Program/Separate heavy-weight process

Programmer

16

4 交互／通信问题

17

交互：进程间的相互影响4.1 交互的类型

通信：两个或多个进程间传送数的操作通信方式：共享变量父进程传给子进程 ( 参数传递方式 )消息传递


18

同步：原子同步控制同步 ( 路障 , 临界区 ) 数据同步 ( 锁 , 条件临界区 , 监控程序 , 事件 )

例子 :原子同步parfor (i:=1; i<n; i++) {

atomic{x:=x+1; y:=y-1}}路障同步

parfor(i:=1; i<n; i++){Pi barrierQi }临界区

parfor(i:=1; i<n; i++){critical{x:=x+1; y:=y+1}}数据同步 ( 信号量同步 )

parfor(i:=1; i<n; i++){lock(S); x:=x+1; y:=y-1; unlock(S)}


19

聚集 (aggregation) ：用一串超步将各分进程计算所得的部分结果合并为一个完整的结果 , 每个超步包含一个短的计算和一个简单的通信或 / 和同步 .聚集方式 :归约扫描

例子 : 计算两个向量的内积parfor(i:=1; i<n; i++){

X[i]:=A[i]*B[i]inner_product:=aggregate_sum(X[i]);}


20

4.2 交互的方式交互代码 C

…

…

P1 P2 Pn同步的交互 : 所有参与者同时到达并执行交互代码 C异步的交互 : 进程到达 C 后 , 不必等待其它进程到达即可执行 C


21

•4.3 交互的模式一对一 : 点到点 (point to point)一对多 : 广播 (broadcast), 播撒 (scatter)多对一 : 收集 (gather), 归约 (reduce)多对多 : 全交换 (Tatal Exchange), 扫描(scan) , 置换 / 移位 (permutation/shift)


22

1

3

5

P1

P2

P3

1

3

5,1

P1

P2

P3

1

3

5

P1

P2

P3

1,1

3,1

5,1

P1

P2

P3

1,3,5

P1

P2

P3

1,3,5

3

5

P1

P2

P3

1

3

5

P1

P2

P3

1,3,5

3

5

P1

P2

P3

(a) 点对点 ( 一对一 ): P1 发送一个值给 P3 (b) 广播 ( 一对多 ): P1 发送一个值给全体

(c) 播撒 ( 一对多 ): P1向每个结点发送一个值 (d) 收集 ( 多对一 ): P1从每个结点接收一个值


23

1

3

5

P1

P2

P3

1,5

3,1

5,3

P1

P2

P3

1,2,3

4,5,6

7,8,9

P1

P2

P3

1,4,7

2,5,8

3,6,9

P1

P2

P3

1

3

5

P1

P2

P3

1,1

3,4

5,9

P1

P2

P3

1

3

5

P1

P2

P3

1,9

3

5

P1

P2

P3

(e) 全交换 ( 多对多 ): 每个结点向每个结点发送一个不同的消息(f) 移位 ( 置换 , 多对多 ): 每个结点向下一个结点发送一个值并接收来自上一个结点的一个值 .

(g) 归约 ( 多对一 ): P1 得到和 1+3+5=9 (h) 扫描 ( 多对多 ): P1 得到 1, P2 得到1+3=4, P3 得到 1+3+5=9

Approaches for Parallel Programs Development

• Implicit Parallelism– Supported by parallel languages and parallelizing compilers

that take care of identifying parallelism, the scheduling of calculations and the placement of data.

• Explicit Parallelism– based on the assumption that the user is often the best

judge of how parallelism can be exploited for a particular application.

– programmer is responsible for most of the parallelization effort such as task decomposition, mapping task to processors, the communication structure.

24

Strategies for Development

25

Parallelization Procedure

26

Sequential Computation

Decomposition

Tasks

Assignment

Process Elements

Orchestration

MappingProcessors

Sample Sequential Program

27

…loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } }}…

FDM (Finite Difference Method)

Parallelize the Sequential Program

• Decomposition

28

a task…loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } }}…


• Assignment

29

PE

PE

PE

PE

Divide the tasks equally among process elements


• Orchestration

30

PE

PE

PE

PE

need to communicate and to synchronize


• Mapping

31

PE

PE

PE

PE

Multiprocessor

Parallel Programming Models• Sequential Programming Model• Shared Memory Model (Shared Address Space Model)

– DSM– Threads/OpenMP (enabled for clusters)– Java threads

• Message Passing Model– PVM – MPI

• Hybrid Model– Mixing shared and distributed memory model– Using OpenMP and MPI together

32

Sequential Programming Model

• Functional– Naming: Can name any variable in virtual address

space• Hardware (and perhaps compilers) does translation to

physical addresses– Operations: Loads and Stores– Ordering: Sequential program order

33

Sequential Programming Model

• Performance– Rely on dependences on single location (mostly):

dependence order– Compiler: reordering and register allocation– Hardware: out of order, pipeline bypassing, write

buffers– Transparent replication in caches

34

SAS(Shared Address Space) Programming Model

35

Thread(Process)

Thread(Process)

System

X

read(X) write(X)

Shared variable

Shared Address Space Architectures•Naturally provided on wide range of platforms

–History dates at least to precursors of mainframes in early 60s

–Wide range of scale: few to hundreds of processors•Popularly known as shared memory machines or model–Ambiguous: memory may be physically distributed

among processors

36

Shared Address Space Architectures•Any processor can directly reference any memory

location –Communication occurs implicitly as result of loads and

stores•Convenient:

–Location transparency–Similar programming model to time-sharing on

uniprocessors• Except processes run on different processors• Good throughput on multiprogrammed workloads

37

Shared Address Space Model• Process: virtual address space plus one or more threads of control• Portions of address spaces of processes are shared

38

• Writes to shared address visible to other threads (in other processes too)

• Natural extension of uniprocessor model: conventional memory operations for comm.; special atomic operations for synchronization

• OS uses shared memory to coordinate processes

St or e

P1P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

Communication Hardware for SAS• Also natural extension of uniprocessor• Already have processor, one or more memory

modules and I/O controllers connected by hardware interconnect of some sort– Memory capacity increased by adding modules, I/O by

controllers

39

Add processors for processing! Processor

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor

Interconnect

I/Odevices

History of SAS Architecture• “Mainframe” approach

– Motivated by multiprogramming– Extends crossbar used for mem bw and I/O– Originally processor cost limited to small

• later, cost of crossbar– Bandwidth scales with p– High incremental cost; use multistage instead

40

P

C

C

I/O

I/O

M MM M

P

P

CI/O

M MCI/O

$

P

$

“Minicomputer” approach Almost all microprocessor systems have

bus Motivated by multiprogramming Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck

caching is key: coherence problem Low incremental cost

Example: Intel Pentium Pro Quad

– Highly integrated, targeted at high volume

– Low latency and bandwidth

41

P-Promodule

P-Promodule

P-Promodule

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

Bus interface

MIU

256-KBL2 $

Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

Example: SUN Enterprise

– Memory on processor cards themselves• 16 cards of either type: processors + memory, or I/O

– But all memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus

42

Gigaplane bus (256 data, 41 address, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

Scaling Up

– Problem is interconnect: cost (crossbar) or bandwidth (bus)– Dance-hall: bandwidth still scalable, but lower cost than

crossbar• latencies to memory uniform, but uniformly large

– Distributed memory or non-uniform memory access (NUMA)• Construct shared address space out of simple message transactions

across a general-purpose network (e.g. read-request, read-response)43

M M M

M M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory

Shared Address Space Programming Model

• Naming– Any process can name any variable in shared space

• Operations– Loads and stores, plus those needed for ordering

• Simplest Ordering Model– Within a process/thread: sequential program order– Across threads: some interleaving (as in time-sharing)– Additional orders through synchronization

44

Synchronization• Mutual exclusion (locks)

– No ordering guarantees• Event synchronization

– Ordering of events to preserve dependences – e.g. producer —> consumer of data– 3 main types:

• point-to-point• global• group

45

MP Programming Model

46

process process

Node A

message

Y Y’

send (Y) receive (Y’)

Node B

Message-Passing Programming Model

– Send specifies data buffer to be transmitted and receiving process– Recv specifies sending process and application storage to receive into– Memory to memory copy, but need to name processes– User process names only local data and entities in process/tag space– In simplest form, the send/recv match achieves pairwise synch event

– Many overheads: copying, buffer management, protection 47

Match

Process Q

Address Y

Local processaddress space

Process P

Address X

Local processaddress space

Send X, Q, t

Receive Y, P, t

Message Passing Architectures • Complete computer as building block, including I/O

– Communication via explicit I/O operations• Programming model: directly access only private

address space (local memory), comm. via explicit messages (send/receive)

• Programming model more removed from basic hardware operations– Library or OS intervention

48

Evolution of Message-Passing Machines

• Early machines: FIFO on each link– synchronous ops– Replaced by DMA, enabling non-

blocking ops• Buffered by system at destination until

recv

• Diminishing role of topology– Store&forward routing– Cost is in node-network interface– Simplifies programming

49

000001

010011

100

110

101

111

Example: IBM SP-2

– Made out of essentially complete RS6000 workstations– Network interface integrated in I/O bus (bw limited by I/O bus)

• Doesn’t need to see memory references 50

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

Example Intel Paragon

– Network interface integrated in memory bus, for performance 51

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Supercomputer

Message Passing Programming Model

• Naming– Processes can name private data directly. – No shared address space

• Operations– Explicit communication: send and receive– Send transfers data from private address space to

another process– Receive copies data from process to private

address space– Must be able to name processes

52

Message Passing Programming Model

• Ordering– Program order within a process– Send and receive can provide pt-to-pt synch

between processes• Can construct global address space

– Process number + address within process address space

– But no direct operations on these names

53

Design Issues Apply at All Layers

• Programming model’s position provides constraints/goals for system

• In fact, each interface between layers supports or takes a position on– Naming model– Set of operations on names– Ordering model– Replication– Communication performance

54

Naming and Operations

• Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS

• Example– Shared virtual address space in programming model

• Hardware interface supports shared physical address space – Direct support by hardware through v-to-p

mappings, no software layers

55

Naming and Operations (Cont’d)

• Hardware supports independent physical address spaces– system/user interface: can provide SAS through OS

• v-to-p mappings only for data that are local• remote data accesses incur page faults; brought in via

page fault handlers– Or through compilers or runtime, so above

sys/user interface

56


• Example: Implementing Message Passing• Direct support at hardware interface

– But match and buffering benefit from more flexibility• Support at sys/user interface or above in

software (almost always)– Hardware interface provides basic data transport– Send/receive built in sw for flexibility (protection,

buffering)– Or lower interfaces provide SAS, and send/receive built on

top with buffers and loads/stores

57


• Need to examine the issues and tradeoffs at every layer– Frequencies and types of operations, costs

• Message passing– No assumptions on orders across processes except

those imposed by send/receive pairs• SAS

– How processes see the order of other processes’ references defines semantics of SAS

• Ordering very important and subtle 58


• Uniprocessors play tricks with orders to gain parallelism or locality

• These are more important in multiprocessors• Need to understand which old tricks are valid,

and learn new ones• How programs behave, what they rely on, and

hardware implications

59

Message Passing Model / Shared Memory Model

60

Message Passing

Shared Memory

Architecture any SMP or DSMProgramming

difficult easier

Performance good better (SMP)worse (DSM)

Cost less expensive very expensiveSunFire15K $4,140,830

Parallelisation Paradigms

• Task-Farming/Master-Worker • Single-Program Multiple-Data (SPMD)• Pipelining • Divide and Conquer• Speculation.

61

Master Worker/Slave Model• Master decomposes the

problem into small tasks, distributes to workers and gathers partial results to produce the final result.

• Mapping/Load Balancing– Static– Dynamic

• When number of tasks are larger than the number of CPUs / they are know at runtime / CPUs are heterogeneous.

62Static

Single-Program Multiple-Data• Most commonly used

model. • Each process executes

the same piece of code, but on different parts of the data.—splitting the data among the available processors.

• geometric/domain decomposition, data parallelism.

63

Pipelining

64

• Suitable for fine grained parallelism.• Also suitable for application involving

multiple stages of execution, but need to operate on large number of data sets.

Divide and Conquer• A problem is divided into two or

more sub problems, and each of these sub problems are solved independently, and their results are combined.

• 3 operations: split, compute, and join.

• Master-worker/task-farming is like divide and conquer with master doing both split and join operation.

• (seems like a “hierarchical” master-work technique)

65

Speculative Parallelism

• It used when it is quite difficult to achieve parallelism through one of the previous paradigms.

• Problems with complex dependencies – use “look ahead “execution.

• Employing different algorithms for solving the same problem

66

Documents

Foundation to Parallel Programming