View
143
Download
0
Category
Preview:
DESCRIPTION
Foundation to Parallel Programming . 并行程序设计基础. Introduction to Parallel Programming Approaches for Parallel Programs Parallel Programming Model Parallel Programming Paradigm. Parallel Programming is a Complex Task. Parallel software developers handle issues such as: Non-determinism - PowerPoint PPT Presentation
Citation preview
Foundation to Parallel Programming
并行程序设计基础• Introduction to Parallel Programming • Approaches for Parallel Programs • Parallel Programming Model• Parallel Programming Paradigm
2
Parallel Programming is a Complex Task
• Parallel software developers handle issues such as:– Non-determinism– Communication– Synchronization– data partitioning and distribution– load-balancing– fault-tolerance– Heterogeneity– shared or distributed memory– Deadlocks– race conditions.
3
Users’ Expectations from Parallel Programming Environments
• Parallel computing can only be widely successful if parallel software is able to meet expectations of the users, such as:– provide architecture/processor type transparency– provide network/communication transparency– be easy-to-use and reliable – provide support for fault-tolerance – accommodate heterogeneity – assure portability – provide support for traditional high-level languages– be capable of delivering increased performance– to provide parallelism transparency
4
2 并行程序构造方法
5
串行代码段for ( i= 0; i<N; i++ ) A[i]=b[i]*b[i+1];for (i= 0; i<N; i++) c[i]=A[i]+A[i+1];
(a) 使用库例程构造并行程序id=my_process_id();p=number_of_processes();for ( i= id; i<N; i=i+p) A[i]=b[i]*b[i+1];barrier();for (i= id; i<N; i=i+p) c[i]=A[i]+A[i+1];例子 : MPI,PVM, Pthreads
(b) 扩展串行语言my_process_id,number_of_processes(), and barrier()
A(0:N-1)=b(0:N-1)*b(1:N)c=A(0:N-1)+A(1:N)例子 : Fortran 90
(c) 加编译注释构造并行程序的方法#pragma parallel#pragma shared(A,b,c)#pragma local(i) {# pragma pfor iterate(i=0;N;1)for (i=0;i<N;i++) A[i]=b[i]*b[i+1];# pragma synchronize# pragma pfor iterate (i=0; N; 1)for (i=0;i<N;i++)c[i]=A[i]+A[i+1];}例子: SGI power C
2 并行程序构造方法
6
方法 实例 优点 缺点
库例程 MPI, PVM 易于实现, 不需要新编
译器
无编译器检查 ,
分析和优化
扩展 Fortran90 允许编译器检查、分析
和优化
实现困难,需要新
编译器
编译器注释 SGI powerC, HPF 介于库例程和扩展方法之间, 在串行平台
上不起作用.
三种并行程序构造方法比较
3 并行性问题
7
3.1 进程的同构性SIMD: 所有进程在同一时间执行相同的指令MIMD: 各个进程在同一时间可以执行不同的指令
SPMD: 各个进程是同构的,多个进程对不同的数据执行相同的代码 ( 一般是数据并行的同义语 )常对应并行循环,数据并行结构,单代码MPMD: 各个进程是异构的, 多个进程执行不同的代码 要为有 1000 个处理器的计算机编写一个完全异构的并行程序是很困难的
3 并行性问题
8
并行块parbegin S1 S2 S3 …….Sn parendS1 S2 S3 …….Sn 可以是不同的代码并行循环 : 当并行块中所有进程共享相同代码时parbegin S1 S2 S3 …….Sn parend S1 S2 S3 …….Sn 是相同代码简化为parfor (i=1; i<=n, i++) S(i)
3 并行性问题
9
用单代码方法要说明以下 SPMD 程序 :parfor (i=0; i<=N, i++) foo(i)
用户需写一个以下程序 :pid=my_process_id();numproc=number_of _processes();parfor (i=pid; i<=N, i=i+numproc) foo(i)
此程序经编译后生成可执行程序 A, 用shell 脚本将它加载到 N 个处理结点上 :run A –numnodes N
SPMD 程序的构造方法用数据并行程序的构造方法要说明以下 SPMD 程序 :parfor (i=0; i<=N, i++) {
C[i]=A[i]+B[i];}
用户可用一条数据赋值语句 :C=A+B或forall (i=1,N) C[i]=A[i]+B[i]
3 并行性问题
10
用 SPMD 伪造 MPMD要说明以下 MPMD 程序 :parbegin S1 S2 S3 parend 可以用以下 SPMD 程序 :parfor (i=0; i<3, i++) {
if (i=0) S1if (i=1) S2if (i=2) S3}
MPMD 程序的构造方法用多代码方法对不提供并行块或并行循环的语言要说明以下 MPMD 程序 :parbegin S1 S2 S3 parend
用户需写 3 个程序 , 分别编译生成3 个可执行程序 S1 S2 S3, 用 shell脚本将它们加载到 3 个处理结点上 :run S1 on node1run S2 on node2run S3 on node3S1, S2 和 S3 是顺序语言程序加上进行交互的库调用 .
3 并行性问题
11
3.2 静态和动态并行性程序的结构 : 由它的组成部分构成程序的方法静态并行性的例子 :parbegin P, Q, R parend
其中 P,Q,R 是静态的
动态并行性的例子 :while (C>0) begin
fork (foo(C));C:=boo(C);end
静态并行性 : 程序的结构以及进程的个数在运行之前 ( 如编译时 , 连接时或加载时 )就可确定 , 就认为该程序具有静态并行性 . 动态并行性 : 否则就认为该程序具有动态并行性 . 即意味着进程要在运行时创建和终止
3 并行性问题
12
Process A:begin
Z:=1fork(B);T:=foo(3);end
Process B:begin
fork(C);X:=foo(Z);join(C);output(X+Y);end
Process C:begin
Y:=foo(Z);end
开发动态并行性的一般方法 : Fork/Join
Fork: 派生一个子进程Join: 强制父进程等待子进程
3 并行性问题
13
3.3 进程编组目的 : 支持进程间的交互 , 常把需要交互的进程调度在同一组中一个进程组成员由:组标识符 + 成员序号 唯一确定 .
3.4 划分与分配原则 : 使系统大部分时间忙于计算 , 而不是闲置或忙于交互 ; 同时不牺牲并行性 ( 度 ).划分 : 切割数据和工作负载分配:将划分好的数据和工作负载映射到计算结点 ( 处理器 )上分配方式显式分配 : 由用户指定数据和负载如何加载隐式分配:由编译器和运行时支持系统决定就近分配原则:进程所需的数据靠近使用它的进程代码
3 并行性问题
14
并行度 (Degree of Parallelism, DOP): 同时执行的分进程数 .
并行粒度 (Granularity): 两次并行或交互操作之间所执行的计算负载 .指令级并行块级并行进程级并行任务级并行并行度与并行粒度大小常互为倒数 : 增大粒度会减小并行度 .
增加并行度会增加系统 ( 同步 ) 开销
Levels of Parallelism
15
Code-GranularityCode ItemLarge grain(task level)Program
Medium grain(control level)Function (thread)
Fine grain(data level)Loop (Compiler)
Very fine grain(multiple issue)With hardware
Task i-l Task i Task i+1
func1 ( ){........}
func2 ( ){........}
func3 ( ){........}
a ( 0 ) =..b ( 0 ) =..
a ( 1 )=..b ( 1 )=..
a ( 2 )=..b ( 2 )=..
+ x Load
PVM/MPI
Threads
Compilers
CPU
Responsible for Parallelization
Grain Size Code Item Parallelised by
Very Fine Instruction Processor
Fine Loop/Instruction block Compiler
Medium (Standard one page) Function Programmer
Large Program/Separate heavy-weight process
Programmer
16
4 交互/通信问题
17
交互:进程间的相互影响4.1 交互的类型
通信:两个或多个进程间传送数的操作 通信方式:共享变量父进程传给子进程 ( 参数传递方式 )消息传递
4 交互/通信问题
18
同步: 原子同步 控制同步 ( 路障 , 临界区 ) 数据同步 ( 锁 , 条件临界区 , 监控程序 , 事件 )
例子 :原子同步parfor (i:=1; i<n; i++) {
atomic{x:=x+1; y:=y-1}}路障同步
parfor(i:=1; i<n; i++){Pi barrierQi }临界区
parfor(i:=1; i<n; i++){critical{x:=x+1; y:=y+1}}数据同步 ( 信号量同步 )
parfor(i:=1; i<n; i++){lock(S); x:=x+1; y:=y-1; unlock(S)}
4 交互/通信问题
19
聚集 (aggregation) :用一串超步将各分进程计算所得的部分结果合并为一个完整的结果 , 每个超步包含一个短的计算和一个简单的通信或 / 和同步 .聚集方式 :归约扫描
例子 : 计算两个向量的内积parfor(i:=1; i<n; i++){
X[i]:=A[i]*B[i]inner_product:=aggregate_sum(X[i]);}
4 交互/通信问题
20
4.2 交互的方式 交互代码 C
…
…
P1 P2 Pn同步的交互 : 所有参与者同时到达并执行交互代码 C异步的交互 : 进程到达 C 后 , 不必等待其它进程到达即可执行 C
4 交互/通信问题
21
•4.3 交互的模式一对一 : 点到点 (point to point)一对多 : 广播 (broadcast), 播撒 (scatter)多对一 : 收集 (gather), 归约 (reduce)多对多 : 全交换 (Tatal Exchange), 扫描(scan) , 置换 / 移位 (permutation/shift)
4 交互/通信问题
22
1
3
5
P1
P2
P3
1
3
5,1
P1
P2
P3
1
3
5
P1
P2
P3
1,1
3,1
5,1
P1
P2
P3
1,3,5
P1
P2
P3
1,3,5
3
5
P1
P2
P3
1
3
5
P1
P2
P3
1,3,5
3
5
P1
P2
P3
(a) 点对点 ( 一对一 ): P1 发送一个值给 P3 (b) 广播 ( 一对多 ): P1 发送一个值给全体
(c) 播撒 ( 一对多 ): P1向每个结点发送一个值 (d) 收集 ( 多对一 ): P1从每个结点接收一个值
4 交互/通信问题
23
1
3
5
P1
P2
P3
1,5
3,1
5,3
P1
P2
P3
1,2,3
4,5,6
7,8,9
P1
P2
P3
1,4,7
2,5,8
3,6,9
P1
P2
P3
1
3
5
P1
P2
P3
1,1
3,4
5,9
P1
P2
P3
1
3
5
P1
P2
P3
1,9
3
5
P1
P2
P3
(e) 全交换 ( 多对多 ): 每个结点向每个结点发送一个不同的消息(f) 移位 ( 置换 , 多对多 ): 每个结点向下一个结点发送一个值并接收来自上一个结点的一个值 .
(g) 归约 ( 多对一 ): P1 得到和 1+3+5=9 (h) 扫描 ( 多对多 ): P1 得到 1, P2 得到1+3=4, P3 得到 1+3+5=9
Approaches for Parallel Programs Development
• Implicit Parallelism– Supported by parallel languages and parallelizing compilers
that take care of identifying parallelism, the scheduling of calculations and the placement of data.
• Explicit Parallelism– based on the assumption that the user is often the best
judge of how parallelism can be exploited for a particular application.
– programmer is responsible for most of the parallelization effort such as task decomposition, mapping task to processors, the communication structure.
24
Strategies for Development
25
Parallelization Procedure
26
Sequential Computation
Decomposition
Tasks
Assignment
Process Elements
Orchestration
MappingProcessors
Sample Sequential Program
27
…loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } }}…
FDM (Finite Difference Method)
Parallelize the Sequential Program
• Decomposition
28
a task…loop{ for (i=0; i<N; i++){ for (j=0; j<N; j++){ a[i][j] = 0.2 * (a[i][j-1] + a[i][j+1] + a[i-1][j] + a[i+1][j] + a[i][j]); } }}…
Parallelize the Sequential Program
• Assignment
29
PE
PE
PE
PE
Divide the tasks equally among process elements
Parallelize the Sequential Program
• Orchestration
30
PE
PE
PE
PE
need to communicate and to synchronize
Parallelize the Sequential Program
• Mapping
31
PE
PE
PE
PE
Multiprocessor
Parallel Programming Models• Sequential Programming Model• Shared Memory Model (Shared Address Space Model)
– DSM– Threads/OpenMP (enabled for clusters)– Java threads
• Message Passing Model– PVM – MPI
• Hybrid Model– Mixing shared and distributed memory model– Using OpenMP and MPI together
32
Sequential Programming Model
• Functional– Naming: Can name any variable in virtual address
space• Hardware (and perhaps compilers) does translation to
physical addresses– Operations: Loads and Stores– Ordering: Sequential program order
33
Sequential Programming Model
• Performance– Rely on dependences on single location (mostly):
dependence order– Compiler: reordering and register allocation– Hardware: out of order, pipeline bypassing, write
buffers– Transparent replication in caches
34
SAS(Shared Address Space) Programming Model
35
Thread(Process)
Thread(Process)
System
X
read(X) write(X)
Shared variable
Shared Address Space Architectures•Naturally provided on wide range of platforms
–History dates at least to precursors of mainframes in early 60s
–Wide range of scale: few to hundreds of processors•Popularly known as shared memory machines or model–Ambiguous: memory may be physically distributed
among processors
36
Shared Address Space Architectures•Any processor can directly reference any memory
location –Communication occurs implicitly as result of loads and
stores•Convenient:
–Location transparency–Similar programming model to time-sharing on
uniprocessors• Except processes run on different processors• Good throughput on multiprogrammed workloads
37
Shared Address Space Model• Process: virtual address space plus one or more threads of control• Portions of address spaces of processes are shared
38
• Writes to shared address visible to other threads (in other processes too)
• Natural extension of uniprocessor model: conventional memory operations for comm.; special atomic operations for synchronization
• OS uses shared memory to coordinate processes
St or e
P1P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses
Communication Hardware for SAS• Also natural extension of uniprocessor• Already have processor, one or more memory
modules and I/O controllers connected by hardware interconnect of some sort– Memory capacity increased by adding modules, I/O by
controllers
39
Add processors for processing! Processor
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor
Interconnect
I/Odevices
History of SAS Architecture• “Mainframe” approach
– Motivated by multiprogramming– Extends crossbar used for mem bw and I/O– Originally processor cost limited to small
• later, cost of crossbar– Bandwidth scales with p– High incremental cost; use multistage instead
40
P
C
C
I/O
I/O
M MM M
P
P
CI/O
M MCI/O
$
P
$
“Minicomputer” approach Almost all microprocessor systems have
bus Motivated by multiprogramming Used heavily for parallel computing Called symmetric multiprocessor (SMP) Latency larger than for uniprocessor Bus is bandwidth bottleneck
caching is key: coherence problem Low incremental cost
Example: Intel Pentium Pro Quad
– Highly integrated, targeted at high volume
– Low latency and bandwidth
41
P-Promodule
P-Promodule
P-Promodule
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
CPU
Bus interface
MIU
256-KBL2 $
Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
Example: SUN Enterprise
– Memory on processor cards themselves• 16 cards of either type: processors + memory, or I/O
– But all memory accessed over bus, so symmetric– Higher bandwidth, higher latency bus
42
Gigaplane bus (256 data, 41 address, 83 MHz)
SB
US
SB
US
SB
US
2 F
iber
Cha
nnel
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
Scaling Up
– Problem is interconnect: cost (crossbar) or bandwidth (bus)– Dance-hall: bandwidth still scalable, but lower cost than
crossbar• latencies to memory uniform, but uniformly large
– Distributed memory or non-uniform memory access (NUMA)• Construct shared address space out of simple message transactions
across a general-purpose network (e.g. read-request, read-response)43
M M M
M M M
NetworkNetwork
P
$
P
$
P
$
P
$
P
$
P
$
“Dance hall” Distributed memory
Shared Address Space Programming Model
• Naming– Any process can name any variable in shared space
• Operations– Loads and stores, plus those needed for ordering
• Simplest Ordering Model– Within a process/thread: sequential program order– Across threads: some interleaving (as in time-sharing)– Additional orders through synchronization
44
Synchronization• Mutual exclusion (locks)
– No ordering guarantees• Event synchronization
– Ordering of events to preserve dependences – e.g. producer —> consumer of data– 3 main types:
• point-to-point• global• group
45
MP Programming Model
46
process process
Node A
message
Y Y’
send (Y) receive (Y’)
Node B
Message-Passing Programming Model
– Send specifies data buffer to be transmitted and receiving process– Recv specifies sending process and application storage to receive into– Memory to memory copy, but need to name processes– User process names only local data and entities in process/tag space– In simplest form, the send/recv match achieves pairwise synch event
– Many overheads: copying, buffer management, protection 47
Match
Process Q
Address Y
Local processaddress space
Process P
Address X
Local processaddress space
Send X, Q, t
Receive Y, P, t
Message Passing Architectures • Complete computer as building block, including I/O
– Communication via explicit I/O operations• Programming model: directly access only private
address space (local memory), comm. via explicit messages (send/receive)
• Programming model more removed from basic hardware operations– Library or OS intervention
48
Evolution of Message-Passing Machines
• Early machines: FIFO on each link– synchronous ops– Replaced by DMA, enabling non-
blocking ops• Buffered by system at destination until
recv
• Diminishing role of topology– Store&forward routing– Cost is in node-network interface– Simplifies programming
49
000001
010011
100
110
101
111
Example: IBM SP-2
– Made out of essentially complete RS6000 workstations– Network interface integrated in I/O bus (bw limited by I/O bus)
• Doesn’t need to see memory references 50
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
Example Intel Paragon
– Network interface integrated in memory bus, for performance 51
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Supercomputer
Message Passing Programming Model
• Naming– Processes can name private data directly. – No shared address space
• Operations– Explicit communication: send and receive– Send transfers data from private address space to
another process– Receive copies data from process to private
address space– Must be able to name processes
52
Message Passing Programming Model
• Ordering– Program order within a process– Send and receive can provide pt-to-pt synch
between processes• Can construct global address space
– Process number + address within process address space
– But no direct operations on these names
53
Design Issues Apply at All Layers
• Programming model’s position provides constraints/goals for system
• In fact, each interface between layers supports or takes a position on– Naming model– Set of operations on names– Ordering model– Replication– Communication performance
54
Naming and Operations
• Naming and operations in programming model can be directly supported by lower levels, or translated by compiler, libraries or OS
• Example– Shared virtual address space in programming model
• Hardware interface supports shared physical address space – Direct support by hardware through v-to-p
mappings, no software layers
55
Naming and Operations (Cont’d)
• Hardware supports independent physical address spaces– system/user interface: can provide SAS through OS
• v-to-p mappings only for data that are local• remote data accesses incur page faults; brought in via
page fault handlers– Or through compilers or runtime, so above
sys/user interface
56
Naming and Operations (Cont’d)
• Example: Implementing Message Passing• Direct support at hardware interface
– But match and buffering benefit from more flexibility• Support at sys/user interface or above in
software (almost always)– Hardware interface provides basic data transport– Send/receive built in sw for flexibility (protection,
buffering)– Or lower interfaces provide SAS, and send/receive built on
top with buffers and loads/stores
57
Naming and Operations (Cont’d)
• Need to examine the issues and tradeoffs at every layer– Frequencies and types of operations, costs
• Message passing– No assumptions on orders across processes except
those imposed by send/receive pairs• SAS
– How processes see the order of other processes’ references defines semantics of SAS
• Ordering very important and subtle 58
Naming and Operations (Cont’d)
• Uniprocessors play tricks with orders to gain parallelism or locality
• These are more important in multiprocessors• Need to understand which old tricks are valid,
and learn new ones• How programs behave, what they rely on, and
hardware implications
59
Message Passing Model / Shared Memory Model
60
Message Passing
Shared Memory
Architecture any SMP or DSMProgramming
difficult easier
Performance good better (SMP)worse (DSM)
Cost less expensive very expensiveSunFire15K $4,140,830
Parallelisation Paradigms
• Task-Farming/Master-Worker • Single-Program Multiple-Data (SPMD)• Pipelining • Divide and Conquer• Speculation.
61
Master Worker/Slave Model• Master decomposes the
problem into small tasks, distributes to workers and gathers partial results to produce the final result.
• Mapping/Load Balancing– Static– Dynamic
• When number of tasks are larger than the number of CPUs / they are know at runtime / CPUs are heterogeneous.
62Static
Single-Program Multiple-Data• Most commonly used
model. • Each process executes
the same piece of code, but on different parts of the data.—splitting the data among the available processors.
• geometric/domain decomposition, data parallelism.
63
Pipelining
64
• Suitable for fine grained parallelism.• Also suitable for application involving
multiple stages of execution, but need to operate on large number of data sets.
Divide and Conquer• A problem is divided into two or
more sub problems, and each of these sub problems are solved independently, and their results are combined.
• 3 operations: split, compute, and join.
• Master-worker/task-farming is like divide and conquer with master doing both split and join operation.
• (seems like a “hierarchical” master-work technique)
65
Speculative Parallelism
• It used when it is quite difficult to achieve parallelism through one of the previous paradigms.
• Problems with complex dependencies – use “look ahead “execution.
• Employing different algorithms for solving the same problem
66
Recommended