42
CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming James Demmel http://www.cs.berkeley.edu/~demmel/ cs267_Spr99

CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.1 Demmel Sp 1999

CS 267 Applications of Parallel Computers

Lecture 6: Distributed Memory (continued)

Data Parallel Programming

James Demmel

http://www.cs.berkeley.edu/~demmel/cs267_Spr99

Page 2: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.2 Demmel Sp 1999

Recap of Last Lecture

°Distributed memory machines• Each processor has independent memory

• Connected by network

°Cost = #messages * + #words_sent * + #flops * f

°Distributed memory programming • MPI

• Send/Receive

• Collective Communication

• Sharks and Fish under gravity as example

Page 3: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.3 Demmel Sp 1999

Outline

° Distributed Memory Programming (continued)• Review Gravity Algorithms

• Look at Sharks and Fish code

° Data Parallel Programming

Page 4: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.4 Demmel Sp 1999

Example: Sharks and Fish

° N fish on P procs, N/P fish per processor• At each time step, compute forces on fish and move them

° Need to compute gravitational interaction• In usual n^2 algorithm, every fish depends on every other fish

• every fish needs to “visit” every processor, even if it “lives” on one

° What is the cost?

Page 5: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.5 Demmel Sp 1999

2 Algorithms for Gravity: What are their costs?

Algorithm 1 Copy local Fish array of length N/P to Tmp array for j = 1 to N for k = 1 to N/P, Compute force from Tmp(k) on Fish(k) “Rotate” Tmp by 1 for k=2 to N/P, Tmp(k) <= Tmp(k-1) recv(my_proc - 1,Tmp(1)) send(my_proc+1,Tmp(N/P)

Algorithm 2

Copy local Fish array of length N/P to Tmp array for j = 1 to P for k=1 to N/P, for m=1 to N/P, Compute force from Tmp(k) on Fish(m) “Rotate” Tmp by N/P recv(my_proc - 1,Tmp(1:N/P)) send(my_proc+1,Tmp(1:N/P))

What could go wrong? (be careful of overwriting Tmp)

Page 6: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.6 Demmel Sp 1999

More Algorithms for Gravity

° Algorithm 3 (in sharks and fish code)• All processors send their Fish to Proc 0

• Proc 0 broadcasts all Fish to all processors

° Tree-algorithms• Barnes-Hut, Greengard-Rokhlin, Anderson

• O(N log N) instead of O(N^2)

• Parallelizable with cleverness

• “Just” an approximation, but as accurate as you like (often only a few digits are needed, so why pay for more)

• Same idea works for other problems where effects of distant objects becomes “smooth” or “compressible”

- electrostatics, vorticity, …

- radiosity in graphics

- anything satisfying Poisson equation or something like it

• Will talk about it in detail later in course

Page 7: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.7 Demmel Sp 1999

CS 267 Applications of Parallel Computers

Lecture 11: Data Parallel Programming

Kathy Yelick

http://www.cs.berkeley.edu/~dmartin/cs267/

Page 8: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.8 Demmel Sp 1999

Outline

° Break for Mat Mul results

° Quick Evolution of Data-Parallel Machines

° Fortran 90

° HPF extensions

Page 9: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.9 Demmel Sp 1999

Data Parallel Architectures

° Programming model • operations are performed on each element of a large (regular) data

structure in a single step

• arithmetic, global data transfer

° processor is logically associated with each data element, general communication, and cheap global synchronization.

• driven originally by simple O.D.E.

ControlProcessor

P-M P-M P-M

P-M P-M P-M

P-M P-M P-M

Page 10: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.10 Demmel Sp 1999

Vector Machines

° The Cray-1• RISC machine with load/store vector architecture

Compact vector instructions, with non-unit stride.

High-performance scalar (ECL, single phase, 80 MHZ, freon cooled)

Tightly integrated scalar and vector => low start-up cost

Function Unit Chaining (C = Ax + B)

vectorregisters

pipelined function units

highly interleaved semiconductor (SRAM) memory

Page 11: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.11 Demmel Sp 1999

Use of SIMD Model on Vector Machines

VP0 VP1 VP$vlr-1

vr0

vr1

vr31

vf0

vf1

vf31

$vdw bits

1 bit

GeneralPurpose

Registers(32)

FlagRegisters

(32)

Virtual Processors ($vlr)

vcr0

vcr1

vcr15

ControlRegisters

32 bits

Page 12: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.12 Demmel Sp 1999

Evolution of Data Parallelism

° Rigid control structure (SIMD in Flynn’s Taxonomy)• SISD = uniprocessor, MIMD = multiprocessor

° Cost savings in centralized instruction sequencer

° Simple, regular calculations usually have good locality

• realize on SM or MP machine with decent compiler

• may still require fast global synchronization

° Programming model converges to SPMD• data parallel appears as convention in many languages

Page 13: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.13 Demmel Sp 1999

Fortran90 Execution Model

• Sequential composition of parallel (or scalar) statements• •Parallel operations on arrays

• Arrays have rank (# dimensions), shape (extents), type (elements)

– HPF adds layout• Communication implicit in array operations• Configuration independent

Page 14: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.14 Demmel Sp 1999

Example: gravitational fish integer, parameter :: nfish = 10000

complex fishp(nfish), fishv(nfish), force(nfish), accel(nfish)

real fishm(nfish)

. . .

do while (t < tfinal)

t = t + dt

fishp = fishp + dt*fishv

call compute_current(force,fishp)

accel = force/fishm

fishv = fishv + dt*accel

...

enddo

. . .

subroutine compute_current(force,fishp)

complex force(:),fishp(:)

force = (3,0)*(fishp*(0,1))/(max(abs(fishp),0.01)) - fishp

end

*

+

parallel assignment

pointwise parallel operator

Page 15: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.15 Demmel Sp 1999

Array Operations

Parallel Assignment

A = 0 ! scalar extension

L = .TRUE.

B = [1,2,3,4] ! array constructor

X = [1:n] ! real sequence [1.0, 2.0, . . .,n]

I = [0:100:4] ! integer sequence [0,4,8,...,100]

C = [ 50[1], 50[2,3] ] ! 150 elts first 1s then repeated 2,3

D = C ! array copy

Binary array operators operate pointwise on conformable arrays

• have the same size and shape

Page 16: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.16 Demmel Sp 1999

Array Sections

Portion of an array defined by a triplet in each dimension• may appear wherever an array is used

A(3) ! third element

A(1:5) ! first five elements

A(1:5:1) ! same

A(:5) ! same

A(1:10:2) ! odd elements in order

A(10:2:-2) ! even in reverse order

A(10:2:2) ! []

B(1:2,3:4) ! 2x2 block

B(1, :) ! first row

B(:, j) ! jth column

Page 17: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.17 Demmel Sp 1999

Reduction Operators

Reduce an array to a scalar under an associative binary operation

• sum, product

• minval, maxval

• count (number of .TRUE. elements of logical array)

• any, all

simplest form of communication

do while (t < tfinal)

t = t + dt

fishp = fishp + dt*fishv

call compute_current(force,fishp)

accel = force/fishm

fishv = fishv + dt*accel

fishspeed = abs(fishv)

mnsqvel = sqrt(sum(fishspeed*fishspeed)/nfish)

dt = .1*maxval(fishspeed) / maxval(abs(accel))

enddo

implicit broadcast

Page 18: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.18 Demmel Sp 1999

Conditional Operation

force = (3,0)*(fishp*(0,1))/(max(abs(fishp),0.01)) - fishp

could use

dist = 0.01

where (abs(fishp) > dist) dist = abs(fishp)

or

far = abs(fishp) > 0.01

where far dist = abs(fishp)

or

where (abs(fishp) .ge. 0.01)

dist = abs(fishp)

elsewhere

dist = 0.01

end where

No nested wheres. Only assignment in body of the where.The boolean expression is really a mask array.

Page 19: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.19 Demmel Sp 1999

Forall in HPF (Extends F90)

FORALL ( triplet, triplet,…,mask ) assignment

forall ( i = 1:n) A(i) = 0 ! same as A = 0

forall ( i = 1:n ) X(i) = i ! same as X = [ 1:n ]

forall (i=1:nfish) fishp(i) = (i*2.0/nfish)-1.0

forall (i=1:n, j = 1:m) H(i,j) = i+j

forall (i=1:n, j = 1:m) C(i+j*2) = j

forall (i = 1:n) D(Index(i)) = C(i,i) ! Maybe

forall (i=1:n, j = 1:n, k = 1:n)

* C(i,j) = C(i,j) + A(i,k) * B(k,j) ! NO

Evaluate entire RHS for all index values (in any order)

Perform all assignments (in any order)

No more than one value for each element on the left (may be checked)

Page 20: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.20 Demmel Sp 1999

Conditional (masked) intrinsics

Most intrinsics take an optional mask argument

funny_prod = product( A, A .ne. 0)

bigem = maxval(A, mask = inside )

Use of masks in the FORALL assignment (HPF)

forall ( i=1:n, j=1:m, A(i,j) .ne. 0.0 ) B(i,j) = 1.0 / A(i,j)

forall ( i=1:n, inside) A(i) = i/n

Page 21: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.21 Demmel Sp 1999

Subroutines

• Arrays can be passed as arguments.

• Shapes must match.

• Limited dynamic allocation

• Arrays passed by reference, sections by value (i.e., a copy is made)

•HPF: either remap or inherit

• Can extract array information using inquiry functions

Page 22: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.22 Demmel Sp 1999

Implicit Communication

Operations on conformable array sections may require data movement

i.e., communication

A(1:10, : ) = B(1:10, : ) + B(11:20, : )

Parallel finite difference A’[i] = (A[i+1] - A[i])*dt

A(1:n-1) = (A(2:n) - A(1:n-1)) * dt

Example: smear pixels

show(:,1:m-1) = show(:,1:m-1) + show(:,2:m)

show(1:m-1,:) = show(1:m-1,:) + show(2:m,:)

Page 23: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.23 Demmel Sp 1999

Global Communication

c(:, 1:5:2) = c(:, 2:6:2) ! shift noncontiguous sections

D = D(10:1:-1) ! permutation (reverse)

A = [1,0,2,0,0,0,4]

I = [1,3,7]

B = A(I) ! B = [1,2,4] ``gather''

C(I) = B ! C = A ``scatter'' (no duplicates on left)

D = A([1,1,3,3]) ! replication

Page 24: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.24 Demmel Sp 1999

Specialized Communication

CSHIFT( array, dim, shift) ! cyclic shift in one dimension

EOSHIFT( array, dim, shift [, boundary]) ! end off shift

TRANSPOSE( matrix ) ! matrix transpose

SPREAD(array, dim, ncopies)

Page 25: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.25 Demmel Sp 1999

Example: nbody calculation

subroutine compute_gravity(force,fishp,fishm,nfish)

complex force(:),fishp(:),fishm(:)

complex fishmp(nfish), fishpp(DSHAPE(fishp)), dif(DSIZE(force))

integer k

force = (0.,0.)

fishpp = fishp

fishmp = fishm

do k=1, nfish-1

fishpp = cshift(fishpp, DIM=1, SHIFT=-1)

fishmp = cshift(fishmp, DIM=1, SHIFT=-1)

dif = fishpp - fishp

force = force + (fishmp * fishm * dif / (abs(dif)*abs(dif)))

enddo

end

Page 26: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.26 Demmel Sp 1999

HPF Data Distribution (layout) directives

° Can ALIGN arrays with other arrays for affinity• elements that are operated on together

° Can ALIGN with TEMPLATE for abstract index space

° Can DISTRIBUTE arrays over processor grids

° Compiler maps processor grids to physical procs.

Arrays Arrays orTemplates

ALIGN

AbstractProcessors

DISTRIBUTE

physicalcomputer

Page 27: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.27 Demmel Sp 1999

Alignment

A:

B:

ALIGN A(I) WITH B(I)

A:

B:

ALIGN A(I) WITH B(I+2)

C:

B:

ALIGN C(I) WITH B(2*I)

ALIGN D(i,j) WITH E(j,i)

ALIGN D(:,*) with A(:) ALIGN A(:) with D(*,:)

- collapse dimension - replication

?

Page 28: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.28 Demmel Sp 1999

Layouts

(Block, *) (*, Block) (Block, Block)

(Cyclic, *) (Cyclic, Block)(Cyclic, Cyclic)

Page 29: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.29 Demmel Sp 1999

Example

Declaring Processor Grids!HPF$ PROCESSORS P(32)

!HPF$ PROCESSORS Q(4,8)

Distributing Arrays onto Processor Grids!HPF$ PROCESSORS p(32)

real D(1024), E(1024)!HPF$ DISTRIBUTE D(BLOCK)

!HPF$ DISTRIBUTE E(BLOCK) ONTO p

Page 30: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.30 Demmel Sp 1999

Independent

° assert that the iterations of a do-loop can be performed independently without changing the result computed.

• in any order or concurrently

!HPF$ INDEPENDENT

do i=1,n

A(Index(i)) = B(i)

enddo

Page 31: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.31 Demmel Sp 1999

Extra slides

Page 32: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.32 Demmel Sp 1999

Recap: Historical Perspective

° Diverse spectrum of parallel machines designed to implement a particular programming model directly

° Technological convergence on collections of microprocessors on a scalable interconnection network

° Map any programming model to simple hardware• with some specialization

Shared Address Space

centralized sharedmemory

Message Passing

hypercubes andgrids

Data Parallel

SIMD

° ° °

M

Scalable Interconnection Network

CA

P

$

essentially completecomputer

Page 33: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.33 Demmel Sp 1999

Basics of a Parallel Language

° How is parallelism expressed?

° How is communication expressed?

° How is synchronization expressed?

° What global / local data structures can be constructed?

° How do you optimize for performance?

Page 34: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.34 Demmel Sp 1999

Where are things going

° High-end• collections of almost complete workstations/SMP on high-speed

network

• with specialized communication assist integrated with memory system to provide global access to shared data

° Mid-end• almost all servers are bus-based CC SMPs

• high-end servers are replacing the bus with a network

- Sun Enterprise 10000, IBM J90, HP/Convex SPP

• volume approach is Ppro quadpack + SCI ring

- Sequent, Data General

° Low-end• SMP desktop is here

° Major change ahead• SMP on a chip as a building block

Page 35: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.35 Demmel Sp 1999

Scan Operations

forall (i=1:5) B(i) = SUM( A(1:i) ) ! forward running sum

forall (i=1:n) B(i) = SUM( A(n-i+1:n) ) ! reverse direction

dimension fact(n)

fact = [1:n]

forall (i=1:n) fact(i) = product( fact(1:i) )

or

CMF_SCAN_op (dest,source,segment,axis,direction,inclusion,mode,mask)

op = [add,max,min,copy,ior,iand,ieor]

Page 36: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.36 Demmel Sp 1999

CMF Homes and Layouts

• Machine is really a front-end sparc connected to a bunch of sparc nodes.

•Arrays have a home: FE or CM. Don't mismatch homes!

•If you do parallel operations on it, it lives in the CM

• If you specify a layout, it lives on the CM

• FORALL with a user function or intrinisic will usually end up on the front end.

• Arrays have a layout

• Compiler views machine as a multidimensional grid of processors

• A multidimensional array is placed over this grid in block of some shape.

• Layout directive tells the compiler which dimensions should remain processor local.

• If you give a layout directive, all subroutines that operate on the array must give the same one!

Page 37: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.37 Demmel Sp 1999

CMF Layout Example

C Layout matrix as 8x8 grid of nxn blocks

integer, parameter :: p = 8

integer, parameter :: n = 64

double precision, array(p,p,n,n) :: A,B,C

CMF$ LAYOUT a(:NEWS,:NEWS,:SERIAL,:SERIAL)

CMF$ LAYOUT b(:NEWS,:NEWS,:SERIAL,:SERIAL)

CMF$ LAYOUT c(:NEWS,:NEWS,:SERIAL,:SERIAL)

call CMF_describe_array(A)

Page 38: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.38 Demmel Sp 1999

2D Electromagnetics

c D[ex,t] = c ( D[bz,y] - jx )

ex(1:m-1,2:n-1) = ex(1:m-1,2:n-1) +

* epsy * (bz(1:m-1,2:n-1) - bz(1:m-1,1:n-2)) - jx(1:m-1,2:n-1)

c D[ey,t] = c ( -D[bz,x] - jy )

ey(2:m-1,1:n-1) = ey(2:m-1,1:n-1) -

* epsx * (bz(2:m-1,1:n-1) - bz(1:m-2,1:n-1)) - jy(2:m-1,1:n-1)

c D[bz,t] = c ( D[ex,y] - D[ey,x] )

bz(1:m-1,1:n-1) = bz(1:m-1,1:n-1) + (

* epsy * (ex(1:m-1,2:n) - ex(1:m-1,1:n-1)) -

* epsx * (ey(2:m,1:n-1) - ey(1:m-1,1:n-1)) )

Page 39: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.39 Demmel Sp 1999

Stencil Calculations

nbrcnt(2:d-1,2:d-1) = fish(1:d-2,2:d-1) + fish(1:d-2,1:d-2) +

! fish(1:d-2,3:d) + fish(2:d-1,1:d-2) +

! fish(2:d-1,3:d) + fish(3:d,2:d-1) +

! fish(3:d,1:d-2) + fish(3:d,3:d)

How do you eliminate unnecessary data movement?

Page 40: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.40 Demmel Sp 1999

Blocking

subroutine compute_gravity(force,fishp,fishm,nblocks)

complex force(:,B),fishp(:,B),fishm(:,B)

complex fishmp(nblocks,B), fishpp(nblocks,B),dif(nblocks,B)

CMF$ layout force(:news, :serial), . . .

force = (0.,0.)

fishpp = fishp

fishmp = fishm

do k=1, nblocks-1

fishpp = cshift(fishpp, DIM=1, SHIFT=-1)

fishmp = cshift(fishmp, DIM=1, SHIFT=-1)

do j = 1, B

forall (i = 1:nblocks) dif(i,:) = fishpp(i,j) - fishp(i,:)

forall (i = 1:nblocks) force(i,:) = force(i,:) +

* (fishmp(i,j) * fishm(i,:) * dif(i,:) / (abs(dif(i,:))*abs(dif(i,:))))

end do

enddo

Page 41: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.41 Demmel Sp 1999

Load balancing in Wator worlds?

° What is the current function was highly irregular?

° What if fish breed?

° How would you optimize away open ocean?

° How do you compute the fish density?

Positions

Fish per cell

Page 42: CS267 L6 Data Parallel Programming.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 6: Distributed Memory (continued) Data Parallel Programming

CS267 L6 Data Parallel Programming.42 Demmel Sp 1999

Other Data Parallel Languages

• *LISP, C*

• NESL, FP

• PC++

• APL, MATLAB, . . .