1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling

1

CS4402 – Parallel Computing

Lecture 7

Parallel Graphics – More Fractals

Scheduling

04/19/23 2

FRACTALS

04/19/23 3

Fractals

A fractal is a set of points such that:

- its fractal dimension is infinite [infinite detail at every point].

- satisfies self-similarity: any part of the fractal is similar with the fractal.

Generating a fractal is a iterative process:

- start from P0

- iteratively generate P1=F(P0), P2=F(P1), …, Pn=F(Pn-1), …

P0 is a set of initial points

F is a transformation:

Geometric transformations: translations, rotations, scaling, …

Non-Linear coordinate transformation.

04/19/23 4

We work with 2 rectangular areas.

The user space:

- Real coordinates (x,y)

- Bounded between [xMin,xMax]*[yMin,yMax]

The screen space

- Integer coordinates (i, j)

- Bounded between [0,w-1]*[0,h-1]

- Is upside down with the Oy axis downward

How to squeeze the user space into the screen space?

How to translate (x,y) in (i,j)?

Points vs Pixels

04/19/23 5

Julia Sets – Self-Squaring Fractals

Consider the generating function F(z)=z2+c, z,c C.

Sequence of complex numbers: z0C and zn+1= zn2 + c.

Chaotic behaviour but two attractors for |zn|: 0 and +.

For a c C, Julia’s set Jc represents all the points whose orbit is finite.

04/19/23 6

Julia Sets – Algorithm

Inputs:

c C the complex number; [xmin,xmax] * [ymin,ymax] a region in plane.

Niter a number of iterations for orbits; R a threshold for the attractor .

Output: Jc the Julia set of c

Algorithm

For each pixel (i,j) on the screen

translate (i,j) into (x,y)

construct z0=x+j*y;

find the orbit of z0 [first Niter elements]

if (all the orbit points are under the threshold) draw (x,y)

04/19/23 7

for(i=0; i<=width; i++) for(j=0; j<width; j++){

int k =0;// construct the orbit of zz.re = XMIN + i*STEP; z.im = YMIN + j*STEP;for (k=0; k < NUMITER; k++) {

z = func(z,c);if (CompAbs(z) > R) break;

}

// test if the orbit in infiniteif (k>NUMITER-1) {

MPE_Draw_point(graph, i,j, MPE_YELLOW); MPE_Update(graph);

}else {

MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph);

}}

04/19/23 8

Julia Sets – || Algorithm

Remark 1.

The double for loop on (i,j) can be split into processors e.g.

uniform block or cyclic on i.

uniform block or cyclic on j.

No communication at all between processors, therefore this is

embarrassingly || computation.

Remark 2.

All processors draw a block of the fractal or several rows on the XGraph.

Prank knows the area to draw.

04/19/23 9

for(i=rank*width/size; i<=(rank+1)*width/size; i++) for(j=0; j<width; j++){// for(i=rank; i<width; i+=size) for(j=0; j<width; j++){// for(i=0; i<width; i++) for(j=rank*width/size; j<=(rank+1)*width/size; j++)// for(i=0; i<width; i++) for(j=rank; j<width; j+=size)

int k =0;// construct the orbit of zz.re = XMIN + i*STEP;z.im = YMIN + j*STEP;for (k=0; k < NUMITER; k++) {


}



}else {

MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph);}

}

04/19/23 10

04/19/23 11

04/19/23 12

The Maldelbrot Set

THE MANDELBROT FRACTAL IS AN INDEX FOR JULIA FRACTALS

Maldelbrot Set contains all the points cC such that

z0=0 and zn+1= zn2 + c has an finite orbit.

Inputs: [xmin,xmax] * [ymin,ymax] a region in plane.

Niter a number of iterations for orbits; R a threshold for the attractor .

Output: M the Mandelbrot set.

Algorithm

For each (x,y) in [xmin,xmax] * [ymin,ymax]

c=x+i*y;

find the orbit of z0=0 while under the threshold.

if (all the orbit points are not under the threshold) draw c(x,y)

04/19/23 13

for(i=0; i<=width; i++) for(j=0; j<width; j++){

int k =0;// construct the point cc.re = XMIN + i*STEP; c.im = YMIN + j*STEP;// construct the orbit of 0z.re = z.im = 0;for (k=0; k < NUMITER; k++) {


}



}else {

MPE_Draw_point(graph, i,j, MPE_RED); MPE_Update(graph);

}}

04/19/23 14

The Mandelbrot Set – || Algorithm

Remark 1.

The double for loop on (i,j) can be split into processors e.g.

uniform block or cyclic on i.

uniform block or cyclic on j.

No communication at all between processors, therefore this is

embarrassingly || computation.

Remark 2.

When the orbit goes to infinity in k steps then we can draw the pixel (i,j)

with the k-th color from a palette.

Bands color-ed similarly contain points with the same behaviour.

04/19/23 15

04/19/23 16

Fractal and Prime Numbers

Prime numbers can generate fractals.Remarks:

- If p>5 is prime then p%5 is 1,2,3,4.- 1,2,3,4 represent direction to do e.g. left, right, up down.- The fractal has the sizes w and h.

Step 1. Initialise a matrix of color with 0.Step 2. For each number p>5

If p is prime thenif(p%5==1)x=(x-1)%w;if(p%5==2)x=(x+1)%w;if(p%5==3)y=(y-1)%w;if(p%5==4)y=(y+1)%w;

Increase the color of (x,y)

Step 3. Draw the pixels with the color matrix.

04/19/23 17

Simple Remarks

The prime number set is infinite, furthermore it has no patter.

prime: 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, …

move: 3, 0, 2, 1, 3, 2, 4, 3, 4, 1, 2, …

The set of moves satisfies:

- it does not have any pattern moves are quite random.

- the number of 1-s, 2-s, 3-s and 4-s moves are quite similar,

hence the central pixels are reached more often.

The computation of the for loop is the most expensive operation.

04/19/23 18

// initialise a matrix with 0for(i=0;i<width;i++)for(j=0;j<width;j++)map[i][j]=0;

//start from the image centreposX = posY = width/2;

// traverse the set of prime numbersfor(i=0;i<n;i++){

if(isPrime(2*i+1)){

// move to a new position on the map and increment itmove = (2*i+1)%5;if (move==1) posX = (posX-1)%width;if (move==2) posX = (posX+1)%width;if (move==3) posY = (posY-1)%width;if (move==4) posY = (posY+1)%width;

map[posY][posX]++}

}

04/19/23 19

Parallel Computation: Simple Remarks

Processor rank gets some primes to test using some partitioning.

Processor rank therefore will traverse the pixels according with some moves.

Processor rank has to work with its own matrix map.

The map must be reduce on processor 0 to find the total number of hits.

04/19/23 20

Parallel Computation: Simple Remarks

The parallel computation of processor rank follows the steps:

1. Initialise the matrix map.

2. For each prime number assigned to rank do

1. Find the move and go to a new location

2. Increment the map

3. Reduce the matrix map.

4. If processor 0 then draw the map.

21

Splitting Loops

How to split the sequential loop if we have size processors?

Maths: n iterations & size processors n/size iterations per processor.

for(i=0;i<n;i++){

// body of looploop_body(data,i);

}

22

Splitting Loops in Similar Blocks

P rank gets the iterations rank*n/size, rank*n/size+1,…, (rank+1)*n/size-1

for(i=rank*n/size;i<(rank+1)*n/size;i++){

//aquire the data for this iterationloop_body(data,i);

}

rank*n/size (rank+1)*n/size-1

P rank

23

Splitting Loops in Cycles

P rank gets the iterations rank, rank+size, rank+2*size,….

for(i=rank;i<n;i+=size){


}

P rank

24

Splitting Loops in Variable Blocks

P rank gets the iterations l[rank], l[rank]+1,…, u[rank]

for(i=l[rank];i<=u[rank];i++){


}

l[rank] u[rank]

P rank

04/19/23 25

// initialise a matrix with 0for(i=0;i<width;i++)for(j=0;j<width;j++)map[i][j]=0;

//start from the image centreposX = posY = width/2;

// traverse the set of prime numbersfor(i=rank*n/size;i<(rank+1)*n/size;i++){

if(isPrime(p=2*i+1)){

// move to a new position on the map and increment itmove = p%5;if (move==1) posX = (posX-1)%width;if (move==2) posX = (posX+1)%width;if (move==3) posY = (posY-1)%width;if (move==4) posY = (posY+1)%width;

map[posY][posX]++}

}MPI_Reduce(&map[0][0], &globalMap[0][0], width*width, MPI_LONG, MPI_SUM, 0,

MPI_COMM_WORLD);

if(rank==0){

for(i=0;i<width;i++)for(j=0;j<width;j++)MPE_Draw_point(graph, i, j, colors[globalMap[i][j]);

}

04/19/23 26

04/19/23 27

Scheduling

28

Parallel Loops

Parallel loops represent the main source of parallelism.

Consider a system with p processors P1,P2,…, Pp and

for i=1, n do

call loop_body(i)

end for

Scheduling Problem:

Map the iterations {1,2,…,n} onto processors so that:

- the execution time is minimal.

- the execution times per processors are balanced.

- the processor’s idle time is minimal.

29

Parallel Loops

Suppose that the workload of loop_body is know and given by w1, w2,…, wn.

For Processor PJ the set of iteration is SJ={i1, i2, …, ik} so

- The execution time of Processor PJ is T(PJ)=∑ {wi: i in SJ}

- The execution time of the parallel loop is T=max{T(PJ): j=1,2,..,p}.

Static Scheduling: the partition is found at the compiling time.

Dynamic Scheduling: the partition is found at the running time.

30

Data Dependency

A dependency exists between program statements when the order of statement

execution affects the results of the program.

A data dependency results from multiple use of the same location(s) in storage

by different tasks. A data is “input” for another data.

Dependencies are important to parallel programming because they are one of the

primary inhibitors to parallelism.

Loops with data dependencies cannot be scheduled.

Example: The following for loop contains data dependencies.

for i=1, n do

a[i]=a[i-1]+1

end for

31

Load Balancing

Load balancing refers to the practice of distributing work among

processors so that all processors are kept busy all of the time.

If all the processor execution times are the same then a perfect load balance

is achieved.

Load Imbalance is the most important overhead of parallel computation

and reflects the case when there is a difference between two execution

times.

32

33

34

Useful Rules:

- If the workloads are similar then use static uniform block scheduling.

- If the workloads increase/decrease then use static cyclic scheduling.

- If we know the workloads and they are simple then guide the load balance.

- If the workloads are not known they use dynamic methods.

35

Balanced Workload Block Scheduling

w1, w2, …, wn the workload of the iterations

- total workload is w1+ w2+ …+ wn

- average per processor is

Each Processor gets consecutive iterations:

- lrank urank– the lower and upper indices of the block

- The workload is

size

wwwW n

...21

Wwww ull ...1

36

Balanced Workload Block Scheduling

Simple to work with integrals:

Average Workload per a processor is

Each processor workload is

n

diiwsize

W0

)(1

Wdiiwid

id

x

x

1

)(

WidWxWiddiiwWdiiw id

xx

x

idid

id

1

0

)()(1

37

38

39

40

41

42

Granularity

Granularity is the ratio of computation to communication.

Periods of computation are typically separated from periods of communication by synchronization events.

Fine-grain Parallelism: Relatively small amounts of computational work are done between communication events.

Facilitates load balancing and Implies high communication overhead and less opportunity for performance enhancement

Coarse-grain Parallelism: Relatively large amounts of computational work are done between communication/synchronization events. Harder to load balance efficiently

Documents

1 CS4402 – Parallel Computing Lecture 7 Parallel Graphics – More Fractals Scheduling