14
Lecture 39: Review Session #1 Reminders Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 Course evaluation (Blue Course Evaluation) Access through zzusis 1

Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Embed Size (px)

Citation preview

Page 1: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Lecture 39: Review Session #1

• Reminders– Final exam, Thursday 12/18/2014 @ 3:10pm

• Sloan 150

– Course evaluation (Blue Course Evaluation)• Access through zzusis

1

Page 2: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Problem #1

• Consider a system with two multiprocessors with the following configurations:– Machine A with two processors, each with local memory of 512 MB with local memory

access latency of 20 cycles per word and remote memory access latency of 60 cycles per word.

– Machine B with two processors, with a shared memory of 1GB with access latency of 40 cycles per word.

•  Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words.– Is it possible to partition this array on the local memories of ‘A’ machine so

that the application runs faster on it rather than ‘B’ machine? • If so, specify the partitioning!• If not, by how many more cycles should the ‘B’ memory latency be worsened for a partitioning

on the ‘A’ machine to enable a faster run than the ‘B’ machine? • Assume that the memory operations dominate the execution time.

2

Page 3: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Solution #1

• Suppose we have ‘x’ words on one processor and (T-x) words on the other processor, where T = 4096.

•  Execution Time on ‘A’ = max(20x + 60(T-x), 60x + 20(T-x)) = max(60T-40x, 20T+40x)

•  The max is 40T (unit time), where x = T/2 • Execution Time on ‘B’ = 40T• So, we can't make ‘A’ faster than ‘B’. However, if ‘B’ access is

one more cycle slower (that is, 41 cycles access latency), ‘A’ could be faster.

3

Page 4: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Problem #2

• Consider a multi-core processor with heterogeneous cores: A, B, C and D where core B runs twice as fast as A, core C runs three times as fast as A and cores C and A run at the same speed (ie have the same processor frequency, micro architecture etc). Suppose an application needs to compute the square of each element in an array of 256 elements. Consider the following two divisions of labor:

• (a)

• (b)

• Compute (1) the total execution time taken in the two cases and (2) cumulative processor utilization (amount of total time the processors are not idle divided by the total execution time). For case (b), if you do not consider Core D in cumulative processor utilization (assuming we have another application to run on Core D), how would it change? Ignore cache effects by assuming that a perfect pre-fetcher is in operation.

4

Core A 32 elementsCore B 128 elementsCore C 64 elementsCore D 32 elements

Core A 48 elementsCore B 128 elementsCore C 80 elementsCore D Unused

Page 5: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Solution #2

• (1) Total execution Time– Total execution time = max(32/1, 128/2, 64/3, 32/1) = 64 (unit time)– Total execution time = max(48/1, 128/2, 80/3, 0/1) = 64 (unit time)

• (2) Utilization:– (a) Utilization = (32/1 + 128/2 + 64/3 +32/1) / 4 * (1/64) = 0.58 – (b) Utilization = (48/1 + 128/2 + 80/3 + 0/1) / 4 * (1/64) = 0.54– (b) Utilization (if processor D is ignored) = (48/1 + 128/2 + 80/3) / 3 *

(1/64) = 0.72

5

Page 6: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Problem #3

• How would you rewrite the following sequential code so that it can be run as two parallel threads on a dual-core processor ?

int A[80], B[80], C[80], D[80];

for (i = 0 to 40)

{

A[i] = B[i] * D[2*i];

C[i] = C[i] + B[2*i];

D[i] = 2*B[2*i];

A[i+40] = C[2*i] + B[i];

}

6

Page 7: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Solution #3

• The code can be written into two threads as follows: • Thread 1:

int A[80], B[80], C[80], D[80];

for (i = 0 to 40)

{

A[i] = B[i] * D[2*i];

C[i] = C[i] + B[2*i];

A[i+40] = C[2*i] + B[i];

}

 • Thread 2:

int A[80], B[80], C[80], D[80];

for (i = 0 to 40)

{

D[i] = 2*B[2*i];

}

7

Page 8: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

8

RAID 0 and RAID 1

• RAID 0 has no additional redundancy (misnomer) – it uses an array of disks and stripes (interleaves) data across the arrays to improve parallelism and throughput

• RAID 1 mirrors or shadows every disk – every write happens to two disks

• Reads to the mirror may happen only when the primary disk fails – or, you may try to read both together and the quicker response is accepted

• Expensive solution: high reliability at twice the cost

Page 9: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

9

RAID 3

• Data is bit-interleaved across several disks and a separate

disk maintains parity information for a set of bits. On failure, use parity bits to reconstruct missing data.

• For example: with 8 disks, bit 0 is in disk-0, bit 1 is in disk 1, …, bit 7 is in disk-7; disk-8 maintains parity for all 8 bits

• For any read, 8 disks must be accessed (as we usually read more than a byte at a time) and for any write, 9 disks must be accessed as parity has to be re-calculated

• High throughput for a single request, low cost for redundancy (overhead: 12.5% in the above example), low task-level parallelism

Page 10: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

10

RAID 4 / RAID 5

• Data is block interleaved – this allows us to get all our data from a single disk on a read – in case of a disk error, read all 9 disks

• Block interleaving reduces throughput for a single request (as only a single disk drive is servicing the request), but improves task-level parallelism as other disk drives are free to service other requests.

• On a write, we access the disk that stores the data and the parity disk – parity information can be updated simply by checking if the new data differs from the old data

Page 11: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

RAID 3 vs RAID 4

11

Page 12: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

12

RAID 5

• If we have a single disk for parity, multiple writes can not happen in parallel (as all writes must update parity info)

• RAID 5 distributes the parity block to allow simultaneous writes

Page 13: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Problem #4

• Discuss why RAID 3 is not suited for transaction processing applications. What kind of applications is it suitable for and why?

13

Page 14: Lecture 39: Review Session #1 Reminders –Final exam, Thursday 12/18/2014 @ 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through

Solution #4

• RAID 3 is unsuited to transactional processing because each read involves activity at all disks. In RAID 4 and 5 reads only involve activity at one disk. The disadvantages of RAID 3 are mitigated when long sequential reads are common, but performance never exceeds RAID 5. For this reason, RAID 3 has been all but abandoned commercially.

14