52
A Seminar on Optimizations for Modern Architectures Computer Science Seminar 2, 236802 Winter 2006 Lecturer: Erez Petrank http://www.cs.technion.ac.il/~erez/courses/ seminar

Lecturer: Erez Petrank cs.technion.ac.il/~erez/courses/seminar

  • Upload
    raziya

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

A Seminar on Optimizations for Modern Architectures Computer Science Seminar 2,  236802 Winter 2006. Lecturer: Erez Petrank http://www.cs.technion.ac.il/~erez/courses/seminar. Topics today. Administration Introduction Course topics Administration again. Course format. - PowerPoint PPT Presentation

Citation preview

Page 1: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

A Seminar on Optimizations for

Modern Architectures

Computer Science Seminar 2,  236802

  Winter 2006

Lecturer: Erez Petrank

http://www.cs.technion.ac.il/~erez/courses/seminar

Page 2: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Topics today

• Administration• Introduction• Course topics • Administration again

Page 3: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Course format

• Lectures 2-14 will be given by you. • Each lecture will follow a lecture from a

course by Ken Kennedy at Rice University. • Slides “drafts” are in

http://www.cs.rice.edu/~ken/comp515/lectures . • Original lectures took 80 minutes. • 16 such lectures, one by each registered

student.

Page 4: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Grading

• Presentation - at least 85%,• Participation in class – at most 15%. • Reduction of 5 points for each (unjustified)

absence.

• Presentation:– Understanding the material– Slides effectiveness– Communicating the ideas

Page 5: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Seminar Requirement • Easier than “standard”:

– Material is taken from a book– Draft slides are available on the internet

• Harder than “standard”:– A presentation is 80 minutes instead of 50. – Presentation should be much better than draft

slides. – Presentations may depend on previous

material, so one actually has to listen…– Lecturer has to know what he is talking about.

• Goal: make people understand !

Page 6: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Scheduling • Since lectures are not synch’ed with our

2*50 minutes schedule, one must be prepared to start after the previous talk ends.

• Lecturers should send me their talk slides by 12:30 the day of the talk.

• After the talk, please modify the slides to create a final version and send it to me by the end of Thursday.

Page 7: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

The book

Optimizing Compilers for Modern Architectures

by Ken Kennedy & Randy Allen

(Available at the library: 1 reserved copy at the library and 2 more for a week's loan. )

Page 8: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

The nature of this topic

• Trying to automatically detect potential parallelization seems to be harder than expected.

• Requires some math, some common sense, some compiler knowledge, some NP-Hardness, etc.

• At the frontline of programming languages research today.

• Interesting both from research and from practice point of view.

Page 9: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

An Example

• Can we run the following loop iterations in parallel? • For (i=0, i<1000, i++)

For (j=0, j<1000, j++) For (k=0, k<1000, k++) For (l=0, l<1000, l++)

A(i+3,2j+2,k+5,l) = A(2j-i+k,2k,7,2i)

• A sufficient question is whether one iteration writes to a location in A that another iteration reads from…

• This translates to whether there exist two integer vectors (i1,j1,k1,l1) and (i2,j2,k2,l2) such that i1+3 = 2j2-i2+k2 & 2j1+2 = 2k2 & k1+5 = 7 & l1 = 2i2

(Solutions should also be within the loop bounds.)

Page 10: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

An Example

• Do there exist two integer vectors (i1,j1,k1,l1) and (i2,j2,k2,l2) such that i1+3 = 2j2-i2+k2 & 2j1+2 = 2k2 & k1+5 = 7 & l1 = 2i2 ?

• This is a set of linear equations. Let’s solve them!• Bad news: finding integer solutions for linear

equations in NP-Hard. • So what do we do? • Restrict to special cases that are common and we

have the math tools to solve…

Page 11: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Introduction to Course(Chapter 1)

• What kind of parallel platforms are we targeting?

• How “dependency” relates to all.

Page 12: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Moore’s Law

Page 13: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Features of Machine Architectures

• Pipelining.• Superscalar: multiple execution units (pipelined).

– When controlled by the software it is called VLIW.

• Vector operations• Parallel processing

– Shared memory, distributed memory, message-passing

• Registers• Cache hierarchy• Combinations of the above

– Parallel-vector machines

Page 14: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Compiler’s Job

• Many advanced developers write platform-conscious code. – Parallelism, cache behavior, etc.

• Time consuming and bug prone. • Bugs are extremely difficult to find.• Holly grail: the compiler gets “standard”

code and optimizes it to make best use of the hardware.

Page 15: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Main Tool: Reordering

• Changing the execution order of original code: – Do not wait for cache misses– Run instructions in parallel– Improve “branch prediction”, etc.

• Main question: Which orders preserve program semantics?– Dependence. – Safety

Page 16: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Dependnece

• Dependence is the main theme in this book.

• How do we determine dependence?• How do we use this knowledge to improve

programs?

• Next, we go through the various platform relevant properties: pipelining, vector instructions, superscalar, processor parallelism, memory hierarchy.

Page 17: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Earliest: Instruction Pipelining

• Instruction pipelining– DLX Instruction Pipeline

• Picture shows perfect pipelining, executing 1 inst/cycle

IF EX MEMID

c1 c2 c3 c4 c5 c6

WB

IF EX MEMID WB

IF EX MEMID WB

c7

Instruction fetch,Decode, execute, mem access, & write back to register. (RISC.)

Page 18: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Replicated Execution Logic• Pipelined Execution Units

• Multiple Execution Units

FetchOperands

(FO)

EquateExponents

(EE)

NormalizeMantissas

Add

(AM)

Result

(NR)

InputsResult

(EE) (NR)b3c3

(AM)

b2 + c2

(FO)b4c4

a1

b5

c5

b1 + c1

Adder 1

b2 + c2

Adder 2

b4 + c4

Adder 4

b3 + c3

Adder 3

b5c5

Results

Typical programs do

not exploit such pipelining or

parallelism well. But a compiler

can help by reordering

instructions.Fine-grained parallelism

Page 19: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Parallelism Hazards

• Structural Hazard– A single bus brings instructions and data. So

memory access cannot happen with instruction fetch.

• Data Hazard– The result produced by one instruction is

required by the following instruction

• Control Hazard– Branches

• Hazards create stalls. Goal: minimize stalls.

Page 20: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Vector Instructions

• By the 70’s, the task of keeping the pipeline full had become cumbersome.

• Vector operations: an attempt to simplify instruction processing.

• Idea: long “vector” register, loaded, executed, or stored in a single operation.

• Example:» VLOAD V1,A» VLOAD V2,B» VADD V3,V1,V2» VSTORE V3,C

Page 21: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Vector Instructions

• The machine runs vector operations using the same pipelined hardware.

• Simplification: we know what is happening. • Platform/system costs: processor has more

registers, more instructions, cache complications, etc.

Page 22: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Vector Operations

• How does a programmer specify or how does a compiler identify vector operations?

» DO I = 1, 64» C(I) = A(I) + B(I)» D(I+1) = D(I) + C(I)» ENDDO

Page 23: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

VLIW(Very Long Instruction Word)

• Multiple instruction issue on the same cycle– Wide word instruction

• Challenges for programmer/compiler– Find enough parallel instructions– Reschedule operations to minimize number of

cycles.• Avoid hazards

– Dependence !

Page 24: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

SMP Parallelism

• Multiple processors with uniform shared memory. • Course grained (task) parallelism: independent

tasks. • Do not rely on one another for obtaining data or

making control decisions.

p1

Memory

Bus

p2 p3 p3

Page 25: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Memory Hierarchy

• Memory speed falls behind processor speed, and gap still increasing.

• Solution: use a fast cache between memory and CPU.

• Implication: program cache behavior has a significant impact on program efficiency.

• Challenge: How can we enhance reuse? – Correct! We reorder instructions.

CacheMem CPU

Page 26: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Memory Hierarchy– Example:

– DO I = 1, N– DO J = 1, M– C(I) = C(I) + A(J)

– If M is small enough to let A[1:M] in the cache, then (almost) no cache misses on A. Otherwise, cache miss on A per loop iteration.

– Solution for a cache of size L:– DO jj = 1, M, L– DO I = 1, N– DO J = jj, jj+L-1– C(I) = C(I) + A(J)

Page 27: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Distributed Memory• Memory packaged with processors

– Message passing– Distributed shared memory

• SMP clusters– Shared memory on node, message passing off

node

• Problem more complex: split computation and data. – Minimizing communication: data placement– Create copies? Maintain coherence, etc.

• We will probably not have time to discuss data partition in this seminar.

Page 28: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Some Words on Dependence

• Goal: aggressive transformations to improve performance

• Problem: when is a transformation legal?– Simple answer: when program meaning not

changed.

• Meaning?– Same sequence of memory states? Too strong!– Same answers? Good, but in general undecidable.

Need a sufficient condition.

• We use in this course: dependence. – Ensures instructions that access the same location

(with at least one store) must not be reordered.

Page 29: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Bernstein Conditions

• When is it safe to run two instructions I1 and I2 in parallel?

• If none of the following holds:1. I1 writes into a memory location that I2 reads2. I2 writes into a memory location that I1 reads3. Both I1 and I2 write to the same memory

location

• Loop parallelism:• Think of loop iterations as tasks

Page 30: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Bernstein Conditions

• Bernstein Conditions 1 --- True Dependence:– I1 writes into a memory location that I2 reads

– Do i=1,N– A(i+1) = A(i) + B(i)

• Bernstein Conditions 2 --- Antidependence:– I2 writes into a memory location that I1 reads

– Do i=1,N– A(i-1) = A(i) + B(i)

Page 31: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Bernstein Conditions

• Bernstein Conditions 3 --- Output Dependence:– Both I1 and I2 write to the same memory location

– Do i=1,N– S = A(i) + B(i)

– Better: – X = 10– X = 20

Page 32: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Compiler Technologies• Program Transformations

– Most architectural issues dealt with by reordering.

• Vectorization, parallelization, cache reuse enhancement

– Challenges:• Determining when transformations are legal• Selecting transformations based on profitability

• All transformations require some understanding of the ways that instructions and statements depend on one another (share data).

Page 33: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

A Common Problem: Matrix Multiplication

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO ENDDO

This code will perform perfectly on a scalar machine. Compiler only needs to identify the common variable C(j,i) in the inner loop and put it in a register.

Page 34: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Pipelined Machine

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO ENDDO

But Suppose it ran on a pipeline machine with 4 stages. Problem: inner (k) loop cannot be parallelized.

Page 35: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Problem for Pipelines• Instructions in inner (k) loop are dependent.

• Solution: – work on several iterations of the J-loop

simultaneously

A(1,1)*B(1,1)C(1,1)+

A(1,2)*B(2,1)

A(1,1)*B(1,1)C(1,1)+

A(1,2)*B(2,1) A(2,1)*B(1,1)C(2,1)+

A(3,1)*B(1,1)C(3,1)+

A(4,1)*B(1,1)C(4,1)+

Page 36: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

MatMult for a Pipelined Machine

DO I = 1, N, DO J = 1, N, 4

C(J,I) = 0.0 C(J+1,I) = 0.0 C(J+2,I) = 0.0C(J+3,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I)C(J+1,I) = C(J+1,I) + A(J+1,K) * B(K,I)C(J+2,I) = C(J+2,I) + A(J+2,K) * B(K,I)C(J+3,I) = C(J+3,I) + A(J+3,K) * B(K,I)

ENDDO ENDDO

ENDDO

Page 37: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Matrix Multiply on Vector Machines

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO ENDDO

Reminder: this was the original code.

Page 38: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Problems for Vectors• Inner loop must be vector

– And should be stride 1

• Vector registers have finite length (Cray: 64 elements)

• Solution– Strip mine the loop over the stride-one dimension

to the register length (say, 64). – Move the iterate over strip loop to the innermost

position

Page 39: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Vectorizing Matrix Multiply

DO I = 1, N DO J = 1, N, 64

DO JJ = 0,63C(JJ,I) = 0.0

ENDDODO K = 1, N

DO JJ = 0,63C(J,I) = C(J,I) + A(J,K) * B(K,I)

ENDDO ENDDO

ENDDO ENDDO

Page 40: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

MatMult for a Vector Machine

DO I = 1, N DO J = 1, N, 64

C(J:J+63,I) = 0.0 DO K = 1, N

C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I) ENDDO

ENDDOENDDO

Page 41: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Matrix Multiply on Parallel SMPs

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO ENDDO

On an SMP, we prefer to parallelize at a coarse level because of the cost involved in synchronizing the parallel runs.

Page 42: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Matrix Multiply on Parallel SMPs

DO I = 1, N ! Independent for all I DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO ENDDO

We can let each processor execute some of the outer loop iterations. (They should all have access to the full matrix.)

Page 43: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

MatMult on a Shared-Memory MP

PARALLEL DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO END PARALLEL DO

Page 44: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

MatMult on a Vector SMP

PARALLEL DO I = 1, N DO J = 1, N, 64

C(J:J+63,I) = 0.0 DO K = 1, N

C(J:J+63,I) = C(J:J+63,I) + A(J:J+63,K)*B(K,I)ENDDO

ENDDO ENDDO

Page 45: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Matrix Multiply for Cache Reuse

DO I = 1, N DO J = 1, N

C(J,I) = 0.0 DO K = 1, N

C(J,I) = C(J,I) + A(J,K) * B(K,I) ENDDO

ENDDO ENDDO

Page 46: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Problems on Cache• There is reuse of C but no reuse of A and B• Solution

– Block the loops so you get reuse of both A and B• Multiply a block of A by a block of B and add to block of

C

Page 47: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

MatMult on a Uniprocessor with CacheDO I = 1, N, S

DO J = 1, N, SDO p = I, I+S-1

DO q = J, J+S-1 C(q,p) = 0.0

ENDDOENDDODO K = 1, N, S

DO p = I, I+S-1DO q = J, J+S-1

DO r = K, K+S-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)

ENDDOENDDO

ENDDO ENDDO

ENDDO ENDDO

ST elements ST elements

S2 elements

Page 48: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

MatMult on a Distributed-Memory MP

PARALLEL DO I = 1, NPARALLEL DO J = 1, N

C(J,I) = 0.0ENDDO

ENDDOPARALLEL DO I = 1, N, S

PARALLEL DO J = 1, N, SDO K = 1, N, T

DO p = I, I+S-1DO q = J, J+S-1

DO r = K, K+T-1 C(q,p) = C(q,p) + A(q,r) * B(r,p)

ENDDOENDDO

ENDDOENDDO

ENDDO ENDDO

Page 49: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Conclusion• Tailoring the program to a given platform

is difficult.• Due to the speed of hardware change,

modifications may be required repeatedly. • Thus, it is useful to let the compiler do

much of the work. • How much can it do? This is the topic of

this course.

Page 50: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Summary• Modern computer architectures present

many performance challenges• Most of the problems can be overcome by

changing order of instructions, and in particular, by transforming loop nests– Transformations are not obviously correct.

• Dependence tells us when this is feasible– Most of the course is concentrated on

detection and use of dependence (or independence) for the various goals.

Page 51: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Administration Again

Page 52: Lecturer:  Erez Petrank cs.technion.ac.il/~erez/courses/seminar

Registration Procedure

1. I’ll wait to get additional initial registration emails for the next 2 hours.

2. I will send emails asking students if they want to pick a lecture.

3. I will wait a limited time for responses and go for the next candidates. If you want to ensure your place in this course, check

your email every 4 hours tomorrow and respond.

4. A schedule will be built and put on the internet with names of registered students.

5. (If you don’t get registered, you are still welcome as a listener…)