22
Computer Architecture Lecture 7 Compiler Considerations and Optimizations

Computer Architecture

  • Upload
    uma

  • View
    21

  • Download
    0

Embed Size (px)

DESCRIPTION

Computer Architecture. Lecture 7 Compiler Considerations and Optimizations. Structure of Recent Compilers. Front End. Transform Language to Common Intermediate Form - PowerPoint PPT Presentation

Citation preview

Page 1: Computer Architecture

Computer Architecture

Lecture 7

Compiler Considerations and Optimizations

Page 2: Computer Architecture

Structure of Recent Compilers

Front End Transform Language to Common Intermediate Form Note: Only few companies make front for C. Source code for C++ Front end is about 30 times bigger than C. Most Front down convert C++ to C before compilation.

High Level Optimization

High Level Loop Optimization

Example: Procedure In-lining

(Lang Dep., Machine Ind.)

Global Optimization

Global and Local Optimization and register allocation

(Small Lang Dep., Small Machine dep.)

Code Generation

Detailed Instruction Selection and machine dependent optimization (No Lang Dep., Highly Machine Dep.)

Page 3: Computer Architecture

Compiler Prime Target Program Correctness Speed Compilation Time? Phases of compilers help write bug-

free code

Page 4: Computer Architecture

Optimizations High-level Local (Basic Block) Global (across branches) Register Allocation, Live Range

Analysis Processor Dependent

Page 5: Computer Architecture

Optimization Names Procedure Integration Common Sub-expression Elimination/Dead Code Elimination

A = b+ c ;dead code eliminated, no subsequent use of b+cA = x+ ySimilarly if a procedure does not return a value and uses only local

variables will be eliminated. (Test this in VC++) Constant Propagation: A variable used as constant. (Constants aren’t,

Variable Won’t. Osborn’s Law) Global Sub-expression Elimination Copy Propagation (a = b, a will be replaced by b) Code Motion (A code that does not change with index in a loop will be moved out of the

loop.) Induction Variable Elimination (A = A + 5 in a loop that runs n times

will be replaced with A = A + 5 * n and moved out of loop, if A is not used,) Strength Reduction (Multiply replaced with shift and add if possible, A*25 +

b*25 will be replaced with (A+B) * 25) Pipeline Scheduling Branch Optimization

Page 6: Computer Architecture

Problems with PointersA = 5;p = x+y;*p = 9 (only programmer knows &A = p)

Compiler cannot assign a register

Page 7: Computer Architecture

Architecture Help Provide Orthogonality

The Operations, The Data Types, The Addressing Modes, The Register Functions should be orthogonal

Simplify Trade-offs between alternatives (With caches and pipelining, trade-offs have become very complex) For Example: Most difficult one in register-memory architecture: How many times a variable is referenced before it is assigned a register.

Provide Instructions to Bind Variables with Constants

Most SIMD kernels are hand-coded as no compiler support

Page 8: Computer Architecture

Hand-Coded VS Compiler GeneratedOn TMS320C6203 (VLIW CPU) (reported May 2000)

EEMBC Telecom Kernels

Ratio of Execution Time (Compiler/Hand

Written)

Ratio of Code Size (Compiler/Hand Written)

Convolution Encoder 44.0 0.5

Fixed Point Complex FFT

13.5 1.0

Viterbi GSM Decoder

13.0 0.7

Fixed Point Bit Allocation

7.0 1.4

Auto-correlation 1.8 0.7

Page 9: Computer Architecture

Basic Compiler Techniques Basic Pipelining Static Loop UnrollingExample:

Instruction Producing

Result

Instruction Using Result

Latency in CC

FP ALU FP ALU 3FP ALU Store 2Load FP ALU 1

FP Load FP Store 0

Page 10: Computer Architecture

Example (Contd…)Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) DADDUI R1,R1, #-8 BNEQ R1,R2, Loop

Page 11: Computer Architecture

Example (Without Scheduling)

Loop: L.D F0, 0(R1) stall ;LUD ADD.D F4,F0,F2 stall stall S.D F4, 0(R1) DADDUI R1,R1, #-8 stall BNEQ R1,R2, Loop stall ;Successor flushedTotal 10cc

Page 12: Computer Architecture

Example (With Scheduling)

Loop: L.D F0, 0(R1) DADDUI R1,R1, #-8

ADD.D F4,F0,F2 stall BNEQ R1,R2, Loop S.D F4, 8(R1) ;delay slot

Total 6cc (3 for data, 3 overhead)

Page 13: Computer Architecture

Example (Static Loop Unrolling 4 times)

Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1,R1, #-32 S.D F12, 16(R1) BNEQ R1,R2, Loop S.D F16, 8(R1) ; Delay slotTotal 3.5cc per element

Compiler Considerations:1. Use of delay slot2. Loop level independence3. Register Assignment4. Proper Loop Adjustment

Page 14: Computer Architecture

Example (Static Dual Issue, 1 Int and 1 FP/CC)

Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) ADD.D F4,F0,F2 L.D F14, -32(R1) ADD.D F8,F6,F2 L.D F18, -36(R1) ADD.D F12,F10,F2

S.D F4, 0(R1) ADD.D F16,F14,F2 S.D F8, -8(R1) ADD.D F20,F18,F2 S.D F12, -16(R1) DADDUI R1,R1, #-40 S.D F16, 16(R1) BNEQ R1,R2, Loop S.D F20, 8(R1) ; Delay slotTotal 2.4cc per element

LUD

Page 15: Computer Architecture

VLIW Compiler formats issue packets Compiler ensures that dependencies

are not present 64 to 200-bit long instructions

Page 16: Computer Architecture

Example (VLIW, 1 Int, 2 FP, 2 LD/ST /CC 5-slots)

Mem 1 Slot Mem 2 Slot

FP 1 Slot FP 2 Slot Int/ Branch

L.D F0, 0(R1) L.D F6, -8(R1)

L.D F10, -16(R1) L.D F14, -24(R1)

L.D F18, -36(R1) L.D F22, -40(R1)

ADD.D F4,F0,F2 ADD.D F8,F6,F2

L.D F26, -48(R1) ADD.D F12,F10,F2 ADD.D F16,F14,F2

ADD.D F20,F18,F2 ADD.D F24,F12,F2

S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2

S.D F12, -16(R1) S.D F16, -24(R1) DADDUI R1,R1, #-56

S.D F20, 24(R1) S.D F24, 16(R1)

S.D F28, 8(R1) BNEQ R1,R2, Loop

1.29cc per element, 23 slots used out of potential 45

Page 17: Computer Architecture

Loop Level Parallelism Loop Carried Dependence:

Data calculated in one loop iteration is required in the next loop.

A Parallel LoopFor (I = 1000; I > 0; I = i-1)x[i] = x[i] + s

Page 18: Computer Architecture

Example

For (i = 1; i <= 100; i = i+1) {

A[i+1] = A[i] + + C[i];B[i+1] = B[i] + + A[i+1]; }

Dependences?

Page 19: Computer Architecture

Example 2 Make the following loop parallel.

For (i = 1; i <= 100; i = i+1) {

A[i] = A[i] + + B[i];B[i+1] = C[i] + + D[i]; }

Page 20: Computer Architecture

The GCD Test Loop stores in a j + b and later

fetches from c k + d. Sufficient test is that if loop carried

dependence exits then GCD(c,a) must integer divide (d-b) (no remainder).For (i = 1; i <= 100; i = i+1)

x(2*i+3] = x[2*i] *5This test ignores loop bounds.

Page 21: Computer Architecture

Example 2 Use renaming to find ILP

For (i = 1; i <= 100; i = i+1){ Y[i] = X[i] /c1 X[i] = X[i] +c2 Z[i] = Y[i] + c3 Y[i] = c4 - Y[i] /c }

Page 22: Computer Architecture

Other techniquesAddi R1, R2, # 4Addi R1, R2, # 4ToAddi R1, R2, # 8 ;copy Propagation AndAdd R1, R2, R3Add R2, R1, R5Addi R7, R2, R8 ;(tree height reduction)Sum = sum + x[i]Sum = (sum + x[1]) + ( x[2] + x[3]) +

(x[4]+x[5]) ;recurrence optimization