54
Code Scheduling Techniques for VLIW Processors and Their Applicability to Clustered Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi http://embedded.cse.iitd.ernet.in March 11, 2004 Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 1/51

Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Code Scheduling Techniques for VLIW Processorsand Their Applicability to Clustered Architectures

Anup Gangwar

Embedded Systems Group,Department of Computer Science and Engineering,

Indian Institute of Technology Delhihttp://embedded.cse.iitd.ernet.in

March 11, 2004

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 1/51

Page 2: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 2/51

Page 3: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

RISC Processor Architecture

A Pipelined RISC processor can execute at most oneInstruction per cycle (IPC)

Typical hazards such as branches and cache misses reduceIPC to less than one

Advantages:

Simplified hardware and compiler

Low power consumption

Disadvantages:

Low performance

Increase in performance beyond 1 IPC can be achieved bymultiple-issue processors

Most of the current embedded processors are RISCs: ARM,MIPS, StrongARM etc.

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 3/51

Page 4: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

SuperScalar Processor Architecture

Have multiple functional units (ALUs, LD/ST, FALUs etc.)

Multiple instruction executions may be in progress at thesame time

Detect parallelism dynamically at run-time

Advantages:Binary compatibility across all generations of processors

Compilation is trivial, at most compiler can rearrange instructions to

facilitate detection of ILP at run-time

Disadvantages:High power consumption

Complicated hardware: hence not very suitable for customization

Most of the General Purpose Processors are SuperScalars:Pentium (Pro, II, III, 4), UltraSPARC, Athlon, MIPS10K etc.

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 4/51

Page 5: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

VLIW Processor Architecture

Compiler extracts parallelism, these have evolved fromhorizontal microcoded architectures

Latest industry coined acronym, EPIC for Explicitly ParallelInstruction Computing

Commercial Architectures:

General Purpose Computing: Intel Itanium

Embedded Computing: TriMedia, TiC6x, Sun’s MAJC etc.

RISC SuperScalar VLIW (4 issue)ADD r1, r2, r3 ADD r1, r2, r3 ADD r1, r2, r3 | SUB r4, r2, r3 | NOP | NOP

SUB r4, r2, r3 SUB r4, r2, r3 MUL r5, r1, r2 | NOP | NOP | NOP

MUL r5, r1, r2 MUL r5, r1, r2

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 5/51

Page 6: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

VLIW Processor Architecture (contd)

Advantages:Simplified hardware: Suitable for customization

Less power consumption as compared to SuperScalar processors

High performance

Disadvantages:Complicated compiler: limits retargetability

Code size blow up due to explicit NOPs.

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 6/51

Page 7: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 7/51

Page 8: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Clustered VLIW Processors

For N functional units connected to a RF the Area grows as

N3, Delay as N32 and Power as N3 (Rixner et. al, HPCA 2000)

Solutions is to break up the RF into a set of smaller RFs

� � � � � � � � �� � � � � � � � �� � � � � � � � �

� � � � � � � � �� � � � � � � � �� � � � � � � � �� � � � � � � � � �� � � � � � � � � �� � � � � � � � � �

� � � � � � � � � �� � � � � � � � � �� � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

� � � � � � � � � �� � � � � � � � � �� � � � � � � � � �

� � � � � � � � �� � � � � � � � �� � � � � � � � �

....

FU2 FU3FU1 FU1 FU2 FU3 FU1 FU2 FU3

Cluster 3Cluster 1 Cluster 2

Register File 1 Register File 2 Register File 3

Interconnection Network

Memory System

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 8/51

Page 9: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Clustered VLIW Architectures

RF-to-RF

LD/ST ALU LD/ST ALUALU ALU

R.F. R.F.

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 9/51

Page 10: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Clustered VLIW Architectures

Write Across (2)

ALU ALU

R.F. R.F.

ALULD/ST LD/ST ALU

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 9/51

Page 11: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Clustered VLIW Architectures

Read Across (2)

ALU ALU

R.F. R.F.

LD/ST ALU LD/ST ALU

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 9/51

Page 12: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Clustered VLIW Architectures

Write/Read Across (2)

ALU ALU

R.F. R.F.

LD/ST ALU LD/ST ALU

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 9/51

Page 13: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Compilation for Clustered VLIWs

Compilation is complicated due to partial connectivitybetween clusters

Important acyclic and cyclic (modulo) scheduling techniquesdeveloped for monolithic VLIWs are not directly applicable

Operation ⇒ FU in Cluster binding problem is the mostcritical

Another problem is register allocation

Scheduling/Binding techniques developed are inter-clusterinterconnect specific

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 10/51

Page 14: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 11/51

Page 15: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Block Diagram of a Typical VLIW Compiler

C−S

ourc

e

Hig

h−Le

vel I

R

Low−Level IR

Ass

embl

y

* High−level code transformations* Function inlining* Profiling* Input parsing

* IR convertion

* Classical optimizations* ILP enhancement* Block formation

* Post−pass scheduling* Register allocation* Modulo scheduling* Acyclic scheduling

Compiler Back−end

Compiler Front−end

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 12/51

Page 16: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

High Level Intermediate Representation

SUIF2

Designed to capture high-level informationNo instructions, only statements and expressionsStatements: if-else, do-while etc.Expressions: Binary expressions etc.

Hcode (IMPACT)

Designed to capture high-level informationSimilar to SUIF2, contains statements mixed withexpressionsNo notion of instructions, enables control-flow profiling

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 13/51

Page 17: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Low Level Intermediate Representation

MachSUIF

Designed to capture low-level information, which isarchitecture specificInterfaces for target specific code generation, registerallocation, scheduling

Lcode (IMPACT)

Designed to capture low-level information, for a typicalVLIW architectureLcode nodes (instructions), support predication etc.Enables block formation, register allocation, codescheduling etc.

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 14/51

Page 18: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 15/51

Page 19: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Block Formation Techniques: Basic Block

Code Snippet:

....

if( i = 10 ){

j = 100;

k = 110;

}else{

j = 200;

k = 210;

}

Basic Blocks:BB-1 BB-2j = 100; j = 200;k = 110; k = 210;

Single entry single exit blocks

No control flow information, representible as DFGs

Typically very small, large in few media applications

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 16/51

Page 20: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Block Formation Techniques: Traces

Code Snippet:

....

a = 50;

if( i = 10 ){

j = 100;

k = 110;

}else{

j = 200;

k = 210;

}

c = a + j + k;

Trace:(when then part is more frequently executed):

Trace-1a = 50;j = 100;k = 110;c = a + j + k;

Single entry single exit blocks, list scheduling after trace detectionBookkeeping code added to preserve semanticsNo speculative execution (in original Fisher approach)

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 17/51

Page 21: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Block Formation Techniques: HyperBlock

Require support from hardware in the form of predicatedexecution

Assume a predicate (TRUE/FALSE) value is appended toeach instruction as agrument e.g.ADD r1, r2, r3 → ADD r1, r2, r3, p1

Each instruction execution is conditional on either thepredicate being TRUE or FALSE

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 18/51

Page 22: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

HyperBlocks (contd)

Code Snippet:

....

a = 50;

if( i = 10 ){

j = 100;

k = 110;

}else{

j = 200;

k = 210;

}

c = a + j + k;

HyperBlock:HyperBlock-1a = 50, TRUEp<t> = ? i = 10;j = 100, p<t>k = 110, p<t>j = 200, p<f>k = 210, p<f>c = a + j + k, TRUE

Single entry multiple exit blocks, incorporates both IF and ELSE partsSpecialized scheduling techniques after detection, tend to grow largePerformance may not increase due to dynamic nullification

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 19/51

Page 23: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Block Formation Techniques: SuperBlock

Superblocks are traces with all join points removed via tailduplication

May lead to a larger code than trace

Enable easy application of certain optimizations such ascopy propagation

Require hardware support for speculative execution

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 20/51

Page 24: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

SuperBlocks (contd)

Code Snippet:

....

a = 50;

if( i = 10 ){

j = 100;

k = 110;

}else{

j = 200;

k = 210;

}

c = a + j + k;

SuperBlocks:(when then part is more frequently executed):

SuperBlock-1a = 50;j = 100;k = 110;c = a + j + k;

Single entry multiple exit blocks

Specialized scheduling after detection

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 21/51

Page 25: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Block Formation Techniques: Treegion

Code Snippet:

....

if( i = 10 ){

j = 100;

k = 110;

}else{

j = 200;

k = 210;

}

c = j + k;

Treegions:Treegion-1....if( i = 10 ){j = 100;k = 110;

}else{j = 200;k = 210;

}

Single entry multiple exit block (a tree of BBs)Global scheduling is possible, support for speculativeexecutionProfile information is not needed

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 22/51

Page 26: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Comparison of Block Formation Techniques

Name If-Else Entry Exits H/W ProfileIncluded Supp.? Info.?

Basic Block N.A. Single Single None No

Traces Only One Single Single None Yes

HyperBlock Both Single Multiple Yes Yes

SuperBlock One or Single Multiple Yes YesBoth

Treegion Global Single Multiple Yes No

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 23/51

Page 27: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 24/51

Page 28: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: List Scheduling

Works on a Data Flow Graph (DFG)

1. Sort noted based on some priority function2. Schedule as many nodes in this cycle as possible3. Increase the schedule step and repeate 2, 3 till all

nodes are scheduled

Works well for DFGs, modifications are needed to work withHyperBlocks and SuperBlocks.

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 25/51

Page 29: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: Trace Scheduling

Works on traces1. Do a trace formation on the region of interest2. Perform list scheduling on the trace3. Add compensation code

It has been superseded by SuperBlock schedulingalgorithms

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 26/51

Page 30: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 27/51

Page 31: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Cyclic Code Scheduling: S/W Pipelining

Proposed by B. R. Rau in 1981, subsequently modified byLam (PLDI, 1998), Rau (MICRO, 1994) and others

Lam’s approach employs hierachical reduction which is notreally relevant today due to HyperBlocks etc.

Iterative Modulo Scheduling by Rau (MICRO, 1996) is mostestablished

Code Snippet:

for( i = 0; i < 100; i++ )

a = a + 1;

Architecture has 4 ALUs:Steady State Pipeline

ADD.0 a1 = a0 + 1;ADD.1 a3 = a2 + 1;ADD.2 a5 = a4 + 1;ADD.3 a7 = a6 + 1;

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 28/51

Page 32: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 29/51

Page 33: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: Problems

Cluster Assignment

Register Allocation

Instruction Scheduling

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 30/51

Page 34: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: Jacome et. al.

Works on DFGs for RF-to-RF architectures

Divide the DFG into a collection of Vertical and HorizontalAggregates

A derived value, Load, is used to determine load if aparticular operation is scheduled onto a particular cluster

The remaining problem is to map these aggregates onto theclustered architecture, an aggregate is scheduled onto acluster as a whole and is not further subdivided

Algorithm can be used for both design-space-explorationand code generation

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 31/51

Page 35: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Jacome et. al. (contd)

Vertical and Horizontal Aggregates:

1 2 3 4

65

7 8

VA−1

VA−2

HA−2 HA−1

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 32/51

Page 36: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: Sanchez et. al.

Works for DFGs for RF-to-RF architectures

Greedy algorithm which tries to minimize communicationamongst clusters

Simultaneous cluster assignment and scheduling for eachoperation

Register assignment is trivial, with generated values firstgoing to the cluster in which this operation is scheduled

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 33/51

Page 37: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: PCC

Partial Component Clustering, proposed by G. Desoli, 1998

Works for DFGs and RF-to-RF architectures

Problem reduction is achieved by identifying Sub-DAGswhich are in turn scheduled and bound

Works well if the Sub-DAGs are balanced in terms ofnumber of operations and schedule lengths

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 34/51

Page 38: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

PCC (contd)

DAG of Matrix Init After Loop Unrolling:

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 35/51

Page 39: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: RAW

The RAW Architecture:

The RAW Microprocessor

and a switch, which has its own instruction memoryEach processor contains ALU, Instruction/Data memory

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 36/51

Page 40: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

RAW (contd)

Unlike a VLIW processor, the RAW architecture has multipleflows of control

The instruction partioner, partitions the instruction intomultiple groups, both at local and global levels

The clustering and merging phases bring down the numberof such groups to the number of processing resources

The RAW scheduler, schedules these groups onto eachprocessor and generates the required communicationoperations

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 37/51

Page 41: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: CARS

Single step cluster assignment, register allocation andinstruction scheduling avoids back-tracking and leads tobetter overall schedules

Works for any region (Basic-Blocks, HyperBlocks,SuperBlocks, Treegions) and RF-to-RF architectures

Register allocation is performed on the fly to incorporateeffects of generated spill code

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 38/51

Page 42: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: TTS

An Example Treegion:

BB1

BB2

BB4 BB5

BB6

BB7

BB3

Treegion−1

Treegion−2

Treegion−3

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 39/51

Page 43: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

S (contd)@

The TTS Algorithm:

1. Sort basic blocks as per the execution frequency2. Start list scheduling from the root3. Consider speculative execution similar to

SuperBlocks4. Repeat till all basic blocks have been scheduled

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 40/51

Page 44: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: Rainer Leupers

Works on DFGs for Write Across type of architectures(TiC6201)

Operation partitioning is based on simple simulatedannealing approach with cost as schedule length of thisDFG accounting for the copy operations

Finally a list scheduling is carried out on the DFG

100 node DFG partitioned in 10 CPU seconds on a SUNUltra-1, though this could be prohibitively large for big DFGs

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 41/51

Page 45: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Acyclic Code Scheduling: Our Approach

DFGs are build out of the execution traces

Operation partitioning amongst clusters is carried out byassuming a best neighbour connectivity amongst clusters

Finally a resource constrained list scheduling is carried outto generate correct schedule

Very time consuming and not suitable for code generation,only for evaluation of architectures

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 42/51

Page 46: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Comparison of Acyclic Scheduling Algorithms

Name Region Architecture(s) Run-time

Jacome et. al. DFG RF-to-RF May be highSanchez et. al. DFG RF-to-RF FastPCC DFG RF-to-RF May be highRAW Global MIT RAW May be highCARS All RF-to-RF LowTTS Treegion RF-to-RF LowLeupers DFG Write Across May be highOurs DFG All High

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 43/51

Page 47: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 44/51

Page 48: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Cyclic Code Scheduling: CALIBER

Works on DFGs for RF-to-RF architectures, proposed byJacome et. al, 2002

Tries to achieve final scheduling latencies which are close tomonolithic VLIWs

Kernel detection tries to minimize the schedule length aswell as register pressure and code size

The retiming module is an extention over the previousmodulo scheduling algorithms

The graph partitioning module generates a number ofsolutions based on their acyclic scheduling algorithm whichare heuristically pruned

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 45/51

Page 49: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Cyclic Code Scheduling: Sanchez et. al.

Works on DFGs for RF-to-RF architecures

Kernel detection, spill code generation, register allocationand cluster assignment are performed simultaneously

Similar to monolithic VLIW modulo scheduling algorithmsapart from the integration of above modules

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 46/51

Page 50: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Cyclic Code Scheduling: Others

Another algorithm from Sanchez et. al., MICRO, 2000 takescare of address generation for multiple-banked memoryarchitectures

Few others but similar to CALIBER and Sanchez et. al.approaches

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 47/51

Page 51: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Comparison of Cyclic Code Scheduling

Algorithms

Name Region Architecture(s) Run-time

Jacome et. al. DFG RF-to-RF MediumSanchez et. al. DFG RF-to-RF MediumOthers DFG RF-to-RF -

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 48/51

Page 52: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Outline

Processor Architectures: RISC, SuperScalar and VLIW

Clustered VLIW Processors

Intermediate Representations for VLIW Compilers

Block Formation Techniques

Acyclic Code Scheduling in VLIW Compilers

Cyclic Code Scheduling in VLIW Compilers

Acyclic Code Scheduling in Clustered VLIW Compilers

Cyclic Code Scheduling in Clustered VLIW Compilers

Conclusions

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 49/51

Page 53: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Conclusions: Where does all this take us?

All scheduling techniques are for RF-to-RF architecturesexcept one (Rainer Leupers, PACT 2000)

Interconnects influence the scheduling algorithms to a largeextent, RF-to-RF is the simplest to deal with, but has thelowest performance (Gangwar et. al., WASP, 2003)

Solutions for clustered architectures for modulo schedulingare trivial and a comparative study is missing along withconcrete running times

Future work will focus on developing scheduling algorithmsfor architectures with specialized interconnects

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 50/51

Page 54: Code Scheduling Techniques for VLIW Processors and Their ...anup/homepage/files/presentations/vliw.compiler.pdfVLIW Processor Architecture Compiler extracts parallelism, these have

Thank You

Thank You

Ph.D. Thursday Seminar Series http://embedded.cse.iitd.ernet.in Slide 51/51