Download ppt - UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning

UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Exploiting Pseudo-schedules to Guide Data Dependence Graph

Partitioning

Exploiting Pseudo-schedules to Guide Data Dependence Graph

Partitioning

Alex AletàJosep M. CodinaJesús Sánchez

Antonio GonzálezDavid Kaeli

{aaleta, jmcodina, fran, antonio}@[email protected]

PACT 2002, Charlottesville, Virginia – September 2002

Clustered ArchitecturesClustered Architectures

Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity

Clustering: divide the system in semi-independent units Each unit Cluster

Fast interconnects intra-cluster Slow interconnects inter-clusters

Common trend in commercial VLIW processors TI’s C6x Analog’s TigerSHARC HP’s LX Equator’s MAP1000

Architecture OverviewArchitecture Overview

L1 CACHE

LOCALREGISTER FILE

FU

FU MEM

LOCALREGISTER FILE

FU

FU

MEM

Register Buses

CLUSTER 1 CLUSTER n

Instruction SchedulingInstruction Scheduling

For non-clustered architectures Resources Dependences

For clustered architectures Cluster assignment Minimize inter-cluster communication delays

Exploit communication locality

This work focuses on modulo scheduling for clustered VLIW architectures Technique to schedule loops

Talk OutlineTalk Outline

Previous work

Proposed algorithm Overview Graph partitioning Pseudo-scheduling

Performance evaluation

Conclusions

MS for Clustered ArchitecturesMS for Clustered Architectures

Two steps Data Dependence Graph partitioning: each

instruction is assigned to a cluster Scheduling: instructions are scheduled in

a suitable slot but only in the preassigned cluster

In previous work, two different approaches were proposed:

II++

ClusterAssignment + Scheduling

One stepThere is no initial cluster assignmentThe scheduler is free to choose any cluster

ClusterAssignment

ClusterAssignment SchedulingScheduling

II++

Goal of the WorkGoal of the Work

Both approaches have benefits Two steps

Global vision of the Data Dependence Graph Workload is better split among different clusters Number of communications is reduced

One step Local vision of partial scheduling Cluster assignment is performed with information of the partial

scheduling

Goal: obtain an algorithm taking advantage of the benefits of both approaches

BaselineBaseline

Baseline scheme: GP [Aletà et al., Micro34] Cluster assignment performed with a graph partitioning

algorithm Feed-back between the partitioning and the scheduler Results outperformed previous approaches Still little information available for cluster assignment

New algorithm: better partition Pseudo-schedules are used to guide the partition

Global vision of the Data Dependence Graph More information to perform cluster assignment

Algorithm OverviewAlgorithm Overview

YES

II++

Refine Partition

II:= MIICompute

initial partition

Able to schedule?Select next operation

(j++)

Start scheduling

Schedule Opj based on the current partition

Move Opj toanother cluster

NO

NO

Able to schedule?

YES


YES

II++

Refine Partition

II:= MIICompute

initial partition


(j++)

Start scheduling



NO

NO

Able to schedule?

YES

Graph Partitioning BackgroundGraph Partitioning Background

Problem statement Split the nodes into a pre-determined number of sets and

optimizing some functions

Multilevel strategy Coarsen the graph

Iteratively, fuse pairs of nodes into new macro-nodes Enhancing heuristics

Avoid excess load in any one set Reduce execution time of the loops

Graph CoarseningGraph Coarsening

Previous definitions Matching

Slack

Iterate until same number of nodes than clusters: The edges are weighted according to

Impact on execution time of adding a bus delay to the edge Slack of the edge

Then, select the maximum weight matching Nodes linked by edges in the matching are fused in a single

macro-node

Coarsening ExampleCoarsening Example

Find matching

4

4

2

Find matching

Final graphInitial graph

4

4

4

2

1

4

coarsening

Example (II)Example (II)

1st STEP: Partition induced in the original graph

Initial graph Induced Partition

Final graph

Estimation of execution time needed

Pseudo-schedules

Information obtained II SC Lifetimes Spills

Reducing Execution TimeReducing Execution Time

Dependences Respected if possible Else a penalty on register pressure and/or in execution time

is assessed

Cluster assignment Partition strictly followed

Building pseudo-schedulesBuilding pseudo-schedules

Pseudo-schedule: examplePseudo-schedule: example

Induced partitionA

D B

C

Cluster 1

Cluster 2

0 A

1

2

3 B

4 D

5

6 C?NO

7 C?NO

Cluster 1

Cluster 2

A D

B

2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2

Instruction latency= 3

Pseudo-schedule: examplePseudo-schedule: example

Induced partitionA

D B

C

Cluster 1

Cluster 2

0 A

1

2

3 B

4 D

5

6

7

8 C

Cluster 1

Cluster 2

A,C D

B

Heuristic descriptionHeuristic description

While improvement, iterate: Different partitions are obtained by moving nodes among

clusters Partitions that produce overload resources in any of the clusters

are discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure is

selected


YES

II++

Refine Partition

II:= MIICompute

initial partition


(j++)

Start scheduling



NO

NO

Able to schedule?

YES

The Scheduling StepThe Scheduling Step

To schedule the partition we use URACAM [Codina et al., PACT’01] Figure of merit Uses dynamic transformations to improve the partial

schedule Register communications

• Bus memory

Spill code on-the-fly

• Register pressure memory

If an instruction can not be scheduled in the cluster assigned by the partition Try all other clusters Select the best one according to a figure of merit


YES

II++

Refine Partition

II:= MIICompute

initial partition


(j++)

Start scheduling



NO

NO

Able to schedule?

YES

Partition RefinementPartition Refinement

II has increased A better partition can be found for the new II

New slots have been generated in each cluster More lifetimes are available A larger number of bus communications allowed

Coarsening process is repeated Only edges between nodes in the same set can appear in the

matching After coarsening, the induced partition will be the last partition

that could not be scheduled

The reducing execution time heuristic is reapplied

Benchmarks and Configurations

Benchmarks and Configurations

Benchmarks - all the SPECfp95 using the ref input set

Two schedulers evaluated: GP – (previous work) Pseudo-schedule (PSP)

Resources

INT/cluster

FP/cluster

MEM/cluster

Unified

4

4

4

2-cluster

2

2

2

4-cluster

1

1

1

Latencies INT FP MEM 2 2 ARITH 1 3 MUL/ABS 2 6

6 18DIV/SQR/TRG

GP vs PSPGP vs PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

32 registers split into 2 clusters1 bus (L=1)


GP vs PSPGP vs PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP



ConclusionsConclusions

A new algorithm to perform MS for clustered VLIW architectures Cluster assignment based on multilevel graph partitioning

The partition algorithm is improved Based on pseudo-schedules Reliable information available to guide the partition

Outperform previous work 38.5% speedup for some configurations

UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Any questions?Any questions?

GP vs PSPGP vs PSP



0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

les

baseline

PSP

0

1

2

3

4

5

6

7

8

Inst

ruct

ions

per

cyc

le

baseline

PSP

Different AlternativesDifferent Alternatives

ClusterAssignment


II++

• Global vision when assigning clusters• Schedule follows exactly assignment• Re-scheduling does not take into account more resources available

• Local vision when assigning and scheduling• Assignment is based on current resource usage• No global view of the graph

II++

ClusterAssignment + Scheduling

• Global and local views of the graph• If cannot schedule, depending on the reason

• Re-schedule• Re-compute cluster assignment

ClusterAssignment


II++??

Clustered ArchitecturesClustered Architectures

Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity

Solutions: VLIW architectures Clustering: divide the system in semi-independent units

Fast interconnects intra-cluster Slow interconnects inter-clusters

Common trend in commercial VLIW processors• TI’s C6x • Analog’s Tigersharc

• HP’s LX • Equator’s MAP1000

Example (I)Example (I)

1st STEP: Coarsening the graph

Initial graph

1 5

3

Find matching

New graph

31

Find matching

31

Final graph

1

coarsening

Example (I)Example (I)

1st STEP: Partition induced in the original graph

Initial graph Induced partition

coarsened graph

1

Reducing Execution TimeReducing Execution Time

Heuristic description Different partitions are obtained by moving nodes among

clusters Partitions overloading resources in any of the clusters are

discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure

Estimation of execution time needed

Pseudo-schedules

Building pseudo-schedules Dependences

Respected if possible Else a penalty on register pressure and/or in execution time is

assumed Cluster assignment

Partition strictly followed

Valuable information can be estimated II Length of the pseudo-schedule Register pressure

Pseudo-schedulesPseudo-schedules

Execution time