UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors
Exploiting Pseudo-schedules to Guide Data Dependence Graph
Partitioning
Exploiting Pseudo-schedules to Guide Data Dependence Graph
Partitioning
Alex AletàJosep M. CodinaJesús Sánchez
Antonio GonzálezDavid Kaeli
{aaleta, jmcodina, fran, antonio}@[email protected]
PACT 2002, Charlottesville, Virginia – September 2002
Clustered ArchitecturesClustered Architectures
Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity
Clustering: divide the system in semi-independent units Each unit Cluster
Fast interconnects intra-cluster Slow interconnects inter-clusters
Common trend in commercial VLIW processors TI’s C6x Analog’s TigerSHARC HP’s LX Equator’s MAP1000
Architecture OverviewArchitecture Overview
L1 CACHE
LOCALREGISTER FILE
FU
FU MEM
LOCALREGISTER FILE
FU
FU
MEM
Register Buses
CLUSTER 1 CLUSTER n
Instruction SchedulingInstruction Scheduling
For non-clustered architectures Resources Dependences
For clustered architectures Cluster assignment Minimize inter-cluster communication delays
Exploit communication locality
This work focuses on modulo scheduling for clustered VLIW architectures Technique to schedule loops
Talk OutlineTalk Outline
Previous work
Proposed algorithm Overview Graph partitioning Pseudo-scheduling
Performance evaluation
Conclusions
MS for Clustered ArchitecturesMS for Clustered Architectures
Two steps Data Dependence Graph partitioning: each
instruction is assigned to a cluster Scheduling: instructions are scheduled in
a suitable slot but only in the preassigned cluster
In previous work, two different approaches were proposed:
II++
ClusterAssignment + Scheduling
One stepThere is no initial cluster assignmentThe scheduler is free to choose any cluster
ClusterAssignment
ClusterAssignment SchedulingScheduling
II++
Goal of the WorkGoal of the Work
Both approaches have benefits Two steps
Global vision of the Data Dependence Graph Workload is better split among different clusters Number of communications is reduced
One step Local vision of partial scheduling Cluster assignment is performed with information of the partial
scheduling
Goal: obtain an algorithm taking advantage of the benefits of both approaches
BaselineBaseline
Baseline scheme: GP [Aletà et al., Micro34] Cluster assignment performed with a graph partitioning
algorithm Feed-back between the partitioning and the scheduler Results outperformed previous approaches Still little information available for cluster assignment
New algorithm: better partition Pseudo-schedules are used to guide the partition
Global vision of the Data Dependence Graph More information to perform cluster assignment
Algorithm OverviewAlgorithm Overview
YES
II++
Refine Partition
II:= MIICompute
initial partition
Able to schedule?Select next operation
(j++)
Start scheduling
Schedule Opj based on the current partition
Move Opj toanother cluster
NO
NO
Able to schedule?
YES
Algorithm OverviewAlgorithm Overview
YES
II++
Refine Partition
II:= MIICompute
initial partition
Able to schedule?Select next operation
(j++)
Start scheduling
Schedule Opj based on the current partition
Move Opj toanother cluster
NO
NO
Able to schedule?
YES
Graph Partitioning BackgroundGraph Partitioning Background
Problem statement Split the nodes into a pre-determined number of sets and
optimizing some functions
Multilevel strategy Coarsen the graph
Iteratively, fuse pairs of nodes into new macro-nodes Enhancing heuristics
Avoid excess load in any one set Reduce execution time of the loops
Graph CoarseningGraph Coarsening
Previous definitions Matching
Slack
Iterate until same number of nodes than clusters: The edges are weighted according to
Impact on execution time of adding a bus delay to the edge Slack of the edge
Then, select the maximum weight matching Nodes linked by edges in the matching are fused in a single
macro-node
Coarsening ExampleCoarsening Example
Find matching
4
4
2
Find matching
Final graphInitial graph
4
4
4
2
1
4
coarsening
Example (II)Example (II)
1st STEP: Partition induced in the original graph
Initial graph Induced Partition
Final graph
Estimation of execution time needed
Pseudo-schedules
Information obtained II SC Lifetimes Spills
Reducing Execution TimeReducing Execution Time
Dependences Respected if possible Else a penalty on register pressure and/or in execution time
is assessed
Cluster assignment Partition strictly followed
Building pseudo-schedulesBuilding pseudo-schedules
Pseudo-schedule: examplePseudo-schedule: example
Induced partitionA
D B
C
Cluster 1
Cluster 2
0 A
1
2
3 B
4 D
5
6 C?NO
7 C?NO
Cluster 1
Cluster 2
A D
B
2 clusters, 1 FU/cluster, 1 bus of latency 1, II= 2
Instruction latency= 3
Pseudo-schedule: examplePseudo-schedule: example
Induced partitionA
D B
C
Cluster 1
Cluster 2
0 A
1
2
3 B
4 D
5
6
7
8 C
Cluster 1
Cluster 2
A,C D
B
Heuristic descriptionHeuristic description
While improvement, iterate: Different partitions are obtained by moving nodes among
clusters Partitions that produce overload resources in any of the clusters
are discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure is
selected
Algorithm OverviewAlgorithm Overview
YES
II++
Refine Partition
II:= MIICompute
initial partition
Able to schedule?Select next operation
(j++)
Start scheduling
Schedule Opj based on the current partition
Move Opj toanother cluster
NO
NO
Able to schedule?
YES
The Scheduling StepThe Scheduling Step
To schedule the partition we use URACAM [Codina et al., PACT’01] Figure of merit Uses dynamic transformations to improve the partial
schedule Register communications
• Bus memory
Spill code on-the-fly
• Register pressure memory
If an instruction can not be scheduled in the cluster assigned by the partition Try all other clusters Select the best one according to a figure of merit
Algorithm OverviewAlgorithm Overview
YES
II++
Refine Partition
II:= MIICompute
initial partition
Able to schedule?Select next operation
(j++)
Start scheduling
Schedule Opj based on the current partition
Move Opj toanother cluster
NO
NO
Able to schedule?
YES
Partition RefinementPartition Refinement
II has increased A better partition can be found for the new II
New slots have been generated in each cluster More lifetimes are available A larger number of bus communications allowed
Coarsening process is repeated Only edges between nodes in the same set can appear in the
matching After coarsening, the induced partition will be the last partition
that could not be scheduled
The reducing execution time heuristic is reapplied
Benchmarks and Configurations
Benchmarks and Configurations
Benchmarks - all the SPECfp95 using the ref input set
Two schedulers evaluated: GP – (previous work) Pseudo-schedule (PSP)
Resources
INT/cluster
FP/cluster
MEM/cluster
Unified
4
4
4
2-cluster
2
2
2
4-cluster
1
1
1
Latencies INT FP MEM 2 2 ARITH 1 3 MUL/ABS 2 6
6 18DIV/SQR/TRG
GP vs PSPGP vs PSP
0
1
2
3
4
5
6
7
8
Inst
ruct
ions
per
cyc
le
baseline
PSP
0
1
2
3
4
5
6
7
8
Inst
ruct
ions
per
cyc
le
baseline
PSP
32 registers split into 2 clusters1 bus (L=1)
32 registers split into 4 clusters1 bus (L=1)
GP vs PSPGP vs PSP
0
1
2
3
4
5
6
7
8
Inst
ruct
ions
per
cyc
le
baseline
PSP
0
1
2
3
4
5
6
7
8
Inst
ruct
ions
per
cyc
le
baseline
PSP
64 registers split into 4 clusters1 bus (L=2)
32 registers split into 4 clusters1 bus (L=2)
ConclusionsConclusions
A new algorithm to perform MS for clustered VLIW architectures Cluster assignment based on multilevel graph partitioning
The partition algorithm is improved Based on pseudo-schedules Reliable information available to guide the partition
Outperform previous work 38.5% speedup for some configurations
UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors
Any questions?Any questions?
GP vs PSPGP vs PSP
64 registers split into 2 clusters1 bus (L=1)
64 registers split into 4 clusters1 bus (L=1)
0
1
2
3
4
5
6
7
8
Inst
ruct
ions
per
cyc
les
baseline
PSP
0
1
2
3
4
5
6
7
8
Inst
ruct
ions
per
cyc
le
baseline
PSP
Different AlternativesDifferent Alternatives
ClusterAssignment
ClusterAssignment SchedulingScheduling
II++
• Global vision when assigning clusters• Schedule follows exactly assignment• Re-scheduling does not take into account more resources available
• Local vision when assigning and scheduling• Assignment is based on current resource usage• No global view of the graph
II++
ClusterAssignment + Scheduling
• Global and local views of the graph• If cannot schedule, depending on the reason
• Re-schedule• Re-compute cluster assignment
ClusterAssignment
ClusterAssignment SchedulingScheduling
II++??
Clustered ArchitecturesClustered Architectures
Current/future challenges in processor design Delay in the transmission of signals Power consumption Architecture complexity
Solutions: VLIW architectures Clustering: divide the system in semi-independent units
Fast interconnects intra-cluster Slow interconnects inter-clusters
Common trend in commercial VLIW processors• TI’s C6x • Analog’s Tigersharc
• HP’s LX • Equator’s MAP1000
Example (I)Example (I)
1st STEP: Coarsening the graph
Initial graph
1 5
3
Find matching
New graph
31
Find matching
31
Final graph
1
coarsening
Example (I)Example (I)
1st STEP: Partition induced in the original graph
Initial graph Induced partition
coarsened graph
1
Reducing Execution TimeReducing Execution Time
Heuristic description Different partitions are obtained by moving nodes among
clusters Partitions overloading resources in any of the clusters are
discarded The partition minimizing execution time is chosen In case of tie, the one that minimizes register pressure
Estimation of execution time needed
Pseudo-schedules
Building pseudo-schedules Dependences
Respected if possible Else a penalty on register pressure and/or in execution time is
assumed Cluster assignment
Partition strictly followed
Valuable information can be estimated II Length of the pseudo-schedule Register pressure
Pseudo-schedulesPseudo-schedules
Execution time