20
1 University of Michigan Electrical Engineering and Computer Science Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

  • Upload
    deliz

  • View
    31

  • Download
    0

Embed Size (px)

DESCRIPTION

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures. Michael Chu, Kevin Fan, Rajiv Ravindran, Scott Mahlke Advanced Computer Architecture Lab University of Michigan Workshop on Application-Specific Processors (WASP-2) December 2, 2003. - PowerPoint PPT Presentation

Citation preview

Page 1: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

1 University of MichiganElectrical Engineering and Computer Science

Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath

Architectures

Michael Chu, Kevin Fan, Rajiv Ravindran, Scott MahlkeAdvanced Computer Architecture Lab

University of Michigan

Workshop on Application-Specific Processors (WASP-2)December 2, 2003

Page 2: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

2 University of MichiganElectrical Engineering and Computer Science

Clustered Architectures

• Decentralize architecture to reduce register file bottleneck• Used in Lx/ST200, TI C6x, Analog Tigersharc and others.• Goal of our work: Automatic synthesis of an application-

specific heterogeneous multicluster architecture

Homogeneous Clustered Architecture

Register File

Cluster 1 (32-bit)

Register File

Cluster 2 (32-bit)

+*-<< +*-<< +*-<< +*-<< +*-<< +*-<<

Heterogeneous Clustered Architecture

Register File

+ -

Cluster 1 (32-bit)

<<

Register File

+ - + -

Cluster 2 (8-bit)

<<*

Page 3: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

3 University of MichiganElectrical Engineering and Computer Science

Our Approach• Partition operations with both performance and required

hardware cost in mind– Maintain performance and reduce cost (bitwidth, FU repertoire)– Previous work has focused on single basic block, single cluster

[Note ‘91] [Paulin ‘89] [Marwedel ‘90]• Each partition dictates a cluster configuration which has an

associated hardware cost

RFFU

FU

RFFU

FU

Page 4: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

4 University of MichiganElectrical Engineering and Computer Science

Our Proposed System• Today’s Focus: Cost-Sensitive Operation Partitioning• Input: Application, High-level machine specification:

– Number of clusters, number of generic FU’s

• Output: Multicluster Architecture Description

Page 5: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

5 University of MichiganElectrical Engineering and Computer Science

Cost-Sensitive Operation Partitioning• Builds off Region-Based Hierarchical Operation Partitioning

– Pure performance based partitioner, no notion of hardware cost– Weight calculation creates guides for good partitions– Partitioning clusters based on given weights

• Cost metric added to Graph Partitioning phase which accounts for gate cost

Weight Calculation

GraphPartitioning

11

10

10

10

10

1

8

8

8

8 8

81 1

1 1 11 1

1 11

Region

Page 6: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

6 University of MichiganElectrical Engineering and Computer Science

Coarsening Phase• Progressively groups highly related operations together

– Continually pairs operations together– Forces partitioner to consider several operations as a single unit– Traditional RHOP: coarsen using edge weights– Cost-centric coarsening can ignore dependence edge criticality

Coarsened State 1 Coarsened State 2 Coarsened State 3 Coarsened State 4

Narrow bitwidth Wide bitwidth

Page 7: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

7 University of MichiganElectrical Engineering and Computer Science

Partitioning Phase• Travel back through each of the coarsening steps, at each

stage try refining partition– est_cycles: performance metric from traditional RHOP– Adds new cost metric for cost of the cluster

costeperformancbenefit

oldoldnewnew cyclesestcostcyclesestcost

11

Page 8: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

8 University of MichiganElectrical Engineering and Computer Science

Cost-Sensitive Refinement• Moves are made when they have positive benefit• When no more moves can be made, algorithm uncoarsens

to previous coarsened state and tries moving again

est cycles = 7cost: 28K

est cycles = 8cost: 15K

est cycles = 7cost: 15K

Narrow bitwidth Wide bitwidth

Page 9: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

9 University of MichiganElectrical Engineering and Computer Science

Multicluster Cost Model• Cost model determines an estimate of gate cost of clusters

– Estimate minimum required cost to support partitioned operations • Factors that influence hardware cost:

– Register file size/width– Functional Unit (FU) width– FU opcode repertoire

• Greedy algorithm used– Ignores dependences between

operations– Similar to Rec/Res MII calculations

for software pipelined loops

Register File (32-bit)

Int Unit 1 Int Unit 2

*10

Highcost

Lowcost

*16

+8

+16

+32

+16

*10

*16

+8

+16

+32

+16

Total cost of cluster: 1 32-bit register file 1 16-bit multiplier/adder 1 32-bit adder

Page 10: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

10 University of MichiganElectrical Engineering and Computer Science

Experimental Methodology• Trimaran toolset: a retargetable VLIW compiler• Evaluated main loop of DSP kernels and selected

benchmarks from MediaBench, MiBench and NetBench• Bitwidth information gathered through automatic program

analysis• Cost estimates computed using Synopsis design tools at

0.18µ

• 64 registers per clusterName Configuration

2-2111 2 clusters2 I, 1 F, 1 M, 1 B per cluster

4-2111 4 clusters2 I, 1 F, 1 M, 1 B per cluster

Page 11: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

11 University of MichiganElectrical Engineering and Computer Science

2-Cluster Cost Savings and Performance

chan

nel

dct fft

fsed

huffm

an LU rls

rawc

audi

o

rawd

audi

o

gsm

deco

de

gsm

enco

de

blow

fish

crc url

Aver

age

Perc

enta

ge P

erfo

rman

ce L

oss

/ Cos

t Sav

ings

-20.0

-10.0

0.0

10.0

20.0

30.0

40.0

Performance Loss Cost Savings

Page 12: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

12 University of MichiganElectrical Engineering and Computer Science

Source of Cost Savings BreakdownNo

rmal

ized

Cost

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

CS-RHOP 32-bit CS-RHOP

chan

nel

dct fft

fsed

huffm

an LU rls

rawc

audi

o

rawd

audi

o

gsm

deco

de

gsm

enco

de

blow

fish

crc url

Aver

age

Page 13: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

13 University of MichiganElectrical Engineering and Computer Science

Pareto Charts of Examined Machines

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

6 6.5 7 7.5 8 8.5 9 9.5

Cost (thousands of gates)

Rel

ativ

e Pe

rform

ance

fsed kernel LU kernel

• A wide spectrum of machine configurations were examined• Multiple groups often appear with expensive units

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

18 20 22 24 26 28 30 32 34 36 38 40

Cost (thousands of gates)R

elat

ive

Perf

orm

ance

Page 14: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

14 University of MichiganElectrical Engineering and Computer Science

Work in Progress• Merging step

– How can machine designs for several basic blocks be combined?• Inaccurate cost model

– How can a more accurate estimate for the cost be developed?• Space Exploration (external/internal)

– Number of clusters and generic FU’s are externally spacewalked– Allowable performance increase internally spacewalked– What areas of this space exploration should be external/internal?

• Reprogrammability of designed machines

Page 15: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

15 University of MichiganElectrical Engineering and Computer Science

Conclusions• Developed a cost-sensitive method for partitioning operations across

clusters• Used this partitioning to define an application-specific low-cost

multicluster datapath architecture• Average performance loss and cost savings for two and four cluster

machines:

Machine Configuration Performance Loss Cost Savings

2-cluster -5.4% 20.4%

4-cluster -2.5% 28.0%

Page 16: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

16 University of MichiganElectrical Engineering and Computer Science

Questions?

http://cccp.eecs.umich.edu

Page 17: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

17 University of MichiganElectrical Engineering and Computer Science

Backup Slides

Page 18: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

18 University of MichiganElectrical Engineering and Computer Science

4-Cluster Cost Savings and Performance

-20.0

-10.0

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

Performance Cost Savings

chan

nel

dct fft

fsed

huffm

an LU rls

rawc

audi

o

rawd

audi

o

gsm

deco

de

gsm

enco

de

blow

fish

crc url

Aver

age

Perc

enta

ge P

erfo

rman

ce L

oss

/ Cos

t Sav

ings

Page 19: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

19 University of MichiganElectrical Engineering and Computer Science

Previous Work• Datapath synthesis

– Cathedral-III: complete synthesis system from IMEC– Paulin and Knight: force directed scheduling– Sehwa: designed processing pipelines from behavioral specs– PICO: designed application-specific VLIW processors

• Bitwidth sensitive datapath synthesis– Valen-C: augmented C language to convey bitwidth information

Page 20: Cost-Sensitive Operation Partitioning for Synthesizing Multicluster Datapath Architectures

20 University of MichiganElectrical Engineering and Computer Science

Weight Calculation Phase• Edge weights

– Assigns higher weight to edges likely to increase schedule length when cut– Uses a slack distribution method to assign weights

• Node weights– Assigns weights to each operation based on how much it is likely to effect the load of the FUs in the cluster– Higher weights attributed to operations that can

• Not changed from Traditional RHOP