35
UNIVERSITAT POLITÈCNICA DE CATALUNYA UNIVERSITAT POLITÈCNICA DE CATALUNYA UNIVERSITAT POLITÈCNICA DE CATALUNYA UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Heterogeneous Clustered VLIW Microarchitectures CGO’07, San Jose, California - March 2007 Northeastern University Northeastern University Northeastern University Northeastern University Microarchitectures Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David Kaeli Josep M. Codina, Antonio González and David Kaeli Josep M. Codina, Antonio González and David Kaeli Josep M. Codina, Antonio González and David Kaeli

Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Heterogeneous Clustered VLIW

Microarchitectures

CGO’07, San Jose, California - March 2007

Northeastern UniversityNortheastern UniversityNortheastern UniversityNortheastern University

Microarchitectures

Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli

Page 2: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Clustered Microarchitectures

� Challenges Challenges Challenges Challenges in processor designin processor designin processor designin processor design

� Wire delays

� Power consumption

� Clustering: Clustering: Clustering: Clustering: divide the system into semi-independent

units

2

units

� Each unit ⇒⇒⇒⇒ Cluster

� Fast interconnects intra-cluster

� Slow interconnects inter-clusters

� Common trend in commercial VLIW processors

� DSP/embedded domain

Page 3: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Clustered VLIW Architecture

Register buses

Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor

REGISTER FILE

I-CACHE

3

CLUSTER

1CLUSTER

2CLUSTER

N

MAIN MEMORY

DATA CACHE

DATA CACHE

INT INT FP FP MEM MEM

REGISTER FILE

Page 4: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Motivation

� Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact

on execution timeon execution timeon execution timeon execution time

� Divide resourcesDivide resourcesDivide resourcesDivide resources

� Performance oriented clusters

� Higher voltages

L

M

I

J

A

B

C

4

� Higher voltages

� Faster

� Place critical instructions

� Power oriented clusters

� Lower voltages

� Consume less power

� Place non-critical instructions

NK D

E

F

G

H

Page 5: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Motivation

CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus

L

M

I

J

A

B

C

C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN

Cycle timeCycle timeCycle timeCycle time 1111 1111 1111 1111Homogeneous

5

CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus

0000 AAAA

1111 BBBB Com ACom ACom ACom A

2222 CCCC IIII LLLL

3333 DDDD JJJJ MMMM

4444 EEEE KKKK NNNN

5555 FFFF

6666 GGGG

7777 HHHH

NK D

E

F

G

H

Scheduling

Page 6: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Motivation

CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus

L

M

I

J

A

B

C

C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN

Cycle timeCycle timeCycle timeCycle time 1111 2222 2222 1111Heterogeneous

6

CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus

0000 AAAA

1111 BBBB Com ACom ACom ACom A

2222 CCCCIIII LLLL

3333 DDDD

4444 EEEEJJJJ MMMM

5555 FFFF

6666 GGGGKKKK NNNN

7777 HHHH

NK D

E

F

G

H

Scheduling

Page 7: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Talk Outline

� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture

� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques

7

� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation

� ConclusionsConclusionsConclusionsConclusions

Page 8: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Heterogeneous Architecture

� Configured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain design

� Domain boundaries:

� Each cluster

� Inter-connection Network

� Memory hierarchy

8

� Each domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequency

� Performance oriented: higher voltage/frequency

� Power oriented: lower voltage/frequency

� Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization

queuesqueuesqueuesqueues

Page 9: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

I-CACHE

Heterogeneous Architecture

Register buses

Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor synchronization

queues

9

DATA CACHE

INT INT FP FP MEM MEM

REGISTER FILECLUSTER

1CLUSTER

2CLUSTER

N

MAIN MEMORY

DATA CACHE

Page 10: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Distributed Control Path

� Fetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clusters

� Instruction lay-out

� Grouped by cluster

I1C1 I1C4I1C3I1C2 I2C1 I2C4I2C3I2C2 I3C1 I3C4I3C3I3C2

Centralized Control Path

10

I1C1 I3C1I2C1 I1C2 I3C2I2C2

I1C3 I3C3I2C3 I1C4 I3C4I2C4

I C1 I C4I C3I C2 I C1 I C4I C3I C2 I C1 I C4I C3I C2

PC

... ...

......Distributed

Control Path

PC3 PC4

PC1 PC2

[Zhong et al. PACT’05]

Page 11: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Talk Outline

� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture

� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques

11

� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation

� ConclusionsConclusionsConclusionsConclusions

Page 12: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Statically Scheduled Processors

� Performance relies on the compilerPerformance relies on the compilerPerformance relies on the compilerPerformance relies on the compiler

� Instruction scheduling

� Register allocation

� Clustered microarchitectures

12

� Clustered microarchitectures

� Cluster assignment

� Communications

� Multimedia and numeric codeMultimedia and numeric codeMultimedia and numeric codeMultimedia and numeric code

� Majority of the execution in loop bodies

� Modulo scheduling

Page 13: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

cycle 2cycle 2cycle 2cycle 2

cycle 1cycle 1cycle 1cycle 1

Modulo Scheduling

� Effective technique for Effective technique for Effective technique for Effective technique for scheduling loopsscheduling loopsscheduling loopsscheduling loops

� Overlaps loop iterations

� Increases parallelism

Iteration space

3rd It

A

B

1st It

13

cycle 6cycle 6cycle 6cycle 6

cycle 5cycle 5cycle 5cycle 5

cycle 4cycle 4cycle 4cycle 4

cycle 3cycle 3cycle 3cycle 3

cycle 2cycle 2cycle 2cycle 2

A

B

C

4th ItA

B

C

3rd ItB

C

Tim

e

Kernel: pattern repeated every II cyclesA

B

C

2nd It

II ≥MII (Minimum Initiation Interval)

� Constrained by resources and recurrences

�MII = max {resMII, recMII}

Page 14: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

MS for Heterogeneous

A

B

CD

EAcycle 4cycle 4cycle 4cycle 4

cycle 3cycle 3cycle 3cycle 3

cycle 2cycle 2cycle 2cycle 2

cycle 1cycle 1cycle 1cycle 1

Iteration space

1st It

2nd It

AE

KernelEach cluster has its own II

Constant Iteration Time (IT)

14

EA

B

CD

EA

B

CD

E

cycle 6cycle 6cycle 6cycle 6

cycle 5cycle 5cycle 5cycle 5

cycle 4cycle 4cycle 4cycle 4

Tim

e

3rd It

A

B

CD

E

MIT = max {resMIT, recMIT}

Page 15: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Proposed Technique

Profile homogenous

architecture

15

Select frequencies

and voltages

Modulo Scheduling

Page 16: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Select Voltages and Frequencies

� For each domainFor each domainFor each domainFor each domain

� At program levelAt program levelAt program levelAt program level

� Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow

domainsdomainsdomainsdomains

16

domainsdomainsdomainsdomains

� Estimate execution time

� Estimate energy consumption

� Select minimum ED2

Page 17: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Estimate Execution Time

� Use profilingUse profilingUse profilingUse profiling

� Texec= niters · (II + SC - 1) · Tcycle

� Estimated Estimated Estimated Estimated IT

� Enough to accommodate

� All instructions

17

� All instructions

� All recurrences

� Value lifetimes of the profiled homogeneous

� Communications required by the profiled homogeneous

Page 18: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Estimate Energy Consumption

� Same microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltages

� Relative dynamic power

� Relative static power

2

22

2

11

2

1

ddLt

ddLt

dyn

dyn

VCfp

VCfp

P

P

⋅⋅⋅

⋅⋅⋅=

2

22

2

11

dd

dd

Vf

Vf

⋅=

18

� Relative static power

20

0

10

0

2

1

2

1

10

10

ddS

V

t

ddS

V

t

stat

stat

VWW

I

VWW

I

P

P

th

th

⋅⋅⋅

⋅⋅⋅

=−

2

112

10dd

ddS

VV

V

Vthth

⋅=

Page 19: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

MS Algorithm

Compute

MIT

IT:=MIT

Increase

IT

Select IIs

& freqs

OK?NO

ClusterClusterClusterCluster Supported cycle timesSupported cycle timesSupported cycle timesSupported cycle times

19

C0C0C0C0 1ns1ns1ns1ns

C1C1C1C1 5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns

IT= 7ns II(C0)= 7 /1 = 7 cycles

II(C1)= 7 / 1.3333 = 5.25

II(C1)= 7 / 1.25 = 5.6

II(C1)= 7 / 1.5 = 4.667

IT= 8ns II(C0)= 8 / 1= 8 cycles

II(C1)= 8 / 1.3333 = 6

II(C1)= 8 / 1.25 = 6.4

Page 20: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

MS Algorithm

Compute

MIT

IT:=MIT

Increase

IT

Select IIs

& freqs

OK?

YES

NO

20

Done

Partition

DDG

Schedule

OK?YES

YES

NO

Page 21: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

MS for Heterogeneous

�Multilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithm

� Coarsening

� Place critical recurrences in fast clusters

� Refinement

� Estimate execution time and energy consumption

21

� Optimize for ED2

� SchedulingSchedulingSchedulingScheduling

� Instructions delayed due to synchronization hazards

Page 22: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Recurrence Constrained Loops

� Large benefits expectedLarge benefits expectedLarge benefits expectedLarge benefits expected

� A small number of instructions are critical for execution time

� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions

� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster

A

B

22

� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster

� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles

� II= 6 cycles (all clusters!) lots of unused slots

B

C

H

I

J

K

D

E

F

G

� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns

� 3 slow clusters, cycle time= 2ns

IT= 6ns

II(fast cluster)= 6

II(slow clusters)= 3

Page 23: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Talk Outline

� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture

� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques

23

� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation

� ConclusionsConclusionsConclusionsConclusions

Page 24: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Experimental Environment

�MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture

� Clusters

� 4 clusters

� 1 FP-unit, 1 INT-unit, 1 memory port, 16 registers per cluster

� Inter-connection Network

24

� 1-cycle latency broadcast buses

� Heterogeneity

� Clusters: 1 performance oriented / 3 power oriented

� Benchmarks:Benchmarks:Benchmarks:Benchmarks:

� SpecFP2k Fortran programs

� Loops obtained with ORC

� Baseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architecture

Page 25: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Results

0.8

0.9

1

1 bus

2 buses

ED2

35%30%

17%

25

0.5

0.6

0.7

wup

wise

swim

mgrid

applu

galg

el

face

rec

luca

s

fma3d

sixt

rack

aplsi

mean

35%30%

Page 26: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

High ILP Programs

� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time

A

H

I

B

C

D

E

F

G

26

� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time

� No benefit using different frequencies

� Higher IPC

� Dynamic energy accounts for a majority of the total energy

consumption

� Use lower Vdd and lower frequencies

� Memory hierarchy and inter-connection network can use different voltages

Page 27: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Talk Outline

� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture

� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques

27

� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation

� ConclusionsConclusionsConclusionsConclusions

Page 28: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Conclusions

� Heterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architectures

� Clusters run at different voltages / frequencies

� Instructions that impact execution time scheduled in fast

clusters

� Remaining instructions in power-oriented clusters

28

� Proposed compiler techniquesProposed compiler techniquesProposed compiler techniquesProposed compiler techniques

� Algorithm to select the voltages / frequencies

� MS for heterogeneous configurations

� EDEDEDED2222 : 15% improvement on average: 15% improvement on average: 15% improvement on average: 15% improvement on average

� Up to 35% for selected programs

Page 29: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Heterogeneous Clustered VLIW

Microarchitectures

CGO’07, San Jose, California - March 2007

Microarchitectures

Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli

Page 30: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Motivating Example

C1C1C1C1 BusBusBusBus C2C2C2C2

0000 AAAA

1111 BBBB Comm AComm AComm AComm A

2222 CCCC EEEE

3333 DDDD

A

B E

30

3333 DDDD

C

B E

D

C1C1C1C1 BusBusBusBus C2C2C2C2

0000 AAAA

1111 BBBB Comm AComm AComm AComm A

2222 CCCCEEEE

3333 DDDD

C1C1C1C1 C2C2C2C2

Cycle timeCycle timeCycle timeCycle time 1 ns1 ns1 ns1 ns 1 ns1 ns1 ns1 ns2 ns

Page 31: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Heterogeneous Architecture

� SignalsSignalsSignalsSignals

enable_C1

Freq.

C1

clock

31

enable_mem

enable_CN

Freq.

Multiplier

Divider

syncqueues

bus

CN

enable_all

clock

On Chip

Memory

Page 32: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Branch Instructions

� Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions

(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)

� Branch target computation

� Independent in each cluster

� Branch condition evaluation

32

� Branch condition evaluation

� Computed in one cluster

� Broadcasted to the rest

� Control transfer

� Different in each cluster

Page 33: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Results

0.8

0.9

1

0.8

0.9

1

ED2 ED21 bus 2 buses

33

0.5

0.6

0.7

wup

wis

esw

imm

grid

appl

uga

lgel

face

rec

luca

sfm

a3d

sixt

rack

apls

im

ean

0.5

0.6

0.7

wup

wis

esw

imm

grid

appl

uga

lgel

face

rec

luca

sfm

a3d

sixt

rack

apls

im

ean

Page 34: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Recurrence Constrained Loops

� Largest benefits obtainedLargest benefits obtainedLargest benefits obtainedLargest benefits obtained

� A small number of instructions are critical for execution time

� Improvement: 30% - 35% for 189.lucas and 200.sixtrack

� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions

� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster

A

B

34

� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster

� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles

� II= 6 cycles (all clusters!) lots of unused slots

B

C

H

I

J

K

D

E

F

G

� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns

� 3 slow clusters, cycle time= 2ns

IT= 6ns

II(fast cluster)= 6

II(slow clusters)= 3

Page 35: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111

Smallest Benefits

� Around 5%Around 5%Around 5%Around 5%

� 168.wupwise

� Loops have different characteristics

� 173.applu

35

� Low number of iterations

� Schedule length has a big impact