35
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Heterogeneous Clustered VLIW Microarchitectures Alex Aletà, Josep M. Codina, Antonio González and David Kaeli CGO’07, San Jose, California - March 2007 Northeastern University

Heterogeneous Clustered VLIW Microarchitectures

  • Upload
    yule

  • View
    37

  • Download
    0

Embed Size (px)

DESCRIPTION

Northeastern University. CGO’07, San Jose, California - March 2007. Heterogeneous Clustered VLIW Microarchitectures. Alex Aletà, Josep M. Codina, Antonio González and David Kaeli. Clustered Microarchitectures. Challenges in processor design Wire delays Power consumption - PowerPoint PPT Presentation

Citation preview

Page 1: Heterogeneous Clustered VLIW Microarchitectures

UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Heterogeneous Clustered VLIW Microarchitectures

Heterogeneous Clustered VLIW Microarchitectures

Alex Aletà, Josep M. Codina, Antonio González and David Kaeli

CGO’07, San Jose, California - March 2007

Northeastern University

Page 2: Heterogeneous Clustered VLIW Microarchitectures

2

Clustered MicroarchitecturesClustered Microarchitectures

Challenges in processor design Wire delays Power consumption

Clustering: divide the system into semi-independent units Each unit Cluster

Fast interconnects intra-cluster Slow interconnects inter-clusters

Common trend in commercial VLIW processors DSP/embedded domain

Page 3: Heterogeneous Clustered VLIW Microarchitectures

3

Clustered VLIW ArchitectureClustered VLIW Architecture

CLUSTER

1CLUSTER

2CLUSTER

N

MAIN MEMORY

Register buses

Clustered VLIW processor

DATA CACHE

DATA CACHE

INT INT FP FP MEM MEM

REGISTER FILE

I-CACHE

Page 4: Heterogeneous Clustered VLIW Microarchitectures

4

MotivationMotivation

Not all instructions have the same impact on execution time

Divide resources Performance oriented clusters

Higher voltages Faster Place critical instructions

Power oriented clusters Lower voltages Consume less power Place non-critical instructions

L

M

NK

I

J

A

B

C

D

E

F

G

H

Page 5: Heterogeneous Clustered VLIW Microarchitectures

5

MotivationMotivation

Cycle

C0 C1 C2 Bus

0 A1 B Com A2 C I L3 D J M4 E K N5 F6 G7 H

L

M

NK

I

J

A

B

C

D

E

F

G

H

C0 C1 C2ICN

Cycle time

1 1 1 1

Homogeneous

Scheduling

Page 6: Heterogeneous Clustered VLIW Microarchitectures

6

MotivationMotivation

Cycle

C0 C1 C2 Bus

0 A

1 B Com A

2 CI L

3 D

4 EJ M

5 F

6 GK N

7 H

L

M

NK

I

J

A

B

C

D

E

F

G

H

C0 C1 C2ICN

Cycle time

1 2 2 1

Heterogeneous

Scheduling

Page 7: Heterogeneous Clustered VLIW Microarchitectures

7

Talk OutlineTalk Outline

Heterogeneous Clustered Architecture

Proposed Compiler Techniques

Experimental Evaluation

Conclusions

Page 8: Heterogeneous Clustered VLIW Microarchitectures

8

Heterogeneous ArchitectureHeterogeneous Architecture

Configured similar to a multiple clock domain design Domain boundaries:

Each cluster Inter-connection Network Memory hierarchy

Each domain can use a different voltage / frequency Performance oriented: higher voltage/frequency Power oriented: lower voltage/frequency

Communication between domains: synchronization queues

Page 9: Heterogeneous Clustered VLIW Microarchitectures

9

DATA CACHE

INT INT FP FP MEM MEM

REGISTER FILE

I-CACHE

Heterogeneous ArchitectureHeterogeneous Architecture

CLUSTER

1CLUSTER

2CLUSTER

N

MAIN MEMORY

Register buses

Clustered VLIW processor

DATA CACHE

synchronizationqueues

Page 10: Heterogeneous Clustered VLIW Microarchitectures

10

Distributed Control PathDistributed Control Path

Fetch and decode units distributed among clusters Instruction lay-out

Grouped by cluster

I1C1 I3

C1I2C1 I1

C2 I3C2I2

C2

I1C3 I3

C3I2C3 I1

C4 I3C4I2

C4

I1C1 I1

C4I1C3I1

C2 I2C1 I2

C4I2C3I2

C2 I3C1 I3

C4I3C3I3

C2

Centralized Control Path

PC

... .........

DistributedControl Path

PC3 PC4

PC1 PC2

[Zhong et al. PACT’05]

Page 11: Heterogeneous Clustered VLIW Microarchitectures

11

Talk OutlineTalk Outline

Heterogeneous Clustered Architecture

Proposed Compiler Techniques

Experimental Evaluation

Conclusions

Page 12: Heterogeneous Clustered VLIW Microarchitectures

12

Statically Scheduled ProcessorsStatically Scheduled Processors

Performance relies on the compiler

Instruction scheduling

Register allocation

Clustered microarchitectures

Cluster assignment

Communications

Multimedia and numeric code

Majority of the execution in loop bodies

Modulo scheduling

Page 13: Heterogeneous Clustered VLIW Microarchitectures

13

cycle 6

cycle 5

cycle 4

cycle 3

cycle 2

cycle 1

Modulo SchedulingModulo Scheduling

Effective technique for scheduling loops

Overlaps loop iterations

Increases parallelism

Iteration space

A

B

C

4th ItA

B

C

3rd It

A

B

C

1st It

Tim

e

Kernel: pattern repeated every II cyclesA

B

C

2nd It

II ≥MII (Minimum Initiation Interval)

Constrained by resources and recurrences

MII = max {resMII, recMII}

Page 14: Heterogeneous Clustered VLIW Microarchitectures

14

MS for HeterogeneousMS for Heterogeneous

A

B

C D

E A

B

C D

E A

B

C D

E

cycle 6

cycle 5

cycle 4

cycle 3

cycle 2

cycle 1

Iteration space

Tim

e

1st It

2nd It

3rd It

A

B

C D

E

KernelEach cluster has its own II

Constant Iteration Time (IT)

MIT = max {resMIT, recMIT}

Page 15: Heterogeneous Clustered VLIW Microarchitectures

15

Proposed TechniqueProposed Technique

Profile homogenous architecture

Select frequencies and voltages

Modulo Scheduling

Page 16: Heterogeneous Clustered VLIW Microarchitectures

16

Select Voltages and FrequenciesSelect Voltages and Frequencies

For each domain

At program level

Consider different delays between fast and slow domains Estimate execution time Estimate energy consumption Select minimum ED2

Page 17: Heterogeneous Clustered VLIW Microarchitectures

17

Estimate Execution TimeEstimate Execution Time

Use profiling Texec= niters · (II + SC - 1) · Tcycle

Estimated IT Enough to accommodate

All instructions All recurrences

Value lifetimes of the profiled homogeneous

Communications required by the profiled homogeneous

Page 18: Heterogeneous Clustered VLIW Microarchitectures

18

Estimate Energy ConsumptionEstimate Energy Consumption

Same microarchitecture, different voltages Relative dynamic power

Relative static power

2

22

2

11

2

1

ddLt

ddLt

dyn

dyn

VCfp

VCfp

P

P

20

0

10

0

2

1

2

1

10

10

ddS

V

t

ddS

V

t

stat

stat

VWWI

VWWI

P

Pth

th

2

112

10dd

ddS

VV

V

Vthth

2

22

2

11

dd

dd

Vf

Vf

Page 19: Heterogeneous Clustered VLIW Microarchitectures

19

MS AlgorithmMS Algorithm

ComputeMIT

IT:=MIT

IncreaseIT

Select IIs& freqs

OK? NO

Cluster

Supported cycle times

C0 1ns

C1 5/4 ns ; 4/3 ns ; 3/2 nsIT= 7ns II(C0)= 7 /1 = 7 cycles

II(C1)= 7 / 1.3333 = 5.25II(C1)= 7 / 1.25 = 5.6

II(C1)= 7 / 1.5 = 4.667

IT= 8ns II(C0)= 8 / 1= 8 cycles II(C1)= 8 / 1.3333 = 6II(C1)= 8 / 1.25 = 6.4

Page 20: Heterogeneous Clustered VLIW Microarchitectures

20

MS AlgorithmMS Algorithm

ComputeMIT

IT:=MIT

IncreaseIT

Done

PartitionDDG

Schedule

Select IIs& freqs

OK?

OK?YES

YES

NO

NO

Page 21: Heterogeneous Clustered VLIW Microarchitectures

21

MS for HeterogeneousMS for Heterogeneous

Multilevel graph partitioning algorithm Coarsening

Place critical recurrences in fast clusters Refinement

Estimate execution time and energy consumption Optimize for ED2

Scheduling Instructions delayed due to synchronization hazards

Page 22: Heterogeneous Clustered VLIW Microarchitectures

22

Recurrence Constrained LoopsRecurrence Constrained Loops

Large benefits expected A small number of instructions are critical for execution time

2-cycle latency instructions 4 clusters, 1 FU per cluster

Homogeneous Recurrence: 6 cycles II= 6 cycles (all clusters!) lots of unused slots

A

B

C

H

I

J

K

D

E

F

G

Heterogeneous 1 fast cluster, cycle time= 1ns 3 slow clusters, cycle time= 2ns

IT= 6nsII(fast cluster)= 6II(slow clusters)= 3

Page 23: Heterogeneous Clustered VLIW Microarchitectures

23

Talk OutlineTalk Outline

Heterogeneous Clustered Architecture

Proposed Compiler Techniques

Experimental Evaluation

Conclusions

Page 24: Heterogeneous Clustered VLIW Microarchitectures

24

Experimental EnvironmentExperimental Environment

Microarchitecture Clusters

4 clusters 1 FP-unit, 1 INT-unit, 1 memory port, 16 registers per cluster

Inter-connection Network 1-cycle latency broadcast buses

Heterogeneity Clusters: 1 performance oriented / 3 power oriented

Benchmarks: SpecFP2k Fortran programs Loops obtained with ORC

Baseline: homogeneous architecture

Page 25: Heterogeneous Clustered VLIW Microarchitectures

25

ResultsResults

0.5

0.6

0.7

0.8

0.9

1

1 bus

2 buses

ED2

35%30%

17%

Page 26: Heterogeneous Clustered VLIW Microarchitectures

26

High ILP ProgramsHigh ILP Programs

All instructions have a similar impact on execution time No benefit using different frequencies Higher IPC

Dynamic energy accounts for a majority of the total energy consumption

Use lower Vdd and lower frequencies

Memory hierarchy and inter-connection network can use different voltages

A

H

I

B

C

D

E

F

G

Page 27: Heterogeneous Clustered VLIW Microarchitectures

27

Talk OutlineTalk Outline

Heterogeneous Clustered Architecture

Proposed Compiler Techniques

Experimental Evaluation

Conclusions

Page 28: Heterogeneous Clustered VLIW Microarchitectures

28

ConclusionsConclusions

Heterogeneous clustered architectures Clusters run at different voltages / frequencies

Instructions that impact execution time scheduled in fast clusters

Remaining instructions in power-oriented clusters

Proposed compiler techniques Algorithm to select the voltages / frequencies MS for heterogeneous configurations

ED2 : 15% improvement on average Up to 35% for selected programs

Page 29: Heterogeneous Clustered VLIW Microarchitectures

UNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Heterogeneous Clustered VLIW Microarchitectures

Heterogeneous Clustered VLIW Microarchitectures

Alex Aletà, Josep M. Codina, Antonio González and David Kaeli

CGO’07, San Jose, California - March 2007

Page 30: Heterogeneous Clustered VLIW Microarchitectures

30

Motivating ExampleMotivating Example

C1 Bus C2

0 A

1 B Comm A

2 C E

3 D

A

C

B E

DC1 Bus C2

0 A

1 B Comm A

2 CE

3 D

C1 C2

Cycle time

1 ns

1 ns

2 ns

Page 31: Heterogeneous Clustered VLIW Microarchitectures

31

Heterogeneous ArchitectureHeterogeneous Architecture

Signals

enable_mem

enable_C1

enable_CN

Freq.MultiplierDivider

syncqueues

bu

s

C1

CN

enable_all

clock

On ChipMemory

Page 32: Heterogeneous Clustered VLIW Microarchitectures

32

Branch InstructionsBranch Instructions

Branches decoupled in several instructions (Unbundled Branch Architecture) Branch target computation

Independent in each cluster Branch condition evaluation

Computed in one cluster Broadcasted to the rest

Control transfer Different in each cluster

Page 33: Heterogeneous Clustered VLIW Microarchitectures

33

ResultsResults

0.5

0.6

0.7

0.8

0.9

1

0.5

0.6

0.7

0.8

0.9

1

ED2 ED21 bus 2 buses

Page 34: Heterogeneous Clustered VLIW Microarchitectures

34

Recurrence Constrained LoopsRecurrence Constrained Loops

Largest benefits obtained A small number of instructions are critical for execution time Improvement: 30% - 35% for 189.lucas and 200.sixtrack

2-cycle latency instructions 4 clusters, 1 FU per cluster

Homogeneous Recurrence: 6 cycles II= 6 cycles (all clusters!) lots of unused slots

A

B

C

H

I

J

K

D

E

F

G

Heterogeneous 1 fast cluster, cycle time= 1ns 3 slow clusters, cycle time= 2ns

IT= 6nsII(fast cluster)= 6II(slow clusters)= 3

Page 35: Heterogeneous Clustered VLIW Microarchitectures

35

Smallest BenefitsSmallest Benefits

Around 5% 168.wupwise

Loops have different characteristics 173.applu

Low number of iterations Schedule length has a big impact