33
An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James Smith Ilhyun Kim

An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

Embed Size (px)

Citation preview

Page 1: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

An Approach for Implementing Efficient Superscalar CISC

Processors

Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

Page 2: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 2

Processor Design Challenges

• CISC challenges -- Suboptimal internal micro-ops. – Complex decoders & obsolete features/instructions

– Instruction count expansion: 40% to 50% mgmt, comm …

– Redundancy & Inefficiency in the cracked micro-ops

– Solution: Dynamic optimization

• Other current challenges (CISC & RISC)– Efficiency (Nowadays, less performance gain per transistor)

– Power consumption has become acute

– Solution: Novel efficient microarchitectures

Page 3: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 3

Dynam

ic Translation

Implementation ISAe.g. fusible ISA

Software in Architected ISA: OS, Drivers, Lib code, Apps

HW Implementation: Processors, Mem-sys, I/O devices

Architected ISAe.g. x86

Solution: Architecture Innovations

• ISA mapping: – Hardware: Simple translation, good for startup performance. – Software: Dynamic optimization, good for hotspots.

• Can we combine the advantages of both? – Startup: Fast, simple translation – Steady State: Intelligent translation/optimization, for hotspots.

Pipeline

Decoders

ConventionalHW design

PipelineCode $

SoftwareBinary

Translator

VM paradigm

Page 4: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 4

Microarchitecture: Macro-op Execution

• Enhanced OoO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions

throughout the entire pipeline

– Analogy: All lanes car-pool on highway reduce congestion w/ high throughput, AND raise the speed limit from 65mph to 80mph.

DecodeRenameDispatch

Wake-up

RFSelect EXEFetch MEM

cacheports

AlignFuse

Fusebit

3-1 ALUs

RetireWB

Page 5: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 5

Related Work: x86 processors

• AMD K7/K8 microarchitecture – Macro-Operations – High performance,

efficient pipeline• Intel Pentium M

– Micro-op fusion. – Stack manager. – High performance,

low power.

• Transmeta x86 processors – Co-Designed x86 VM – VLIW engine + code

morphing software.

Page 6: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 6

Related Work

• Co-designed VM: IBM DAISY, BOA – Full system translator on tree regions + VLIW engine

– Other research projects: e.g. DBT for ILDP

• Macro-op execution

– ILDP, Dynamic Strands, Dataflow Mini-graph, CCG.

– Fill Unit, SCISM, rePLay, PARROT.

• Dynamic Binary Translation / Optimization

– SW based: (Often user mode only) UQBT, Dynamo (RIO), IA-32 EL. Java and .NET HLL VM runtime systems

– HW based: Trace cache fill units, rePLay, PARROT, etc

Page 7: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 7

I-$Code $

(Macro-op)

MemoryHierarchy

verticalx86

decoder

horizontalmicro / Macro-op

decoder

Rename/Dispatch

PipelineEXE

backend

Issuebuffer

VM translation /

optimization software

x86 code

Co-designed x86 processor architecture

• Co-designed virtual machine paradigm– Startup: Simple hardware decode/crack for fast translation – Steady State: Dynamic software translation/optimization for

hotspots.

12

Page 8: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 8

Fusible Instruction Set

• RISC-ops with unique features:

– A fusible bit per instr. for fusing

– Dense encoding, 16/32-bit ISA

• Special Features to Support x86

– Condition codes

– Addressing modes

– Aware of long immediate values

-21-bit Immediate / Displacement / 10 b opcode

11b Immediate / Disp 5b Rds

5b Rsrc

-16-bit immediate / Displacement

F

-Core 32-bit instruction formats

-Add-on 16-bit instruction formats for code density

Fusible ISA Instruction Formats

10 b opcode

10 b opcode

16 bit opcode 5b Rsrc

5b Rsrc

5b Rds

5b Rds

5b Rds

5b Rds5b Rsrc

5b Immd

10b Immediate / Disp5b opcode

5b opcode

5b opcode

F

F

F

F

F

F

Page 9: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 9

Macro-op Fusing Algorithm

• Objectives: – Maximize fused dependent pairs – Simple & Fast

• Heuristics: – Pipelined Scheduler: Only single-cycle ALU ops can be a

head. Minimize non-fused single-cycle ALU ops– Criticality: Fuse instructions that are “close” in the original

sequence. ALU-ops criticality is easier to estimate. – Simplicity: 2 or less distinct register operands per fused pair

• Solution: Two-pass Fusing Algorithm:– The 1st pass, forward scan, prioritizes ALU ops, i.e. for each

ALU-op tail candidate, look backward in the scan for its head– The 2nd pass considers all kinds of RISC-ops as tail candidates

Page 10: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 10

Fusing Algorithm: Example

x86 asm:

-----------------------------------------------------------

1. lea eax, DS:[edi + 01]

2. mov [DS:080b8658], eax

3. movzx ebx, SS:[ebp + ecx << 1]

4. and eax, 0000007f

5. mov edx, DS:[eax + esi << 0 + 0x7c]

RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]

After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]

Page 11: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 11

Instruction Fusing Profile

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pe

rce

nta

ge

of D

yn

am

ic In

str

uctio

ns

ALU

FP or NOPs

BR

ST

LD

Fused

• 55+% fused RISC-ops increases effective ILP by 1.4

• Only 6% single-cycle ALU ops left un-fused.

Page 12: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 12

RenameDispatchwakeupFetch Align Payload RF EXE WB Retirex86

Decode3 Selectx86

Decode2X86

Decode1

Pipelined 2-cycle Issue Logic

RenameDispatchwakeupFetchAlign/Fuse

Payload RF EXE WB RetireDecode SelectMacro-op Pipeline

-

x86 Pipeline

Processor Pipeline

• Macro-op pipeline for efficient hotspot execution– Execute macro-ops – Higher IPC, and Higher clock speed potential – Shorter pipeline front-end

Reduced Instr. traffic throughout

Pipelined schedule

r

Reduced forwarding

Page 13: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 13

1 23

45

slot 0 slot 1 slot 2

1 2 3 4 5 6Align /

1 23

45

Fuse

Decode

Dispatch

Rename

Fetch

1 2323

4545

slot 0 slot 1 slot 2

16 Bytes

1 2 3 4 5 6Align /

1 23

45

Fuse

Decode

Dispatch

Rename

Co-designed x86 pipeline frond-end

Page 14: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 14

Wakeup

Select

Payload

RF

EXE

WB/ Mem

2-cycle Macro-op Scheduler

lane 0dual entry

lane 02 read ports

lane 0dual entry

issue port 0

lane 02 read ports

lane 1dual entry

lane 12 read ports

lane 1dual entry

issue port 1

lane 12 read ports

Mem Port 0

ALU0 3-1ALU0

lane 2dual entry

lane 22 read ports

lane 2dual entry

issue port 2

lane 22 read ports

Mem Port 1

ALU1 ALU2 3-1ALU23-1ALU1

Co-designed x86 pipeline backend

Page 15: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 15

Experimental Evaluation

• x86vm: Experimental framework for exploring the co-designed x86 virtual machine paradigm.

• Proposed co-designed x86 processor – A specific instantiation of the framework.

– Software components: VMM – DBT, Code caches, VM runtime control and resource management system (Extracted some source code from BOCHS 2.2)

– Hardware components: Microarchitecture timing simulators, Baseline OoO Superscalar, Macro-op Execution, etc.

• Benchmarks: SPEC2000 integer

Page 16: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 16

Performance Evaluation: SPEC2000

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64issue window size

Rela

tive IP

C p

erf

orm

ance

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base

Page 17: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 17

Performance Contributors • Many factors contribute to the IPC performance

improvement: – Code straightening,

– Macro-op fusing and execution.

– Reduce pipeline front-end (reduce branch penalty)

– Collapsed 3-1 ALUs (resolve branches & addresses sooner).

• Besides baseline and macro-op models, we model three middle configurations:– M0: baseline + code cache

– M1: M0 + macro-op fusing.

– M2: M1 + shorter pipeline front-end. (Macro-op mode)

– Macro-op: M2 + collapsed 3-1 ALUs.

Page 18: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 18

Performance Contributors: SPEC2000

-10

0

10

20

30

40

50

60

70

Nom

arlized IPC s

peedu

p (%

)

M0: Base + Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU

Page 19: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 19

Conclusions

• Architecture Enhancement– Hardware/Software co-designed paradigm enable novel

designs & more desirable system features

– Fuse dependent instruction pairs collapse dataflow graph to increase ILP

• Complexity Effectiveness– Pipelined 2-cycle instruction scheduler

– Reduce ALU value forwarding network significantly

– DBT software reduces hardware complexity

• Power Consumption Implication – Reduced pipeline width

– Reduced Inter-instruction communication and instruction management

Page 20: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 20

Finale – Questions & Answers

Suggestions and comments are welcome, Thank you!

Page 21: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 21

OutlineOutline

• Motivation & Introduction

• Processor Microarchtecture Details

• Evaluation & Conclusions

Page 22: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 22

Performance Simulation Configuration

BASELINE BASELINE PIPELINED MACRO-OP

ROB Size 128 128 128

Retire width 3,4 3,4 2,3,4 MOP

Scheduler Pipeline Stages 1 2 2

Fuse RISCops ? No No Yes

Issue Width 3,4 3,4 2,3,4 MOP

Issue Window Size Variable. Sample points: from 16, up to 64. Effectively larger for the macro-op mode.

Register File 128 entries, 8,10 Read ports, 5,6 Write ports 128 entries, 6,8,10 Read &

6,8,10 Write ports

Functional Units 4,6,8 INT ALU, 2 MEM R/W ports, 2 FP ALU

Cache Hierarchy 4-way 32KB L1-I, 4-way 32KB L1-D, 8-way 1 MB L2

Cache/Memory Latency L1 : 2 cycles + 1 cycle AGU, L2 : 8 cycles, Mem: 200 cycles for the 1st chunk, 6 cycles

b/w chunks

Fetch width 16-Bytes x86 instructions 16B fusible micro-ops

Page 23: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 23

Fuse Macro-ops: An Illustrative Example

x86 instructions Fusible ISA Execution Latency

1 LD Rtmp, [Rebx + 02] 3 2 cmp ds:[ebx + 02], 0d CMP Rtmp, 0d :: Jz 2f 1 3 jnz 08115ae1 4 jmp 08115bf2 (direct jmp removed) 5 add esp, 0c ADD.cc Resp, 0c :: LD Rebx,[Resp] 3 6 pop ebp ADD Resp, 4 :: LD Rtmp,[Resp] 3 7 ret_near ADD Resp, 4 1 8 BR.ret Rtmp 1 16 Bytes

6 x86 instructions 20 Bytes, 9 RISC-like instructions. Fused into 6 macro-ops, 6 issue queue slots & issues

Page 24: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 24

Translation Framework

Dynamic binary translation framework:

1. Form hotspot superblock. Crack x86 instructions into RISC-style micro-ops

2. Perform Cluster Analysis of embedded long immediate values and assign to registers if necessary.

3. Generate RISC-ops (IR form) in the implementation ISA

4. Construct DDG (Data Dependency Graph) for the superblock

5. Fusing Algorithm: Scan looking for dependent pairs to be fused. Forward scan, backward pairing. Two-pass to prioritize ALU ops.

6. Assign registers; re-order fused dependent pairs together, extend live ranges for precise traps, use consistent state mapping at superblock exits

7. Code generation to code cache

Page 25: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 25

Other DBT Software Profile

• Of all fused macro-ops: – 50% ALU-ALU pairs.

– 30% fused condition test & conditional branch pairs.

– Others mostly ALU-MEM ops pairs.

• Of all fused macro-ops: – 70+% are inter-x86instruction fusion.

– 46% access two distinct source registers,

– only 15% (6% of all instruction entities) write two distinct destination registers.

• Translation Overhead Profile– About 1000+ instructions per translated hotspot instruction.

Page 26: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 26

A

C

B

D

A

D

B

C

N

Head

Tail

YX

Head

Tail

a b

A

D

B

C

c

?

Dependence Cycle Detection

• All cases are generalized to (c) due to Anti-Scan Fusing Heuristic

Page 27: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 27

HST back-end profile

• Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk.

• Heavy-weight opts: uops translation, fusing, codegen – none dominates

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

Num

ber of x86 inst

ructio

ns

ProcLongImm xlate_uop DDGsetup Fuse macro-ops Codegen

Page 28: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 28

Hotspot Coverage vs. runs

0

10

20

30

40

50

60

70

80

90

100

Hot

spot

Cov

erag

e %

100M TestRun RefRun

Page 29: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 29

Hotspot Detected vs. runs

0

5

10

15

20

25

30

35

40

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

Ove

rhea

d: I

ns x

late

d pe

r m

illio

n In

s E

xe

100M TestRun RefRun

Page 30: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 30

Performance Evaluation: SPEC2000

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base

Page 31: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 31

Performance evaluation (WSB2004)

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

16 32 48 64issue buffer size

Rela

tive IPC

perform

ance

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base

Page 32: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 32

Performance Contributors (WSB2004)

-10

-5

0

5

10

15

20

25

30

35

40

Nom

arliz

ed IP

C s

peedup (

%)

M0: Base+Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU

Page 33: An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 33

Future Directions

• Co-Designed Virtual Machine Technology: – Confidence: More realistic benchmark study – important for

whole workload behavior such as hotspot behavior and impact of context switches.

– Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques.

– Application: Specific enabling techniques for specific novel computer architectures of the future.

• Example co-designed x86 processor design: – Confidence Study as above.

– Enhancement: HW μ-Arch Reduce register write ports. VMM More dynamic optimizations in HST, e.g. CSE, software stack manager, SIMDification.