An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

An Approach for Implementing Efficient Superscalar CISC

Processors

Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim

HPCA 2006, Austin, TX 2

Processor Design Challenges

• CISC challenges -- Suboptimal internal micro-ops. – Complex decoders & obsolete features/instructions

– Instruction count expansion: 40% to 50% mgmt, comm …

– Redundancy & Inefficiency in the cracked micro-ops

– Solution: Dynamic optimization

• Other current challenges (CISC & RISC)– Efficiency (Nowadays, less performance gain per transistor)

– Power consumption has become acute

– Solution: Novel efficient microarchitectures


Dynam

ic Translation

Implementation ISAe.g. fusible ISA

Software in Architected ISA: OS, Drivers, Lib code, Apps

HW Implementation: Processors, Mem-sys, I/O devices

Architected ISAe.g. x86

Solution: Architecture Innovations

• ISA mapping: – Hardware: Simple translation, good for startup performance. – Software: Dynamic optimization, good for hotspots.

• Can we combine the advantages of both? – Startup: Fast, simple translation – Steady State: Intelligent translation/optimization, for hotspots.

Pipeline

Decoders

ConventionalHW design

PipelineCode $

SoftwareBinary

Translator

VM paradigm


Microarchitecture: Macro-op Execution

• Enhanced OoO superscalar microarchitecture– Process & execute fused macro-ops as single Instructions

throughout the entire pipeline

– Analogy: All lanes car-pool on highway reduce congestion w/ high throughput, AND raise the speed limit from 65mph to 80mph.

DecodeRenameDispatch

Wake-up

RFSelect EXEFetch MEM

cacheports

AlignFuse

Fusebit

3-1 ALUs

RetireWB


Related Work: x86 processors

• AMD K7/K8 microarchitecture – Macro-Operations – High performance,

efficient pipeline• Intel Pentium M

– Micro-op fusion. – Stack manager. – High performance,

low power.

• Transmeta x86 processors – Co-Designed x86 VM – VLIW engine + code

morphing software.


Related Work

• Co-designed VM: IBM DAISY, BOA – Full system translator on tree regions + VLIW engine

– Other research projects: e.g. DBT for ILDP

• Macro-op execution

– ILDP, Dynamic Strands, Dataflow Mini-graph, CCG.

– Fill Unit, SCISM, rePLay, PARROT.

• Dynamic Binary Translation / Optimization

– SW based: (Often user mode only) UQBT, Dynamo (RIO), IA-32 EL. Java and .NET HLL VM runtime systems

– HW based: Trace cache fill units, rePLay, PARROT, etc


I-$Code $

(Macro-op)

MemoryHierarchy

verticalx86

decoder

horizontalmicro / Macro-op

decoder

Rename/Dispatch

PipelineEXE

backend

Issuebuffer

VM translation /

optimization software

x86 code

Co-designed x86 processor architecture

• Co-designed virtual machine paradigm– Startup: Simple hardware decode/crack for fast translation – Steady State: Dynamic software translation/optimization for

hotspots.

12


Fusible Instruction Set

• RISC-ops with unique features:

– A fusible bit per instr. for fusing

– Dense encoding, 16/32-bit ISA

• Special Features to Support x86

– Condition codes

– Addressing modes

– Aware of long immediate values

-21-bit Immediate / Displacement / 10 b opcode

11b Immediate / Disp 5b Rds

5b Rsrc

-16-bit immediate / Displacement

F

-Core 32-bit instruction formats

-Add-on 16-bit instruction formats for code density

Fusible ISA Instruction Formats

10 b opcode

10 b opcode

16 bit opcode 5b Rsrc

5b Rsrc

5b Rds

5b Rds

5b Rds

5b Rds5b Rsrc

5b Immd

10b Immediate / Disp5b opcode

5b opcode

5b opcode

F

F

F

F

F

F


Macro-op Fusing Algorithm

• Objectives: – Maximize fused dependent pairs – Simple & Fast

• Heuristics: – Pipelined Scheduler: Only single-cycle ALU ops can be a

head. Minimize non-fused single-cycle ALU ops– Criticality: Fuse instructions that are “close” in the original

sequence. ALU-ops criticality is easier to estimate. – Simplicity: 2 or less distinct register operands per fused pair

• Solution: Two-pass Fusing Algorithm:– The 1st pass, forward scan, prioritizes ALU ops, i.e. for each

ALU-op tail candidate, look backward in the scan for its head– The 2nd pass considers all kinds of RISC-ops as tail candidates


Fusing Algorithm: Example

x86 asm:

-----------------------------------------------------------

1. lea eax, DS:[edi + 01]

2. mov [DS:080b8658], eax

3. movzx ebx, SS:[ebp + ecx << 1]

4. and eax, 0000007f

5. mov edx, DS:[eax + esi << 0 + 0x7c]

RISC-ops:-----------------------------------------------------1. ADD Reax, Redi, 12. ST Reax, mem[R22] 3. LD.zx Rebx, mem[Rebp + Recx << 1]4. AND Reax, 0000007f5. ADD R17, Reax, Resi6. LD Redx, mem[R17 + 0x7c]

After fusing: Macro-ops-----------------------------------------------------1. ADD R18, Redi, 1 :: AND Reax, R18, 007f 2. ST R18, mem[R22]3. LD.zx Rebx, mem[Rebp + Recx << 1]4. ADD R17, Reax, Resi :: LD Rebx, mem[R17+0x7c]


Instruction Fusing Profile

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pe

rce

nta

ge

of D

yn

am

ic In

str

uctio

ns

ALU

FP or NOPs

BR

ST

LD

Fused

• 55+% fused RISC-ops increases effective ILP by 1.4

• Only 6% single-cycle ALU ops left un-fused.


RenameDispatchwakeupFetch Align Payload RF EXE WB Retirex86

Decode3 Selectx86

Decode2X86

Decode1

Pipelined 2-cycle Issue Logic

RenameDispatchwakeupFetchAlign/Fuse

Payload RF EXE WB RetireDecode SelectMacro-op Pipeline

-

x86 Pipeline

Processor Pipeline

• Macro-op pipeline for efficient hotspot execution– Execute macro-ops – Higher IPC, and Higher clock speed potential – Shorter pipeline front-end

Reduced Instr. traffic throughout

Pipelined schedule

r

Reduced forwarding


1 23

45

slot 0 slot 1 slot 2

1 2 3 4 5 6Align /

1 23

45

Fuse

Decode

Dispatch

Rename

Fetch

1 2323

4545

slot 0 slot 1 slot 2

16 Bytes

1 2 3 4 5 6Align /

1 23

45

Fuse

Decode

Dispatch

Rename

Co-designed x86 pipeline frond-end


Wakeup

Select

Payload

RF

EXE

WB/ Mem

2-cycle Macro-op Scheduler

lane 0dual entry

lane 02 read ports

lane 0dual entry

issue port 0

lane 02 read ports

lane 1dual entry

lane 12 read ports

lane 1dual entry

issue port 1

lane 12 read ports

Mem Port 0

ALU0 3-1ALU0

lane 2dual entry

lane 22 read ports

lane 2dual entry

issue port 2

lane 22 read ports

Mem Port 1

ALU1 ALU2 3-1ALU23-1ALU1

Co-designed x86 pipeline backend


Experimental Evaluation

• x86vm: Experimental framework for exploring the co-designed x86 virtual machine paradigm.

• Proposed co-designed x86 processor – A specific instantiation of the framework.

– Software components: VMM – DBT, Code caches, VM runtime control and resource management system (Extracted some source code from BOCHS 2.2)

– Hardware components: Microarchitecture timing simulators, Baseline OoO Superscalar, Macro-op Execution, etc.

• Benchmarks: SPEC2000 integer


Performance Evaluation: SPEC2000

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64issue window size

Rela

tive IP

C p

erf

orm

ance

4-wide Macro-op 3-wide Macro-op 2-wide Macro-op 4-wide Base 3-wide Base


Performance Contributors • Many factors contribute to the IPC performance

improvement: – Code straightening,

– Macro-op fusing and execution.

– Reduce pipeline front-end (reduce branch penalty)

– Collapsed 3-1 ALUs (resolve branches & addresses sooner).

• Besides baseline and macro-op models, we model three middle configurations:– M0: baseline + code cache

– M1: M0 + macro-op fusing.

– M2: M1 + shorter pipeline front-end. (Macro-op mode)

– Macro-op: M2 + collapsed 3-1 ALUs.


Performance Contributors: SPEC2000

-10

0

10

20

30

40

50

60

70

Nom

arlized IPC s

peedu

p (%

)

M0: Base + Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU


Conclusions

• Architecture Enhancement– Hardware/Software co-designed paradigm enable novel

designs & more desirable system features

– Fuse dependent instruction pairs collapse dataflow graph to increase ILP

• Complexity Effectiveness– Pipelined 2-cycle instruction scheduler

– Reduce ALU value forwarding network significantly

– DBT software reduces hardware complexity

• Power Consumption Implication – Reduced pipeline width

– Reduced Inter-instruction communication and instruction management


Finale – Questions & Answers

Suggestions and comments are welcome, Thank you!


OutlineOutline

• Motivation & Introduction

• Processor Microarchtecture Details

• Evaluation & Conclusions


Performance Simulation Configuration

BASELINE BASELINE PIPELINED MACRO-OP

ROB Size 128 128 128

Retire width 3,4 3,4 2,3,4 MOP

Scheduler Pipeline Stages 1 2 2

Fuse RISCops ? No No Yes

Issue Width 3,4 3,4 2,3,4 MOP

Issue Window Size Variable. Sample points: from 16, up to 64. Effectively larger for the macro-op mode.

Register File 128 entries, 8,10 Read ports, 5,6 Write ports 128 entries, 6,8,10 Read &

6,8,10 Write ports

Functional Units 4,6,8 INT ALU, 2 MEM R/W ports, 2 FP ALU

Cache Hierarchy 4-way 32KB L1-I, 4-way 32KB L1-D, 8-way 1 MB L2

Cache/Memory Latency L1 : 2 cycles + 1 cycle AGU, L2 : 8 cycles, Mem: 200 cycles for the 1st chunk, 6 cycles

b/w chunks

Fetch width 16-Bytes x86 instructions 16B fusible micro-ops


Fuse Macro-ops: An Illustrative Example

x86 instructions Fusible ISA Execution Latency

1 LD Rtmp, [Rebx + 02] 3 2 cmp ds:[ebx + 02], 0d CMP Rtmp, 0d :: Jz 2f 1 3 jnz 08115ae1 4 jmp 08115bf2 (direct jmp removed) 5 add esp, 0c ADD.cc Resp, 0c :: LD Rebx,[Resp] 3 6 pop ebp ADD Resp, 4 :: LD Rtmp,[Resp] 3 7 ret_near ADD Resp, 4 1 8 BR.ret Rtmp 1 16 Bytes

6 x86 instructions 20 Bytes, 9 RISC-like instructions. Fused into 6 macro-ops, 6 issue queue slots & issues


Translation Framework

Dynamic binary translation framework:

1. Form hotspot superblock. Crack x86 instructions into RISC-style micro-ops

2. Perform Cluster Analysis of embedded long immediate values and assign to registers if necessary.

3. Generate RISC-ops (IR form) in the implementation ISA

4. Construct DDG (Data Dependency Graph) for the superblock

5. Fusing Algorithm: Scan looking for dependent pairs to be fused. Forward scan, backward pairing. Two-pass to prioritize ALU ops.

6. Assign registers; re-order fused dependent pairs together, extend live ranges for precise traps, use consistent state mapping at superblock exits

7. Code generation to code cache


Other DBT Software Profile

• Of all fused macro-ops: – 50% ALU-ALU pairs.

– 30% fused condition test & conditional branch pairs.

– Others mostly ALU-MEM ops pairs.

• Of all fused macro-ops: – 70+% are inter-x86instruction fusion.

– 46% access two distinct source registers,

– only 15% (6% of all instruction entities) write two distinct destination registers.

• Translation Overhead Profile– About 1000+ instructions per translated hotspot instruction.


A

C

B

D

A

D

B

C

N

Head

Tail

YX

Head

Tail

a b

A

D

B

C

c

?

Dependence Cycle Detection

• All cases are generalized to (c) due to Anti-Scan Fusing Heuristic


HST back-end profile

• Light-weight opts: ProcLongImm, DDG setup, encode – tens of instrs. each Overhead per x86 instruction -- initial load from disk.

• Heavy-weight opts: uops translation, fusing, codegen – none dominates

0

100

200

300

400

500

600

700

800

900

1000

1100

1200

Num

ber of x86 inst

ructio

ns

ProcLongImm xlate_uop DDGsetup Fuse macro-ops Codegen


Hotspot Coverage vs. runs

0

10

20

30

40

50

60

70

80

90

100

Hot

spot

Cov

erag

e %

100M TestRun RefRun


Hotspot Detected vs. runs

0

5

10

15

20

25

30

35

40

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

Ove

rhea

d: I

ns x

late

d pe

r m

illio

n In

s E

xe

100M TestRun RefRun


Performance Evaluation: SPEC2000

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

1.3

16 32 48 64



Performance evaluation (WSB2004)

0.5

0.6

0.7

0.8

0.9

1

1.1

1.2

16 32 48 64issue buffer size

Rela

tive IPC

perform

ance



Performance Contributors (WSB2004)

-10

-5

0

5

10

15

20

25

30

35

40

Nom

arliz

ed IP

C s

peedup (

%)

M0: Base+Code $ M1:= M0 + fusing M2:= M1 + shorter pipe Macro-op:= M2 + 3-1 ALU


Future Directions

• Co-Designed Virtual Machine Technology: – Confidence: More realistic benchmark study – important for

whole workload behavior such as hotspot behavior and impact of context switches.

– Enhancement: More synergetic, complexity-effective HW/SW Co-design techniques.

– Application: Specific enabling techniques for specific novel computer architectures of the future.

• Example co-designed x86 processor design: – Confidence Study as above.

– Enhancement: HW μ-Arch Reduce register write ports. VMM More dynamic optimizations in HST, e.g. CSE, software stack manager, SIMDification.

Documents

An Approach for Implementing Efficient Superscalar CISC Processors Shiliang Hu, Ilhyun Kim, Mikko Lipasti, James SmithIlhyun Kim