33
Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 CS250 VLSI Systems Design Lecture 11: Patterns for Communication Links, Rocket μArchitecture, Testing John Wawrzynek, Krste Asanovic, with John Lazzaro and Brian Zimmer (TA) UC Berkeley Fall 2011

CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011

CS250 VLSI Systems DesignLecture 11: Patterns for Communication Links,

Rocket µArchitecture, Testing

John Wawrzynek, Krste Asanovic,with

John Lazzaroand

Brian Zimmer (TA)

UC BerkeleyFall 2011

Page 2: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Interconnect Design Patterns

2

Page 3: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Implementing Communication Queues

Queue can be implemented as centralized FIFO with single control FSM if both ends are close to each other and directly connected:

In large designs, there may be several cycles of communication latency from one end to other. This introduces delay both in forward data propagation and in reverse flow control

Control split into send and receive portions. A credit-based flow control scheme is often used to tell sender how many units of data it can send before overflowing receiver’s buffer.

3

Cntl.

Send Recv.

Page 4: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

End-End Credit-Based Flow Control

For one-way latency of N cycles, need 2*N buffers at receiver to ensure full bandwidth

– Will take at least 2N cycles before sender can be informed that first unit sent was consumed (or not) by receiver

If receive buffer fills up and stalls communication, will take N cycles before first credit flows back to sender to restart flow, then N cycles for value to arrive from sender

meanwhile, receiver can work from 2*N buffered values

4

Send Recv.

Page 5: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Distributed Flow Control

An alternative to end-end control is distributed flow control (chain of FIFOs)

Requires less storage, as communication flops reused as buffers, but needs more distributed control circuitry

– Lots of small buffers also less efficient than single larger buffer

Sometimes not possible to insert logic into communication path

– e.g., wave-pipelined multi-cycle wiring path, or photonic link

5

Cntl. Cntl. Cntl.

Page 6: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Network PatternsConnects multiple units using shared resources

BusLow-cost, ordered

CrossbarHigh-performance

Multi-stage networkTrade cost/performance

6

Page 7: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Buses

Buses were popular board-level option for implementing communication as they saved pins and wires

Less attractive on-chip as wires are plentiful and buses are slow and cumbersome with central control

Often used on-chip when shrinking existing legacy system design onto single chip

Newer designs moving to either dedicated point-point unit communications or an on-chip network

Bus Unit

7

Bus Control

Page 8: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

On-Chip NetworkOn-chip network multiplexes long range wires to reduce cost

Routers use distributed flow control to transmit packets

Units usually need end-end credit flow control in addition because intermediate buffering in network is shared by all units

Router

Router Router

Router

8

Page 9: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011

Rocket µArchitecture

9

Page 10: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

UCB Rocket:An In-Order RISC-V Decoupled µArchitecture

10

A family of µarchitectures supporting hardware floating-point, demand-paged virtual memory

In-order single or dual-issue, decoupled floating-point unit, precise traps

32-bit or 64-bit implementations

From 5-stage to ~9-stage pipelines

Designed to be close to commercial cores

Single-issue version will be made available in time for class projects

Page 11: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

George Stephenson’s Rocket

11

“The Rocket was the most advanced steam engine of its day. It was built for the Rainhill Trials held by the Liverpool & Manchester Railway in 1829 to choose the best and most competent design. It set the standard for a hundred and fifty years of steam locomotive power. Though the Rocket was not the first steam locomotive, its claim to fame is that it was the first to bring together several innovations to produce the most advanced locomotive of its day, and the template for most steam locomotives since.” [Wikipedia]

[AllyJane, LensFlare]

Page 12: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

A Simple Core?

12

=

VPC

ITLB

43

TAG

SDA

TA

I$

valid

dout

Bran

chTa

rget

Buffe

r

NPC

Che

ck

Fetch Decode

rsrt

Scor

eboa

rd(R

ead/

Set)

rsrt

rdset

busy

Deco

de,

Arbi

tratio

n,St

all

Dete

ctio

nLo

gic

Execute

ALU

IDIV

Bran

ch?

BYPA

SS

Sign

Exte

nd

imm

=

DTLB

TAG

SDA

TA

D$

Memory

Tile

Lin

k

Commit

Com

mit

Poin

tXB

AR +

Sig

n Ex

tens

ion

Misp

redi

ct?

EPC

EPC

EPC

CAUS

E

CAUS

E

CAUS

E

Exce

ptio

n?

FPU

Com

man

dQ

ueue

FPU

Inte

ger

Resp

Que

ue

HTIF

Requ

est

Que

ue

HTIF

Resp

onse

Que

ue

Pref

etch

er

Scor

eboa

rd(C

lear

)

FP

Regfi

le

(Rea

d)Sc

oreb

oard

(Rea

d/Se

t)De

code

+Ha

zard

Dete

ctio

nLo

gic

FMA

ITO

FFT

OI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load

/Sto

reAd

dr C

heck

ISDQ

mreq_data

FPU

Load

Data

Reor

der

Que

ue

busy

BYPA

SS

Decode

Floa

ting

Poin

tUn

it

RECO

DE

Execute

Scor

eboa

rd(C

lear

)

Commit

Repl

ay?

FSR

RECO

DE

FCM

P

NPCGEN

Prio

rity

Enco

der

CAUS

E

predict

predict_addr

branch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Stor

e AC

KCo

unte

r

ehpc

Ctrl

Regs

(Rea

d)

Ctrl

Regs

(W

rite)

Tim

er

ls_conflict

27

epc

eret

epc_ex

eret

miss

stall_fetch

miss

busy

exception

paddr

vaddr

rs

V V V

mreq_addr

wd0wa0 Re

gfile

we0

wd1wa1we1

ppn

data

tag

Inst

ruct

ion

Que

ue

control

st_addr

mresp_data

mreq_tag

mreq_val

mreq_rdy

EPC

FPU

Inte

ger

Ope

rand

Que

ue

Alig

ned?

dc_miss

MSH

R

V

dc_busy

to PTW

4+

busy

PTW

mresp_val

mresp_tag

mresp_datato ITLB to

DTLB

mreq_op

Tile

Link

mre

q_pt

w

D$Control

Ctrl

Regs

(Rea

d)

mreq_ptw

dc_busy

en

stall_fetch

dc_m

iss

mode

dtlb_m

iss

exception

to PTW

to FIRQ

stall

waddr

wdata

FP R

egfile

(Writ

e)

waddr

ra0 Re

gfile

(Rea

d)ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1 Ex211

Page 13: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Rocket Pipeline Structure

13

Four major phases of execution

Instruction fetchGet instruction bits from I-cache

Decode, including operand fetch and issueRead register file, determine interlocks and bypass control

ExecutionPerform instruction

CommitIf no traps or interrupts, write architectural state

Each phase can contain multiple pipeline stages, but approx. one stage each in initial design.

Page 14: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

F D X M C

FD FX1

FX2

FX3 FW

P Integer Pipeline

Floating-Point Pipeline

Generate Next PC

Fetch Instruction

Decode, Operand Fetch, Issue

Execute Integer ALU

Data Cache

Commit

FP Decode, Operand

Fetch, Issue

FP Execute Stages

FP Register Write

Commit Point

Rocket 5-Stage Pipeline Structure

14

P is a pseudo-stage, as contents spread over

many stages

FPU decoupling queue placed at commit point in pipeline

Page 15: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

PC Generation

15

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

Next PC can come from number of sourcesPC+4 if sequential fetch (predicted not-taken)Predicted branch address (if predicted taken)Resolved branch address (if mispredicted)Replay PC (if pipeline flush/replay from either X or M stage)Trap handler address (on trap/interrupt)Restore PC (at end of trap/interrupt handler)

Page 16: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

Implementing Precise TrapsHandle traps in program order at end of memory stage (the commit point)

Synchronous trap can be generated in any stage, held in Error PC & Cause shifted down pipeline

EPC always holds PC of instruction in that stage

Asynchronous interrupts handled in memory stage

Trap/interrupt flushes pipe and resets PC to handler address

RISC-V floating-point ISA designed to have no traps (only exception flags), so commit point before FPU decode

16

Page 17: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

enrdata0

rdata1

rdata2

Ex1Ex2

11

Fetch Stage

Predict next PC from current PC using BTB - fed back to P stage

Fetch instructions from cache into instruction queue

Translate virtual address PC into physical address PC for I-cache physical tag check, check for illegal PC -> signal trap

I-cache miss goes to memory system, I-stream prefetcher fetches sequential blocks ahead of miss

17

Page 18: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

Decode Stage

Decode instructions from queue, check for illegal ops -> signal trap

Fetch register operands and sign-extend immediate

Check for unavailable source operands using scoreboard (busy bit per register), stall decode if not available

Set busy bit for long latency operations

Calculate bypass control and mux bypass operands into ALU inputs

18

Page 19: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

LogicExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

Integer Execute Stage

Most integer instructions complete in one cycle and can be bypassed to next instruction

Integer multiply takes few cycles overlapped with memory stage

Integer divide takes many cycles - so sets busy bit on destination register

Branches resolved in ALU - mispredict detected by comparing target address with EPC in following instruction (was correct path taken?)

ALU calculates load+store addresses, integer store data placed in store data queue (ISDQ)

19

Page 20: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfilewe0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addrm

resp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

Memory Stage

Virtual load/store address translated and checked -> illegal address trap

Store addresses are always queued in-order in SAQ to wait for data in ISDQ or FSDQ (from FPU). Stores go-ahead when address and data available.

Loads can bypass stores if no conflict, but replayed if conflict with address in SAQ.

Non-blocking cache supports multiple outstanding primary and secondary misses.

Flushes pipe and injects handler PC if any traps or interrupts.

End of memory stage is commit point - FPU operations enqueued if no traps.

FPU load instruction enqueues command to read FPU load data queue

FPU store instruction enqueues command to send FP register value to FSDQ20

Page 21: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

oryTile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

Commit Stage

Architectural registers written with final valuesBusy bits on scoreboard cleared as results arriveData cache finishes aligning and sign-extending small width values. Rocket only bypasses 32-bit and 64-bit values from end of memory stage, other sizes of load operands bypassed from end of commit stage.FPU begins decoding instructions from FPU queue.

21

Page 22: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

=

VPC

ITLB

43

TAGS DATA

I$

valid dout

BranchTargetBuffer

NPC Check

FetchDecode

rs rt

Scoreboard(Read/Set)

rs rt rd set

busy

Decode,Arbitration,

StallDetection

Logic

ExecuteALU IDIVBranch?

BYPASS

SignExtend

imm

=

DTLBTAGS DATA

D$

Mem

ory

Tile Link

Com

mit

Commit PointXBAR + Sign

Extension

Mispredict?

EPC

EPC

EPC

CAUSE

CAUSE

CAUSE

Exception?

FPUCommand

Queue

FPUIntegerResp

Queue

HTIFRequestQueue

HTIFResponse

Queue

Prefetcher

Scoreboard(Clear)

FP Regfile (Read)

Scoreboard(Read/Set)

Decode +Hazard

DetectionLogic

FMA

ITOF FTOI

FSDQ

interrupt

SAQ

mresp_val

mresp_tag

Load/StoreAddr Check

ISDQ

mreq_data

FPULoadData

ReorderQueue

busyBYPASS

Decode

FloatingPointUnit

RECODE

Execute

Scoreboard(Clear)

Com

mit

Replay?

FSR

RECODE

FCMP

NPCGENPriority

Encoder

CAUSE

predict

predict_addrbranch_addr

mispredict

exception

epc_mem

replay

stall_decode

IMUL

Store ACKCounter

ehpc

CtrlRegs

(Read)

CtrlRegs

(Write)

Timer

ls_conflict

27

epc

eret

epc_ex

eret

missstall_fetch

miss

busy

exception

paddr

vaddr

rs

V

V

V

mreq_addr

wd0wa0

Regfile

we0

wd1wa1we1

ppn

data

tag

InstructionQueue

control

st_addr

mresp_data

mreq_tag

mreq_valmreq_rdy

EPC

FPUInteger

OperandQueue

Aligned?

dc_miss

MSHR

V

dc_busy

toPTW

4+

busy

PTWmresp_valmresp_tag

mresp_data

toITLB

toDTLB

mreq_op

TileLink

mreq_ptw

D$Control

CtrlRegs

(Read)

mreq_ptw

dc_busy

enstall_fetch

dc_miss

mode

dtlb_miss

exception

toPTW

toFIRQ

stall

waddr wdata

FP Regfile(Write)

waddr

ra0Regfile(Read)

ra1

waddrwdata

en

rdata0

rdata1

rdata2

Ex1Ex2

11

FPU built around a fused multiply-add unit (2008 revision of IEEE 754 FP standard) with full hardware support for all cases including subnormals.Regfile holds value in internal recoded format with extra bit to simplify handling of subnormals. Have to convert on load/store and move to/from integer.

22

Page 23: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011

Design Verification

23

Page 24: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Verification large part of NRE costToo expensive to respin part

prototype cost in $Mslost time-to-market $10Ms

2-3X engineer time on verification versus design

Only getting worse over time as chips get larger and more complex

24

Page 25: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Types of ChecksDesign verification: “Does RTL design implement the functional specification?”

Tool/implementation checking: “Does design layout match RTL design?”

Physical design checking: “Does design work across all process corners, obey all the electrical design rules (antenna rules, electromigration, ...), is power/clock/reset distribution OK, does design meet design-for-X rules (X=test, manufacturing,reliability,...)”

Manufacturing testing: “Does a fabricated chip implement the design to specification?”

25

Page 26: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Design Verification Greatest ChallengeTool/implementation checking mostly automated using static formal verification checks (though finding and fixing error can be labor-intensive)

Same for EDRC rules and other physical design checks

Manufacturing tests can be automatically generated from RTL if scan chains used for all state elements (automatic test pattern generation - ATPG)

26

Page 27: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Source of Bugs in RTL DesignSpecification incorrect

Designers built an implementation faithful to the specification, but the specification was wrong.

Specification misreadDesigners built an implementation faithful to their reading of the specification, but they misunderstood specification.

Incorrect RTL designThe RTL design does not do what designer wanted it to.

Incorrect RTL codingThe RTL design was correct in designer’s head, but the RTL code doesn’t match that RTL design.

27

Page 28: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Avoiding Incorrect SpecificationBuild an executable version of the specification, which should be simple functional model of intended design

For RISC-V cores, we have a C++ instruction set interpreter, requiring only a few lines of code for each instruction.

Exercise executable specification inside system-level test harness with representative workload

For RISC-V, we have built a test harness that can run programs on simulator. Classic test for processors was booting Unix on functional model.System-C common in industry for this level of modeling, where entire system modeled sufficiently to run whole software stack. FPGA emulation popular to accelerate model.

28

Page 29: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Avoiding Misread SpecificationHave executable specification as “golden model”

Have different designers write executable specification and system test code to catch misread specification when building golden model

If errors found, don’t just fix model, also rewrite specification to make it less ambiguous or more readable.

Perform extensive directed and random testing to compare RTL design with golden model

29

Page 30: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Catching bad RTL design or codingPerform extensive directed and random testing to compare RTL design with golden model

Modern processor design team will perform many billions of cycles of RTL simulation using 10,000s cores prior to tapeout

30

Page 31: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

When are you done?

But did you find all bugs, or reach limits of your test coverage?

31

Bugs found per minute of testing

Time

Bug Rate

Page 32: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Test CoverageDid every bit toggle?

Was every value on every bus?

Was every state machine transition taken?

Could your tests observe this happening?

32

Page 33: CS250 VLSI Systems Design Lecture 11: Patterns for ...cs250/fa11/lectures/lec11.pdf · Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 George Stephenson’s Rocket 11 “The

CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

Unit TestingDivide and Conquer

Tradeoff between cost of defining unit boundary and improved test visibility and coverage

Typical granularity of test units in processor:Floating-point functional unitsCachesInteger coreWhole processor

33