33
Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011 CS250 VLSI Systems Design Lecture 11: Patterns for Communication Links, Rocket μArchitecture, Testing John Wawrzynek, Krste Asanovic, with John Lazzaro and Brian Zimmer (TA) UC Berkeley Fall 2011

CS250 VLSI Systems Design Lecture 11: Patterns for …inst.eecs.berkeley.edu/~cs250/fa11/lectures/lec11.pdf · 2011. 10. 11. · CS250 VLSI Systems Design Lecture 11: Patterns for

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

  • Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011

    CS250 VLSI Systems DesignLecture 11: Patterns for Communication Links,

    Rocket µArchitecture, Testing

    John Wawrzynek, Krste Asanovic,with

    John Lazzaroand

    Brian Zimmer (TA)

    UC BerkeleyFall 2011

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Interconnect Design Patterns

    2

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Implementing Communication Queues

    Queue can be implemented as centralized FIFO with single control FSM if both ends are close to each other and directly connected:

    In large designs, there may be several cycles of communication latency from one end to other. This introduces delay both in forward data propagation and in reverse flow control

    Control split into send and receive portions. A credit-based flow control scheme is often used to tell sender how many units of data it can send before overflowing receiver’s buffer.

    3

    Cntl.

    Send Recv.

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    End-End Credit-Based Flow Control

    For one-way latency of N cycles, need 2*N buffers at receiver to ensure full bandwidth

    – Will take at least 2N cycles before sender can be informed that first unit sent was consumed (or not) by receiver

    If receive buffer fills up and stalls communication, will take N cycles before first credit flows back to sender to restart flow, then N cycles for value to arrive from sender

    meanwhile, receiver can work from 2*N buffered values

    4

    Send Recv.

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Distributed Flow Control

    An alternative to end-end control is distributed flow control (chain of FIFOs)

    Requires less storage, as communication flops reused as buffers, but needs more distributed control circuitry

    – Lots of small buffers also less efficient than single larger buffer

    Sometimes not possible to insert logic into communication path

    – e.g., wave-pipelined multi-cycle wiring path, or photonic link

    5

    Cntl. Cntl. Cntl.

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Network PatternsConnects multiple units using shared resources

    BusLow-cost, ordered

    CrossbarHigh-performance

    Multi-stage networkTrade cost/performance

    6

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Buses

    Buses were popular board-level option for implementing communication as they saved pins and wires

    Less attractive on-chip as wires are plentiful and buses are slow and cumbersome with central control

    Often used on-chip when shrinking existing legacy system design onto single chip

    Newer designs moving to either dedicated point-point unit communications or an on-chip network

    Bus Unit

    7

    Bus Control

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    On-Chip NetworkOn-chip network multiplexes long range wires to reduce cost

    Routers use distributed flow control to transmit packets

    Units usually need end-end credit flow control in addition because intermediate buffering in network is shared by all units

    Router

    Router Router

    Router

    8

  • Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011

    Rocket µArchitecture

    9

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    UCB Rocket:An In-Order RISC-V Decoupled µArchitecture

    10

    A family of µarchitectures supporting hardware floating-point, demand-paged virtual memory

    In-order single or dual-issue, decoupled floating-point unit, precise traps

    32-bit or 64-bit implementations

    From 5-stage to ~9-stage pipelines

    Designed to be close to commercial cores

    Single-issue version will be made available in time for class projects

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    George Stephenson’s Rocket

    11

    “The Rocket was the most advanced steam engine of its day. It was built for the Rainhill Trials held by the Liverpool & Manchester Railway in 1829 to choose the best and most competent design. It set the standard for a hundred and fifty years of steam locomotive power. Though the Rocket was not the first steam locomotive, its claim to fame is that it was the first to bring together several innovations to produce the most advanced locomotive of its day, and the template for most steam locomotives since.” [Wikipedia]

    [AllyJane, LensFlare]

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    A Simple Core?

    12

    =

    VPC

    ITLB

    43

    TAG

    SDA

    TA

    I$

    valid

    dout

    Bran

    chTa

    rget

    Buffe

    r

    NPC

    Che

    ck

    Fetch Decode

    rsrt

    Scor

    eboa

    rd(R

    ead/

    Set)

    rsrt

    rdset

    busy

    Deco

    de,

    Arbi

    tratio

    n,St

    all

    Dete

    ctio

    nLo

    gic

    Execute

    ALU

    IDIV

    Bran

    ch?

    BYPA

    SS

    Sign

    Exte

    nd

    imm

    =

    DTLB

    TAG

    SDA

    TA

    D$

    Memory

    Tile

    Lin

    k

    Commit

    Com

    mit

    Poin

    tXB

    AR +

    Sig

    n Ex

    tens

    ion

    Misp

    redi

    ct?

    EPC

    EPC

    EPC

    CAUS

    E

    CAUS

    E

    CAUS

    E

    Exce

    ptio

    n?

    FPU

    Com

    man

    dQ

    ueue

    FPU

    Inte

    ger

    Resp

    Que

    ue

    HTIF

    Requ

    est

    Que

    ue

    HTIF

    Resp

    onse

    Que

    ue

    Pref

    etch

    er

    Scor

    eboa

    rd(C

    lear

    )

    FP

    Regfi

    le

    (Rea

    d)Sc

    oreb

    oard

    (Rea

    d/Se

    t)De

    code

    +Ha

    zard

    Dete

    ctio

    nLo

    gic

    FMA

    ITO

    FFT

    OI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load

    /Sto

    reAd

    dr C

    heck

    ISDQ

    mreq_data

    FPU

    Load

    Data

    Reor

    der

    Que

    ue

    busy

    BYPA

    SS

    Decode

    Floa

    ting

    Poin

    tUn

    it

    RECO

    DE

    Execute

    Scor

    eboa

    rd(C

    lear

    )

    Commit

    Repl

    ay?

    FSR

    RECO

    DE

    FCM

    P

    NPCGEN

    Prio

    rity

    Enco

    der

    CAUS

    E

    predict

    predict_addr

    branch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Stor

    e AC

    KCo

    unte

    r

    ehpc

    Ctrl

    Regs

    (Rea

    d)

    Ctrl

    Regs

    (W

    rite)

    Tim

    er

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    miss

    stall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V V V

    mreq_addr

    wd0wa0 Re

    gfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    Inst

    ruct

    ion

    Que

    ue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_val

    mreq_rdy

    EPC

    FPU

    Inte

    ger

    Ope

    rand

    Que

    ue

    Alig

    ned?

    dc_miss

    MSH

    R

    V

    dc_busy

    to PTW

    4+

    busy

    PTW

    mresp_val

    mresp_tag

    mresp_datato ITLB to

    DTLB

    mreq_op

    Tile

    Link

    mre

    q_pt

    w

    D$Control

    Ctrl

    Regs

    (Rea

    d)

    mreq_ptw

    dc_busy

    en

    stall_fetch

    dc_m

    iss

    mode

    dtlb_m

    iss

    exception

    to PTW

    to FIRQ

    stall

    waddr

    wdata

    FP R

    egfile

    (Writ

    e)

    waddr

    ra0 Re

    gfile

    (Rea

    d)ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1 Ex211

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Rocket Pipeline Structure

    13

    Four major phases of execution

    Instruction fetchGet instruction bits from I-cache

    Decode, including operand fetch and issueRead register file, determine interlocks and bypass control

    ExecutionPerform instruction

    CommitIf no traps or interrupts, write architectural state

    Each phase can contain multiple pipeline stages, but approx. one stage each in initial design.

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    F D X M C

    FD FX1FX2

    FX3 FW

    P Integer Pipeline

    Floating-Point Pipeline

    Generat

    e Next

    PC

    Fetch In

    structio

    n

    Decode

    , Operan

    d Fetch,

    Issue

    Execute

    Integer

    ALU

    Data C

    ache

    Commit

    FP Dec

    ode, Op

    erand

    Fetch, Is

    sue

    FP Exe

    cute Sta

    ges

    FP Reg

    ister W

    rite

    Commit Point

    Rocket 5-Stage Pipeline Structure

    14

    P is a pseudo-stage, as contents spread over

    many stages

    FPU decoupling queue placed at commit point in pipeline

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    PC Generation

    15

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Next PC can come from number of sourcesPC+4 if sequential fetch (predicted not-taken)Predicted branch address (if predicted taken)Resolved branch address (if mispredicted)Replay PC (if pipeline flush/replay from either X or M stage)Trap handler address (on trap/interrupt)Restore PC (at end of trap/interrupt handler)

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Implementing Precise TrapsHandle traps in program order at end of memory stage (the commit point)

    Synchronous trap can be generated in any stage, held in Error PC & Cause shifted down pipeline

    EPC always holds PC of instruction in that stage

    Asynchronous interrupts handled in memory stage

    Trap/interrupt flushes pipe and resets PC to handler address

    RISC-V floating-point ISA designed to have no traps (only exception flags), so commit point before FPU decode

    16

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    enrdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Fetch Stage

    Predict next PC from current PC using BTB - fed back to P stage

    Fetch instructions from cache into instruction queue

    Translate virtual address PC into physical address PC for I-cache physical tag check, check for illegal PC -> signal trap

    I-cache miss goes to memory system, I-stream prefetcher fetches sequential blocks ahead of miss

    17

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Decode Stage

    Decode instructions from queue, check for illegal ops -> signal trap

    Fetch register operands and sign-extend immediate

    Check for unavailable source operands using scoreboard (busy bit per register), stall decode if not available

    Set busy bit for long latency operations

    Calculate bypass control and mux bypass operands into ALU inputs

    18

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    LogicExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Integer Execute Stage

    Most integer instructions complete in one cycle and can be bypassed to next instruction

    Integer multiply takes few cycles overlapped with memory stage

    Integer divide takes many cycles - so sets busy bit on destination register

    Branches resolved in ALU - mispredict detected by comparing target address with EPC in following instruction (was correct path taken?)

    ALU calculates load+store addresses, integer store data placed in store data queue (ISDQ)

    19

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfilewe0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addrm

    resp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Memory Stage

    Virtual load/store address translated and checked -> illegal address trap

    Store addresses are always queued in-order in SAQ to wait for data in ISDQ or FSDQ (from FPU). Stores go-ahead when address and data available.

    Loads can bypass stores if no conflict, but replayed if conflict with address in SAQ.

    Non-blocking cache supports multiple outstanding primary and secondary misses.

    Flushes pipe and injects handler PC if any traps or interrupts.

    End of memory stage is commit point - FPU operations enqueued if no traps.

    FPU load instruction enqueues command to read FPU load data queue

    FPU store instruction enqueues command to send FP register value to FSDQ20

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    oryTile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    Commit Stage

    Architectural registers written with final valuesBusy bits on scoreboard cleared as results arriveData cache finishes aligning and sign-extending small width values. Rocket only bypasses 32-bit and 64-bit values from end of memory stage, other sizes of load operands bypassed from end of commit stage.FPU begins decoding instructions from FPU queue.

    21

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    =

    VPC

    ITLB

    43

    TAGS DATA

    I$

    valid dout

    BranchTargetBuffer

    NPC Check

    FetchDecode

    rs rt

    Scoreboard(Read/Set)

    rs rt rd set

    busy

    Decode,Arbitration,

    StallDetection

    Logic

    ExecuteALU IDIVBranch?

    BYPASS

    SignExtend

    imm

    =

    DTLBTAGS DATA

    D$

    Mem

    ory

    Tile Link

    Com

    mit

    Commit PointXBAR + Sign

    Extension

    Mispredict?

    EPC

    EPC

    EPC

    CAUSE

    CAUSE

    CAUSE

    Exception?

    FPUCommand

    Queue

    FPUIntegerResp

    Queue

    HTIFRequestQueue

    HTIFResponse

    Queue

    Prefetcher

    Scoreboard(Clear)

    FP Regfile (Read)

    Scoreboard(Read/Set)

    Decode +Hazard

    DetectionLogic

    FMA

    ITOF FTOI

    FSDQ

    interrupt

    SAQ

    mresp_val

    mresp_tag

    Load/StoreAddr Check

    ISDQ

    mreq_data

    FPULoadData

    ReorderQueue

    busyBYPASS

    Decode

    FloatingPointUnit

    RECODE

    Execute

    Scoreboard(Clear)

    Com

    mit

    Replay?

    FSR

    RECODE

    FCMP

    NPCGENPriority

    Encoder

    CAUSE

    predict

    predict_addrbranch_addr

    mispredict

    exception

    epc_mem

    replay

    stall_decode

    IMUL

    Store ACKCounter

    ehpc

    CtrlRegs

    (Read)

    CtrlRegs

    (Write)

    Timer

    ls_conflict

    27

    epc

    eret

    epc_ex

    eret

    missstall_fetch

    miss

    busy

    exception

    paddr

    vaddr

    rs

    V

    V

    V

    mreq_addr

    wd0wa0

    Regfile

    we0

    wd1wa1we1

    ppn

    data

    tag

    InstructionQueue

    control

    st_addr

    mresp_data

    mreq_tag

    mreq_valmreq_rdy

    EPC

    FPUInteger

    OperandQueue

    Aligned?

    dc_miss

    MSHR

    V

    dc_busy

    toPTW

    4+

    busy

    PTWmresp_valmresp_tagmresp_data

    toITLB

    toDTLB

    mreq_op

    TileLink

    mreq_ptw

    D$Control

    CtrlRegs

    (Read)

    mreq_ptw

    dc_busy

    enstall_fetch

    dc_miss

    mode

    dtlb_miss

    exception

    toPTW

    toFIRQ

    stall

    waddr wdata

    FP Regfile(Write)

    waddr

    ra0Regfile(Read)

    ra1

    waddrwdata

    en

    rdata0

    rdata1

    rdata2

    Ex1Ex2

    11

    FPU built around a fused multiply-add unit (2008 revision of IEEE 754 FP standard) with full hardware support for all cases including subnormals.Regfile holds value in internal recoded format with extra bit to simplify handling of subnormals. Have to convert on load/store and move to/from integer.

    22

  • Lecture 11, Rocket Testing CS250, UC Berkeley, Fall 2011

    Design Verification

    23

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Verification large part of NRE costToo expensive to respin part

    prototype cost in $Mslost time-to-market $10Ms

    2-3X engineer time on verification versus design

    Only getting worse over time as chips get larger and more complex

    24

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Types of ChecksDesign verification: “Does RTL design implement the functional specification?”

    Tool/implementation checking: “Does design layout match RTL design?”

    Physical design checking: “Does design work across all process corners, obey all the electrical design rules (antenna rules, electromigration, ...), is power/clock/reset distribution OK, does design meet design-for-X rules (X=test, manufacturing,reliability,...)”

    Manufacturing testing: “Does a fabricated chip implement the design to specification?”

    25

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Design Verification Greatest ChallengeTool/implementation checking mostly automated using static formal verification checks (though finding and fixing error can be labor-intensive)

    Same for EDRC rules and other physical design checks

    Manufacturing tests can be automatically generated from RTL if scan chains used for all state elements (automatic test pattern generation - ATPG)

    26

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Source of Bugs in RTL DesignSpecification incorrect

    Designers built an implementation faithful to the specification, but the specification was wrong.

    Specification misreadDesigners built an implementation faithful to their reading of the specification, but they misunderstood specification.

    Incorrect RTL designThe RTL design does not do what designer wanted it to.

    Incorrect RTL codingThe RTL design was correct in designer’s head, but the RTL code doesn’t match that RTL design.

    27

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Avoiding Incorrect SpecificationBuild an executable version of the specification, which should be simple functional model of intended design

    For RISC-V cores, we have a C++ instruction set interpreter, requiring only a few lines of code for each instruction.

    Exercise executable specification inside system-level test harness with representative workload

    For RISC-V, we have built a test harness that can run programs on simulator. Classic test for processors was booting Unix on functional model.System-C common in industry for this level of modeling, where entire system modeled sufficiently to run whole software stack. FPGA emulation popular to accelerate model.

    28

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Avoiding Misread SpecificationHave executable specification as “golden model”

    Have different designers write executable specification and system test code to catch misread specification when building golden model

    If errors found, don’t just fix model, also rewrite specification to make it less ambiguous or more readable.

    Perform extensive directed and random testing to compare RTL design with golden model

    29

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Catching bad RTL design or codingPerform extensive directed and random testing to compare RTL design with golden model

    Modern processor design team will perform many billions of cycles of RTL simulation using 10,000s cores prior to tapeout

    30

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    When are you done?

    But did you find all bugs, or reach limits of your test coverage?

    31

    Bugs found per minute of testing

    Time

    Bug Rate

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Test CoverageDid every bit toggle?

    Was every value on every bus?

    Was every state machine transition taken?

    Could your tests observe this happening?

    32

  • CS250, UC Berkeley, Fall 2011Lecture 11, Rocket Testing

    Unit TestingDivide and Conquer

    Tradeoff between cost of defining unit boundary and improved test visibility and coverage

    Typical granularity of test units in processor:Floating-point functional unitsCachesInteger coreWhole processor

    33