14
ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 1 / 14 C Copyright 2010 Gandhi Puvvada Fall 2010 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 12/10/2010, Friday Closed Book, Closed Notes; Time: 8:00 - 10:45AM SGM123 Calculator and Cadence Verilog Guide allowed Total points: 235 Name: Perfect score: 220 / 235 1 ( 42 points) 25 min. Pipelining (Lab 7 Part 3 modified): On the next page you find the original lab 7 Part 3 Block Diagram, provided for your information. On the page after, you find a modified diagram for you to complete. Mr. Trojan says that what you intend to do at 12:01AM (at the beginning of a clock) can easily be done at 11:59PM of the previous day (at the end of the previous clock, logic wise, assuming timing is not an issue). The original forwarding in EX1 (controlled by FU1, FORW1) is now moved to ID (controlled by FU_ID, FORW_ID). And the original forwarding in EX2 (controlled by FU2, FORW2) is now moved to EX1 (controlled by FU_EX1, FORW_EX1). These changes in forwarding do not cause any change in (a) HDU or generation of STALL T / F (b) generation of SKIP1 or SKIP2 T / F (c) the internal forwarding logic/mechanism in the register file T / F Draw the logic for the two new FUs (Forwarding Units). If you were to code this new design in RTL coding style, among ID, EX1, and EX2, you would code _____________ first, and then ______________, and finally _____________. Assume that the register file is negative-edge triggered and the rest of the system is positive-edge triggered. In the RTL coding of Lab 7 Part 3, in the main clocked procedural block, we used ___________ ____ ( if(STALL) / if (~STALL)) ________________ (with / without) an else clause. PRIORITY FORW_ID FU_ID FORW_EX1 FU_EX1

1 ( 42 points) 25 min....Name: Perfect score: 220 / 235 1 ( 42 points) 25 min. Pipelining (Lab 7 Part 3 modified): On the next page you fi nd the original lab 7 Part 3 Block Di agram,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 1 / 14C Copyright 2010 Gandhi Puvvada

    Fall 2010 EE457 Instructor: Gandhi Puvvada Final Exam (30%) Date: 12/10/2010, Friday Closed Book, Closed Notes; Time: 8:00 - 10:45AM SGM123 Calculator and Cadence Verilog Guide allowed Total points: 235 Name: Perfect score: 220 / 235

    1 ( 42 points) 25 min.

    Pipelining (Lab 7 Part 3 modified):

    On the next page you find the original lab 7 Part 3 Block Diagram, provided for your information.On the page after, you find a modified diagram for you to complete.

    Mr. Trojan says that what you intend to do at 12:01AM (at the beginning of a clock) can easily be done at 11:59PM of the previous day (at the end of the previous clock, logic wise, assuming timing is not an issue). The original forwarding in EX1 (controlled by FU1, FORW1) is now moved to ID (controlled by FU_ID, FORW_ID). And the original forwarding in EX2 (controlled by FU2, FORW2) is now moved to EX1 (controlled by FU_EX1, FORW_EX1).

    These changes in forwarding do not cause any change in(a) HDU or generation of STALL T / F (b) generation of SKIP1 or SKIP2 T / F (c) the internal forwarding logic/mechanism in the register file T / F

    Draw the logic for the two new FUs (Forwarding Units).

    If you were to code this new design in RTL coding style, among ID, EX1, and EX2, you would code _____________ first, and then ______________, and finally _____________. Assume that the register file is negative-edge triggered and the rest of the system is positive-edge triggered.

    In the RTL coding of Lab 7 Part 3, in the main clocked procedural block, we used _______________ ( if(STALL) / if (~STALL)) ________________ (with / without) an else clause.

    PRIORITY

    FORW_ID

    FU_ID

    FORW_EX1

    FU_EX1

  • ee457_Final_Fall2010_r1.fm D

    ecember 10, 2010 3:28 pmEE457 Final Exam

    - Fall 2010 2 / 14C

    Copyright 2010 G

    andhi Puvvada

    FOR REFERENCE ONLY

    PC

    XA

    Reg. File

    XA

    RA

    RDR-Write

    0

    10

    1

    0

    10

    1

    A

    Cout

    A

    Cout

    Comp Station in ID Stage

    ID_XMEX1 ID_XMEX2

    P P Q

    IF ID EX1 EX2 WBComp Station in ID Stage

    Q

    ID_XA EX1_RA ID_XA EX2_RA

    P=Q P=Q

    ID_XMEX1= ID_XA Matched with EX1_RA

    XD

    HDU

    EN

    XM

    EX1

    XM

    EX2

    A-3 A+4

    EN

    XM

    EX1

    FU1

    EN

    RD

    Writ

    e

    RA

    FU2

    XD

    XD

    EX1_ADD4

    EX1_SUB3

    EX1_ADD1

    EX1_RA

    PRIORITY0 1

    RESET_BRESET_B RESET_BRESET_B

    1. Complete all missing connections to the Reg. File. Also complete the RA(Result Addreee) connection in ID stage (ID_RA).2. Complete all five enable (EN) controls on the pipeline registers (including PC).

    4. Complete the skip controls(SKIP1,SKIP2).5. Draw the logic for the HDU, FU1, and FU2, producing STALL, PRIORITY, FORW1, FORW2.

    EX2_ADD4

    EX2_SUB3

    EX2_ADD1EX2_RA

    WB_RA

    WB_Write

    WB_RDX1_Mux

    R1_Mux X2_Mux

    R2_Mux

    SKIP

    1

    SKIP

    2

    Qualifying signals

    Qualifyingsignals

    QualifyingSignals

    LAB 7 Part 3 Block Diagram

    I-MEMEN

    RESET_B

    PRIORITYEX2_XMEX1

    ADD4SUB3STALL

    EN

    FOR

    W1 FO

    RW

    2

    Fig. 1

    ADD4

    SUB3

    AD

    D1

    RAM

    OV

    ADD4

    SUB3

    AD

    D1

    RA

    MO

    V

    ADD4

    SUB3

    AD

    D1

    RA

    MO

    V

    EX1_MOVEX2_MOV

    revised 7/18/2010

    3. Complete the forwarding path from EX2 to EX1. Should it start from upstream or downstream of the X2_mux?

    FOR REFERENCE ONLY

  • ee457_Final_Fall2010_r1.fm D

    ecember 10, 2010 3:28 pmEE457 Final Exam

    - Fall 2010 3 / 14C

    Copyright 2010 G

    andhi Puvvada

    COMPLETE THIS

    PC

    XA

    XA

    RA

    RDR-Write

    0

    10

    1

    0

    10

    1A

    Cout

    A

    Cout

    Comp Station in ID Stage

    ID_XMEX1 ID_XMEX2

    P P Q

    IF ID EX1 EX2 WBComp Station in ID Stage

    Q

    ID_XA EX1_RA ID_XA EX2_RA

    P=Q P=Q

    ID_XMEX1= ID_XA Matched with EX1_RA

    XD

    HDU

    EN

    A-3 A+4

    EN

    FU_ID

    EN

    RD

    Writ

    e

    RA

    FU_EX1

    XD

    XD

    EX1_ADD4

    EX1_SUB3

    EX1_ADD1

    EX1_RA

    PRIO

    RIT

    Y

    0 1

    RESET_BRESET_BRESET_B RESET_B

    1. Connect/label all missing connections to the Reg. File. Also complete the RA(Result Addreee) connection in ID stage (ID_RA).2. Complete all five enable (EN) controls on the pipeline registers (including PC).

    4. Complete the skip controls(SKIP1,SKIP2).5. Draw on a separate paper the logic for the FU_ID, and FU_EX1,

    EX2_ADD4

    EX2_SUB3

    EX2_ADD1EX2_RA

    WB_RA

    WB_Write

    WB_RDXID_MuxR1_Mux

    XEX1_Mux

    R2_Mux

    SKIP

    1

    SKIP

    2

    Qualifying signals Qualifyingsignals

    QualifyingSignals

    LAB 7 Part 3

    I-MEMEN

    RESET_B

    EX1_XMEX

    ADD4SUB3STALL

    EN

    FOR

    W_I

    D

    FOR

    W_E

    X1

    Subpart 1 Fig. modified

    ADD4

    SUB3

    AD

    D1

    RAM

    OV

    ADD4

    SUB3

    AD

    D1

    RA

    MO

    V

    ADD4

    SUB3

    AD

    D1

    RA

    MO

    V

    EX1_MOVEX2_MOV

    12/7/2010

    3. Complete the forwarding paths into ID. If a path is not needed, write "no connection".

    for Fall2010 Final Exam

    InternallyForwardingReg. File

    ID_RA

    producing PRIORITY, FORW_ID, FORW_EX1.

    EX2_XD

    EX1_

    XD

    EX1_

    XD

    _OU

    T

    EX2_

    XD

    _OU

    T

    ID_X

    D_O

    UT

    Write a “1” or “2”

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 4 / 14C Copyright 2010 Gandhi Puvvada

    2 ( 12 points) 5 min.

    RTL coding: Suppose you are asked to write a verilog RTL code using one clocked always procedural block for the control unit (CU)and another clocked procedural block for the datapath unit (DPU).

    A. Would you divide the two parts as per the left diagram or the right diagram? Left / Right B. Is it essential to have the RESET control for the CU or the DPU? CU / DPUC. The outputs of OFL will be treated as intermediate variables or final outputs? Intermediate / FinalD. You will be using blocking or non-blocking assignments to produce these OFL outputs? Blocking / Non-blockingE. Is it possible to combine the two clocked always blocks into one single always block? Yes / NoF. If combining is possible, the combined always block __________ (will / will not) have a RESET control signal in the event list (sensitivity list).

    3 ( 27 points) 20 min.

    Arithmetic (Fast Adders)

    3.1 You are taught the following cascadable incrementer which performs R2R1R0 = A2A1A0 + C0.

    I0I1 S

    Y

    I0I1 S

    Y X_Reg

    Y_Reg

    NSLSM

    OFL

    DPU

    CU

    Current_State

    I0I1 S

    Y

    I0I1 S

    Y X_Reg

    Y_Reg

    NSLSM

    OFL

    DPU

    CU

    Current_State

    ap s

    cX2

    S2

    ap s

    cX1

    S1

    ap s

    cX0

    S0p0p1p2 C2 C1 C0 C0C3 New CLL INC

    A2 A1 A0

    R2 R1 R0

    ap s

    cXi

    Si

    Si = Xi (+) 0 (+) Ci

    Incrementing cell

    = Xi (+) Ci

    pi = Xi + 0 = Xigi = Xi . 0 = 0

    p0p1p2 C2 C1 C0 C0C3 New CLL INC

    New CLL INC

    Since all gi are zeros, C1 = p0 . C0C2 = p1 . p0 . C0C3 = p2 . p1 . p0 . C0

    Least significant module’s C0 is tied to a 1.

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 5 / 14C Copyright 2010 Gandhi Puvvada

    Complete the following cascadable decrementer which performs R2R1R0 = A2A1A0 -1 by adding 111 to subtract a 1 (R2R1R0 = A2A1A0 + 111). Complete the 7 rectangles.

    3.2 You have gone through the following solution to a question in an earlier exam.

    A variation of the above questions is to add 000_111_000_111 to A11A10A9A8A7A6A5A4A3A2A1A0 and produce the result R11R10R9R8R7R6R5R4R3R2R1R0 .The following design is complete and correct. Let us analyze and try to improve it!

    7pts

    ag s

    cX2

    S2

    ag s

    cX1

    S1

    ag s

    cX0

    S0g0g1g2 C2 C1 C0 C0C3 New CLL DEC

    A2 A1 A0

    R2 R1 R0

    ag s

    cXi

    Si

    Si = Xi (+) 1 (+) Ci

    decrementing cell

    = Xi + Ci

    pi = Xi + 1 = gi = Xi . 1 =

    g0g1g2 C2 C1 C0 C0C3 New CLL DEC

    New CLL DEC

    Since all pi are , C1 = C0C2 = C0C3 = C0

    Least significant module’s C0 is tied to a . XNOR

    9-bit constant addition: You need to add 000_111_000 to A8A7A6A5A4A3A2A1A0 and producethe result R8R7R6R5R4R3R2R1R0 . Mr. Trojan says that you should be able to do it by cascading justone incrementer and one decrementer designed before. Complete the design below.

    ap s

    cX2

    S2

    ap s

    cX1

    S1

    ap s

    cX0

    S0p0p1p2 C2 C1 C0 C0C3 New CLL INC

    ag s

    cY2

    D2

    ag s

    cY1

    D1

    ag s

    cY0

    D0g0g1g2 C2 C1 C0 C0C3 New CLL DEC

    A8 A7 A6 A5 A4 A3 A2 A1 A0

    R8 R7 R6 R5 R4 R3 R2 R1 R0

    C9?

    SOLUTION

    9pts

    ap sc

    X2

    S2

    ap sc

    X1

    S1

    ap sc

    X0

    S0p0p1p2 C2 C1 C0 C0C3 New CLL INC

    ag sc

    Y2

    D2

    ag sc

    Y1

    D1

    ag sc

    Y0

    D0g0g1g2 C2 C1 C0 C0C3 New CLL DEC

    A5 A4 A3 A2 A1 A0

    R5 R4 R3 R2 R1 R0C0

    C3C6

    ap sc

    X2

    S2

    ap sc

    X1

    S1

    ap sc

    X0

    S0p0p1p2 C2 C1 C0 C0C3 New CLL INC

    ag sc

    Y2

    D2

    ag sc

    Y1

    D1

    ag sc

    Y0

    D0g0g1g2 C2 C1 C0 C0C3 New CLL DEC

    A11 A10 A9 A8 A7 A6

    R11 R10 R9 R8 R7 R6C6

    C9C12

    State cumulative delays

    C6: _______; C9: _______; C11: _______; R11: _______; C12: _______;

    Note: In EE457, we count an XOR or a XNOR as a2-gate-delay device.

    C3: _______; in gate-delays

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 6 / 14C Copyright 2010 Gandhi Puvvada

    Miss Trojan proposed to have Group Propagates (upper case P’s) and Group Generates (upper case G’s) so that she can add a 2nd level CLL and avoid the linear cascade shown above.

    Here is Miss Trojan’s proposed design. She wants you to simplify a regular CLL to form the special 2nd level CLL which takes advantage of the specific values of P’s and G’s and overall C0. Note that this design is meant for this specific 12-bit constant addition and it need not be cascadable or extendible.

    5pts

    p0p1p2 C2 C1 C0 C0C3 New CLL INC

    New CLL INCg0g1g2 C2 C1 C0 C0C3 New CLL DEC

    New CLL DEC

    P GP G

    Since the individual gi’s are all _______ (zero/one)the Group G =

    And the Group P =

    Since the individual pi’s are all _______ (zero/one)the Group P =

    And the Group G =

    20pts

    ag s

    cY2

    D2

    ag s

    cY1

    D1

    ag s

    cY0

    D0g0g1g2 C2 C1 C0 C0New CLL DEC

    A2 A1 A0

    R2 R1 R0

    Miss Trojan’s Design

    C0C3

    ap s

    cX2

    S2

    ap s

    cX1

    S1

    ap s

    cX0

    S0p0p1p2 C2 C1 C0 C0New CLL INC

    ag s

    cY2

    D2

    ag s

    cY1

    D1

    ag s

    cY0

    D0g0g1g2 C2 C1 C0 C0New CLL DEC

    A11 A10 A9 A8 A7 A6

    R11 R10 R9 R8 R7 R6

    C6C9P G P G P G P G

    P0 G0P1 G1P2 G2P3 G3 C3C1C2C4

    C122nd level CLL specific for this problem

    Write equations for C4, C3, C2, C1 for a generic CLL in terms of the9 inputs (C0, P0, G0, P1, G1, P2, G2, P3, G3), and simplify substituting

    C1 = G0 + P0.C0 =

    C2 =

    C3 =

    C4 =

    ap s

    cX2

    S2

    ap s

    cX1

    S1

    ap s

    cX0

    S0p0p1p2 C2 C1 C0 C0New CLL INC

    A5 A4 A3

    R5 R4 R3

    C0: _______ (0 / 1 / variable); 0P0: _______ (0 / 1 / variable); G0: _______ (0 / 1 / variable); P1: _______ (0 / 1 / variable); G1: _______ (0 / 1 / variable); P2: _______ (0 / 1 / variable); G2: _______ (0 / 1 / variable); P3: _______ (0 / 1 / variable); G3: _______ (0 / 1 / variable);

    constants of 0 or 1 wherever possible. For example C0 = 0.

    State delays in gate-delays

    C6: _______;

    C9: _______;

    C11: _______;

    R11: _______;

    C12: _______;

    C3: _______;

    Cancelled

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 7 / 14C Copyright 2010 Gandhi Puvvada

    3.3 In a 256x256 multiplier, to reduce 256 PPs (partial products) in a CSA tree, the number of levels of the CSA needed are approximately(a) log10256 (b) log2256 (c) log1.5256 (d) log1.564 (e) log264 (f) other ___________

    The above is ________________ (lower / upper) bound.

    The partial CSA tree on the right needs _______ iterations to reduce the 256 PPs to _____ (2 / 1) vector(s) for further processing by the CPA.

    4 ( 71 points) 40 min.

    4.1 Out of Order (OoO) Execution

    4.1.1 "Branch Prediction and speculative execution beyond branches" is only possible in the design on the _______ (left / right) because in the other design, if we dispatch instructions based on prediction, these speculative instructions ____________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

    4.1.2 We know that a memory delay of 10ns would mean 10 clocks for a processor running at ___________ (1 GHz / 2 GHz) and the same 10ns would mean 20 clocks if the processor is running at ___________ (1 GHz / 2 GHz). Hence, when increasing the processor frequency from 1 GHz to 2 GHz (without any change in memory speed), you would recommend that the depth of the instruction queues is __________________ (increased / decreased) in the case of ____________________________________________________________________________(state the queue name/names from among Integer queue, Load-Store queue, Divider queue, Multiplier queue).

    S3

    S4

    S5

    S2

    S1

    CPA

    6pts

    Issue Unit

    Int.

    Div

    ider

    63

    2

    TAG FIFO

    Int.

    Mul

    tiplie

    r

    Issue Unit

    Int.

    Div

    ider

    63

    2

    63

    2

    TAG FIFO

    Int.

    Mul

    tiplie

    r

    3pts

    4pts

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 8 / 14C Copyright 2010 Gandhi Puvvada

    4.1.3 RAW dependency is solved by simply making the reader wait until the writer can forward the information to the reader in the design on _____________ (the left / the right / both sides).

    4.1.4 In the short code of 4 lines on the side, you notice that OoO can potentially cause ___________________________________ (RAW/WAR/WAW/multiple of these(state them)) hazards for $8.

    In-order writing alone in the design on the _____________ (left / right) eliminates _____________________________ (RAW/WAR/WAW) hazards among register.

    Design on the left: Let us say that the dispatch unit assigns a symbolic Tag of LION to the destination register $8 of instr. #1 and a little later assigns TIGER to the destination register $8 of instr. #3. LION is written against $8 in RST first and later it is replaced by TIGER. State how the hazards listed by you for $8 are addressed in the design on the left.____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ Design on the right: Let us say that the dispatch unit assigns the ROB Tag of 21 to the destination register $8 of instr. #1 and a little later assigns the ROB Tag of 23 to the destination register $8 of instr. #3. Again state how the hazards listed by you for $8 are addressed in the design on the right.

    ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________ ____________________________________________________________________________

    4.1.5 Conditional branches cause more stalls in dispatch in the design on the ___________ (left / right) where as in the design on the ___________ (left / right) more flushes due to branch mispredictions occur.

    4.1.6 Flush of ___________________________ (IFQ / Backend / ROB) is same in both the designs due to (circle all applicable items below) (a) mispredicted conditional branches (beq/bne) (b) unconditional jump (j) (c) unconditional jump and link (jal) (d) unconditional program return (jr $31)

    4.1.7 It is enough to predict the direction of a conditional branch using BPB, standing for Branch _____________ Buffer, if prediction is done from the ________________ (IF stage / Dispatch

    2pts

    add $8, $1, $2; instr. #1lw $10, 100($8); instr. #2add $8, $3, $4; instr. #3lw $11, 100($8); instr. #4

    4pts

    10pts

    3pts

    3pts

    3pts

    Cancelled

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 9 / 14C Copyright 2010 Gandhi Puvvada

    stage), but in the other case, we need BTB, standing for Branch ___________ Buffer to provide the target address.

    4.1.8 On the side we have shown three instructions at the PC values 1040H, 2040H, and 3040H. They are distanced by 1000H bytes = 400H words.If BPB and BTB are each 1K deep (210 = 1K = 400H), does it cause aliasing? Where aliasing is unacceptable? ______________________(trying to predict from IF stage / trying to predict from dispatch stage / both / none). Explain briefly. ________________________________ ________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________ _____________________________________________________________________________

    4.2 Exceptions:

    4.2.1 Page fault _______________ (needs /does not need) to be associated with an instruction and hence ________ (is / isn’t) a precise exception.

    4.2.2 As part of handling a precise exception, we need to (a) tag the offending instruction with its Cause and EPC info. T/F(b) convert the offending instruction and all the following instructions into bubbles. T/F(c) allow all preceding (senior) instructions in process order to complete. T/F(d) be silent and carry the Cause and EPC until the offending instruction reaches WB. T/F

    4.2.3 Place a check mark in the stage (or stages) an exception can occur.

    4.3 RAS (Return Address stack)

    4.3.1 Consider the following 4 types of program control instructions:(a) unconditional branches (example: j) (b) function calls (example: jal)(c) conditional branches (example: beq, bne)(d) function returns (example: jr $31)

    RAS provides target address for the __________________ (j/jal/beq/bne/jr $31) instruction(s).

    Exception IF ID EX MEM WB

    Page Fault

    Integer Overflow

    Undefined Opcode

    Memory Protection Violation

    1040: beq $1, $2, 20;. . . . .

    2040: add $4, $5, $6;. . . . .

    3040: beq $10, $20, 5;

    6pts

    2pts

    4pts

    4pts

    4pts

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 10 / 14C Copyright 2010 Gandhi Puvvada

    A PUSH operation on RAS takes place when _________________ (j/jal/beq/bne/jr $31) is/are executed.A POP operation on RAS takes place when ___________________ (j/jal/beq/bne/jr $31) is/are executed.

    4.3.2 RAS, being usually _______________ (small / large), can only predict the return address. The prediction can go wrong if the degree of nesting/ the degree of recursion in recursive call _______________________ (exceeds / does not exceed) the depth of the RAS.

    4.4 CMP (Chip Multiprocessors) with CMT (Chip multithreading)

    4.4.1 ILP (standing for ___________________________ ) has been more or less fully exploited and processor architects have turned to exploit TLP (standing for ___________________________ )

    4.4.2 When one thread is switched with another in a multi-threaded core, the register file contents are saved in the main memory. T / F Since the number of alternative register files ______________ (is / isn’t quite) finite and since the number of process control block copies in memory ______________ (is / isn’t quite) finite, the number of threads per a multi-threaded core is _____________________________________, where as the number of processes that can be run on a core using software context switching is generally ___________________________________________________________________.

    4.4.3 Functional units such as ALU are ____________ (common / separate) for the 4 threads running on a core.

    4.4.4 The stall penalty due to dependency on a load word instruction can usually be avoided in ______ _______________________ (Fine-Grained / Coarse-Grained / both types of / neither type of ) multithreading.

    4.4.5 _________ (Fine / Coarse)-grain multithreading switches threads on each instruction where as _________ (fine / coarse)-grain switches threads on costly stalls such as cache misses.

    4.4.6 Dynamic power considerations favor ________________________ (Uniprocessors/Multiprocessors).

    4.4.7 A _____________________ (non-blocking / blocking) cache handles multiple cache requests, usually as long as they are hits under a pending miss. A CMP (such as Sun Niagara T1) needs to use a ____________________ (non-blocking / blocking) cache to be able to execute multiple threads. A non-blocking cache ________________ (is / isn’t) useful in an OoO executing processor as it is ______________ (possible / not possible) to handle several load/store instructions in the cache.

    4.5 Multiprocessors and Cache Coherence:

    4.5.1 Snoopy controller, in a ____________________ (write-through /write-back/both/neither) cache-coherence system, does not care to watch read transactions from the other processors [ R(j) ].

    3pts

    2pts

    6pts

    1pts

    2pts

    2pts

    1pts

    3pts

    3pts

    3pts

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 11 / 14C Copyright 2010 Gandhi Puvvada

    Label the following two state diagrams as write-through or write-back.

    4.5.2 The "dual" directory in snoopy protocol refers to the duplicated _________ (TAG / DATA/TLB) RAM.

    4.5.3 If there is a "Dirty bit" besides a "Valid bit" associated with a cache block, then the designer must be using a ____________________ (write-through /write-back/any of the two/neither) cache.

    5 ( 18 points) 10 min. Non-linear pipelines:

    Complete the datapath to support the function, Z, using dedicated OUTPUT stage registers for each stage (and the needed muxes). Show where the output Z is taken from. Complete the reservation table and arrive at ICV. Draw state diagram, record greedy simple cycle(s) and arrive at MAL.

    write-through /write-back

    write-through /write-back

    1.5pts

    1.5pts

    SQRX

    - 3

    /9

    Dedicated OUTPUT stage registers.

    +5

    Z = X2 9

    2+ 5 - 3

    9

    SquareSubtract 3Divide by 9

    1 2 3 4 5

    Add 5

    Reservation table for Z

    Show the tap off for the Z output. ICV: _________________

    STA

    TE

    DIA

    GR

    AM

    MAL analysis:

    6

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 12 / 14C Copyright 2010 Gandhi Puvvada

    6 ( 44 points) 25 min.

    Virtual Memory and Cache

    Specs of the Trojan computer (a 32-bit address, 32-bit data, byte-addressable machine with physically addressed cache (more specifically PIPT cache).

    Virtual address space = 4GB, Virtual address = 32 bits (VA31-VA0) (232 = 4G), Physical address space = 4GB, Physical address = 32 bits (PA31-PA0) (232 = 4G)

    Page size = 2KB (211 = 2K), TLB size = 64 entry (fully-associative) (26 = 64)Page table organization: 2-level table with 256-entry (28 = 256) page directory (top level table)

    Cache size = 192KB (3*216 = 192K), Cache Block (cache line size) = four 32-bit words (16 bytes total) (24 = 16), Cache mapping: Set-associative with three blocks per set. (note 3 blocks per set)

    Main memory organization: Lower-order Interleaved. Degree of interleaving to suit the most efficient access of the main-memory block for transferring to cache.

    6.1 Divide the virtual address into VPN (Virtual Page Number) and Page offset fields. Since TLB is a fully associative TLB, we ____________ (further divide / do not divide) the VPN into TAG and SET fields. How many comparators of what size are needed in the TLB? _____________ _______________________________________

    Is any portion of the virtual address used for "indexing" TLB? ______________ (Yes / No ).

    6.2 Divide the virtual address into VPN and Page offset fields again and further divide the VPN (based on the page table organization information) into page directory index and 2nd-level page table index.

    6.3 Divide the physical address into PPFN (Physical Page Frame Number) and Page offset fields.

    4pts

    VA19 VA18 VA17 VA16VA31 VA30 VA29 VA28 VA27 VA26 VA25 VA24 VA23 VA22 VA21 VA20 VA3 VA2 VA1 VA0VA15 VA14 VA13 VA12 VA11 VA10 VA9 VA8 VA7 VA6 VA5 VA4

    Word Byte

    Virtual addressVA31-VA0

    BE3-BE0Bank Enables(Byte enables)

    3pts

    VA19 VA18 VA17 VA16VA31 VA30 VA29 VA28 VA27 VA26 VA25 VA24 VA23 VA22 VA21 VA20 VA3 VA2 VA1 VA0VA15 VA14 VA13 VA12 VA11 VA10 VA9 VA8 VA7 VA6 VA5 VA4

    Word Byte

    Virtual addressVA31-VA0

    BE3-BE0Bank Enables(Byte enables)

    3pts

    PA19 PA18 PA17 PA16PA31 PA30 PA29 PA28 PA27 PA26 PA25 PA24 PA23 PA22 PA21 PA20 PA3 PA2 PA1 PA0PA15 PA14 PA13 PA12 PA11 PA10 PA9 PA8 PA7 PA6 PA5 PA4

    Word Byte

    Physical addressPA31-PA0

    BE3-BE0Bank Enables(Byte enables)

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 13 / 14C Copyright 2010 Gandhi Puvvada

    6.4 Divide the physical address (based on cache specifications) into TAG, SET, WORD and BYTE fields

    6.5 If the 32-bit physical byte address (produced by address translation through TLB or Page Table) is 90586124H (1001_0000_0101_1000_0110_0001_0010_0100B), which set in the cache you will be approaching? Does this set number form an index (an address) into _____________________________ (the multiple TAG RAMs/the single TAG RAM/neither of these).

    Complete the TAG RAM details in the side panel.

    6.6 Complete the Cache DATA RAM details below.

    6.7 Complete the Interleaved Main Memory details below.

    6.8 TLB miss does not cause a TRAP. T / F During TLB look up, a Read/Write/Execute violation (a memory protection violation) causes a TRAP. T / F

    6.9 If there is only one TAG RAM, it is a ________________________ (direct mapped/set-associative/fully-associative) cache.

    3pts

    PA19 PA18 PA17 PA16PA31 PA30 PA29 PA28 PA27 PA26 PA25 PA24 PA23 PA22 PA21 PA20 PA3 PA2 PA1 PA0PA15 PA14 PA13 PA12 PA11 PA10 PA9 PA8 PA7 PA6 PA5 PA4

    Word Byte

    Physical addressPA31-PA0

    BE3-BE0Bank Enables(Byte enables)

    Address

    Data_in

    Data_out

    Com

    para

    tor

    HIT

    Size =

    + valid

    TAG RAM

    _____ more (besides the above)

    are needed in this cache.

    8pts

    5pts

    DATA RAMAddress

    TrojanProcessor D31-D0 D

    31-D

    24

    D15

    -D8

    D23

    -D16

    D7-

    D0

    ______ moresuch DATARAM units

    Size: Eachof the 4byte_widebanks is a x 8

    (besides h one on the sie

    5pts

    D31-D24 D23-D16

    32 bit bidirectional buffer (XCVR)

    256MB 256MB

    PA - PA

    D15-D8 D7-D0

    256MB 256MB

    D31-D0

    ______ more such units (besides the one on the left) exist in Main Memory.

    2.5pts

    1.5pts

  • ee457_Final_Fall2010_r1.fm December 10, 2010 3:28 pmEE457 Final Exam - Fall 2010 14 / 14C Copyright 2010 Gandhi Puvvada

    6.10 In a set associative cache of 2-blocks per set and 4 words per block, the degree of lower-order interleaving recommended for the main memory is __________ (1-way/2-way/4-way/other namely ...) and the number of TAG RAMs is __________ (8/16/32/other namely ...).The depth of a TAG RAM is determined by ________________________________________.

    6.11 The fully associative TLB can have a non-power of 2 number of entries, say 53 entries. T / FThe number of sets in a set associative mapping can be a non-power of 2 number, say 53 sets. T / FThe number of TAG RAMs in a set associative mapping can be a non-power of 2 number, say 3. T / F

    7 ( 21 points) 15 min.

    Page Table: Number of A,B,C Tables built by the OS:

    PQRST on the side represents a 20-bit (5-digit hex) VPN in a 3-level page table with upper 8 bits (PQ) indexing the A-level table, next 8 bits (RS) indexing the B-level tables, and the last 4 bits (T) indexing the C-level tables.

    7.1 Suppose the first 8 distinct virtual pages accessed by the application program had the VPNs as stated in TABLE-I (in sorted order).How many tables of what size are built by OS by this time?A-level: _____________________________________________ B-level: _____________________________________________ C-level: _____________________________________________

    7.2 Complete 8 distinct VPNs of your choice in TABLE-II such that the least number of A,B,C tables are built by OS. This least set consists of ____ of A-Table(s), ____ of B-Table(s), ____ of C-Table(s).

    7.3 Similarly, complete 8 distinct VPNs of your choice in TABLE-III such that the most number of A,B,C tables are built by OS. This most set consists of ____ of A-Table(s), ____ of B-Table(s), ____ of C-Table(s).

    5pts

    4pts

    TABLE-II TABLE-IIIP Q R S T P Q R S T P Q R S TTABLE-I

    1 2 3 4 51 2 3 4 71 2 3 6 51 3 3 6 51 4 3 6 51 5 3 6 51 6 3 6 51 6 5 6 5

    9pts

    6pts

    6pts

    Blank space for rough work

    We enjoyed teaching this course. Hope you liked the course. Hope to see some of you in EE454L or EE560. Grades will be out in a week. Enjoy your Xmas break! - Gandhi, Jonathan, Prasanjeet, Sabya, Mehrtash, Ben, Ankit, Girish, Jingming, Sumit

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 1200 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile () /PDFXOutputCondition () /PDFXRegistryName (http://www.color.org) /PDFXTrapped /Unknown

    /Description >>> setdistillerparams> setpagedevice