Click here to load reader

Advanced Microarchitecture

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Advanced Microarchitecture. Lecture 3: Superscalar Fetch. Fetch Rate is an ILP Upper Bound. To sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC! - PowerPoint PPT Presentation

Text of Advanced Microarchitecture

CS8803: Advanced Microarchitecture

Advanced MicroarchitectureLecture 3: Superscalar Fetch1Fetch Rate is an ILP Upper BoundTo sustain an execution rate of N IPC, you must be able to sustain a fetch rate of N IPC!

Over the long term, you cannot burn 2000 calories a day while only consuming 1500 calories a day. You will starve!

This also suggests that you dont need to fetch N instructions every cycle, just on averageLecture 3: Superscalar Fetch 2Im not fat! I just have a lot of calorie buffers2Impediments to Perfect FetchA machine with superscalar degree N will ideally fetch N instructions every cycleThis doesnt happen due toInstruction cache organizationBranchesAnd interaction between the twoLecture 3: Superscalar Fetch 3Instruction Cache OrganizationTo fetch N instructions per cycle from I$, we needPhysical organization of I$ row must be wide enough to store N instructionsMust be able to access entire row at the same timeLecture 3: Superscalar Fetch 4DecoderTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstAddressCache LineAlternative: do multiple fetchesper cycleNot Good: increases cycle timelatency by too muchFetch OperationEach cycle, PC of next instruction to fetch is used to access an I$ lineThe N instructions specified by this PC and the next N-1 sequential addresses form a fetch groupThe fetch group might not be aligned with the row structure of the I$Lecture 3: Superscalar Fetch 5Fragmentation via MisalignmentIf PC = xxx01001, N=4:Ideal fetch group is xxx01001 through xxx01100 (inclusive)Lecture 3: Superscalar Fetch 6DecoderTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInst000001010011111xxx0100100011011Row widthFetch groupCan only access one line percycle, means we fetch only 3instructions (instead of N=4)Fetch Rate ComputationAssume N=4Assume fetch group starts at random locationThen fetch rate = x 4+ x 3+ x 2+ x 1= 2.5 instructions per cycleLecture 3: Superscalar Fetch 7This is just to demonstrate how to analytically estimate fetch rates7Reduces Fetch BandwidthIt now takes two cycles to fetch N instructionsHalved fetch bandwidth!Lecture 3: Superscalar Fetch 8DecoderTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInst000001010011111xxx0100100011011DecoderTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInst000001010011111xxx0110000011011InstInstInstInstCycle 1Cycle 2InstInstInstReduction may not be as bad asa full halvingReduction may not be as bad as a full halving: just because you fetched only K < N instructions during cycle 1 does not limit you to only fetching N-K instructions in cycle 2.8Reducing Fetch FragmentationMake |Fetch Group| != |Row Width|Lecture 3: Superscalar Fetch 9DecoderTagInstInstInstInstInstInstInstInstTagInstInstInstInstTagInstInstInstInstInstInstInstInstAddressInstInstInstInstIf start of fetch group is N or more from the end of the cache line,then N instructions can be deliveredCache LineThis approach is not terribly practical because you either have to read out twice as many instructions (2x the bitlines), or you need some special logic to enable one wordline for some columns, and another wordline for the others.9May Require Extra HardwareLecture 3: Superscalar Fetch 10DecoderTagInstInstInstInstInstInstInstInstTagInstInstInstInstTagInstInstInstInstInstInstInstInstInstInstInstInstRotatorInstInstInstInstAligned fetch groupArbitrary rotation is not cheap to implement! Remember that each line represents a full instruction which may be 32 bits wide.10Fetch Rate ComputationLet N=4, cache line size = 8Then fetch rate =5/8 x 4+ 1/8 x 3+ 1/8 x 2+ 1/8 x 1

= 3.25 instructions per cycleLecture 3: Superscalar Fetch 11Another example simply assuming that the cacheline is twice as wide.11Fragmentation via BranchesEven if fetch group is aligned, and/or cache line size > than fetch group, taken branches disrupt fetchLecture 3: Superscalar Fetch 12DecoderTagInstInstInstInstTagInstBranchInstTagInstInstInstInstTagInstInstInstInstTagInstInstInstInstInstXXFetch Rate ComputationLet N=4Branch every 5 instructions on averageAssume branch always takenAssume branch target may start at any offset in a cache rowLecture 3: Superscalar Fetch 1325% chance of fetch group starting at each location20% chance for each instruction to be a branchFetch Rate Computation (2)Lecture 3: Superscalar Fetch 14start of fetch group x 1 instructionstart of fetch group x ( 0.2 x 1 + 0.8 x 2 )start of fetch group x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x 3 ) )start of fetch group x ( 0.2 x 1 + 0.8 x ( 0.2 x 2 + 0.8 x ( 0.2 x 3 + 0.8 x 4 ) ) )= 2.048 Instructions Fetched per CycleSimplified analysis: doesnt account for higherprobability of fetch group being aligned due toprevious fetch group not containing branchesEasy exercise: make students estimate fetch rate with different taken probabilities and cacheline widths14Ex. IBM RS/6000Lecture 3: Superscalar Fetch 15PC = B1010T logicA0B0A4B4A8B8A12B12T logicA1B1A5B5A9B9A13B13T logicA2B2A6B6A10B10A14B14A3B3A7B7A11B11A15B15Instruction Buffer Network0123012301230123OneCacheLineFrom TagCheck Logic2233BB12B13B10B11B11B12B13B1015Address can be broken down like (B)(10)(10)So we want to fetch from addresses B1010, B1011, B1100 and B1101In the 3rd column (first instruction to fetch), the T-logic compares the offset (the last 10) to its own position (column 2). Since the offset is less than or equal to its own position (3rd column would have an index of 2), the T-logic does not modify the row selection (the first 10).In the 4th column (column 3), the T-logic similarly compares the original offset of 2 to its column index of 3 and leaves the row index alone.In the 1st column (column 0), the original offset of 2 is greater than the column index of 0, and so the T-index increments the row index to select the next row instead (row 3). The 2nd column behaves similarly.This results in half of the instructions coming from row 2 (for columns 2 and 3), and the other half coming from row 3 (for columns 0 and 1).At the very end, a tag check is still performed with the upper bits of the original address.Types of BranchesDirection:Conditional vs. Unconditional

Target:PC-encodedPC-relativeAbsolute offsetComputed (target derived from register)

Must resolve both direction and target to determine the next fetch groupLecture 3: Superscalar Fetch 16PredictionGenerally use hardware predictors for both direction and targetDirection predictor simply predicts that a branch is taken or not-taken(Exact algorithms covered next lecture)Target prediction needs to predict an actual addressLecture 3: Superscalar Fetch 17This lecture does not discuss how to predict the direction of branches (T vs. NT) see next lecture for that.17Where Are the Branches?Before we can predict a branch, we need to know that we have a branch to predict!

Wheres the branch in this fetch group?Lecture 3: Superscalar Fetch 18I$PC1001010101011010101001010100101011010100101001010101011010100100100000100100111001001010Main point being that if all we have is a PC, we dont know where any branches are (if they even exist) since we havent even yet fetched the instructions (let alone decoded them).18Simplistic Fetch EngineLecture 3: Superscalar Fetch 19I$PDPDPDPDDirPredTargetPredBranchs PC+sizeof(inst)Huge latency! Clock frequency plummetsFetch PC19PD = predecoder (only does enough decode work to determine the branches)Mux selects the first branch in the fetch group (there may be multiple branches)Branch IdentificationLecture 3: Superscalar Fetch 20I$DirPredTargetPredBranchs PC+sizeof(inst)Store 1 bit perinst, set if instis a branch

partial-decodelogic removed still a long latency (I$ itself sometimes > 1 cycle)Note: sizeof(inst) maynot be known beforedecode (ex. x86)Predecode branches on fill from L2Line GranularityPredict next fetch group independent of exact location of branches in current fetch groupIf theres only one branch in a fetch group, does it really matter where it is?Lecture 3: Superscalar Fetch 21XXTXXNXXTNOne predictor entryper instruction PCOne predictor entryper fetch groupThe obvious challenge is it a fetch group contains more than one branch; in such a situation, having only one predictor entry per group (rather than per instruction) will lead to aliasing problems, potentially for both direction and target prediction. This is discussed more next slide.21Predicting by LineLecture 3: Superscalar Fetch 22I$br1br2DirPredTargetPred+sizeof($-line)CorrectDir PredCorrectTarget Predbr1br2Cache Line addressNNN--XYNTTYT--TXBetter! Latency determined by BPredThis is still challenging: we mayneed to choose between multipletargets for the same cache lineMain point being that the critical path does not go through the I$ anymore.

The side table just illustrates the point made in the comments of the previous slide: since there are two branches in this one fetch group/cacheline, this may lead to more difficult prediction scenarios.22Multiple Branch PredictionLecture 3: Superscalar Fetch 23Dir Pred

Target Pred

I$NNNTaddr0addr1addr2addr3Scan for1st T0 1+LSBs of PCsizeof($-line)no LSBs of PCPCI.e., trying to make predictions for all of the branches within the cacheline at the same time.23Direction PredictionDetails next lectureOver 90% accurate today for integer applicationsHigher for FP applicationsLecture 3: Superscalar Fetch 24Target PredictionPC-relative branchesIf not-taken:next address = branch address + sizeof(inst)If taken:next address = branch address + SEXT(offset)

Sizeof(inst) doesnt changeOffset doesnt change(not counting self-modifying code)Lecture 3: Superscalar Fetch 25Indirect branches not discussed here, although they should be mentioned.25Taken Targets OnlyOnly need to predict taken-branch targetsTaken branch target is the same eve

Search related