Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind

Computer Architecture: A Constructive Approach

Branch Prediction - 1

ArvindComputer Science & Artificial Intelligence Lab.Massachusetts Institute of Technology

April 9, 2012 L16-1http://csg.csail.mit.edu/6.S078

http://csg.csail.mit.edu/6.S078

I-cache

Fetch Buffer

IssueBuffer

Func.Units

Arch.State

Execute

Decode

ResultBuffer Commit

PCFetch

Branchexecuted

Next fetch started

Modern processors may have > 10 pipeline stages between next PC calculation and branch resolution !

Control Flow Penalty

How much work is lost if pipeline doesn’t follow correct instruction flow?

~ Loop length x pipeline width

April 9, 2012 L12-2

Average Run-Length between BranchesAverage dynamic instruction mix from SPEC92:

SPECint92 SPECfp92ALU 39 % 13 % FPU Add 20 %FPU Mult 13 %load 26 % 23 %store 9 % 9 %branch 16 % 8 %other 10 % 12 %

SPECint92: compress, eqntott, espresso, gcc , liSPECfp92: doduc, ear, hydro2d, mdijdp2, su2cor

What is the average run-length between branches? April 9, 2012 L16-3http://csg.csail.mit.edu/6.S078

Instruction Taken known?Target known?JJRBEQZ/BNEZ

MIPS Branches and JumpsEach instruction fetch depends on one or two pieces of information from the preceding instruction: 1. Is the preceding instruction a taken branch? 2. If so, what is the target address?

After Inst. Decode

After Inst. Decode After Inst. DecodeAfter Inst. Decode After Reg. FetchAfter Exec

Currently our simple pipelined architecture does very simple branch predictionWhat is it?

Branch is predicted not taken: pc, pc+4, pc+8, …

Can we do better?

Branch Prediction Bits• Assume 2 BP bits per instruction• Use saturating counter

On ¬taken

On taken

1 1 Strongly taken

1 0 Weakly taken

0 1 Weakly ¬taken

0 0 Strongly ¬taken

Branch History Table (BHT)

4K-entry BHT, 2 bits/entry, ~80-90% correct predictions

0 0Fetch PC

Branch? Target PC

I-Cache

Opcode offsetInstruction

kBHT Index

2k-entryBHT,2 bits/entry

Taken/¬Taken?

Where does BHT fit in the processor pipeline?

BHT can only be used after instruction decode

What should we do at the fetch stage?

Need a mechanism to update the BHT where does the update information come from

Overview of branch prediction

Need next PC immediately

Decode RegRead Execute

Instr type, PC relative

targets available

Simple conditions,

register targets available

Complex conditions available

Next AddrPred

BP,JMP,Ret

Loose loop Loose loop Loose loopTight loop

Best predictors reflect program

behavior

Next Address Predictor (NAP)first attempt

BP bits are stored with the predicted target address. IF stage: nPC = If (BP=taken) then target else pc+4 later: check prediction, if wrong then kill the instruction and update BTB & BPb else update BPb

Branch Target Buffer (2k entries)k

BPbpredicted

target BP

target

Address Collisions

What will be fetched after the instruction at 1028?NAP prediction = Correct target =

Assume a 128-entry NAP BPbtarget

take236

1028 Add .....

132 Jump 100

InstructionMemory

2361032

kill PC=236 and fetch PC=1032

Is this a common occurrence?Can we avoid these bubbles?

Use NAP for Control Instructions onlyNAP contains useful information for branch and jump instructions only

Do not update it for other instructions

For all other instructions the next PC is (PC)+4 !

How to achieve this effect without decoding the instruction?

Branch Target Buffer (BTB)a special form of NAP

• Keep the (pc, predicted pc) in the BTB • pc+4 is predicted if no pc match is found• BTB is updated only for branches and jumps

2k-entry direct-mapped BTBI-Cache PC

Entry PC

=match

predicted

target

target PC

Permits nextPC to be determined before instruction is decodedApril 9, 2012 L16-13http://csg.csail.mit.edu/6.S078

Consulting BTB Before Decoding

1028 Add .....

132 Jump 100

BPbtargettake236

entry PC132

• The match for pc =1028 fails and 1028+4 is fetched eliminates false predictions after ALU

instructions

• BTB contains entries only for control transfer instructions

more room to store branch targetsEven very small BTBs are very effective

ObservationsThere is a plethora of branch prediction schemes – their importance grows with the depth of processor pipelineProcessors often use more than one prediction schemeIt is usually easy to understand the data structures required to implement a particular schemeIt takes considerably more effort to understand how a particular scheme with its lookup and updates is integrated in the pipeline and how various schemes interact with each other

PlanWe will begin with a very simple 2-stage pipeline and integrate a simple BTB scheme in it

We will extend the design to a multistage pipeline and integrate at least one more predictor, say BHT, in the pipeline (next lecture)

revisit the simple two-stage pipeline without branch prediction

Decoupled Fetch and Execute

Fetch Execute

<instructions, pc, epoch>

nextPC

Fetch sends instructions to Execute along with pc and other control informationExecute sends information about the target pc to Fetch, which updates pc and other control registers whenever it looks at the nextPC fifo

A solution using epochAdd fEpoch and eEpoch registers to the processor state; initialize them to the same value The epoch changes whenever Execute determines that the pc prediction is wrong. This change is reflected immediately in eEpoch and eventually in fEpoch via nextPC FIFOAssociate the fEpoch with every instruction when it is fetched In the execute stage, reject, i.e., kill, the instruction if its epoch does not match eEpoch

Two-Stage pipelineA robust two-rule solution

InstMemory

Decode

Register File

Execute

DataMemory

BypassFIFO

PipelineFIFO

PCfEpo

Either fifo can be a normal (>1 element) fifoApril 9, 2012 L16-19http://csg.csail.mit.edu/6.S078

Two-stage pipeline Decoupledmodule mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(Addr) nextPC <- mkBypassFIFOF; rule doFetch (ir.notFull); let inst = iMem(pc); ir.enq(TypeFetch2Decode {pc:pc, epoch:fEpoch, inst:inst}); if(nextPC.notEmpty) begin pc<=nextPC.first; fEpoch<=!fEpoch; nextPC.deq;end else pc <= pc + 4; endrule

explicit guard

simple branch predictionApril 9, 2012 L16-20http://csg.csail.mit.edu/6.S078

Two-stage pipeline Decoupled contrule doExecute (ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; if(ir.first.epoch==eEpoch) begin let eInst = decodeExecute(irpc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.brTaken) begin nextPC.enq(eInst.addr); eEpoch <= !eEpoch; end end ir.deq;endruleendmodule

Two-Stage pipeline with a Branch Predictor

InstMemory

Decode

Register File

Execute

DataMemory

PCfEpo

chBranch

Predictor

Branch Predictor Interfaceinterface NextAddressPredictor; method Addr prediction(Addr pc); method Action update(Addr pc, Addr target);endinterface

Null Branch Predictionmodule mkNeverTaken(NextAddressPredictor); method Addr prediction(Addr pc); return pc+4; endmethod

method Action update(Addr pc, Addr target); noAction; endmethodendmodule

Replaces PC+4 with … Already implemented in the pipeline

Right most of the time Why?

Branch Target Prediction (BTB)module mkBTB(NextAddressPredictor); RegFile#(LineIdx, Addr) tagArr <- mkRegFileFull; RegFile#(LineIdx, Addr) targetArr <- mkRegFileFull;

method Addr prediction(Addr pc); LineIdx index = truncate(pc >> 2); let tag = tagArr.sub(index); let target = targetArr.sub(index); if (tag==pc) return target; else return (pc+4); endmethod

method Action update(Addr pc, Addr target); LineIdx index = truncate(pc >> 2); tagArr.upd(index, pc);

targetArr.upd(index, target); endmethodendmoduleApril 9, 2012 L16-25http://csg.csail.mit.edu/6.S078

Two-stage pipeline + BPmodule mkProc(Proc); Reg#(Addr) pc <- mkRegU; RFile rf <- mkRFile; IMemory iMem <- mkIMemory; DMemory dMem <- mkDMemory; PipeReg#(TypeFetch2Decode) ir <- mkPipeReg; Reg#(Bool) fEpoch <- mkReg(False); Reg#(Bool) eEpoch <- mkReg(False); FIFOF#(Tuple2#(Addr,Addr)) nextPC <- mkBypassFIFOF; NextAddressPredictor bpred <- mkNeverTaken; The definition of TypeFetch2Decode is changed to

include predicted pc typedef struct { Addr pc; Addr ppc; Bool epoch; Data inst; } TypeFetch2Decode deriving (Bits, Eq);

Some target predictor

Two-stage pipeline + BP Fetch rule rule doFetch (ir.notFull); let ppc = bpred.prediction(pc); let inst = iMem(pc); ir.enq(TypeFetch2Decode {pc:pc, ppc:ppc, epoch:fEpoch, inst:inst}); if(nextPC.notEmpty) begin match{.ipc, .ippc} = nextPC.first; pc <= ippc; fEpoch <= !fEpoch; nextPC.deq; bpred.update(ipc, ippc); end else pc <= ppc; endrule

Two-stage pipeline + BP Execute rulerule doExecute (ir.notEmpty); let irpc = ir.first.pc; let inst = ir.first.inst; let irppc = ir.first.ppc; if(ir.first.epoch==eEpoch) begin let eInst = decodeExecute(irpc, irppc, inst, rf); let memData <- dMemAction(eInst, dMem); regUpdate(eInst, memData, rf); if (eInst.missPrediction) begin nextPC.enq(tuple2(irpc, eInst.brTaken ? eInst.addr : irpc+4)); eEpoch <= !eEpoch; end end ir.deq;endruleendmodule

Execute Functionfunction ExecInst exec(DecodedInst dInst, Data rVal1, Data rVal2, Addr pc, Addr ppc); ExecInst einst = ?;

let aluVal2 = (dInst.immValid)? dInst.imm : rVal2 let aluRes = alu(rVal1, aluVal2, dInst.aluFunc); let brAddr = brAddrCal(pc, rVal1, dInst.iType, dInst.imm); einst.itype = dInst.iType; einst.addr = (memType(dInst.iType)? aluRes : brAddr; einst.data = dInst.iType==St ? rVal2 : aluRes; einst.brTaken = aluBr(rVal1, aluVal2, dInst.brComp); einst.missPrediction = brTaken ? brAddr!=ppc : (pc+4)!=ppc; einst.rDst = dInst.rDst; return einst;endfunctionApril 7, 2012 L7-29

http://csg.csail.mit.edu/6.s078Rev

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind

Documents

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind

Constructive Computer Architecture: Branch Prediction: Direction Predictors Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute

Constructive Computer Architecture Cache Coherence Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology November

Computer Architecture: A Constructive Approach Next Address Prediction – Six Stage Pipeline

Constructive Computer Architecture: Multistage Pipelined Processors and modular refinement Arvind

Branch Prediction Speculative Execution · 1 Branch Prediction and Speculative Execution Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material

Computer Architecture: A Constructive Approach Branch Prediction - 2 Arvind

Computer Architecture: A Constructive Approach, Arvind, Rishiyur S

Constructive Computer Architecture Virtual Memory and …csg.csail.mit.edu/6.175/archive/2015/lectures/L21...Constructive Computer Architecture Virtual Memory and Interrupts Arvind

Computer Architecture: A Constructive Approach Sequential Circuits Arvind

Constructive Computer Architecture: Branch Prediction Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology October

Constructive Computer Architecture Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology 6.175: L01 – September 3,

Constructive Computer Architecture Interrupts/Exceptions/Faults Arvind

Computer Architecture: A Constructive Approach Combinational ALU Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

Computer Architecture: A Constructive Approach Branch Prediction - 1 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of

Computer Architecture: A Constructive Approach Branch Direction Prediction – Six Stage Pipeline

Computer Architecture: A Constructive Approach Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology

6.S078 - Computer Architecture: A Constructive Approach Combinational circuits Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute

Constructive Computer Architecture: Bluespec execution model and concurrency semantics Arvind

Computer Architecture: A Constructive Approach Branch Direction Prediction – Pipeline Integration