40
The Microarchitecture Level The Microarchitecture Level Chapter 4 1

The Microarchitecture LevelThe Microarchitecture Levelabdnoor80.weebly.com/uploads/1/2/2/8/12282674/ch4_coa.pdf · 2018. 9. 7. · Chapter 4 1. The Microarchitecture Level • Its

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

  • The Microarchitecture LevelThe Microarchitecture Level

    Chapter 4

    1

  • The Microarchitecture Level

    • Its job is to implement the ISA (Instruction SetA hit t )Architecture).• Each instruction has a few fields: the first fieldis the OPCODE (operation code) whichidentifies the instruction. The second is theOPERAND , tells which variable.

    2

  • Th D t P th (1)The Data Path (1)• The data path is that part of the CPUThe data path is that part of the CPU containing the ALU.

    • It contains a number of 32-bit i tregisters

    • 6 lines for ALU controls, next slide

    • Shifter, two lines , SLL8 (Shift Left Logical) shift by 1 byte, SRA1 (Shift Right Arithmetic) shift by 1 bit.

    •MAR (Memory Address Register)•MDR (Memory Data Register)

    3

  • MAR/MDR combination is used to read and write ISA-level data words.PC/MBR combination is used to read the executable ISA-level program.p g

    4

  • Data Path Timing

    Timing diagram of one data path cycle.5

  • Stacks (1)

    U f t k f t i l l i blUse of a stack for storing local variables. (a) While A is active. (b) After A calls B. (c) After B calls C (d) After C and B return and A calls D(c) After B calls C. (d) After C and B return and A calls D.

    6

  • Stacks (2)

    Use of an operand stack for doing an arithmetic computation.

    7

  • Compiling Java to IJVM (1)

    (a) A Java fragment. (b) The corresponding Java assembly language.(b) The corresponding Java assembly language. (c) The IJVM program in hexadecimal.

    8

  • Compiling Java to IJVM (1)

    The stack after each instruction of Fig. 4-14(b).

    9

  • Improving performance

    1. Cache memory2. Branch predictionp3. Out- of-order execution with register renaming4. Speculative execution

    Modern processors place overwhelming demands on a memory system, both in terms of latency (the delay in supplying an operand) and bandwidth (the amount of data supplied per unit of time)

    One way to attack this problem is by providing caches. cache holds the most recently used memory words in a small fast memory speeding up access to themrecently used memory words in a small, fast memory, speeding up access to them.

    There are several benefits from having separate caches for instructions and data,often called a split cache. First, memory operations can be initiatedindependently in each cache, effectively doubling the bandwidth of the memorysystem.

    10

  • Cache Memory

    A system with three levels of cache. 11

  • Cache MemoryThe CPU chip itself contains a small instruction cache and a small data cache,typically 16 KB to 64 KB. Then there is the level 2 cache, which is not on the CPUchip, but may be included in the CPU package, next to the CPU chip and connectedto it by a high speed path This cache is generally unified containing a mixture ofto it by a high-speed path. This cache is generally unified, containing a mixture ofdata and instructions. A typical size for the L2 cache is 512 KB to 1 MB. The third-level cache is on the processor board and consists of a few megabytes of SRAM,which is much faster than the main DRAM memory.

    Caches depend on two kinds of address locality to achieve their goal. Spatiallocality is the observation that memory locations with addresses numericallysimilar to a recently accessed memory location are likely to be accessed in the nearsimilar to a recently accessed memory location are likely to be accessed in the nearfuture

    Temporal locality occurs when recently accessed memory locations ared i Thi f l l i h faccessed again. This may occur, for example, to memory locations near the top of

    the stack, or instructions inside a loop.

    All caches use the following model. Main memory is divided up into fixed size blocksAll caches use the following model. Main memory is divided up into fixed size blockscalled cache lines. A cache line typically consists of 4 to 64 consecutive bytes.Lines are numbered consecutively starting at 0, so with a 32-byte line size, line 0 isbytes 0 to 31, line 1 is bytes 32 to 63, and so on.

    12

  • Cache Memory

    When memory is referenced, the cache controller circuit checks to see if the wordreferenced is currently in the cache. If so, the value there can be used, saving a tripreferenced is currently in the cache. If so, the value there can be used, saving a tripto main memory. If the word is not there, some line entry is removed from the cacheand the line needed is fetched from memory or lower level cache to replace it.

    Diff t f f th hDifferent forms for the cache:

    1. Direct-Mapped Caches2. Set-Associative Caches

    13

  • Direct-Mapped Caches

    Next slide shows Direct-Mapped Caches that contains 2048 entries. Each entry(row) in the cache can hold exactly one cache line from main memory. With a 32-bytecache line size (for this example), the cache can hold 64 KB. Each cache entryconsists of three parts:

    1. The Valid bit indicates whether there is any valid data in this entry or not. When the system is booted (started), all entries are marked as invalid.

    2 Th T fi ld i t f i 16 bit l id tif i th di li f2. The Tag field consists of a unique, 16-bit value identifying the corresponding line of memory from which the data came.

    3. The Data field contains a copy of the data in memory. This field holds one cache line py yof 32 bytes.

    14

  • Direct-Mapped CachesFor storing and retrieving data from the cache, the address is broken into fourcomponents, as shown in Fig. (b):

    1. The TAG field corresponds to the Tag bits stored in a cache entry.2. The LINE field indicates which cache entry holds the corresponding data, if they arepresentpresent.3. The WORD field tells which word within a line is referenced.4. The BYTE field is usually not used, but if only a single byte is requested, it tells which byte within the word is needed. For a cache supplying only 32-bit words, this field will always be 0.

    When the CPU produces a memory address, the hardware extracts the 11 LINE bitsfrom the address and uses these to index into the cache to find one of the 2048 entriesfrom the address and uses these to index into the cache to find one of the 2048 entries.If that entry is valid, the TAG field of the memory address and the Tag field in cacheentry are compared. If they agree, the cache entry holds the word being requested, asituation called a cache hit. On a hit, a word being read can be taken from the cache,

    li i ti th d t t O l th d t ll d d i t t d feliminating the need to go to memory. Only the word actually needed is extracted fromthe cache entry. The rest of the entry is not used. If the cache entry is invalid or the tagsdo not match, the needed entry is not present in the cache, a situation called a cachemiss. In this case, the 32-byte cache line is fetched from memory and stored in the, y ycache entry, replacing what was there. However, if the existing cache entry has beenmodified since being loaded, it must be written back to main memory before beingdiscarded.

    15

  • Direct-Mapped Caches

    ( ) A di t d h (b) A 32 bit i t l dd(a) A direct-mapped cache. (b) A 32-bit virtual address.16

  • Set-Associative Caches

    A f t i ti hA four-way set-associative cache.17

  • Branch Prediction

    (a) A program fragment. (b) Its translation to a generic assembly language(b) Its translation to a generic assembly language.

    18

  • Dynamic Branch Prediction (At run time)

    (a) A 1-bit branch history. (b) A 2-bit branch history. (c) A mappingbetween branch instruction address and target address.

    maintain a history table (in special hardware), in which it logs conditional branchesas they occur, so they can be looked up when they occur again. history tableas they occur, so they can be looked up when they occur again. history tablecontains one entry for each conditional branch instruction, The entry contains theaddress of the branch instruction along with a bit telling whether it was taken the lasttime it was executed.

    19

  • Dynamic Branch Prediction (2)

    A 2 bit fi it t t hi f b h di tiA 2-bit finite-state machine for branch prediction.20

  • Out-of-Order Execution and Register Renaming (1)( )

    A l CPU ith i d i d i d l tiA superscalar CPU with in-order issue and in-order completion.21

  • Out-of-Order Execution and Register Renaming (2)( )

    A l CPU ith i d i d i d l tiA superscalar CPU with in-order issue and in-order completion.22

  • Out-of-Order Execution and Register Renaming (3)( )

    Operation of a superscalar CPU with out-of-order issue and out of-order completion. 23

  • Speculative Execution

    a) A program fragmenta) A program fragment. b) The corresponding basic block graph.

    • Computer programs can be broken up into basic blocks, f fwith each basic block consisting of a linear sequence of

    code with one entry point on top and one exit on the bottom• Executing code before it is known if it is even going to beExecuting code before it is known if it is even going to be

    needed is called speculative execution.24

  • Microarchitecture of Pentium 4 CPU

    1. NetBurst Microarchitecture

    2. Pentium 4 consists of four major subsections:

    • Memory subsystem: L2 prefetch unit• Memory subsystem: L2 , prefetch unit

    • Front end: fetch instruction from L2 and decode them in program order, trace cache which is L1 instruction cacheprogram order, trace cache which is L1 instruction cache

    • Out-of-order control: instruction can be issued out of order, retirement unit has the task of retiring instruction, in gorder, and keeping track of where it is.

    • Execution units: Carry out the integer, floating point.

    25

  • Overview of the NetBurst Microarchitecture

    The block diagram of the Pentium 4.26

  • Front EndFront End

    a) Prefetches instructions that are likely to be executeda) Prefetches instructions that are likely to be executedb) Fetches instructions that haven’t been prefetchedc) Decodes instruction into µopsd) Generates µops for complex instructions or special

    purpose code) P di t b he) Predicts branches

    27

  • Trace Cachea) Primary instruction cache in NetBurst architectureb) Stores decoded µopsc) ~12K capacityd) On a Trace Cache miss, instructions are fetched and

    decoded from the L2 cachedecoded from the L2 cache

    Pentium 4 Trace Cache

    • Has its own branch predictor that directs where instruction fetching needs to go next in the Trace Cache.

    Removes• RemovesDecoding costs on frequently decoded instructionsExtra latency to decode instructions upon branch

    mispredictions28

  • Branch Prediction

    a) Predicts ALL near branches– Includes conditional branches, unconditional calls and returns, and

    indirect branches

    b) Does not predict far transfers– Includes far calls, irets, and software interrupts

    29

  • Branch Prediction

    a) Dynamically predict the direction and target of branchesb) If no dynamic prediction is available, statically predictb) If no dynamic prediction is available, statically predict

    – Taken for backwards looping branches– Not taken for forward branches

    ) T b ilt di t d b h t id b hc) Traces are built across predicted branches to avoid branch penalties

    Branch Target Buffer• Uses a branch history table and a branch target buffer to

    predict• Updating occurs when branch is retired•

    30

  • Out-of-Order Execution

    a) Designed to optimize performance by handling the most ti i th t t t f tcommon operations in the most common context as fast as

    possibleb) 126 µops can in flight at onceb) 126 µops can in flight at once

    – Up to 48 loads / 24 stores

    31

  • Execution

    a) Can dispatch up to 6 µops per cycleb) Exceeds trace cache and retirement µop bandwidth

    – Allows for greater flexibility in issuing µops to different execution unitsexecution units

    ) ill b t d th ti i b thc) µops will be executed on the proper execution engine by theprocessor

    d) The number of execution engines limits the amount ofd) The number of execution engines limits the amount of execution that can be performed.

    e) Integer and floating point unites comprise this limiting factor

    32

  • Retirement

    a) During this stage results are written back to memory or actual IA-32 registers that were referred to before grenaming took place.

    b) This unit retires all instructions in their original order, t ki ll b h i t ttaking all branches into account.

    c) Three µops may be retired in one clock cycled) Th d t t d fd) The processor detects and recovers from

    mispredictions in this stage.e) Also a reorder buffer (ROB) is used:e) Also, a reorder buffer (ROB) is used:

    – Updates the architectural state– Manages the ordering of exceptions

    33

  • Renaming Register

    34

  • Renaming Registersa) This stage renames logical registers to the physical

    register spaceb) In the MicroBurst Architecture there are 128

    registers with unique namesc) Basically, any references to original IA-32 general

    purpose registers are renamed to one of the internal physical registersphysical registers.

    d) Also, it removes false register name dependencies between instructions allowing the processor to execute more instructions in parallel.

    e) Parallel execution helps keep all resources busy

    35

  • Schedulersa) Ensures µops execute in the correct sequenceb) Disperses µops in the queue (or pool) to the proper execution ) p µ p q ( p ) p p

    units. c) The scheduler looks to the pool for requests, and checks the

    f ti l it t if thfunctional units to see if the necessary resources are available.

    36

  • The NetBurstThe NetBurst Pipeline

    A simplified view ofA simplified view of the Pentium 4 data path.p

    37

  • The Microarchitecture of the 8051 CPU

    The microarchitecture of the 8051.

    38

  • The Microarchitecture of the 8051 CPU

    ACC : accumulator ,most computational results are storedB: used in multiplication and division as well as being a scratch register for holding t lttemporary results SP: Stack pointer, pint to the top of the stack IR: Instruction register, holds the instruction currently being executedTMP1 and TEMP2: are latches for the ALU, perform the ALU operation, the , p p ,operand are first copied to these latches, then the ALU is staretedPSW: Program Status register, used for the conditon code , which indicate if the result was zero, negative …..

    RAM ADDR: Data RAM is 128 bytes, need 8 bit to address allROM ADDR: code ROM is 64 KB, need 16 bit to address allDPTR: Double poinTeR ,16 bit scratch register for managing and assembling 16 bit address PC: 16 bit program counter , point to the next address to be fetch

    39

  • The Microarchitecture of the 8051 CPU

    1. 8051 is a synchronous processor2 Most instruction takes one clock cycle each divided into six2. Most instruction takes one clock cycle, each divided into six

    parts (states)3. During the first state , the next instruction is fetch from the

    OROM , put on the main bus and routed to the IR . During the second state , the instruction is decoded and the PC incremented . During the third state, the operand is g , pprepared .During the fourth state, one of the operands is put on the main bus, usually for shipment to TMP1 where it can be latched for use as an ALU operand the ACC can also bebe latched for use as an ALU operand, the ACC can also be copied to the TMP2. During the fifth state , ALU executes. Finally , during the sixth state, the ALU output is written back t it d ti ti th i bto its destination on the main bus

    40