Advanced Microprocessor Course(EC311) Unit 2

Embed Size (px)

DESCRIPTION

microprocessors

Citation preview

  • Advanced Microprocessors

    UNIT -II

  • Hardware details of the Pentium

    CPU pin descriptions :

    Pentium 60 MHz & 66 MHz, 273 pin PGA (Pin Grid Array).

    Power supply 5v

    Newer Pentium : faster clock speed, 296 pin PGA, power supply 3.3v

  • Pentium Processor Pin details

  • 1. A20M (Address 20 mask) - input pin

    To force the Pentium to limit addressable memory 1 MB.

    Only active in real mode.

    Undefined in protected mode.

    2. A3- A31 ( Address lines) - Bidirectional pin

    29 address lines together with the byte enable outputs form the

    Pentium 32 bit address bus. (4 GB memory space)

    3. BE0 - BE7 (Byte enable) output pin

    The byte enable pins are used to determine which bytes must be

    written to external memory, or which bytes were requested by the

    CPU for the current cycle.

    These signals are generated internally by the processor from

    address lines A0, A1 and A2.

  • 4. ADS (Address data strobe) output pin

    The address status indicates that a new valid bus cycle is

    currently being driven by the Pentium processor

    5. AHOLD - ( Address hold) input pin

    It is used to place a Pentiums address bus into a high

    impedance state.

  • AP (Address parity) Bidirectional pin

    It is used to indicate the even parity of address lines A5 - A31.

    APCHK# - (Address parity check) output pin

    Detected a parity error on the address bus during inquire cycles

    External circuitry is responsible for taking the appropriate action if a

    parity error is encountered.

    APICEN - (Advanced Programmable Interrupt Controller Enable) -

    Input pin

    Enables or disables the on-chip APIC interrupt controller.

    BF[1:0] - (Bus Frequency) - Input pin

    Determines the bus-to-core frequency ratio. BF[1:0] are sampled at

    RESET.

  • BOFF# - (Back off) - input pin

    This input causes the processor to terminate any bus cycle

    currently in progress and tri state its buses.

    Highest priority

    D63-D0 - (Data lines) Bidirectional pin

    Lines D7-D0 define the least significant byte of the data bus;

    lines D63-D56 define the most significant byte of the data bus

  • DP7-DP0 - (Data parity) - Bidirectional pin

    To indicate the even parity of each data byte on the data bus.

    DP7 applies to D63-56, DP0 applies to D7-0.

    HOLD - (Hold bus) - input pin

    Completes the current bus cycle and tri states its bus signals.

    Activate HLDA.

    HLDA - (hold acknowledge) output pin

    To indicate that the Pentium has been placed in a hold state.

  • Bus Operations

    Types of bus cycles:

    Single transfer cycle

    Burst transfer cycle

    Interrupt ack cycle

    Inquire cycle etc.

    Some of the signals are used to indicate the type of bus cycle.

    M/IO# - Memory / input output - output pin

    If high - memory cycle or low I/O operation

    D/C# - Data / Code - output pin

    This output indicates that the current bus cycle is accessing code or

    data.

    If high - data or low - code

  • W/R# - Write / Read - output pin

    This output indicates that the current bus cycle is a read

    operation or a write operation.

    If high -write operation or low - read operation.

    CACHE# - Cacheability - output pin

    This output indicates whether the data associated with the

    current bus cycle is being read from or written to the internal

    cache.

    All the burst reads are cacheable and all cacheable read cycles

    are bursted.

    KEN# - Cache enable - input pin

    The cache enable input is used to determine if the current cycle

    is cacheable.

  • Bus State Definition

    Ti: This is the bus idle state.

    In this state, no bus cycles are being run.

    The processor may or may not be driving the address and status

    pins, depending on the state of the HLDA, AHOLD, and BOFF#

    inputs.

    An asserted BOFF# or RESET always forces the state machine

    back to this state.

    HLDA is only driven in this state.

    T1: This is the first clock of a bus cycle.

    Valid address and status are driven out and ADS# is asserted.

    There is one outstanding bus cycle.

    T2: This is the second and subsequent clock of the first outstanding bus

    cycle.

  • In state T2, data is driven out (if the cycle is a write)

    data is expected (if the cycle is a read)

    The BRDY# pin is sampled.

    There is one outstanding bus cycle.

    BRDY# - Burst ready - input pin

    Read cycle indicate data is available on the data bus

    Write cycle - informs the processor that the output data has

    been stored.

  • Single-Transfer Cycle

  • Burst Cycles

    Cache uses burst cycles.

    A new 8 byte chunk can be transferred every clock cycle.

    The processor supplies the starting address of the first group of 8 bytes at

    the beginning of the cycle.

    The next groups of 8 bytes are transferred according to the burst order.

    Burst transfer order:

  • The external memory system

    must generate the remaining 3

    address itself, and supply the

    data in the correct order.

    Address and BEs are asserted

    only in the first transfer and

    are not driven for each

    transfer.

  • T12: This state indicates there are two outstanding bus cycles.

    The processor is starting the second bus cycle at the same time

    that data is being transferred for the first.

    In T12, the processor drives

    BRDY# - first cycle

    ADS# - second cycle

    T2P: This state indicates there are two outstanding bus cycles.

    both are in their second and subsequent clocks.

    Same job as T12

    TD : Dead state

    This state is used to insert a dead state between two consecutive

    cycles (read followed by write or vice versa) in order to give

    the system bus time to change states.

  • BREQ - (Bus request) - output pin

    The bus request output tells the external system that the

    Pentium has internally generated a bus request.

    This happens even if the Pentium is not driving its bus at the

    moment.

    NA# - (Next address) - input pin

    Indicates that the external memory system is ready to accept a

    new bus cycle although all data transfers for the current cycle

    have not yet completed.

    Issue ADS# for a pending cycle two clocks after NA# is asserted.

    Pentium supports up to 2 outstanding bus cycles.

  • Flow Functional description

    0 No Request Pending

    1 The processor starts a new bus cycle & ADS# is asserted in the T1 state.

    2 Second clock cycle of current bus cycle

    3 The processor stays in T2 until the transfer is over ( BRDY#) if no new

    request becomes pending or if NA# is not asserted.

    4 If there is a new request pending when the current cycle

    is complete, and if NA# was sampled asserted, the processor begins from T1.

    5 If no cycle is pending when the processor finishes the current cycle or NA# is

    not asserted, the processor goes back to the idle state.

    6 processing the current cycle (one outstanding cycle)

    If NA# is asserted, the processor moves to T12 indicating that the processor

    now has two outstanding cycles.

    ADS# is asserted for the second cycle.

  • 7 When the processor finishes the current cycle, and no dead clock is needed, it goes to

    the T2 state.

    8 When the processor finishes the current cycle and a dead clock is needed, it goes to the

    TD state.

    9 If the current cycle is not completed, the processor always moves to T2P to process the

    data transfer.

    10 The processor stays in T2P until the first cycle transfer is over.

    11 The processor finishes the first cycle and no dead clock is needed, it goes to T2 state

    12 When the first cycle is complete, and a dead clock is needed, it goes to TD state.

    13 If NA# was sampled, a new request is pending, it goes to T12 state.

    14 If NA# was not asserted, no new request is pending, it goes to T2 state.

  • Processor control Instructions

    Lock - s/w instruction - Lock bus during next instruction

    Executing lock - Lock# output goes low

    Lock is used as a prefix to another instruction.

    Lock# - h/w pin - (Bus Lock) - output pin

    To indicate that the current bus cycle is locked & may not be

    interrupted by another bus master.

    Locked operation :

    Atomic operation cannot be broken down into smaller sub

    operations.

    Semaphore - a special type of counter variable that must be read,

    updated and stored in one single uninterruptable operation.

    This requires a read cycle followed by a write cycle.

  • XCHG instruction automatically lock the bus when one of their

    operands is a memory operand.

    AHOLD and HOLD - activated in the middle of locked operation

    locked operation is not affected.

    But it is affected when BOFF# signal is asserted.

    Interrupt acknowledge cycle

    INTR Interrupt request - input pin

    When high Pentium to initiate interrupt processing

    Read a 8 bit vector number and select ISR.

    The processor runs two interrupt ack cycles in response to an

    INTR request.

    Both the cycles are locked.

  • First cycle - D0 - D7 is ignored by the processor

    Second cycle - D0 - D7 is accepted by the processor

    Byte enable outputs are used to indicate the cycles.

    BE4 is low and all other BEs are in high - first cycle

    BE0 is low and all other BEs are in high - second cycle

  • Shutdown :

    If Pentium detects an internal parity error then run the shutdown

    cycle.

    Execution is suspended while in shutdown until the processor

    receives an NMI, INIT and RESET request.

    Cache is unchanged.

    RESET processor reset input pin

    Forces the Pentium processor to begin execution at a known state.

    internal caches will be invalidated upon the RESET.

    Fetch its first instruction from address FFFFFFF0H.

    INIT - initialization - input pin

    Forces the Pentium processor to begin execution in a known state.

    The processor state after INIT is the same as the state after RESET

    except that the internal caches, write buffers, and floating point

    registers retain the values they had prior to INIT.

  • NMI - Non-maskable interrupt - input pin

    request signal indicates that an external non-maskable interrupt has

    been generated.

    No external int ack cycles are generated.

    HALT cycles

    When HALT instruction is executed HALT cycle is run.

    INTR signal may also be used to resume the execution.

    WB/WT# - (writeback/writethrough) - input pin

    allows a data cache line to be defined as writeback (1) or writethrough

    (0) on a line-by-line basis.

    Writeback : Writing results only to the cache are called writeback.

    Writethrough : Writing results to the cache and to main memory are called

    Writethrough.

  • Cache is a small high-speed memory. Stores data from some frequently

    used addresses (of main memory).

    Cache hit Data found in cache. Results in data transfer at maximum speed.

    Cache miss Data not found in cache. Processor loads data from Memory

    and copies into cache. This results in extra delay, called miss

    penalty.

    Hit ratio = percentage of memory accesses satisfied by the cache.

    Miss ratio = 1-hit ratio

    Instruction and Data caches

  • Average memory access time =

    Hit ratio * Tcache + (1 Hit ratio) * (Tcache + TRAM )

    RAM access time = 70 ns

    Cache access time = 10 ns

    Hit ratio =0.85

    Assume there is no external cache.

    Tavg = 0.85 * 10 + (1- 0.85) * (10 + 70)

    = 20.5 ns

    Cache Line : Cache is partitioned into lines (also called blocks). During

    data transfer, a whole line is read or written.

    Each line has a tag that indicates the address in Memory from which the line

    has been copied

  • Types of Cache

    1. Fully Associative

    2. Direct Mapped

    3. Set Associative

    Sequential Access :

    Start at the beginning and read through in order

    Access time depends on location of data and previous location

    Example: tape

    Direct Access :

    Individual blocks have unique address

    Access is by jumping to vicinity then performing a sequential search

    Access time depends on location of data within "block" and previous

    location

    Example: hard disk

  • Random access:

    Each location has a unique address

    Access time is independent of location or previous access

    e.g. RAM

    Associative access :

    Data is retrieved based on a portion of its contents rather than its

    address

    Access time is independent of location or previous access

    e.g. cache

  • Performance

    Transfer Rate : Rate at which data can be moved

    For random-access memory, equal to 1/(cycle time)

    For non-random-access memory, the following relationship holds:

    TN = TA + N/R

    where

    TN = Average time to read or write N bits

    TA = Average access time

    N = Number of bits

    R = Transfer rate, in bits per second(bps)

  • Fully Associative Cache

    Allows any line in main memory

    to be stored at any location in the

    cache.

    Main memory and cache are both

    divided into lines of equal size.

  • No restriction on mapping from Memory to Cache.

    It requires large number of comparators to check all the address.

    Associative search of tags is expensive.

    Feasible for very small size caches only (less than 4 K).

    Some special purpose cache, such as the virtual memory Translation

    Lookaside Buffer (TLB) is an associative cache.

    Associative mapping works the best, but is complex to implement.

  • Direct-Mapped Cache

    One way set associative cache.

    Memory divided into cache pages

    Page size and cache size both are

    equal.

    Line 0 of any page - Line 0 of

    cache

    Directly maps the memory line into

    an equivalent cache line.

    Direct has the lowest performance,

    but is easiest to implement.

    Direct is often used for instruction

    cache.

    Less flexible

  • Set-Associative Cache

    Set associative is a compromise

    between the other two.

    The bigger the way the better the

    performance, but the more complex

    and expensive.

    Combination of fully associative and

    direct mapped caching schemes.

    Divide the cache in to equal sections

    called cache ways.

    Page size is equal to the size of the cache way.

    Each cache way is treated like a small direct mapped cache.

  • Design of cache organization

    Cache size : 4KB

    Line size : 32 bytes

    Physical address : 32 bit

    Fully Associative Cache

    32 bit physical address is divided

    into two fields.

    n = cache size / line size = number of lines

    b = log2(line size) = bit for offset

    remaining upper bits = tag address bits

  • Consider fully associate mapping

    scheme with 27 bit tag and 5 bit offset

    01111101011101110001101100111000

    Compare all tag fields for the value

    011111010111011100011011001.

    If a match is found, return byte 11000

    (2410) of the line.

  • Direct Cache Addressing

    n = cache size / line size = number of lines

    b = log2(line size) = bit for offset

    log2(number of lines) = bits for cache index

    remaining upper bits = tag address bits

  • Direct mapping scheme with 20 bit tag, 7

    bit index and 5 bit offset

    01111101011101110001101100111000

    Compare the tag field of line 1011001

    (8910) for the value

    01111101011101110001.

    If it matches, return byte 11000 (2410) of

    the line.

  • Set Associative Mapping

    n = cache size / line size = number of lines

    b = log2(line size) = bit for offset

    log2(number of lines) = bits for cache index

    remaining upper bits = tag address bits

    w = number of lines / set

    s = n / w = number of sets

  • Two way set-associate mapping with 19 bit tag, 6 bit index and 5 bit

    offset

    01111101011101110001101100111000

    Compare the tag fields of lines 0110010 to 0110011 for the value

    011111010111011100011.

    If a match is found, return byte 11000 (2410) of that line

  • Instruction & Data Cache of Pentium

    Both caches are organized as

    2-way set associative caches

    Cache size : 8KB

    Line size : 32 bytes

    Physical address : 32 bits

    128 sets, total 256 entries

    Each entry in a set has its own

    tag

  • Data Cache of Pentium

    Tags in the data cache are triple ported

    They can be accessed from 3 different places at the same time

    U pipeline

    V pipeline

    Bus snooping

    Each entry in data cache can be configured for write through or write-back

    Parity bits are used to maintain data integrity

    Each tag and every byte in data cache has its own parity bit.

  • Instruction Cache of Pentium

    Instruction cache is write protected to prevent self-modifying code.

    Tags in instruction cache are also triple ported

    Two ports for split-line accesses

    Third port for bus snooping

    In Pentium (since CISC), instructions are of variable length(1-15bytes).

    Multibyte instructions may staddle two sequential lines stored in code

    cache.

    Then it has to go for two sequential access which degrades performance.

    Solution: Split line Access

  • Split-line Access

    It permits upper half of one line and lower half of next to be fetched from

    code cache in one clock cycle.

    When split-line is read, the information is not correctly aligned.

    The bytes need to be rotated so that prefetch queue receives instruction in

    proper order.

    Instruction boundaries within the cache line need to be defined

    There is one parity bit for every 8 byte of data in instruction cache

  • Split-line Access

  • Multiprocessor System

    When multiple processors are used in a single system, there needs to be a

    mechanism whereby all processors agree on the contents of shared cache

    information.

    For e.g., two or more processors may utilize data from the same memory

    location, X.

    Each processor may change value of X, thus which value of X has to be

    considered?

    If each processor change the value of the data item, we have different

    (incoherent) values of Xs data in each cache.

    Solution : Cache Coherency Mechanism

  • A multiprocessor system with incoherent cache data

  • Clean Data : The data in the cache and the data in the main memory

    both are same, the data in the cache is called clean data.

    Dirty Data : The data is modified within cache but not modified in

    main memory, the data in the cache is called dirty data.

    Stale Data : The data is modified with in main memory but not

    modified in cache, the data in the cache is called stale data.

    Out of- date main memory Data: The data is modified within cache

    but not modified in main memory, the data in the main memory is

    called Out of- date main memory Data.

  • Cache Coherency

    Pentiums mechanism is called MESI

    (Modified/Exclusive/Shared/Invalid)Protocol.

    This protocol uses two bits stored with each line of data to keep track of the

    state of cache line.

    The four states are defined as follows:

    Modified:The current line has been modified (does not match with main memory)

    and is only available in a single cache.

    Exclusive:The current line has not been modified (matches with main memory)

    and is only available in a single cache.

    Writing to this line changes its state to modified

  • Shared:

    Copies of the current line may exist in more than one cache.

    A write to this line causes a write through to main memory and may

    invalidate the copies in the other cache.

    Invalid:

    The current line is empty.

    A read from this line will generate a miss.

    Only the shared and invalid states are used in code cache.

    MESI protocol requires Pentium to monitor all accesses to main

    memory in a multiprocessor system. This is called bus snooping.

    Bus Snooping: It is used to maintain consistent data in a

    multiprocessor system where each processor has a separate cache.

  • Consider the above example.

    If the Processor 3 writes its local copy of X(30) back to memory, the

    memory write cycle will be detected by the other 3 processors.

    Each processor will then run an internal inquire cycle to determine

    whether its data cache contains address of X.

    Processor 1 and 2 then updates their cache based on individual MESI

    states.

    Pentiums address lines are used as inputs during an inquire cycle to

    accomplish bus snooping.

  • Coherence vs. consistency

    Cache coherence protocols guarantee that eventually all copies are updated.

    Depending on how and when these updates are performed, a read

    operation may sometimes return unexpected values.

    Consistency deals with what values can be returned to the user by a read

    operation (may return unexpected values if the update is not

    complete).

  • Cache Coherency Protocol Implementations

    Snooping

    used with low-end, bus-based MPs

    few processors

    centralized memory

    Directory-based

    used with higher-end MPs

    more processors

    distributed memory

  • When we write, should we write to cache or memory?

    Write through cache :write to both cache and main memory.

    Cache and memory are always consistent.

    Write back cache : write only to cache and set a dirty bit.

    When the block gets replaced from the cache,

    write it out to memory.

    Snoop : when a cache is watching the address lines for transaction, this is

    called a snoop.

    This function allows the cache to see if any transactions are

    accessing memory it contains within itself.

    Snarf: when a cache takes the information from the data lines, the cache is

    said to have snarfed the data.

    This function allows the cache to be updated and maintain consistency

  • Cache consistency cycles

    Inquire cycle

    EADS# - (External address strobe) - input pin

    This signal indicates that a valid external address has been driven

    onto the Pentium processor address pins to be used for an inquire

    cycle.

    HIT# - (inquire cycle hit / miss) - output pin

    The hit indication is driven to reflect the outcome of an inquire cycle.

    If an inquire cycle hits a valid line in either data or instruction cache.

    asserted two clocks after EADS#.

    If the inquire cycle misses the cache, this pin is negated two clocks

    after EADS#.

    This pin changes its value only as a result of an inquire cycle and

    retains its value between the cycles.

  • HITM# - (hit / miss modified cache line) - output pin

    The hit to a modified line output is driven to reflect the outcome of

    an inquire cycle.

    It is asserted after inquire cycles which resulted in a hit to a modified

    line in the data cache.

    INV (invalidation) - input pin

    determines the final cache line state (S or I) in case of an inquire

    cycle hit.

    It is sampled together with the address for the inquire cycle in the

    clock EADS# is sampled active.

    High cache line is invalidated

    Low cache line is shared

    Miss inv is no effect

    Hit modified line line will be written back regardless of the state

    of INV.

  • LRU Algorithm

    One or more bits are added to the cache entry to support the LRU algorithm.

    One LRU bit & Two valid bits for two lines.

    If any invalid line (out of two) is found out that is replaced with the newly

    referred data.

    If all the lines are valid a LRU line is replaced by the new one.

  • Four way set associative - LRU algorithm

  • FLUSH# - (Flush cycle) - input pin

    cache flush input forces the Pentium processor to write back all

    modified lines in the data cache and invalidate its internal caches.

    A Flush Acknowledge special cycle will be generated by the Pentium

    processor indicating completion of the write back and invalidation.

    Byte enables indicate the type of bus cycle. BE4 is low and all other BEs are

    high.

    BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0

    1 1 1 0 1 1 1 1

    Cache instructions:

    INVD invalidate cache

    Effectively erases all the information in the data cache. (by marking

    it all invalid).

  • WBIND - write back and invalidate cache

    write back special cycle is driven after the WBIND instruction is

    executed.

    BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0

    1 1 1 1 0 1 1 1

    INVD instruction should be used with care. This instruction does not

    write back modified cache lines.

    Flush cycle is driven after the INVD and WBIND instructions are

    executed.

    BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0

    1 1 1 1 1 1 0 1

    write back cycle is generated followed by the flush cycle.

  • Super scalar Architecture

    Processors capable of parallel instruction execution of multiple instructions

    are known as superscalar machines.

    Parallel execution is possible through U & V pipeline of Pentium.

    Four restriction placed on a pair of integer instruction attempting parallel

    execution:

    1. Both must be simple instructions

    (Mov, Inc, Dec)

    2. No data dependencies may exist between them.

    read after write dependency

    if both instruction write to the same operand

  • 3. Neither instruction may contain both immediate data and a displacement

    value.

    MOV table[SI], 7

    4. Prefixed instruction may only execute in the U pipeline.

    MOV ES:[DI], AL

    For floating point instruction the first instruction of the pair must be one of

    the following :

    FADD, FSUB, FMUL, FDIV, FCOM

    Second instruction must be FXCH

    The compiler plays an important role in the ordering of instruction during

    code generation.

  • I1 I3I2 I4

    I1 I3I2 I4

    I1 I3I2 I4

    I1 I3I2 I4

    I1 I3I2 I4

    PF

    D1

    D2

    EX

    WB

    Pipeline and Instruction Flow

    5 stage pipeline

    PF : prefetch

    D1 : Instruction decode

    D2 : Address Generation

    EX : Execute -ALU and Cache Access

    WB : Write Back

  • U pipeline can execute any processor instruction (including the initial

    stages of the floating point instructions)

    V pipeline only executes simple instructions.

  • Instructions are fed into the PF stages from the cache or memory.

    D1 stage - determine the current pair of instructions can execute together.

    D2 stage addresses for operands that reside in memory are calculated.

    EX stage - operands are read from the data cache or memory.

    ALU operations are performed.

    branch prediction for instruction are verified. (except conditional

    branches)

    WB stage to write the results of the completed instruction

    verify conditional branch ins predictions.

    When paired instruction reach the EX stage, it is possible that one or the

    other will stall and require additional cycles to execute.

  • Stall - no work is done

    Pipeline stall lowers the performance

    U stall V executing

    V stall U - executing

    Both instruction must progress to the WB stage before another pair may

    enter the EX stage.

  • Branch Prediction

    Branch Prediction Strategies :

    Static

    The actions for a branch are fixed for each branch during the entire

    execution. The actions are fixed at compile time.

    Decided before runtime

    Based on the object code

    Dynamic

    The decision causing the branch prediction can dynamically change

    during the program execution.

    Based on the execution history.

    Prediction decisions may change during the execution of the

    program

  • IFA:

    BHT

    00,01: sequential count

    10,11: branch.

    } xx:

    BHT: Branch History Table

    2-bit dynamic prediction

    xx

  • AT: actually taken

    ANT: actually not taken

    Branch has been :

    takentaken takentaken

    ANT ANT

    ANT

    ANT

    AT

    AT AT AT

    Strongly StronglyWeakly Weakly

    not

    Initialised when a

    branch is taken first

    Prediction: "Taken" Prediction: "Not Taken"

    not

    11 10 0001

    State transition diagram of the most frequently used

    2-bit dynamic prediction (Smith algorithm)

  • The prediction will be either taken or not taken.

    If the prediction turns out to be true, the pipeline will not be flushed, and no clock cycles will

    be lost.

    If the prediction turns out to be false, the pipeline is flushed and started over with the correct

    instruction.

    It is best if the predictions are true most of the time.

  • Branch target buffer : four way set associative cache

    256 entries, 64 sets

    Whenever a branch is taken the CPU enters the destination address (target

    address) in the BTB

    BTB stores two history bits that indicate the execution history of the

    branch instructions.

    Two 32 byte prefetch buffers work with BTB and the D1 stage of the U &

    V pipelines to keep a steady stream of instruction flowing into the

    pipelines.

    One buffer prefetches instruction from the current program address.

    Another buffer activated when BTB predicts taken will prefetch instruction

    from the target address.

  • Functional

    Block

    Diagram of

    Pentium

  • Floating point unit

    Co processor family:

    8086 8087

    80286 80287

    80386 80387

    80486 Internal FPU (not pipelined)

    Pentium Internal FPU (pipelined)

  • Sign bit Exponent Mantissa

    Floating point format

    IEEE 754 format

  • The floating-point instructions are those that are executed by the

    processors floating-point unit (FPU).

    These instructions operate on floating-point (real), extended integer, and

    binary-coded decimal (BCD) operands.

    The term floating point is derived from the fact that there is no fixed

    number of digits before and after the decimal point; that is, the decimal

    point can float.

    There are also representations in which the number of digits before and

    after the decimal point is set, called fixed-point representations.

  • PF - prefetch

    D1 instruction decode

    D2 address generation

    EX Memory and register read

    FP data converted into memory format

    memory write

    X1 FP execute stage one

    Memory date converted into FP format

    write operand to FP register file

    Bypass 1 send data back to EX stage

    X2 FP execute stage two

  • WF Round FP result and write to FP register file

    Bypass 2 send data back to EX stage

    ER error reporting, update status word

    Bypass 1:

    FLD ST

    FMUL ST

    Bypass 2:

    The result of an arithmetic instruction in WF stage is made available

    to the next instruction fetching operands in the EX stage.

    FADD, FSUB, FMUL, FDIV, FCOM

    Second instruction must be FXCH

    First ins- U pipeline makes up the first five stages of the FPU pipeline

    Second ins V pipeline

  • 8 - 80 bit floating point registers

    ST (0) through ST (7)