Advanced Microprocessor Course(EC311) Unit 2

Advanced Microprocessors

UNIT -II

Hardware details of the Pentium

CPU pin descriptions :

Pentium 60 MHz & 66 MHz, 273 pin PGA (Pin Grid Array).

Power supply 5v

Newer Pentium : faster clock speed, 296 pin PGA, power supply 3.3v

Pentium Processor Pin details

1. A20M (Address 20 mask) - input pin

To force the Pentium to limit addressable memory 1 MB.

Only active in real mode.

Undefined in protected mode.

2. A3- A31 ( Address lines) - Bidirectional pin

29 address lines together with the byte enable outputs form the

Pentium 32 bit address bus. (4 GB memory space)

3. BE0 - BE7 (Byte enable) output pin

The byte enable pins are used to determine which bytes must be

written to external memory, or which bytes were requested by the

CPU for the current cycle.

These signals are generated internally by the processor from

address lines A0, A1 and A2.

4. ADS (Address data strobe) output pin

The address status indicates that a new valid bus cycle is

currently being driven by the Pentium processor

5. AHOLD - ( Address hold) input pin

It is used to place a Pentiums address bus into a high

impedance state.

AP (Address parity) Bidirectional pin

It is used to indicate the even parity of address lines A5 - A31.

APCHK# - (Address parity check) output pin

Detected a parity error on the address bus during inquire cycles

External circuitry is responsible for taking the appropriate action if a

parity error is encountered.

APICEN - (Advanced Programmable Interrupt Controller Enable) -

Input pin

Enables or disables the on-chip APIC interrupt controller.

BF[1:0] - (Bus Frequency) - Input pin

Determines the bus-to-core frequency ratio. BF[1:0] are sampled at

RESET.

BOFF# - (Back off) - input pin

This input causes the processor to terminate any bus cycle

currently in progress and tri state its buses.

Highest priority

D63-D0 - (Data lines) Bidirectional pin

Lines D7-D0 define the least significant byte of the data bus;

lines D63-D56 define the most significant byte of the data bus

DP7-DP0 - (Data parity) - Bidirectional pin

To indicate the even parity of each data byte on the data bus.

DP7 applies to D63-56, DP0 applies to D7-0.

HOLD - (Hold bus) - input pin

Completes the current bus cycle and tri states its bus signals.

Activate HLDA.

HLDA - (hold acknowledge) output pin

To indicate that the Pentium has been placed in a hold state.

Bus Operations

Types of bus cycles:

Single transfer cycle

Burst transfer cycle

Interrupt ack cycle

Inquire cycle etc.

Some of the signals are used to indicate the type of bus cycle.

M/IO# - Memory / input output - output pin

If high - memory cycle or low I/O operation

D/C# - Data / Code - output pin

This output indicates that the current bus cycle is accessing code or

data.

If high - data or low - code

W/R# - Write / Read - output pin

This output indicates that the current bus cycle is a read

operation or a write operation.

If high -write operation or low - read operation.

CACHE# - Cacheability - output pin

This output indicates whether the data associated with the

current bus cycle is being read from or written to the internal

cache.

All the burst reads are cacheable and all cacheable read cycles

are bursted.

KEN# - Cache enable - input pin

The cache enable input is used to determine if the current cycle

is cacheable.

Bus State Definition

Ti: This is the bus idle state.

In this state, no bus cycles are being run.

The processor may or may not be driving the address and status

pins, depending on the state of the HLDA, AHOLD, and BOFF#

inputs.

An asserted BOFF# or RESET always forces the state machine

back to this state.

HLDA is only driven in this state.

T1: This is the first clock of a bus cycle.

Valid address and status are driven out and ADS# is asserted.

There is one outstanding bus cycle.

T2: This is the second and subsequent clock of the first outstanding bus

cycle.

In state T2, data is driven out (if the cycle is a write)

data is expected (if the cycle is a read)

The BRDY# pin is sampled.

There is one outstanding bus cycle.

BRDY# - Burst ready - input pin

Read cycle indicate data is available on the data bus

Write cycle - informs the processor that the output data has

been stored.

Single-Transfer Cycle

Burst Cycles

Cache uses burst cycles.

A new 8 byte chunk can be transferred every clock cycle.

The processor supplies the starting address of the first group of 8 bytes at

the beginning of the cycle.

The next groups of 8 bytes are transferred according to the burst order.

Burst transfer order:

The external memory system

must generate the remaining 3

address itself, and supply the

data in the correct order.

Address and BEs are asserted

only in the first transfer and

are not driven for each

transfer.

T12: This state indicates there are two outstanding bus cycles.

The processor is starting the second bus cycle at the same time

that data is being transferred for the first.

In T12, the processor drives

BRDY# - first cycle

ADS# - second cycle

T2P: This state indicates there are two outstanding bus cycles.

both are in their second and subsequent clocks.

Same job as T12

TD : Dead state

This state is used to insert a dead state between two consecutive

cycles (read followed by write or vice versa) in order to give

the system bus time to change states.

BREQ - (Bus request) - output pin

The bus request output tells the external system that the

Pentium has internally generated a bus request.

This happens even if the Pentium is not driving its bus at the

moment.

NA# - (Next address) - input pin

Indicates that the external memory system is ready to accept a

new bus cycle although all data transfers for the current cycle

have not yet completed.

Issue ADS# for a pending cycle two clocks after NA# is asserted.

Pentium supports up to 2 outstanding bus cycles.

Flow Functional description

0 No Request Pending

1 The processor starts a new bus cycle & ADS# is asserted in the T1 state.

2 Second clock cycle of current bus cycle

3 The processor stays in T2 until the transfer is over ( BRDY#) if no new

request becomes pending or if NA# is not asserted.

4 If there is a new request pending when the current cycle

is complete, and if NA# was sampled asserted, the processor begins from T1.

5 If no cycle is pending when the processor finishes the current cycle or NA# is

not asserted, the processor goes back to the idle state.

6 processing the current cycle (one outstanding cycle)

If NA# is asserted, the processor moves to T12 indicating that the processor

now has two outstanding cycles.

ADS# is asserted for the second cycle.

7 When the processor finishes the current cycle, and no dead clock is needed, it goes to

the T2 state.

8 When the processor finishes the current cycle and a dead clock is needed, it goes to the

TD state.

9 If the current cycle is not completed, the processor always moves to T2P to process the

data transfer.

10 The processor stays in T2P until the first cycle transfer is over.

11 The processor finishes the first cycle and no dead clock is needed, it goes to T2 state

12 When the first cycle is complete, and a dead clock is needed, it goes to TD state.

13 If NA# was sampled, a new request is pending, it goes to T12 state.

14 If NA# was not asserted, no new request is pending, it goes to T2 state.

Processor control Instructions

Lock - s/w instruction - Lock bus during next instruction

Executing lock - Lock# output goes low

Lock is used as a prefix to another instruction.

Lock# - h/w pin - (Bus Lock) - output pin

To indicate that the current bus cycle is locked & may not be

interrupted by another bus master.

Locked operation :

Atomic operation cannot be broken down into smaller sub

operations.

Semaphore - a special type of counter variable that must be read,

updated and stored in one single uninterruptable operation.

This requires a read cycle followed by a write cycle.

XCHG instruction automatically lock the bus when one of their

operands is a memory operand.

AHOLD and HOLD - activated in the middle of locked operation

locked operation is not affected.

But it is affected when BOFF# signal is asserted.

Interrupt acknowledge cycle

INTR Interrupt request - input pin

When high Pentium to initiate interrupt processing

Read a 8 bit vector number and select ISR.

The processor runs two interrupt ack cycles in response to an

INTR request.

Both the cycles are locked.

First cycle - D0 - D7 is ignored by the processor

Second cycle - D0 - D7 is accepted by the processor

Byte enable outputs are used to indicate the cycles.

BE4 is low and all other BEs are in high - first cycle

BE0 is low and all other BEs are in high - second cycle

Shutdown :

If Pentium detects an internal parity error then run the shutdown

cycle.

Execution is suspended while in shutdown until the processor

receives an NMI, INIT and RESET request.

Cache is unchanged.

RESET processor reset input pin

Forces the Pentium processor to begin execution at a known state.

internal caches will be invalidated upon the RESET.

Fetch its first instruction from address FFFFFFF0H.

INIT - initialization - input pin

Forces the Pentium processor to begin execution in a known state.

The processor state after INIT is the same as the state after RESET

except that the internal caches, write buffers, and floating point

registers retain the values they had prior to INIT.

NMI - Non-maskable interrupt - input pin

request signal indicates that an external non-maskable interrupt has

been generated.

No external int ack cycles are generated.

HALT cycles

When HALT instruction is executed HALT cycle is run.

INTR signal may also be used to resume the execution.

WB/WT# - (writeback/writethrough) - input pin

allows a data cache line to be defined as writeback (1) or writethrough

(0) on a line-by-line basis.

Writeback : Writing results only to the cache are called writeback.

Writethrough : Writing results to the cache and to main memory are called

Writethrough.

Cache is a small high-speed memory. Stores data from some frequently

used addresses (of main memory).

Cache hit Data found in cache. Results in data transfer at maximum speed.

Cache miss Data not found in cache. Processor loads data from Memory

and copies into cache. This results in extra delay, called miss

penalty.

Hit ratio = percentage of memory accesses satisfied by the cache.

Miss ratio = 1-hit ratio

Instruction and Data caches

Average memory access time =

Hit ratio * Tcache + (1 Hit ratio) * (Tcache + TRAM )

RAM access time = 70 ns

Cache access time = 10 ns

Hit ratio =0.85

Assume there is no external cache.

Tavg = 0.85 * 10 + (1- 0.85) * (10 + 70)

= 20.5 ns

Cache Line : Cache is partitioned into lines (also called blocks). During

data transfer, a whole line is read or written.

Each line has a tag that indicates the address in Memory from which the line

has been copied

Types of Cache

1. Fully Associative

2. Direct Mapped

3. Set Associative

Sequential Access :

Start at the beginning and read through in order

Access time depends on location of data and previous location

Example: tape

Direct Access :

Individual blocks have unique address

Access is by jumping to vicinity then performing a sequential search

Access time depends on location of data within "block" and previous

location

Example: hard disk

Random access:

Each location has a unique address

Access time is independent of location or previous access

e.g. RAM

Associative access :

Data is retrieved based on a portion of its contents rather than its

address

Access time is independent of location or previous access

e.g. cache

Performance

Transfer Rate : Rate at which data can be moved

For random-access memory, equal to 1/(cycle time)

For non-random-access memory, the following relationship holds:

TN = TA + N/R

where

TN = Average time to read or write N bits

TA = Average access time

N = Number of bits

R = Transfer rate, in bits per second(bps)

Fully Associative Cache

Allows any line in main memory

to be stored at any location in the

cache.

Main memory and cache are both

divided into lines of equal size.

No restriction on mapping from Memory to Cache.

It requires large number of comparators to check all the address.

Associative search of tags is expensive.

Feasible for very small size caches only (less than 4 K).

Some special purpose cache, such as the virtual memory Translation

Lookaside Buffer (TLB) is an associative cache.

Associative mapping works the best, but is complex to implement.

Direct-Mapped Cache

One way set associative cache.

Memory divided into cache pages

Page size and cache size both are

equal.

Line 0 of any page - Line 0 of

cache

Directly maps the memory line into

an equivalent cache line.

Direct has the lowest performance,

but is easiest to implement.

Direct is often used for instruction

cache.

Less flexible

Set-Associative Cache

Set associative is a compromise

between the other two.

The bigger the way the better the

performance, but the more complex

and expensive.

Combination of fully associative and

direct mapped caching schemes.

Divide the cache in to equal sections

called cache ways.

Page size is equal to the size of the cache way.

Each cache way is treated like a small direct mapped cache.

Design of cache organization

Cache size : 4KB

Line size : 32 bytes

Physical address : 32 bit

Fully Associative Cache

32 bit physical address is divided

into two fields.

n = cache size / line size = number of lines

b = log2(line size) = bit for offset

remaining upper bits = tag address bits

Consider fully associate mapping

scheme with 27 bit tag and 5 bit offset

01111101011101110001101100111000

Compare all tag fields for the value

011111010111011100011011001.

If a match is found, return byte 11000

(2410) of the line.

Direct Cache Addressing



log2(number of lines) = bits for cache index


Direct mapping scheme with 20 bit tag, 7

bit index and 5 bit offset

01111101011101110001101100111000

Compare the tag field of line 1011001

(8910) for the value

01111101011101110001.

If it matches, return byte 11000 (2410) of

the line.

Set Associative Mapping



log2(number of lines) = bits for cache index


w = number of lines / set

s = n / w = number of sets

Two way set-associate mapping with 19 bit tag, 6 bit index and 5 bit

offset

01111101011101110001101100111000

Compare the tag fields of lines 0110010 to 0110011 for the value

011111010111011100011.

If a match is found, return byte 11000 (2410) of that line

Instruction & Data Cache of Pentium

Both caches are organized as

2-way set associative caches

Cache size : 8KB

Line size : 32 bytes

Physical address : 32 bits

128 sets, total 256 entries

Each entry in a set has its own

tag

Data Cache of Pentium

Tags in the data cache are triple ported

They can be accessed from 3 different places at the same time

U pipeline

V pipeline

Bus snooping

Each entry in data cache can be configured for write through or write-back

Parity bits are used to maintain data integrity

Each tag and every byte in data cache has its own parity bit.

Instruction Cache of Pentium

Instruction cache is write protected to prevent self-modifying code.

Tags in instruction cache are also triple ported

Two ports for split-line accesses

Third port for bus snooping

In Pentium (since CISC), instructions are of variable length(1-15bytes).

Multibyte instructions may staddle two sequential lines stored in code

cache.

Then it has to go for two sequential access which degrades performance.

Solution: Split line Access

Split-line Access

It permits upper half of one line and lower half of next to be fetched from

code cache in one clock cycle.

When split-line is read, the information is not correctly aligned.

The bytes need to be rotated so that prefetch queue receives instruction in

proper order.

Instruction boundaries within the cache line need to be defined

There is one parity bit for every 8 byte of data in instruction cache

Split-line Access

Multiprocessor System

When multiple processors are used in a single system, there needs to be a

mechanism whereby all processors agree on the contents of shared cache

information.

For e.g., two or more processors may utilize data from the same memory

location, X.

Each processor may change value of X, thus which value of X has to be

considered?

If each processor change the value of the data item, we have different

(incoherent) values of Xs data in each cache.

Solution : Cache Coherency Mechanism

A multiprocessor system with incoherent cache data

Clean Data : The data in the cache and the data in the main memory

both are same, the data in the cache is called clean data.

Dirty Data : The data is modified within cache but not modified in

main memory, the data in the cache is called dirty data.

Stale Data : The data is modified with in main memory but not

modified in cache, the data in the cache is called stale data.

Out of- date main memory Data: The data is modified within cache

but not modified in main memory, the data in the main memory is

called Out of- date main memory Data.

Cache Coherency

Pentiums mechanism is called MESI

(Modified/Exclusive/Shared/Invalid)Protocol.

This protocol uses two bits stored with each line of data to keep track of the

state of cache line.

The four states are defined as follows:

Modified:The current line has been modified (does not match with main memory)

and is only available in a single cache.

Exclusive:The current line has not been modified (matches with main memory)

and is only available in a single cache.

Writing to this line changes its state to modified

Shared:

Copies of the current line may exist in more than one cache.

A write to this line causes a write through to main memory and may

invalidate the copies in the other cache.

Invalid:

The current line is empty.

A read from this line will generate a miss.

Only the shared and invalid states are used in code cache.

MESI protocol requires Pentium to monitor all accesses to main

memory in a multiprocessor system. This is called bus snooping.

Bus Snooping: It is used to maintain consistent data in a

multiprocessor system where each processor has a separate cache.

Consider the above example.

If the Processor 3 writes its local copy of X(30) back to memory, the

memory write cycle will be detected by the other 3 processors.

Each processor will then run an internal inquire cycle to determine

whether its data cache contains address of X.

Processor 1 and 2 then updates their cache based on individual MESI

states.

Pentiums address lines are used as inputs during an inquire cycle to

accomplish bus snooping.

Coherence vs. consistency

Cache coherence protocols guarantee that eventually all copies are updated.

Depending on how and when these updates are performed, a read

operation may sometimes return unexpected values.

Consistency deals with what values can be returned to the user by a read

operation (may return unexpected values if the update is not

complete).

Cache Coherency Protocol Implementations

Snooping

used with low-end, bus-based MPs

few processors

centralized memory

Directory-based

used with higher-end MPs

more processors

distributed memory

When we write, should we write to cache or memory?

Write through cache :write to both cache and main memory.

Cache and memory are always consistent.

Write back cache : write only to cache and set a dirty bit.

When the block gets replaced from the cache,

write it out to memory.

Snoop : when a cache is watching the address lines for transaction, this is

called a snoop.

This function allows the cache to see if any transactions are

accessing memory it contains within itself.

Snarf: when a cache takes the information from the data lines, the cache is

said to have snarfed the data.

This function allows the cache to be updated and maintain consistency

Cache consistency cycles

Inquire cycle

EADS# - (External address strobe) - input pin

This signal indicates that a valid external address has been driven

onto the Pentium processor address pins to be used for an inquire

cycle.

HIT# - (inquire cycle hit / miss) - output pin

The hit indication is driven to reflect the outcome of an inquire cycle.

If an inquire cycle hits a valid line in either data or instruction cache.

asserted two clocks after EADS#.

If the inquire cycle misses the cache, this pin is negated two clocks

after EADS#.

This pin changes its value only as a result of an inquire cycle and

retains its value between the cycles.

HITM# - (hit / miss modified cache line) - output pin

The hit to a modified line output is driven to reflect the outcome of

an inquire cycle.

It is asserted after inquire cycles which resulted in a hit to a modified

line in the data cache.

INV (invalidation) - input pin

determines the final cache line state (S or I) in case of an inquire

cycle hit.

It is sampled together with the address for the inquire cycle in the

clock EADS# is sampled active.

High cache line is invalidated

Low cache line is shared

Miss inv is no effect

Hit modified line line will be written back regardless of the state

of INV.

LRU Algorithm

One or more bits are added to the cache entry to support the LRU algorithm.

One LRU bit & Two valid bits for two lines.

If any invalid line (out of two) is found out that is replaced with the newly

referred data.

If all the lines are valid a LRU line is replaced by the new one.

Four way set associative - LRU algorithm

FLUSH# - (Flush cycle) - input pin

cache flush input forces the Pentium processor to write back all

modified lines in the data cache and invalidate its internal caches.

A Flush Acknowledge special cycle will be generated by the Pentium

processor indicating completion of the write back and invalidation.

Byte enables indicate the type of bus cycle. BE4 is low and all other BEs are

high.

BE7 BE6 BE5 BE4 BE3 BE2 BE1 BE0

1 1 1 0 1 1 1 1

Cache instructions:

INVD invalidate cache

Effectively erases all the information in the data cache. (by marking

it all invalid).

WBIND - write back and invalidate cache

write back special cycle is driven after the WBIND instruction is

executed.


1 1 1 1 0 1 1 1

INVD instruction should be used with care. This instruction does not

write back modified cache lines.

Flush cycle is driven after the INVD and WBIND instructions are

executed.


1 1 1 1 1 1 0 1

write back cycle is generated followed by the flush cycle.

Super scalar Architecture

Processors capable of parallel instruction execution of multiple instructions

are known as superscalar machines.

Parallel execution is possible through U & V pipeline of Pentium.

Four restriction placed on a pair of integer instruction attempting parallel

execution:

1. Both must be simple instructions

(Mov, Inc, Dec)

2. No data dependencies may exist between them.

read after write dependency

if both instruction write to the same operand

3. Neither instruction may contain both immediate data and a displacement

value.

MOV table[SI], 7

4. Prefixed instruction may only execute in the U pipeline.

MOV ES:[DI], AL

For floating point instruction the first instruction of the pair must be one of

the following :

FADD, FSUB, FMUL, FDIV, FCOM

Second instruction must be FXCH

The compiler plays an important role in the ordering of instruction during

code generation.

I1 I3I2 I4

I1 I3I2 I4

I1 I3I2 I4

I1 I3I2 I4

I1 I3I2 I4

PF

D1

D2

EX

WB

Pipeline and Instruction Flow

5 stage pipeline

PF : prefetch

D1 : Instruction decode

D2 : Address Generation

EX : Execute -ALU and Cache Access

WB : Write Back

U pipeline can execute any processor instruction (including the initial

stages of the floating point instructions)

V pipeline only executes simple instructions.

Instructions are fed into the PF stages from the cache or memory.

D1 stage - determine the current pair of instructions can execute together.

D2 stage addresses for operands that reside in memory are calculated.

EX stage - operands are read from the data cache or memory.

ALU operations are performed.

branch prediction for instruction are verified. (except conditional

branches)

WB stage to write the results of the completed instruction

verify conditional branch ins predictions.

When paired instruction reach the EX stage, it is possible that one or the

other will stall and require additional cycles to execute.

Stall - no work is done

Pipeline stall lowers the performance

U stall V executing

V stall U - executing

Both instruction must progress to the WB stage before another pair may

enter the EX stage.

Branch Prediction

Branch Prediction Strategies :

Static

The actions for a branch are fixed for each branch during the entire

execution. The actions are fixed at compile time.

Decided before runtime

Based on the object code

Dynamic

The decision causing the branch prediction can dynamically change

during the program execution.

Based on the execution history.

Prediction decisions may change during the execution of the

program

IFA:

BHT

00,01: sequential count

10,11: branch.

} xx:

BHT: Branch History Table

2-bit dynamic prediction

xx

AT: actually taken

ANT: actually not taken

Branch has been :

takentaken takentaken

ANT ANT

ANT

ANT

AT

AT AT AT

Strongly StronglyWeakly Weakly

not

Initialised when a

branch is taken first

Prediction: "Taken" Prediction: "Not Taken"

not

11 10 0001

State transition diagram of the most frequently used

2-bit dynamic prediction (Smith algorithm)

The prediction will be either taken or not taken.

If the prediction turns out to be true, the pipeline will not be flushed, and no clock cycles will

be lost.

If the prediction turns out to be false, the pipeline is flushed and started over with the correct

instruction.

It is best if the predictions are true most of the time.

Branch target buffer : four way set associative cache

256 entries, 64 sets

Whenever a branch is taken the CPU enters the destination address (target

address) in the BTB

BTB stores two history bits that indicate the execution history of the

branch instructions.

Two 32 byte prefetch buffers work with BTB and the D1 stage of the U &

V pipelines to keep a steady stream of instruction flowing into the

pipelines.

One buffer prefetches instruction from the current program address.

Another buffer activated when BTB predicts taken will prefetch instruction

from the target address.

Functional

Block

Diagram of

Pentium

Floating point unit

Co processor family:

8086 8087

80286 80287

80386 80387

80486 Internal FPU (not pipelined)

Pentium Internal FPU (pipelined)

Sign bit Exponent Mantissa

Floating point format

IEEE 754 format

The floating-point instructions are those that are executed by the

processors floating-point unit (FPU).

These instructions operate on floating-point (real), extended integer, and

binary-coded decimal (BCD) operands.

The term floating point is derived from the fact that there is no fixed

number of digits before and after the decimal point; that is, the decimal

point can float.

There are also representations in which the number of digits before and

after the decimal point is set, called fixed-point representations.

PF - prefetch

D1 instruction decode

D2 address generation

EX Memory and register read

FP data converted into memory format

memory write

X1 FP execute stage one

Memory date converted into FP format

write operand to FP register file

Bypass 1 send data back to EX stage

X2 FP execute stage two

WF Round FP result and write to FP register file

Bypass 2 send data back to EX stage

ER error reporting, update status word

Bypass 1:

FLD ST

FMUL ST

Bypass 2:

The result of an arithmetic instruction in WF stage is made available

to the next instruction fetching operands in the EX stage.

FADD, FSUB, FMUL, FDIV, FCOM

Second instruction must be FXCH

First ins- U pipeline makes up the first five stages of the FPU pipeline

Second ins V pipeline

8 - 80 bit floating point registers

ST (0) through ST (7)

Documents

Advanced Microprocessor Course(EC311) Unit 2