Instruction Set Architectures Performance issues ALUs Single Cycle CPU

CS141-L7-1 Tarun Soni, Summer’03

Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, ExceptionsPipeliningMemory systems

Static/Dynamic RAM technologiesCache structures: Direct mapped, AssociativeVirtual Memory

The Story so far:


Memory

Memory systems


Computer

Memory

Datapath

Control

Output

Input

Memory Systems


Technology Trends (from 1st lecture)

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

1000:1! 2:1!


Who Cares About the Memory Hierarchy?

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

“Moore’s Law”

Processor-DRAM Memory Gap (latency)


Today’s Situation: Microprocessor

• Rely on caches to bridge gap• Microprocessor-DRAM performance gap

– time of a full cache miss in instructions executed

1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions

2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions

3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions

– 1/2X latency x 3X clock rate x 3X Instr/clock 5X

• Processor Speeds– Intel Pentium III : 4GHz =

• Memory Speeds– Mac/PC/Workstation DRAM: 50 ns– Disks are even slower....

• How can we span this “access time” gap?– 1 instruction fetch per instruction– 15/100 instructions also do a data read or write (load or store)


Impact on Performance

• Suppose a processor executes at

– Clock Rate = 200 MHz (5 ns per cycle)

– CPI = 1.1 – 50% arith/logic, 30%

ld/st, 20% control• Suppose that 10% of

memory ops get 50 cycle miss penalty

DataMiss(1.6)49%

Ideal CPI(1.1)35%

Inst Miss(0.5)16%

• CPI = ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) )= 1.1 cycle + 1.5 cycle = 2. 6

• 58 % of the time the processor is stalled waiting for memory!• a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!


Control

Datapath

Memory

Processor

Mem

ory

Memory

Memory

Mem

ory

Fastest Slowest

Smallest Biggest

Highest Lowest

Speed:

Size:

Cost:

Memory system-hierarchical


Why hierarchy works

• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of

time.

Address Space0 2^n - 1

Probabilityof reference


• Temporal– Tendency to reference locations recently referenced

• Spatial– Tendency to reference locations “near” those recently referenced

x x x

t

likely to reference x

x likely reference zone

address space

• Property of memory references in “typical programs”• Tendency to favor a portion of their address space at any given time

Locality


Memory Hierarchy: How Does it Work?

• Temporal Locality (Locality in Time):

=> Keep most recently accessed data items closer to the processor• Spatial Locality (Locality in Space):

=> Move blocks consisting of contiguous words to the upper levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y


• Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again.

• This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories.

• If it works, we get the illusion of SRAM access time with disk capacity

SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.

DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.

Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.

Memory Hierarchy: How Does it Work?


Memory Hierarchy: Terminology

• Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level (Block Y)

– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor• Hit Time << Miss Penalty

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Blk Y


Memory Hierarchy of a Modern Computer System

• By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest

technology.– Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Reg

isters

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On

-Ch

ipC

ache

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage(Disk)

10,000,000,000s (10s sec)

Ts


Main Memory Background

• Performance of Main Memory: – Latency: Cache Miss Penalty

• Access Time: time between request and word arrives• Cycle Time: time between requests

– Bandwidth: I/O & Large Block Miss Penalty (L2)• Main Memory is DRAM: Dynamic Random Access Memory

– Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):

• RAS or Row Access Strobe• CAS or Column Access Strobe

• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor /bit)– Address not divided

• Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16


Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

word

• Write:

1. Drive bit lines (bit=1, bit=0)

2.. Select row• Read:

1. Precharge bit and bit to Vdd

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

10

0 1


Typical SRAM Organization: 16-word x 4-bit

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

SRAMCell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &Precharger - +

Wr Driver &Precharger - +

Wr Driver &Precharger - +

Wr Driver &Precharger

Ad

dress D

ecod

er

WrEnPrecharge

Din 0Din 1Din 2Din 3

A0

A1

A2

A3


Problems with SRAM

• Six transistors use up a lot of area• Consider a “Zero” is stored in the cell:

– Transistor N1 will try to pull “bit” to 0– Transistor P2 will try to pull “bit bar” to 1

• But bit lines are precharged to high: Are P1 and P2 necessary?

bit = 1 bit = 0

Select = 1

On Off

Off On

N1 N2

P1 P2

OnOn


Logic Diagram of a Typical SRAM

• Write Enable is usually active low (WE_L)• Din and Dout combined to save pins:– output enable (OE_L) (new signal)– WE_L asserted , OE_L deasserted

• D => data input pin– WE_L deasserted , OE_L asserted

• D => data output pin

A

DOE_L

2 Nwordsx M bitSRAM

N

M

WE_L

Write Timing:

D

Read Timing:

WE_L

A

WriteHold Time

Write Setup Time

Data In

Write Address

OE_L

High Z

Read Address

Junk

Read AccessTime

Data Out

Read AccessTime

Data Out

Read Address


1-Transistor Memory Cell (DRAM)

• Write:– 1. Drive bit line– 2.. Select row

• Read:– 1. Precharge bit line to Vdd– 2.. Select row– 3. Cell and bit line share charges

• Very small voltage changes on the bit line– 4. Sense (fancy sense amp)

• Can detect changes of ~1 million electrons– 5. Write: restore the value

• Refresh– 1. Just do a dummy read to every cell.

row select

bit


Classical DRAM Organization (square)

row

decoder

rowaddress

Column Selector & I/O Circuits

ColumnAddress

data

RAM Cell Array

word (row) select

bit (data) lines

• Row and Column Address together: – Select 1 bit a time

Each intersection representsa 1-T DRAM Cell


Logic Diagram of a Typical DRAM

AD

OE_L

256K x 8DRAM9 8

WE_L

• Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low• Din and Dout are combined (D):

– WE_L is asserted (Low), OE_L is deasserted (High)• D serves as the data input pin

– WE_L is deasserted (High), OE_L is asserted (Low)• D is the data output pin

• Row and column addresses share the same pins (A)– RAS_L goes low: Pins A are latched in as row address– CAS_L goes low: Pins A are latched in as column address– RAS/CAS edge-sensitive

CAS_LRAS_L


Key DRAM Timing Parameters

• tRAC: minimum time from RAS line falling to the valid data output.

– Quoted as the speed of a DRAM

– A fast 4Mb DRAM tRAC = 60 ns

• tRC: minimum time from the start of one row access to the start of the next.

– tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns

• tCAC: minimum time from CAS line falling to valid data output.

– 15 ns for a 4Mbit DRAM with a tRAC of 60 ns

• tPC: minimum time from the start of one column access to the start of the next.

– 35 ns for a 4Mbit DRAM with a tRAC of 60 ns


DRAM Performance

• A 60 ns (tRAC) DRAM can

– perform a row access only every 110 ns (tRC)

– perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).

• In practice, external address delays and turning around buses make it 40 to 50 ns

• These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.

– Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins…

– 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM


DRAM Write Timing

AD

OE_L

256K x 8DRAM9 8

WE_LCAS_LRAS_L

WE_L

A Row Address

OE_L

Junk

WR Access Time WR Access Time

CAS_L

RAS_L

Col Address Row Address JunkCol Address

D Junk JunkData In Data In Junk

DRAM WR Cycle Time

Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L

• Every DRAM access begins at:– The assertion of the RAS_L– 2 ways to write:

early or late v. CAS


DRAM Read Timing

AD

OE_L

256K x 8DRAM9 8

WE_LCAS_LRAS_L

OE_L

A Row Address

WE_L

Junk

Read AccessTime

Output EnableDelay

CAS_L

RAS_L


D High Z Data Out

DRAM Read Cycle Time

Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

• Every DRAM access begins at:– The assertion of the RAS_L– 2 ways to read:

early or late v. CAS

Junk Data Out High Z


Cycle Time versus Access Time

• DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time– 2:1; why?

• DRAM (Read/Write) Cycle Time :– How frequent can you initiate an access?– Analogy: A little kid can only ask his father for money on Saturday

• DRAM (Read/Write) Access Time:– How quickly will you get what you want once you initiate an access?– Analogy: As soon as he asks, his father will give him the money

• DRAM Bandwidth Limitation analogy:– What happens if he runs out of money on Wednesday?

TimeAccess Time

Cycle Time


Increasing Bandwidth - Interleaving

Access Pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2

D1 available

Access Pattern with 4-way Interleaving:

Acc

ess

Ban

k 0

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

CPU

MemoryBank 1

MemoryBank 0

MemoryBank 3

MemoryBank 2


Fewer DRAMs/System over TimeM

inim

um

PC

Mem

ory

Siz

e

DRAM Generation

‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

4 MB

8 MB

16 MB

32 MB

64 MB

128 MB

256 MB

32 8

16 4

8 2

4 1

8 2

4 1

8 2

Memory per System growth@ 25%-30% / year

Memory per DRAM growth@ 60% / year

• Increasing DRAM => fewer chips => harder to have banks


Page Mode DRAM: Motivation

• Regular DRAM Organization:– N rows x N column x M-bit– Read & Write M-bit at a time– Each M-bit access requires

a RAS / CAS cycle• Fast Page Mode DRAM

– N x M “register” to save a row

A Row Address Junk

CAS_L

RAS_L


1st M-bit Access 2nd M-bit Access

N r

ows

N cols

DRAM

M bits

RowAddress

ColumnAddress

M-bit Output


Fast Page Mode Operation

• Fast Page Mode DRAM– N x M “SRAM” to save a row

• After a row is read into the register– Only CAS is needed to access other M-bit

blocks on that row– RAS_L remains asserted while CAS_L is

toggled

A Row Address

CAS_L

RAS_L

Col Address Col Address

1st M-bit Access

N r

ows

N cols

DRAM

ColumnAddress

M-bit OutputM bits

N x M “SRAM”

RowAddress

Col Address Col Address

2nd M-bit 3rd M-bit 4th M-bit


DRAMs over Time

DRAM Generation

‘84 ‘87 ‘90 ‘93 ‘96 ‘99

1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

55 85 130 200 300 450

30 47 72 110 165 250

28.84 11.1 4.26 1.64 0.61 0.23

1st Gen. Sample

Memory Size

Die Size (mm2)

Memory Area (mm2)

Memory Cell Area (µm2)


• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be

referenced again soon.– Spatial Locality (Locality in Space): If an item is referenced, items whose

addresses are close by tend to be referenced soon.• By taking advantage of the principle of locality:

– Present the user with as much memory as is available in the cheapest technology.

– Provide access at the speed offered by the fastest technology.• DRAM is slow but cheap and dense:

– Good choice for presenting the user with a BIG memory system• SRAM is fast but expensive and not very dense:

– Good choice for providing the user FAST access time.

Memory - Summary


Cache Fundamentals

• cache hit -- an access where the data

is found in the cache.• cache miss -- an access which isn’t• hit time -- time to access the cache• miss penalty -- time to move data from

further level to closer, then to cpu• hit ratio -- percentage of time the data is found in the

cache• miss ratio -- (1 - hit ratio)

cpu

lowest-levelcache

next-levelmemory/cache

• cache block size or cache line size-- the

amount of data that gets transferred on a

cache miss.• instruction cache -- cache that only holds

instructions.• data cache -- cache that only caches data.• unified cache -- cache that holds both.


Cacheing Issues

On a memory access -• How do I know if this is a hit or miss?

On a cache miss -• where to put the new data?• what data to throw out?• how to remember what data this is?

cpu

lowest-levelcache

next-levelmemory/cache

access

miss


• A cache that can put a line of data in exactly one place is called direct-mapped

tag data

an index is usedto determine

which line an addressmight be found in

an index is usedto determine

which line an addressmight be found in

4 entries, each block holds one word, each word in memory maps to exactly one cache location.

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100

A simple cache : similar to the branch prediction table in pipelining ?

• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index

• Solution 1: make the cache size bigger • Solution 2: Multiple entries for the same Cache Index

• Conflict misses = 0 by definition, for fully associative caches.


Fully associative cache

• A cache that can put a line of data anywhere is called fully associative• The most popular replacement strategy is LRU (least recently used).• How do you find the data ?

tag data

the tag identifiesthe address of the cached data

the tag identifiesthe address of the cached data

4 entries, each block holds one word, any block can hold any word.

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100


• Fully Associative Cache– Forget about the Cache Index– Compare the Cache Tags of all cache entries in parallel– Example: Block Size = 2 B blocks, we need N 27-bit comparators

• By definition: Conflict Miss = 0 for a fully associative cache

:

Cache Data

Byte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte Select

Ex: 0x01

X

X

X

X

X

Fully associative cache


• A cache that can put a line of data in exactly n places is called n-way set-associative.

• The cache lines that share the same index are a cache set.

tag data

4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100tag data

A n-way set associative cache


A n-way set associative cache

• N-way set associative: N entries for each Cache Index– N direct mapped caches operates in parallel

• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit


• N-way Set Associative Cache versus Direct Mapped Cache:– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection

• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.

Cache Data

Cache Block 0

Cache Tag Valid

: ::

Cache Data

Cache Block 0

Cache TagValid

:: :

Cache Index

Mux 01Sel1 Sel0

Cache Block

CompareAdr Tag

Compare

OR

Hit

Direct vs. Set Associative Caches


• Large cache blocks take advantage of spatial locality.• Too large of a block size can waste cache space.• Longer cache blocks require less tag space

tag data

4 entries, each block holds two words, each wordin memory maps to exactly one cache location (this cache is twice the total size of the prior caches).

address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100

00000100

Longer cache-blocks


1 KB

8 KB

16 KB

64 KB

256 KB

256

40%

35%

30%

25%

20%

15%

10%

5%

0%

Miss

rate

64164

Block size (bytes)

Increasing block size

• In general, larger block size take advantage of spatial locality BUT:– Larger block size means larger miss penalty:

• Takes longer time to fill up the block– If block size is too big relative to cache size, miss rate will go up

• Too few cache blocks• In general, Average Access Time:

– = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate


Sources of Cache Misses

• Compulsory (cold start or process migration, first reference): first access to a block

– “Cold” fact of life: not a whole lot you can do about it– Note: If you are going to run “billions” of instruction, Compulsory Misses

are insignificant• Conflict (collision):

– Multiple memory locations mappedto the same cache location

– Solution 1: increase cache size– Solution 2: increase associativity

• Capacity:– Cache cannot contain all blocks access by the program– Solution: increase cache size

• Invalidation: other process (e.g., I/O) updates memory


Direct Mapped N-way Set Associative Fully Associative

Compulsory Miss

Cache Size

Capacity Miss

Invalidation Miss

Big Medium Small

Same Same Same

Conflict Miss High Medium Zero

Low Medium High

Same Same Same

Sources of Cache Misses


1. Use index and tag to access cache and determine hit/miss.

2. If hit, return requested data.

3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).

-load entire missed cache line into cache

-return requested data to CPU (or higher cache)

4. If next lower memory is a cache, goto step 1 for that cache.

ICache Reg A

LU Dcache Reg

IF ID EX MEM WB

Accessing a cache


• 64 KB cache, direct-mapped, 32 byte cache block size

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

64 KB

/ 32 bytes =

2 K cache blocks/sets

11

=256

32

16

hit/miss

012

...

...

...

...204520462047

word offset

Accessing a cache

3


• 32 KB cache, 2-way set-associative, 16-byte block size (cache lines)

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data 32 KB

/ 16 bytes / 2 =

1 K cache sets

10

=

18

hit/miss

012

...

...

...

...102110221023

word offset

tag datavalid

=

Accessing a cache


• The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).

• This results in – no overlap of cache lines– easy mapping of addresses to cache lines (no additions)– data at address X always being present in the same location in the

cache block (at byte X mod blocksize) if it is there at all.• Think of memory as organized into cache-line sized pieces

(because in reality, it is!).• Recall DRAM page mode architecture !!

tag index block offset

memory address

.

.

.

0123456789

10...

Memory

Cache Alignment


• Q1: Where can a block be placed in the upper level? (Block placement)• Q2: How is a block found if it is in the upper level? (Block identification)• Q3: Which block should be replaced on a miss? (Block replacement)• Q4: What happens on a write? (Write strategy)

Basic Memory Hierarchy questions

• Placement: Cache structure determines the where ?:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets

• Identification: Tag on each block– No need to check index or block offset

• Replacement: Easy for Direct Mapped• Set Associative or Fully Associative:

– Random– LRU (Least Recently Used)


What about writes?

• Stores must be handled differently than loads, because...– they don’t necessarily require the CPU to stall.– they change the content of cache/memory (creating memory

consistency issues)– may require a load and a store to complete


• On a store hit, write the new data to cache. In a write-through cache, write the data immediately to memory. In a write-back cache, mark the line as dirty.

• On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a write-around cache.

• On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.

• Keep memory and cache identical?– write-through => all writes go to both cache and main memory– write-back => writes go only to cache. Modified cache lines are

written back to memory when the line is replaced.• Make room in cache for store miss?

– write-allocate => on a store miss, bring written line into the cache– write-around => on a store miss, ignore cache

What about writes?


Cache Performance

TCPI = BCPI + MCPI– where TCPI = Total CPI; – BCPI = base CPI, which means the CPI assuming perfect memory;– MCPI = the memory CPI, the number of cycles (per instruction) the

processor is stalled waiting for memory.

MCPI = accesses/instruction * miss rate * miss penalty– this assumes we stall the pipeline on both read and write misses, that the

miss penalty is the same for both, that cache hits require no stalls.


21164 CPUcore

InstructionCache

DataCache

UnifiedL2

Cache

Off-ChipL3 Cache

• ICache and DCache -- 8 KB, DM, 32-byte lines• L2 cache -- 96 KB, 3-way SA, 32-byte lines• L3 cache -- 1 MB, DM, 32-byte lines

Example -- DEC Alpha 21164 Caches

Dec Alpha


tag index offsetmemory address:

DEC Alpha 21164’s L2 cache: 96 KB, 3-way set associative, 32-Byte lines64 bit addresses

How many “offset” bits?How many “index” bits?How many “tag” bits?Draw cache picture – how do you tell if it’s a hit?


• 8 KB cache, direct-mapped, 32 byte lines

31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index

valid tag data

8 KB

/ 32 bytes =

256 cache blocks/sets

8

=256

32

19

hit/miss

012

...

...

...

...204520462047

word offset3


• 96 KB cache, 3-way set-associative, 32-byte block size (cache lines)

• => Each set has 32 bytes,• => 3 bits per set to locate the right 32 bits (word)• => each “index” has 3 sets (3-way) (in effect three places one could place the

data in)• => 96 bytes per “row” • => 96KB/96B = 1K rows • => index = 10 bits• => Tag = 32 – (10+3+2) = 17bits


Pentium 4: Trace Cache and Instruction Decode

Remember Micro-Instructions??


Power4-Core

Power4-Core

I-Cache64 Kbyte

I-Cache64 Kbyte

D-Cache,32 Kbyte, 128 Byte lines,

2 way set associative,Write through

D-Cache,32 Kbyte, 128 Byte lines,

2 way set associative,Write through L2

1.44 MB128 Byte lines

8-way-Set-AssociativeWrite Back

Power 4 Chip

L3: 128M512 Byte-lines,8-way SAWrite Back

L3128 MB

512 Byte lines8-way-Set-Associative

Write Back

P0

P1

P7

Multi-Chip-Module

Power4


• Often DM at highest level (closest to CPU), associative further away• split I and D close to the processor, unified further away (for throughput rather

than miss rate).• write-through and write-back both common, but never write-through all the way

to memory.• 32-byte cache lines very common

• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even

then)– this means the cache must be able to handle multiple outstanding

accesses.

Real Caches


Recall: Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns$.01-.001/bit

Main MemoryM Bytes100ns-1us$.01-.001

DiskG Bytesms10 - 10 cents

-3 -4

CapacityAccess TimeCost

Tapeinfinitesec-min10

-6

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger


Virtual Memory

• Main memory can act as a cache for the secondary storage (disk)

• Advantages:– illusion of having more physical memory– program relocation – protection

Physical addresses

Disk addresses

Virtual addresses

Address translation


• is just cacheing, but uses different terminology

cache VM

block page

cache miss page fault

address virtual address

index physical address (sort of)

Virtual Memory

• What happens if another program in the processor uses the same addresses that yours does?

• What happens if your program uses addresses that don’t exist in the machine?• What happens to “holes” in the address space your program uses?

• So, virtual memory provides– performance (through the cacheing effect)– protection– ease of programming/compilation– efficient use of memory


• is just a mapping function from virtual memory addresses to physical memory locations, which allows cacheing of virtual pages in physical memory.

• Why is it different ?

Virtual Memory

• MUCH higher miss penalty (millions of cycles)!• Therefore

– large pages [equivalent of cache line] (4 KB to MBs)– associative mapping of pages (typically fully associative)– software handling of misses (but not hits!!)– write-through not an option, only write-back


• Page table maps virtual memory addresses to physical memory locations– Page may be in main memory or may be on disk.

“Valid bit” says which.

Virtual page number (high order bits of virtual address)

Page Tables


virtual page number page offset

valid physical page numberpage table reg

physical page number page offset

virtual address

physical address

pagetable

• Why don’t we need a “tag” field?• How does operating system switch to new process???

This hardwareregister tells VM system where tofind the page table.

Page Tables


Virtual Address and a Cache

CPU Trans-lation

Cache MainMemory

VA PA miss

hitdata

It takes an extra memory access to translate VA to PA

This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible

ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!

for update: must update all cache entries with same physical address or memory becomes inconsistent

determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or

software enforced alias boundary: same lsb of VA &PA > cache size


TLBs

A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB

Virtual Address Physical Address Dirty Ref Valid Access

TLB access time comparable to cache access time (much less than main memory access time)


Translation lookaside buffer (TLB) A hardware cache for the page tableTypical size: 256 entries, 1- to 4-way set associative


Translation Look-Aside Buffers

Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped

TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.

CPU TLBLookup

Cache MainMemory

VA PA miss

hit

data

Trans-lation

hit

miss

20 tt1/2 t

Translationwith a TLB


Translation Look-Aside Buffers

Machines with TLBs go one step further to reduce # cycles/cache access- overlap the cache access with the TLB access

Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache

Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation

This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache


All together!


• Virtual memory provides:

– protection

– sharing

– performance

– illusion of large main memory

• Virtual Memory requires twice as many memory accesses, so we cache page table entries in the TLB.

• Four things can go wrong on a memory access: cache miss, TLB miss (but page table in cache), TLB miss and page table not in cache, page fault.


• The Principle of Locality:– Program likely to access a relatively small portion of the address space at any

instant of time.• Temporal Locality: Locality in Time• Spatial Locality: Locality in Space

• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.– Conflict Misses: increase cache size and/or associativity.

Nightmare Scenario: ping pong effect!– Capacity Misses: increase cache size

• Cache Design Space– total size, block size, associativity– replacement policy– write-hit policy (write-through, write-back)– write-miss policy

Summary


• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled?

• Page tables map virtual address to physical address• TLBs are important for fast translation• TLB misses are significant in processor performance: (funny times, as most systems

can’t access all of 2nd level cache without TLB misses!)

• VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs?

– 1000X DRAM growth removed the controversy• Today VM allows many processes to share single memory without having to swap all

processes to disk; VM protection is more important than memory hierarchy• Today CPU time is a function of (ops, cache misses) vs. just f(ops):

What does this mean to Compilers, Data structures, Algorithms?

Summary


Documents

Instruction Set Architectures Performance issues ALUs Single Cycle CPU