Upload
galen
View
39
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The Story so far:. Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, Exceptions Pipelining Memory systems Static/Dynamic RAM technologies Cache structures: Direct mapped, Associative Virtual Memory. Memory. Memory systems. - PowerPoint PPT Presentation
Citation preview
CS141-L7-1 Tarun Soni, Summer’03
Instruction Set Architectures Performance issues ALUs Single Cycle CPU Multicycle CPU: datapath; control, ExceptionsPipeliningMemory systems
Static/Dynamic RAM technologiesCache structures: Direct mapped, AssociativeVirtual Memory
The Story so far:
CS141-L7-2 Tarun Soni, Summer’03
Memory
Memory systems
CS141-L7-3 Tarun Soni, Summer’03
Computer
Memory
Datapath
Control
Output
Input
Memory Systems
CS141-L7-4 Tarun Soni, Summer’03
Technology Trends (from 1st lecture)
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns
1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
1000:1! 2:1!
CS141-L7-5 Tarun Soni, Summer’03
Who Cares About the Memory Hierarchy?
µProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
Processor-DRAM Memory Gap (latency)
CS141-L7-6 Tarun Soni, Summer’03
Today’s Situation: Microprocessor
• Rely on caches to bridge gap• Microprocessor-DRAM performance gap
– time of a full cache miss in instructions executed
1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions
2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions
3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions
– 1/2X latency x 3X clock rate x 3X Instr/clock 5X
• Processor Speeds– Intel Pentium III : 4GHz =
• Memory Speeds– Mac/PC/Workstation DRAM: 50 ns– Disks are even slower....
• How can we span this “access time” gap?– 1 instruction fetch per instruction– 15/100 instructions also do a data read or write (load or store)
CS141-L7-7 Tarun Soni, Summer’03
Impact on Performance
• Suppose a processor executes at
– Clock Rate = 200 MHz (5 ns per cycle)
– CPI = 1.1 – 50% arith/logic, 30%
ld/st, 20% control• Suppose that 10% of
memory ops get 50 cycle miss penalty
DataMiss(1.6)49%
Ideal CPI(1.1)35%
Inst Miss(0.5)16%
• CPI = ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) )= 1.1 cycle + 1.5 cycle = 2. 6
• 58 % of the time the processor is stalled waiting for memory!• a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!
CS141-L7-8 Tarun Soni, Summer’03
Control
Datapath
Memory
Processor
Mem
ory
Memory
Memory
Mem
ory
Fastest Slowest
Smallest Biggest
Highest Lowest
Speed:
Size:
Cost:
Memory system-hierarchical
CS141-L7-9 Tarun Soni, Summer’03
Why hierarchy works
• The Principle of Locality:– Program access a relatively small portion of the address space at any instant of
time.
Address Space0 2^n - 1
Probabilityof reference
CS141-L7-10 Tarun Soni, Summer’03
• Temporal– Tendency to reference locations recently referenced
• Spatial– Tendency to reference locations “near” those recently referenced
x x x
t
likely to reference x
x likely reference zone
address space
• Property of memory references in “typical programs”• Tendency to favor a portion of their address space at any given time
Locality
CS141-L7-11 Tarun Soni, Summer’03
Memory Hierarchy: How Does it Work?
• Temporal Locality (Locality in Time):
=> Keep most recently accessed data items closer to the processor• Spatial Locality (Locality in Space):
=> Move blocks consisting of contiguous words to the upper levels
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
CS141-L7-12 Tarun Soni, Summer’03
• Memory hierarchies exploit locality by cacheing (keeping close to the processor) data likely to be used again.
• This is done because we can build large, slow memories and small, fast memories, but we can’t build large, fast memories.
• If it works, we get the illusion of SRAM access time with disk capacity
SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.
DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.
Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per Mbyte.
Memory Hierarchy: How Does it Work?
CS141-L7-13 Tarun Soni, Summer’03
Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block X) – Hit Rate: the fraction of memory access found in the upper level– Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss• Miss: data needs to be retrieve from a block in the lower level (Block Y)
– Miss Rate = 1 - (Hit Rate)– Miss Penalty: Time to replace a block in the upper level +
Time to deliver the block the processor• Hit Time << Miss Penalty
Lower LevelMemoryUpper Level
MemoryTo Processor
From ProcessorBlk X
Blk Y
CS141-L7-14 Tarun Soni, Summer’03
Memory Hierarchy of a Modern Computer System
• By taking advantage of the principle of locality:– Present the user with as much memory as is available in the cheapest
technology.– Provide access at the speed offered by the fastest technology.
Control
Datapath
SecondaryStorage(Disk)
Processor
Reg
isters
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On
-Ch
ipC
ache
1s 10,000,000s
(10s ms)
Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage(Disk)
10,000,000,000s (10s sec)
Ts
CS141-L7-15 Tarun Soni, Summer’03
Main Memory Background
• Performance of Main Memory: – Latency: Cache Miss Penalty
• Access Time: time between request and word arrives• Cycle Time: time between requests
– Bandwidth: I/O & Large Block Miss Penalty (L2)• Main Memory is DRAM: Dynamic Random Access Memory
– Dynamic since needs to be refreshed periodically (8 ms)– Addresses divided into 2 halves (Memory as a 2D matrix):
• RAS or Row Access Strobe• CAS or Column Access Strobe
• Cache uses SRAM: Static Random Access Memory– No refresh (6 transistors/bit vs. 1 transistor /bit)– Address not divided
• Size: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16
CS141-L7-16 Tarun Soni, Summer’03
Static RAM Cell
6-Transistor SRAM Cell
bit bit
word(row select)
bit bit
word
• Write:
1. Drive bit lines (bit=1, bit=0)
2.. Select row• Read:
1. Precharge bit and bit to Vdd
2.. Select row
3. Cell pulls one line low
4. Sense amp on column detects difference between bit and bit
replaced with pullupto save area
10
0 1
CS141-L7-17 Tarun Soni, Summer’03
Typical SRAM Organization: 16-word x 4-bit
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
SRAMCell
- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp
: : : :
Word 0
Word 1
Word 15
Dout 0Dout 1Dout 2Dout 3
- +Wr Driver &Precharger - +
Wr Driver &Precharger - +
Wr Driver &Precharger - +
Wr Driver &Precharger
Ad
dress D
ecod
er
WrEnPrecharge
Din 0Din 1Din 2Din 3
A0
A1
A2
A3
CS141-L7-18 Tarun Soni, Summer’03
Problems with SRAM
• Six transistors use up a lot of area• Consider a “Zero” is stored in the cell:
– Transistor N1 will try to pull “bit” to 0– Transistor P2 will try to pull “bit bar” to 1
• But bit lines are precharged to high: Are P1 and P2 necessary?
bit = 1 bit = 0
Select = 1
On Off
Off On
N1 N2
P1 P2
OnOn
CS141-L7-19 Tarun Soni, Summer’03
Logic Diagram of a Typical SRAM
• Write Enable is usually active low (WE_L)• Din and Dout combined to save pins:– output enable (OE_L) (new signal)– WE_L asserted , OE_L deasserted
• D => data input pin– WE_L deasserted , OE_L asserted
• D => data output pin
A
DOE_L
2 Nwordsx M bitSRAM
N
M
WE_L
Write Timing:
D
Read Timing:
WE_L
A
WriteHold Time
Write Setup Time
Data In
Write Address
OE_L
High Z
Read Address
Junk
Read AccessTime
Data Out
Read AccessTime
Data Out
Read Address
CS141-L7-20 Tarun Soni, Summer’03
1-Transistor Memory Cell (DRAM)
• Write:– 1. Drive bit line– 2.. Select row
• Read:– 1. Precharge bit line to Vdd– 2.. Select row– 3. Cell and bit line share charges
• Very small voltage changes on the bit line– 4. Sense (fancy sense amp)
• Can detect changes of ~1 million electrons– 5. Write: restore the value
• Refresh– 1. Just do a dummy read to every cell.
row select
bit
CS141-L7-21 Tarun Soni, Summer’03
Classical DRAM Organization (square)
row
decoder
rowaddress
Column Selector & I/O Circuits
ColumnAddress
data
RAM Cell Array
word (row) select
bit (data) lines
• Row and Column Address together: – Select 1 bit a time
Each intersection representsa 1-T DRAM Cell
CS141-L7-22 Tarun Soni, Summer’03
Logic Diagram of a Typical DRAM
AD
OE_L
256K x 8DRAM9 8
WE_L
• Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low• Din and Dout are combined (D):
– WE_L is asserted (Low), OE_L is deasserted (High)• D serves as the data input pin
– WE_L is deasserted (High), OE_L is asserted (Low)• D is the data output pin
• Row and column addresses share the same pins (A)– RAS_L goes low: Pins A are latched in as row address– CAS_L goes low: Pins A are latched in as column address– RAS/CAS edge-sensitive
CAS_LRAS_L
CS141-L7-23 Tarun Soni, Summer’03
Key DRAM Timing Parameters
• tRAC: minimum time from RAS line falling to the valid data output.
– Quoted as the speed of a DRAM
– A fast 4Mb DRAM tRAC = 60 ns
• tRC: minimum time from the start of one row access to the start of the next.
– tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns
• tCAC: minimum time from CAS line falling to valid data output.
– 15 ns for a 4Mbit DRAM with a tRAC of 60 ns
• tPC: minimum time from the start of one column access to the start of the next.
– 35 ns for a 4Mbit DRAM with a tRAC of 60 ns
CS141-L7-24 Tarun Soni, Summer’03
DRAM Performance
• A 60 ns (tRAC) DRAM can
– perform a row access only every 110 ns (tRC)
– perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).
• In practice, external address delays and turning around buses make it 40 to 50 ns
• These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.
– Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins…
– 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM
CS141-L7-25 Tarun Soni, Summer’03
DRAM Write Timing
AD
OE_L
256K x 8DRAM9 8
WE_LCAS_LRAS_L
WE_L
A Row Address
OE_L
Junk
WR Access Time WR Access Time
CAS_L
RAS_L
Col Address Row Address JunkCol Address
D Junk JunkData In Data In Junk
DRAM WR Cycle Time
Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L
• Every DRAM access begins at:– The assertion of the RAS_L– 2 ways to write:
early or late v. CAS
CS141-L7-26 Tarun Soni, Summer’03
DRAM Read Timing
AD
OE_L
256K x 8DRAM9 8
WE_LCAS_LRAS_L
OE_L
A Row Address
WE_L
Junk
Read AccessTime
Output EnableDelay
CAS_L
RAS_L
Col Address Row Address JunkCol Address
D High Z Data Out
DRAM Read Cycle Time
Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L
• Every DRAM access begins at:– The assertion of the RAS_L– 2 ways to read:
early or late v. CAS
Junk Data Out High Z
CS141-L7-27 Tarun Soni, Summer’03
Cycle Time versus Access Time
• DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time– 2:1; why?
• DRAM (Read/Write) Cycle Time :– How frequent can you initiate an access?– Analogy: A little kid can only ask his father for money on Saturday
• DRAM (Read/Write) Access Time:– How quickly will you get what you want once you initiate an access?– Analogy: As soon as he asks, his father will give him the money
• DRAM Bandwidth Limitation analogy:– What happens if he runs out of money on Wednesday?
TimeAccess Time
Cycle Time
CS141-L7-28 Tarun Soni, Summer’03
Increasing Bandwidth - Interleaving
Access Pattern without Interleaving:
Start Access for D1
CPU Memory
Start Access for D2
D1 available
Access Pattern with 4-way Interleaving:
Acc
ess
Ban
k 0
Access Bank 1
Access Bank 2
Access Bank 3
We can Access Bank 0 again
CPU
MemoryBank 1
MemoryBank 0
MemoryBank 3
MemoryBank 2
CS141-L7-29 Tarun Soni, Summer’03
Fewer DRAMs/System over TimeM
inim
um
PC
Mem
ory
Siz
e
DRAM Generation
‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
4 MB
8 MB
16 MB
32 MB
64 MB
128 MB
256 MB
32 8
16 4
8 2
4 1
8 2
4 1
8 2
Memory per System growth@ 25%-30% / year
Memory per DRAM growth@ 60% / year
• Increasing DRAM => fewer chips => harder to have banks
CS141-L7-30 Tarun Soni, Summer’03
Page Mode DRAM: Motivation
• Regular DRAM Organization:– N rows x N column x M-bit– Read & Write M-bit at a time– Each M-bit access requires
a RAS / CAS cycle• Fast Page Mode DRAM
– N x M “register” to save a row
A Row Address Junk
CAS_L
RAS_L
Col Address Row Address JunkCol Address
1st M-bit Access 2nd M-bit Access
N r
ows
N cols
DRAM
M bits
RowAddress
ColumnAddress
M-bit Output
CS141-L7-31 Tarun Soni, Summer’03
Fast Page Mode Operation
• Fast Page Mode DRAM– N x M “SRAM” to save a row
• After a row is read into the register– Only CAS is needed to access other M-bit
blocks on that row– RAS_L remains asserted while CAS_L is
toggled
A Row Address
CAS_L
RAS_L
Col Address Col Address
1st M-bit Access
N r
ows
N cols
DRAM
ColumnAddress
M-bit OutputM bits
N x M “SRAM”
RowAddress
Col Address Col Address
2nd M-bit 3rd M-bit 4th M-bit
CS141-L7-32 Tarun Soni, Summer’03
DRAMs over Time
DRAM Generation
‘84 ‘87 ‘90 ‘93 ‘96 ‘99
1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb
55 85 130 200 300 450
30 47 72 110 165 250
28.84 11.1 4.26 1.64 0.61 0.23
1st Gen. Sample
Memory Size
Die Size (mm2)
Memory Area (mm2)
Memory Cell Area (µm2)
CS141-L7-33 Tarun Soni, Summer’03
• Two Different Types of Locality:– Temporal Locality (Locality in Time): If an item is referenced, it will tend to be
referenced again soon.– Spatial Locality (Locality in Space): If an item is referenced, items whose
addresses are close by tend to be referenced soon.• By taking advantage of the principle of locality:
– Present the user with as much memory as is available in the cheapest technology.
– Provide access at the speed offered by the fastest technology.• DRAM is slow but cheap and dense:
– Good choice for presenting the user with a BIG memory system• SRAM is fast but expensive and not very dense:
– Good choice for providing the user FAST access time.
Memory - Summary
CS141-L7-34 Tarun Soni, Summer’03
Cache Fundamentals
• cache hit -- an access where the data
is found in the cache.• cache miss -- an access which isn’t• hit time -- time to access the cache• miss penalty -- time to move data from
further level to closer, then to cpu• hit ratio -- percentage of time the data is found in the
cache• miss ratio -- (1 - hit ratio)
cpu
lowest-levelcache
next-levelmemory/cache
• cache block size or cache line size-- the
amount of data that gets transferred on a
cache miss.• instruction cache -- cache that only holds
instructions.• data cache -- cache that only caches data.• unified cache -- cache that holds both.
CS141-L7-35 Tarun Soni, Summer’03
Cacheing Issues
On a memory access -• How do I know if this is a hit or miss?
On a cache miss -• where to put the new data?• what data to throw out?• how to remember what data this is?
cpu
lowest-levelcache
next-levelmemory/cache
access
miss
CS141-L7-36 Tarun Soni, Summer’03
• A cache that can put a line of data in exactly one place is called direct-mapped
tag data
an index is usedto determine
which line an addressmight be found in
an index is usedto determine
which line an addressmight be found in
4 entries, each block holds one word, each word in memory maps to exactly one cache location.
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
00000100
A simple cache : similar to the branch prediction table in pipelining ?
• Conflict Misses are misses caused by:– Different memory locations mapped to the same cache index
• Solution 1: make the cache size bigger • Solution 2: Multiple entries for the same Cache Index
• Conflict misses = 0 by definition, for fully associative caches.
CS141-L7-37 Tarun Soni, Summer’03
Fully associative cache
• A cache that can put a line of data anywhere is called fully associative• The most popular replacement strategy is LRU (least recently used).• How do you find the data ?
tag data
the tag identifiesthe address of the cached data
the tag identifiesthe address of the cached data
4 entries, each block holds one word, any block can hold any word.
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
CS141-L7-38 Tarun Soni, Summer’03
• Fully Associative Cache– Forget about the Cache Index– Compare the Cache Tags of all cache entries in parallel– Example: Block Size = 2 B blocks, we need N 27-bit comparators
• By definition: Conflict Miss = 0 for a fully associative cache
:
Cache Data
Byte 0
0431
:
Cache Tag (27 bits long)
Valid Bit
:
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :
Cache Tag
Byte Select
Ex: 0x01
X
X
X
X
X
Fully associative cache
CS141-L7-39 Tarun Soni, Summer’03
• A cache that can put a line of data in exactly n places is called n-way set-associative.
• The cache lines that share the same index are a cache set.
tag data
4 entries, each block holds one word, each wordin memory maps to one of a set of n cache lines
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
00000100tag data
A n-way set associative cache
CS141-L7-40 Tarun Soni, Summer’03
A n-way set associative cache
• N-way set associative: N entries for each Cache Index– N direct mapped caches operates in parallel
• Example: Two-way set associative cache– Cache Index selects a “set” from the cache– The two tags in the set are compared in parallel– Data is selected based on the tag result
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
CS141-L7-41 Tarun Soni, Summer’03
• N-way Set Associative Cache versus Direct Mapped Cache:– N comparators vs. 1– Extra MUX delay for the data– Data comes AFTER Hit/Miss decision and set selection
• In a direct mapped cache, Cache Block is available BEFORE Hit/Miss:– Possible to assume a hit and continue. Recover later if miss.
Cache Data
Cache Block 0
Cache Tag Valid
: ::
Cache Data
Cache Block 0
Cache TagValid
:: :
Cache Index
Mux 01Sel1 Sel0
Cache Block
CompareAdr Tag
Compare
OR
Hit
Direct vs. Set Associative Caches
CS141-L7-42 Tarun Soni, Summer’03
• Large cache blocks take advantage of spatial locality.• Too large of a block size can waste cache space.• Longer cache blocks require less tag space
tag data
4 entries, each block holds two words, each wordin memory maps to exactly one cache location (this cache is twice the total size of the prior caches).
address string:4 000001008 0000100012 000011004 000001008 0000100020 000101004 000001008 0000100020 0001010024 0001100012 000011008 000010004 00000100
00000100
Longer cache-blocks
CS141-L7-43 Tarun Soni, Summer’03
1 KB
8 KB
16 KB
64 KB
256 KB
256
40%
35%
30%
25%
20%
15%
10%
5%
0%
Miss
rate
64164
Block size (bytes)
Increasing block size
• In general, larger block size take advantage of spatial locality BUT:– Larger block size means larger miss penalty:
• Takes longer time to fill up the block– If block size is too big relative to cache size, miss rate will go up
• Too few cache blocks• In general, Average Access Time:
– = Hit Time x (1 - Miss Rate) + Miss Penalty x Miss Rate
CS141-L7-44 Tarun Soni, Summer’03
Sources of Cache Misses
• Compulsory (cold start or process migration, first reference): first access to a block
– “Cold” fact of life: not a whole lot you can do about it– Note: If you are going to run “billions” of instruction, Compulsory Misses
are insignificant• Conflict (collision):
– Multiple memory locations mappedto the same cache location
– Solution 1: increase cache size– Solution 2: increase associativity
• Capacity:– Cache cannot contain all blocks access by the program– Solution: increase cache size
• Invalidation: other process (e.g., I/O) updates memory
CS141-L7-45 Tarun Soni, Summer’03
Direct Mapped N-way Set Associative Fully Associative
Compulsory Miss
Cache Size
Capacity Miss
Invalidation Miss
Big Medium Small
Same Same Same
Conflict Miss High Medium Zero
Low Medium High
Same Same Same
Sources of Cache Misses
CS141-L7-46 Tarun Soni, Summer’03
1. Use index and tag to access cache and determine hit/miss.
2. If hit, return requested data.
3. If miss, select a cache block to be replaced, and access memory or next lower cache (possibly stalling the processor).
-load entire missed cache line into cache
-return requested data to CPU (or higher cache)
4. If next lower memory is a cache, goto step 1 for that cache.
ICache Reg A
LU Dcache Reg
IF ID EX MEM WB
Accessing a cache
CS141-L7-47 Tarun Soni, Summer’03
• 64 KB cache, direct-mapped, 32 byte cache block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
64 KB
/ 32 bytes =
2 K cache blocks/sets
11
=256
32
16
hit/miss
012
...
...
...
...204520462047
word offset
Accessing a cache
3
CS141-L7-48 Tarun Soni, Summer’03
• 32 KB cache, 2-way set-associative, 16-byte block size (cache lines)
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data 32 KB
/ 16 bytes / 2 =
1 K cache sets
10
=
18
hit/miss
012
...
...
...
...102110221023
word offset
tag datavalid
=
Accessing a cache
CS141-L7-49 Tarun Soni, Summer’03
• The data that gets moved into the cache on a miss are all data whose addresses share the same tag and index (regardless of which data gets accessed first).
• This results in – no overlap of cache lines– easy mapping of addresses to cache lines (no additions)– data at address X always being present in the same location in the
cache block (at byte X mod blocksize) if it is there at all.• Think of memory as organized into cache-line sized pieces
(because in reality, it is!).• Recall DRAM page mode architecture !!
tag index block offset
memory address
.
.
.
0123456789
10...
Memory
Cache Alignment
CS141-L7-50 Tarun Soni, Summer’03
• Q1: Where can a block be placed in the upper level? (Block placement)• Q2: How is a block found if it is in the upper level? (Block identification)• Q3: Which block should be replaced on a miss? (Block replacement)• Q4: What happens on a write? (Write strategy)
Basic Memory Hierarchy questions
• Placement: Cache structure determines the where ?:– Fully associative, direct mapped, 2-way set associative– S.A. Mapping = Block Number Modulo Number Sets
• Identification: Tag on each block– No need to check index or block offset
• Replacement: Easy for Direct Mapped• Set Associative or Fully Associative:
– Random– LRU (Least Recently Used)
CS141-L7-51 Tarun Soni, Summer’03
What about writes?
• Stores must be handled differently than loads, because...– they don’t necessarily require the CPU to stall.– they change the content of cache/memory (creating memory
consistency issues)– may require a load and a store to complete
CS141-L7-52 Tarun Soni, Summer’03
• On a store hit, write the new data to cache. In a write-through cache, write the data immediately to memory. In a write-back cache, mark the line as dirty.
• On a store miss, initiate a cache block load from memory for a write-allocate cache. Write directly to memory for a write-around cache.
• On any kind of cache miss in a write-back cache, if the line to be replaced in the cache is dirty, write it back to memory.
• Keep memory and cache identical?– write-through => all writes go to both cache and main memory– write-back => writes go only to cache. Modified cache lines are
written back to memory when the line is replaced.• Make room in cache for store miss?
– write-allocate => on a store miss, bring written line into the cache– write-around => on a store miss, ignore cache
What about writes?
CS141-L7-53 Tarun Soni, Summer’03
Cache Performance
TCPI = BCPI + MCPI– where TCPI = Total CPI; – BCPI = base CPI, which means the CPI assuming perfect memory;– MCPI = the memory CPI, the number of cycles (per instruction) the
processor is stalled waiting for memory.
MCPI = accesses/instruction * miss rate * miss penalty– this assumes we stall the pipeline on both read and write misses, that the
miss penalty is the same for both, that cache hits require no stalls.
CS141-L7-54 Tarun Soni, Summer’03
21164 CPUcore
InstructionCache
DataCache
UnifiedL2
Cache
Off-ChipL3 Cache
• ICache and DCache -- 8 KB, DM, 32-byte lines• L2 cache -- 96 KB, 3-way SA, 32-byte lines• L3 cache -- 1 MB, DM, 32-byte lines
Example -- DEC Alpha 21164 Caches
Dec Alpha
CS141-L7-55 Tarun Soni, Summer’03
tag index offsetmemory address:
DEC Alpha 21164’s L2 cache: 96 KB, 3-way set associative, 32-Byte lines64 bit addresses
How many “offset” bits?How many “index” bits?How many “tag” bits?Draw cache picture – how do you tell if it’s a hit?
CS141-L7-56 Tarun Soni, Summer’03
• 8 KB cache, direct-mapped, 32 byte lines
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
8 KB
/ 32 bytes =
256 cache blocks/sets
8
=256
32
19
hit/miss
012
...
...
...
...204520462047
word offset3
CS141-L7-57 Tarun Soni, Summer’03
• 96 KB cache, 3-way set-associative, 32-byte block size (cache lines)
• => Each set has 32 bytes,• => 3 bits per set to locate the right 32 bits (word)• => each “index” has 3 sets (3-way) (in effect three places one could place the
data in)• => 96 bytes per “row” • => 96KB/96B = 1K rows • => index = 10 bits• => Tag = 32 – (10+3+2) = 17bits
CS141-L7-58 Tarun Soni, Summer’03
Pentium 4: Trace Cache and Instruction Decode
Remember Micro-Instructions??
CS141-L7-59 Tarun Soni, Summer’03
Power4-Core
Power4-Core
I-Cache64 Kbyte
I-Cache64 Kbyte
D-Cache,32 Kbyte, 128 Byte lines,
2 way set associative,Write through
D-Cache,32 Kbyte, 128 Byte lines,
2 way set associative,Write through L2
1.44 MB128 Byte lines
8-way-Set-AssociativeWrite Back
Power 4 Chip
L3: 128M512 Byte-lines,8-way SAWrite Back
L3128 MB
512 Byte lines8-way-Set-Associative
Write Back
P0
P1
P7
Multi-Chip-Module
Power4
CS141-L7-60 Tarun Soni, Summer’03
• Often DM at highest level (closest to CPU), associative further away• split I and D close to the processor, unified further away (for throughput rather
than miss rate).• write-through and write-back both common, but never write-through all the way
to memory.• 32-byte cache lines very common
• Non-blocking– processor doesn’t stall on a miss, but only on the use of a miss (if even
then)– this means the cache must be able to handle multiple outstanding
accesses.
Real Caches
CS141-L7-61 Tarun Soni, Summer’03
Recall: Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns$.01-.001/bit
Main MemoryM Bytes100ns-1us$.01-.001
DiskG Bytesms10 - 10 cents
-3 -4
CapacityAccess TimeCost
Tapeinfinitesec-min10
-6
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
CS141-L7-62 Tarun Soni, Summer’03
Virtual Memory
• Main memory can act as a cache for the secondary storage (disk)
• Advantages:– illusion of having more physical memory– program relocation – protection
Physical addresses
Disk addresses
Virtual addresses
Address translation
CS141-L7-63 Tarun Soni, Summer’03
• is just cacheing, but uses different terminology
cache VM
block page
cache miss page fault
address virtual address
index physical address (sort of)
Virtual Memory
• What happens if another program in the processor uses the same addresses that yours does?
• What happens if your program uses addresses that don’t exist in the machine?• What happens to “holes” in the address space your program uses?
• So, virtual memory provides– performance (through the cacheing effect)– protection– ease of programming/compilation– efficient use of memory
CS141-L7-64 Tarun Soni, Summer’03
• is just a mapping function from virtual memory addresses to physical memory locations, which allows cacheing of virtual pages in physical memory.
• Why is it different ?
Virtual Memory
• MUCH higher miss penalty (millions of cycles)!• Therefore
– large pages [equivalent of cache line] (4 KB to MBs)– associative mapping of pages (typically fully associative)– software handling of misses (but not hits!!)– write-through not an option, only write-back
CS141-L7-65 Tarun Soni, Summer’03
• Page table maps virtual memory addresses to physical memory locations– Page may be in main memory or may be on disk.
“Valid bit” says which.
Virtual page number (high order bits of virtual address)
Page Tables
CS141-L7-66 Tarun Soni, Summer’03
virtual page number page offset
valid physical page numberpage table reg
physical page number page offset
virtual address
physical address
pagetable
• Why don’t we need a “tag” field?• How does operating system switch to new process???
This hardwareregister tells VM system where tofind the page table.
Page Tables
CS141-L7-67 Tarun Soni, Summer’03
Virtual Address and a Cache
CPU Trans-lation
Cache MainMemory
VA PA miss
hitdata
It takes an extra memory access to translate VA to PA
This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible
ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address!
for update: must update all cache entries with same physical address or memory becomes inconsistent
determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or
software enforced alias boundary: same lsb of VA &PA > cache size
CS141-L7-68 Tarun Soni, Summer’03
TLBs
A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB
Virtual Address Physical Address Dirty Ref Valid Access
TLB access time comparable to cache access time (much less than main memory access time)
CS141-L7-69 Tarun Soni, Summer’03
Translation lookaside buffer (TLB) A hardware cache for the page tableTypical size: 256 entries, 1- to 4-way set associative
CS141-L7-70 Tarun Soni, Summer’03
Translation Look-Aside Buffers
Just like any other cache, the TLB can be organized as fully associative, set associative, or direct mapped
TLBs are usually small, typically not more than 128 - 256 entries even on high end machines. This permits fully associative lookup on these machines. Most mid-range machines use small n-way set associative organizations.
CPU TLBLookup
Cache MainMemory
VA PA miss
hit
data
Trans-lation
hit
miss
20 tt1/2 t
Translationwith a TLB
CS141-L7-71 Tarun Soni, Summer’03
Translation Look-Aside Buffers
Machines with TLBs go one step further to reduce # cycles/cache access- overlap the cache access with the TLB access
Works because high order bits of the VA are used to look in the TLB while low order bits are used as index into cache
Overlapped access only works as long as the address bits used to index into the cache do not change as the result of VA translation
This usually limits things to small caches, large page sizes, or high n-way set associative caches if you want a large cache
CS141-L7-72 Tarun Soni, Summer’03
All together!
CS141-L7-73 Tarun Soni, Summer’03
• Virtual memory provides:
– protection
– sharing
– performance
– illusion of large main memory
• Virtual Memory requires twice as many memory accesses, so we cache page table entries in the TLB.
• Four things can go wrong on a memory access: cache miss, TLB miss (but page table in cache), TLB miss and page table not in cache, page fault.
CS141-L7-74 Tarun Soni, Summer’03
• The Principle of Locality:– Program likely to access a relatively small portion of the address space at any
instant of time.• Temporal Locality: Locality in Time• Spatial Locality: Locality in Space
• Three Major Categories of Cache Misses:– Compulsory Misses: sad facts of life. Example: cold start misses.– Conflict Misses: increase cache size and/or associativity.
Nightmare Scenario: ping pong effect!– Capacity Misses: increase cache size
• Cache Design Space– total size, block size, associativity– replacement policy– write-hit policy (write-through, write-back)– write-miss policy
Summary
CS141-L7-75 Tarun Soni, Summer’03
• Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: 1) Where can block be placed? 2) How is block found? 3) What block is repalced on miss? 4) How are writes handled?
• Page tables map virtual address to physical address• TLBs are important for fast translation• TLB misses are significant in processor performance: (funny times, as most systems
can’t access all of 2nd level cache without TLB misses!)
• VIrtual memory was controversial at the time: can SW automatically manage 64KB across many programs?
– 1000X DRAM growth removed the controversy• Today VM allows many processes to share single memory without having to swap all
processes to disk; VM protection is more important than memory hierarchy• Today CPU time is a function of (ops, cache misses) vs. just f(ops):
What does this mean to Compilers, Data structures, Algorithms?
Summary
CS141-L7-76 Tarun Soni, Summer’03