Upload
atanmiswanipoh
View
229
Download
0
Embed Size (px)
Citation preview
8/19/2019 Class10 Memory
1/39
The Memory Hierarchy
TopicsTopicsStorage technologies and trendsLocality of reference
Caching in the memory hierarchy
CS 105Tour of the Black Holes of Computing
8/19/2019 Class10 Memory
2/39
– 2 – CS 105
Random-Access Memory (RAM)Key featuresKey features
RAM is p ackaged as a chipBasic s torage u nit is a cell (one bit per cell)Multiple RAM chips form a memory
Static RAM (Static RAM ( SRAMSRAM ))Each cell stores bit with a six-transistor circuitRetains value indenitely, as long as it is kept poweredRelatively insensitive to disturbances such as electrical noiseFaster and more expensive than DRAM
Dynamic RAM (Dynamic RAM ( DRAMDRAM ))Each cell stores bit with a capacitor and transistorValue must be refreshed every 10-100 msSensitive to disturbancesSlower and cheaper than SRAM
8/19/2019 Class10 Memory
3/39
– 3 – CS 105
Non-Volatile RAM (NVRAM)Key Feature: Keeps data when power lostKey Feature: Keeps data when power lost
Several typesMost important is NAND ashOngoing R&D
NAND ashNAND ashReading similar to DRAM (though somewhat slower)Writing packed with restrictions:
Can’t change e xisting dataMust erase in large b locks (e.g., 64K)Block dies after about 100K erases
Writing slower than reading (mostly due to erase c ost)Chips o ften packaged with Flash Translation Layer (FTL)
Spreads o ut writes (“wear leveling”)Makes chip appear like disk drive
8/19/2019 Class10 Memory
4/39
– 4 – CS 105
Conventional DRAM Organization
d x w DRAM:d x w DRAM: dw total bits organized as d supercells o f size w bits
cols
rows
0 1 2 30
1
2
3
internal row buffer
16 x 8 DRAM chip
addr
data
supercell(2,1)
2 bits /
8 bits /
memory
controller(to CPU)
8/19/2019 Class10 Memory
5/39
– 5 – CS 105
Reading DRAM Supercell (2,1)Step 1(a): Row address s trobe (Step 1(a): Row address s trobe ( RASRAS ) selects row 2) selects row 2
cols
rows
RAS = 2 0 1 2 30
1
2
internal row buffer
16 x 8 DRAM chip
3
addr
data
2 /
8 /
memorycontroller
Step 1(b): Row 2 copied from DRAM array to row bufferStep 1(b): Row 2 copied from DRAM array to row buffer
8/19/2019 Class10 Memory
6/39
– 6 – CS 105
Reading DRAM Supercell (2,1)Step 2(a): Column access strobe (Step 2(a): Column access strobe ( CASCAS ) selects c olumn 1) selects c olumn 1
cols
rows
0 1 2 30
1
2
3
internal row buffer
16 x 8 DRAM chip
CAS = 1
addr
data
2 /
8 /
memorycontroller
Step 2(b): Supercell (2,1) copied from buffer to data lines,Step 2(b): Supercell (2,1) copied from buffer to data lines,and eventually back to CPUand eventually back to CPU
supercell (2,1)
supercell(2,1)
To CPU
8/19/2019 Class10 Memory
7/39
– 7 – CS 105
Memory Modules
: supercell (i,j)
64 MBmemory moduleconsisting ofeight 8Mx8 DRAMs
addr (row = i, col = j)
Memorycontroller
DRAM 7
DRAM 0
031 78151623243263 394047485556
64-bit doubleword at main memory address A
bits0-7
bits8-15
bits16-23
bits24-31
bits32-39
bits40-47
bits48-55
bits56-63
64-bit doubleword
031 78151623243263 394047485556
64-bit doubleword at main memory address A
8/19/2019 Class10 Memory
8/39
– 8 – CS 105
Typical Bus StructureConnecting CPU and Memory
AA busbus is a collection of parallel wires that carryis a collection of parallel wires that carryaddress, data, and control signalsaddress, data, and control signals
Buses a re typically s hared by multiple d evicesBuses a re typically s hared by multiple d evices
mainmemory
I/Obridgebus interface
ALU
register le
CPU chip
system bus memory bus
8/19/2019 Class10 Memory
9/39
– 9 – CS 105
Memory Read Transaction (1)
CPU places address A on memory busCPU places address A on memory bus
ALU
register le
bus interface A 0
Ax
main memoryI/O bridge
%eax
Load operation: movl A, %eax
8/19/2019 Class10 Memory
10/39
– 10 – CS 105
Memory Read Transaction (2)
Main memory reads A from memory bus, retrieves wordMain memory reads A from memory bus, retrieves wordx, and places it on busx, and places it on bus
ALU
register le
bus interface
x 0
Ax
main memory
%eax
I/O bridge
Load operation: movl A, %eax
8/19/2019 Class10 Memory
11/39
– 11 – CS 105
Memory Read Transaction (3)
CPU reads word x from bus a nd copies it into registerCPU reads word x from bus and copies it into register %eax%eax
xALU
register le
bus interface x
main memory0
A
%eax
I/O bridge
Load operation: movl A, %eax
8/19/2019 Class10 Memory
12/39
– 12 – CS 105
Memory Write Transaction (1)
CPU places address A on bus; main memory reads itCPU places address A on bus; main memory reads itand waits for corresponding data word to arriveand waits for corresponding data word to arrive
yALU
register le
bus interface A
main memory0
A
%eax
I/O bridge
Store operation: movl %eax, A
8/19/2019 Class10 Memory
13/39
– 13 – CS 105
Memory Write Transaction (2)
CPU places data word y o n busCPU places data word y o n bus
yALU
register le
bus interface y
main memory0
A
%eax
I/O bridge
Store operation: movl %eax, A
8/19/2019 Class10 Memory
14/39
– 14 – CS 105
Memory Write Transaction (3)
Main memory reads data word y from bus an d stores itMain memory reads data word y from bus an d stores itat address Aat address A
yALU
register le
bus interface y
main memory0
A
%eax
I/O bridge
Store operation: movl %eax, A
8/19/2019 Class10 Memory
15/39
– 15 – CS 105
Disk Geometry
Disks consist ofDisks consist of plattersplatters , each with two, each with two surfaces surfacesEach surface consists of concentric rings calledEach surface consists of concentric rings called tracks tracks
Each track consists o fEach track consists o f sectors sectors se parated byseparated by gapsgaps
spindle
surfacetracks
track k
sectors
gaps
8/19/2019 Class10 Memory
16/39
– 16 – CS 105
Disk Geometry(Muliple-Platter View)
Aligned tracks form aAligned tracks form a cylindercylinder
surface 0
surface 1surface 2surface 3surface 4
surface 5
cylinder k
spindle
platter 0
platter 1
platter 2
8/19/2019 Class10 Memory
17/39
– 17 – CS 105
Disk Operation (Single-PlatterView)
The d isksurfacespins a t a xedrotational rate
spindle
By moving radially, arm canposition read/write headover any track
Read/write headis a ttached to endof the arm and ies ov erdisk surface o nthin cushion of air
spindle
8/19/2019 Class10 Memory
18/39
– 18 – CS 105
Disk Operation (Multi-PlatterView)
arm
read/write headsmove in unison
from cylinder t o cylinder
spindle
8/19/2019 Class10 Memory
19/39
– 19 – CS 105
Disk Access TimeAverage time to access s ome target sector approximated by :Average time to access so me target sector approximated by :
Taccess = Tavg seek + Tavg rotation + Tavg t ransfer
Seek timeSeek time (Tavg seek)(Tavg seek)Time to position heads o ver cylinder containing target sectorTypical Tavg seek = 9 ms
Rotational latencyRotational latency (Tavg rotation)(Tavg rotation)Time waiting for rst bit of target sector to pass under r/w headTavg rotation = 1/2 x 1 /RPMs x 6 0 sec/1 min
Transfer timeTransfer time (Tavg transfer)(Tavg transfer)
Time to read the bits in the target sector.Tavg transfer = 1/RPM x 1/(avg # s ectors/track) x 60 s ecs/1 m in
8/19/2019 Class10 Memory
20/39
– 20 – CS 105
Disk Access Time ExampleGiven:Given:
Rotational rate = 7,200 RPMAverage seek t ime = 9 m sAvg # sectors/track = 400
Derived:Derived:
Tavg rotation = 1/2 x ( 60 secs/7200 RPM) x 1000 ms/sec = 4 m sTavg transfer = 60/7200 RPM x 1/400 s ecs/track x 1000 ms/sec = 0.02msTaccess = 9 ms + 4 ms + 0.02 ms
Important points:Important points:Access t ime d ominated by s eek time a nd rotational latencyFirst bit in a sector is the most expensive, the rest are freeSRAM access time is about 4 ns/ doubleword, DRAM about 60 ns
Disk is ab out 40,000 times sl ower than SRAM, and2,500 times sl ower then DRAM
8/19/2019 Class10 Memory
21/39
– 21 – CS 105
Logical Disk Blocks
Modern disks p resent a s impler abstract view of theModern disks p resent a s impler abstract view of the
complex sector geometry:complex sector geometry:The s et of available s ectors is m odeled as a s equence of b-sized logical blocks (0 , 1, 2, ...)
Mapping between logical blocks and actual (physical)Mapping between logical blocks and actual (physical)sectors sectors
Maintained by hardware/rmware device c alled diskcontrollerConverts requests for logical blocks into(surface,track,sector) t riples
Allows controller to set aside spare cylinders for eachAllows controller to set aside spare cylinders for eachzonezone
Accounts for the d ifference in “ formatted capacity ” and“maximum capacity ”
8/19/2019 Class10 Memory
22/39
– 22 – CS 105
I/O Bus
mainmemory
I/Obridgebus interface
ALU
register le
CPU chip
system bus memory bus
diskcontroller
graphicsadapter
USBcontroller
mousekeyboard monitor
disk
I/O busExpansion slots forother devices su chas n etwork adapters.
8/19/2019 Class10 Memory
23/39
– 23 – CS 105
Reading a Disk Sector (1)
mainmemory
ALU
register le
CPU chip
diskcontroller
graphicsadapter
USBcontroller
mousekeyboard monitor
disk
I/O bus
bus interface
CPU initiates d isk r ead by w riting
command, logical block n umber, anddestination memory a ddress t o a port(address) associated with disk controller
8/19/2019 Class10 Memory
24/39
– 24 – CS 105
Reading a Disk Sector (2)
mainmemory
ALU
register le
CPU chip
diskcontroller
graphicsadapter
USBcontroller
mousekeyboard monitor
disk
I/O bus
bus interface
Disk controller reads sector and performs
direct memory acce ss ( DMA ) transfer intomain memory
8/19/2019 Class10 Memory
25/39
– 25 – CS 105
Reading a Disk Sector (3)
mainmemory
ALU
register le
CPU chip
diskcontroller
graphicsadapter
USBcontroller
mousekeyboard monitor
disk
I/O bus
bus interface
When the DMA transfer completes, disk
controller noties C PU with interrupt (i .e .,asserts special “interrupt” pin on CPU)
8/19/2019 Class10 Memory
26/39
– 26 – CS 105
Storage Trends
(Culled from back issues o f Byte an d PC Magazine)
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 8,000 880 100 30 1 8,000access (ns) 375 200 100 70 60 6typical size(MB) 0.064 0.256 4 16 64 1,000
DRAM
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 19,200 2,900 320 256 100 190access (ns) 300 150 35 15 2 150
SRAM
metric 1980 1985 1990 1995 2000 2000:1980
$/MB 500 100 8 0.30 0.05 10,000access (ms) 87 75 28 10 8 11typical size(MB) 1 10 160 1,000 9,000 9,000
Disk
8/19/2019 Class10 Memory
27/39
– 27 – CS 105
CPU Clock Rates
1980 1985 1990 1995 2000 2000:1980processor 8080 286 386 Pent P-IIIclock rate(MHz) 1 6 20 150 750 750cycle time(ns) 1,000 166 50 6 1.6 750
8/19/2019 Class10 Memory
28/39
– 28 – CS 105
The CPU-Memory Gap The increasing gap between DRAM, disk, and CPUThe increasing gap between DRAM, disk, and CPU
speeds.speeds.
8/19/2019 Class10 Memory
29/39
– 29 – CS 105
LocalityPrinciple of Locality:Principle of Locality:
Programs t end to reuse data a nd instructions n ear thosethey have u sed recently, or that were r ecently referencedthemselvesSpatial locality: Items with nearby addresses tend to bereferenced close together in time
Temporal locality: Recently referenced items a re likely to bereferenced in the near future
Locality Example:• Data
– Reference array elements in succession(stride-1 reference pattern):– Reference sum e ach iteration:
• Instructions– Reference instructions in s equence:– Cycle through loop repeatedly:
sum = 0;for (i = 0; i !; i"")
sum "= a#i$;
retur! sum;Spatial locality
Spatial localityTemporal locality
Temporal locality
8/19/2019 Class10 Memory
30/39
– 30 – CS 105
Locality ExampleClaim:Claim: Being able to look at code and get qualitativeBeing able to look at code and get qualitative
sense of its locality is key skill for professionalsense of its locality is key skill for professionalprogrammerprogrammer
Question:Question: Does this function have g ood locality?Does this function have good locality?
i!t sumarra rows(i!t a#&$#'$)
i!t i, j, sum = 0;
for (i = 0; i &; i"") for (j = 0; j '; j"") sum "= a#i$#j$; retur! sum;
8/19/2019 Class10 Memory
31/39
– 31 – CS 105
Locality ExampleQuestion:Question: Does this function have g ood locality?Does this function have good locality?
i!t sumarra cols(i!t a#&$#'$)
i!t i, j, sum = 0;
for (j = 0; j '; j"") for (i = 0; i &; i"") sum "= a#i$#j$; retur! sum;
8/19/2019 Class10 Memory
32/39
– 32 – CS 105
Locality ExampleQuestion:Question: Can you permute the loops so that theCan you permute the loops so that the
function scans the 3 -d arrayfunction scans the 3 -d array a#$a#$ with a stride-1with a stride-1reference pattern (and thus has good spatialreference pattern (and thus has good spatiallocality)?locality)?
i!t sumarra *d(i!t a#&$#'$#'$)
i!t i, j, +, sum = 0;
for (i = 0; i '; i"") for (j = 0; j '; j"")
for (+ = 0; + &; +"") sum "= a#+$#i$#j$; retur! sum;
8/19/2019 Class10 Memory
33/39
– 33 – CS 105
Memory HierarchiesSome fundamental and enduring properties o fSome fundamental and enduring properties o f
hardware an d software:hardware an d software:Fast storage technologies cost more p er byte a nd have lesscapacityGap between CPU and main memory speed is widening
Well-written programs tend to exhibit good locality
These fundamental properties co mplement each otherThese fundamental properties co mplement each otherbeautifullybeautifully
They suggest an approach for organizing memory andThey suggest an approach for organizing memory andstorage systems known as astorage systems known as a memory hierarchymemory hierarchy
8/19/2019 Class10 Memory
34/39
– 34 – CS 105
An Example Memory Hierarchy
registers
on-chip L1cache (SRAM)
main memory(DRAM)
local secondary storage(local disks)
Larger,slower,
andcheaper(per byte)storagedevices
remote seco ndary storage(distributed le s ystems, Web servers)
Local disks hold lesretrieved from disks o nremote network ser vers
Main memory holds diskblocks ret rieved from localdisks
off-chip L2cache (SRAM)
L1 c ache h olds c ache l ines retrievedfrom the L2 cache m emory
CPU registers h old words r etrievedfrom L1 cache
L2 cache holds cac he linesretrieved from main memory
L0:
L1:
L2:
L3:
L4:
L5:
Smaller,faster,and
costlier(per byte)storagedevices
8/19/2019 Class10 Memory
35/39
– 35 – CS 105
CachesCache:Cache: Smaller, faster storage device that acts asSmaller, faster storage device that acts as
staging area for subset of data in a larger, slowerstaging area for subset of data in a larger, slowerdevicedevice
Fundamental idea of a memory hierarchy:Fundamental idea of a memory hierarchy:For each k, the faster, smaller device at level k serves as
cache for larger, slower device at level k+1Why do memory hierarchies work?Why do memory hierarchies work?
Programs t end to access data a t level k more o ften than theyaccess data at level k+1
Thus, storage at level k+1 can be slower, and thus larger andcheaper per bitNet effect: Large p ool of memory that costs a s l ittle a s t hecheap storage n ear the bottom, but that serves d ata toprograms at ≈ rate of the fast storage near the top.
8/19/2019 Class10 Memory
36/39
– 36 – CS 105
Caching in a Memory Hierarchy
0 1 2 3
4 5 6 7
8 9 10 1112 13 14 15
Larger, slower, cheaper storagedevice at level k+1 is partitioned
into blocks.
Data is co pied betweenlevels in block-sized transferunits
8 9 14 3Smaller, faster, more expensivedevice at level k caches asubset of the b locks f rom level k+1
Level k:
Level k+1: 4
4
4 10
10
10
8/19/2019 Class10 Memory
37/39
– 37 – CS 105
Request 14Request 12
General Caching ConceptsProgram needs o bject d, which is s toredProgram needs o bject d, which is s tored
in some block bin some block bCache hitCache hit
Program nds b in the cache at levelk. E.g., block 14
Cache missCache missb is not at level k, so level k cachemust fetch it from level k+1.E.g., block 12If level k cache is full, then some
current block must be replaced(evicted). W hich one is the “victim”?Placement policy: where can the newblock go? E.g., b mod 4Replacement policy: which blockshould be e victed? E.g., LRU
9 3
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Level k:
Levelk+1:
1414
12
14
4*
4*12
12
0 1 2 3
Request 12
4*4*12
8/19/2019 Class10 Memory
38/39
– 38 – CS 105
General Caching ConceptsTypes o f cache misses:Types o f cache misses:
Cold (compulsory) missCold misses o ccur because the cache is em pty
Conict missMost caches limit blocks a t level k to a s mall subset (sometimes
a singleton) of the block positions at level k+1E.g. block i at level k+1 must be p laced in block (i mod 4) atlevel kConict misses o ccur when the level k ca che is large e nough,but multiple d ata o bjects a ll map to the s ame level k b lock
E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every timeCapacity miss
Occurs when the set of active cache blocks (working set) islarger than the c ache
8/19/2019 Class10 Memory
39/39
Examples o f Caching in theHierarchy
Hardware0On-Chip TLBAddresstranslations
TLB
Webbrowser
10,000,000Local diskWeb pagesBrowser cache
Web cache
Network buffer
cache
Buffer cache
VirtualMemory
L2 cache
L1 cache
Registers
Cache Type
Web pages
Parts of les
Parts of les
4-KB page32-byte block
32-byte block
4-byte word
What Cached
Web proxyserver
1,000,000,000Remote serverdisks
OS100Main memory
Hardware1On-Chip L1
Hardware10Off-Chip L2
AFS/NFS
client
10,000,000Local disk
Hardware+OS
100Main memory
Compiler0CPU registers
Managed
By
Latency
(cycles)
Where Cached