Class10 Memory

8/19/2019 Class10 Memory

1/39

The Memory Hierarchy

TopicsTopicsStorage technologies and trendsLocality of reference

Caching in the memory hierarchy

CS 105Tour of the Black Holes of Computing


2/39

– 2 – CS 105

Random-Access Memory (RAM)Key featuresKey features

RAM is p ackaged as a chipBasic s torage u nit is a cell (one bit per cell)Multiple RAM chips form a memory

Static RAM (Static RAM ( SRAMSRAM ))Each cell stores bit with a six-transistor circuitRetains value indenitely, as long as it is kept poweredRelatively insensitive to disturbances such as electrical noiseFaster and more expensive than DRAM

Dynamic RAM (Dynamic RAM ( DRAMDRAM ))Each cell stores bit with a capacitor and transistorValue must be refreshed every 10-100 msSensitive to disturbancesSlower and cheaper than SRAM


3/39

– 3 – CS 105

Non-Volatile RAM (NVRAM)Key Feature: Keeps data when power lostKey Feature: Keeps data when power lost

Several typesMost important is NAND ashOngoing R&D

NAND ashNAND ashReading similar to DRAM (though somewhat slower)Writing packed with restrictions:

Can’t change e xisting dataMust erase in large b locks (e.g., 64K)Block dies after about 100K erases

Writing slower than reading (mostly due to erase c ost)Chips o ften packaged with Flash Translation Layer (FTL)

Spreads o ut writes (“wear leveling”)Makes chip appear like disk drive


4/39

– 4 – CS 105

Conventional DRAM Organization

d x w DRAM:d x w DRAM: dw total bits organized as d supercells o f size w bits

cols

rows

0 1 2 30

1

2

3

internal row buffer

16 x 8 DRAM chip

addr

data

supercell(2,1)

2 bits /

8 bits /

memory

controller(to CPU)


5/39

– 5 – CS 105

Reading DRAM Supercell (2,1)Step 1(a): Row address s trobe (Step 1(a): Row address s trobe ( RASRAS ) selects row 2) selects row 2

cols

rows

RAS = 2 0 1 2 30

1

2

internal row buffer

16 x 8 DRAM chip

3

addr

data

2 /

8 /

memorycontroller

Step 1(b): Row 2 copied from DRAM array to row bufferStep 1(b): Row 2 copied from DRAM array to row buffer


6/39

– 6 – CS 105

Reading DRAM Supercell (2,1)Step 2(a): Column access strobe (Step 2(a): Column access strobe ( CASCAS ) selects c olumn 1) selects c olumn 1

cols

rows

0 1 2 30

1

2

3

internal row buffer

16 x 8 DRAM chip

CAS = 1

addr

data

2 /

8 /

memorycontroller

Step 2(b): Supercell (2,1) copied from buffer to data lines,Step 2(b): Supercell (2,1) copied from buffer to data lines,and eventually back to CPUand eventually back to CPU

supercell (2,1)

supercell(2,1)

To CPU


7/39

– 7 – CS 105

Memory Modules

: supercell (i,j)

64 MBmemory moduleconsisting ofeight 8Mx8 DRAMs

addr (row = i, col = j)

Memorycontroller

DRAM 7

DRAM 0

031 78151623243263 394047485556

64-bit doubleword at main memory address A

bits0-7

bits8-15

bits16-23

bits24-31

bits32-39

bits40-47

bits48-55

bits56-63

64-bit doubleword

031 78151623243263 394047485556

64-bit doubleword at main memory address A


8/39

– 8 – CS 105

Typical Bus StructureConnecting CPU and Memory

AA busbus is a collection of parallel wires that carryis a collection of parallel wires that carryaddress, data, and control signalsaddress, data, and control signals

Buses a re typically s hared by multiple d evicesBuses a re typically s hared by multiple d evices

mainmemory

I/Obridgebus interface

ALU

register le

CPU chip

system bus memory bus


9/39

– 9 – CS 105

Memory Read Transaction (1)

CPU places address A on memory busCPU places address A on memory bus

ALU

register le

bus interface A 0

Ax

main memoryI/O bridge

%eax

Load operation: movl A, %eax


10/39

– 10 – CS 105


Main memory reads A from memory bus, retrieves wordMain memory reads A from memory bus, retrieves wordx, and places it on busx, and places it on bus

ALU

register le

bus interface

x 0

Ax

main memory

%eax

I/O bridge



11/39

– 11 – CS 105


CPU reads word x from bus a nd copies it into registerCPU reads word x from bus and copies it into register %eax%eax

xALU

register le

bus interface x

main memory0

A

%eax

I/O bridge



12/39

– 12 – CS 105

Memory Write Transaction (1)

CPU places address A on bus; main memory reads itCPU places address A on bus; main memory reads itand waits for corresponding data word to arriveand waits for corresponding data word to arrive

yALU

register le

bus interface A

main memory0

A

%eax

I/O bridge

Store operation: movl %eax, A


13/39

– 13 – CS 105


CPU places data word y o n busCPU places data word y o n bus

yALU

register le

bus interface y

main memory0

A

%eax

I/O bridge



14/39

– 14 – CS 105


Main memory reads data word y from bus an d stores itMain memory reads data word y from bus an d stores itat address Aat address A

yALU

register le

bus interface y

main memory0

A

%eax

I/O bridge



15/39

– 15 – CS 105

Disk Geometry

Disks consist ofDisks consist of plattersplatters , each with two, each with two surfaces surfacesEach surface consists of concentric rings calledEach surface consists of concentric rings called tracks tracks

Each track consists o fEach track consists o f sectors sectors se parated byseparated by gapsgaps

spindle

surfacetracks

track k

sectors

gaps


16/39

– 16 – CS 105

Disk Geometry(Muliple-Platter View)

Aligned tracks form aAligned tracks form a cylindercylinder

surface 0

surface 1surface 2surface 3surface 4

surface 5

cylinder k

spindle

platter 0

platter 1

platter 2


17/39

– 17 – CS 105

Disk Operation (Single-PlatterView)

The d isksurfacespins a t a xedrotational rate

spindle

By moving radially, arm canposition read/write headover any track

Read/write headis a ttached to endof the arm and ies ov erdisk surface o nthin cushion of air

spindle


18/39

– 18 – CS 105

Disk Operation (Multi-PlatterView)

arm

read/write headsmove in unison

from cylinder t o cylinder

spindle


19/39

– 19 – CS 105

Disk Access TimeAverage time to access s ome target sector approximated by :Average time to access so me target sector approximated by :

Taccess = Tavg seek + Tavg rotation + Tavg t ransfer

Seek timeSeek time (Tavg seek)(Tavg seek)Time to position heads o ver cylinder containing target sectorTypical Tavg seek = 9 ms

Rotational latencyRotational latency (Tavg rotation)(Tavg rotation)Time waiting for rst bit of target sector to pass under r/w headTavg rotation = 1/2 x 1 /RPMs x 6 0 sec/1 min

Transfer timeTransfer time (Tavg transfer)(Tavg transfer)

Time to read the bits in the target sector.Tavg transfer = 1/RPM x 1/(avg # s ectors/track) x 60 s ecs/1 m in


20/39

– 20 – CS 105

Disk Access Time ExampleGiven:Given:

Rotational rate = 7,200 RPMAverage seek t ime = 9 m sAvg # sectors/track = 400

Derived:Derived:

Tavg rotation = 1/2 x ( 60 secs/7200 RPM) x 1000 ms/sec = 4 m sTavg transfer = 60/7200 RPM x 1/400 s ecs/track x 1000 ms/sec = 0.02msTaccess = 9 ms + 4 ms + 0.02 ms

Important points:Important points:Access t ime d ominated by s eek time a nd rotational latencyFirst bit in a sector is the most expensive, the rest are freeSRAM access time is about 4 ns/ doubleword, DRAM about 60 ns

Disk is ab out 40,000 times sl ower than SRAM, and2,500 times sl ower then DRAM


21/39

– 21 – CS 105

Logical Disk Blocks

Modern disks p resent a s impler abstract view of theModern disks p resent a s impler abstract view of the

complex sector geometry:complex sector geometry:The s et of available s ectors is m odeled as a s equence of b-sized logical blocks (0 , 1, 2, ...)

Mapping between logical blocks and actual (physical)Mapping between logical blocks and actual (physical)sectors sectors

Maintained by hardware/rmware device c alled diskcontrollerConverts requests for logical blocks into(surface,track,sector) t riples

Allows controller to set aside spare cylinders for eachAllows controller to set aside spare cylinders for eachzonezone

Accounts for the d ifference in “ formatted capacity ” and“maximum capacity ”


22/39

– 22 – CS 105

I/O Bus

mainmemory

I/Obridgebus interface

ALU

register le

CPU chip

system bus memory bus

diskcontroller

graphicsadapter

USBcontroller

mousekeyboard monitor

disk

I/O busExpansion slots forother devices su chas n etwork adapters.


23/39

– 23 – CS 105

Reading a Disk Sector (1)

mainmemory

ALU

register le

CPU chip

diskcontroller

graphicsadapter

USBcontroller


disk

I/O bus

bus interface

CPU initiates d isk r ead by w riting

command, logical block n umber, anddestination memory a ddress t o a port(address) associated with disk controller


24/39

– 24 – CS 105


mainmemory

ALU

register le

CPU chip

diskcontroller

graphicsadapter

USBcontroller


disk

I/O bus

bus interface

Disk controller reads sector and performs

direct memory acce ss ( DMA ) transfer intomain memory


25/39

– 25 – CS 105


mainmemory

ALU

register le

CPU chip

diskcontroller

graphicsadapter

USBcontroller


disk

I/O bus

bus interface

When the DMA transfer completes, disk

controller noties C PU with interrupt (i .e .,asserts special “interrupt” pin on CPU)


26/39

– 26 – CS 105

Storage Trends

(Culled from back issues o f Byte an d PC Magazine)

metric 1980 1985 1990 1995 2000 2000:1980

$/MB 8,000 880 100 30 1 8,000access (ns) 375 200 100 70 60 6typical size(MB) 0.064 0.256 4 16 64 1,000

DRAM

metric 1980 1985 1990 1995 2000 2000:1980

$/MB 19,200 2,900 320 256 100 190access (ns) 300 150 35 15 2 150

SRAM

metric 1980 1985 1990 1995 2000 2000:1980

$/MB 500 100 8 0.30 0.05 10,000access (ms) 87 75 28 10 8 11typical size(MB) 1 10 160 1,000 9,000 9,000

Disk


27/39

– 27 – CS 105

CPU Clock Rates

1980 1985 1990 1995 2000 2000:1980processor 8080 286 386 Pent P-IIIclock rate(MHz) 1 6 20 150 750 750cycle time(ns) 1,000 166 50 6 1.6 750


28/39

– 28 – CS 105

The CPU-Memory Gap The increasing gap between DRAM, disk, and CPUThe increasing gap between DRAM, disk, and CPU

speeds.speeds.


29/39

– 29 – CS 105

LocalityPrinciple of Locality:Principle of Locality:

Programs t end to reuse data a nd instructions n ear thosethey have u sed recently, or that were r ecently referencedthemselvesSpatial locality: Items with nearby addresses tend to bereferenced close together in time

Temporal locality: Recently referenced items a re likely to bereferenced in the near future

Locality Example:• Data

– Reference array elements in succession(stride-1 reference pattern):– Reference sum e ach iteration:

• Instructions– Reference instructions in s equence:– Cycle through loop repeatedly:

sum = 0;for (i = 0; i !; i"")

sum "= a#i$;

retur! sum;Spatial locality

Spatial localityTemporal locality

Temporal locality


30/39

– 30 – CS 105

Locality ExampleClaim:Claim: Being able to look at code and get qualitativeBeing able to look at code and get qualitative

sense of its locality is key skill for professionalsense of its locality is key skill for professionalprogrammerprogrammer

Question:Question: Does this function have g ood locality?Does this function have good locality?

i!t sumarra rows(i!t a#&$#'$)

i!t i, j, sum = 0;

for (i = 0; i &; i"") for (j = 0; j '; j"") sum "= a#i$#j$; retur! sum;


31/39

– 31 – CS 105

Locality ExampleQuestion:Question: Does this function have g ood locality?Does this function have good locality?

i!t sumarra cols(i!t a#&$#'$)

i!t i, j, sum = 0;

for (j = 0; j '; j"") for (i = 0; i &; i"") sum "= a#i$#j$; retur! sum;


32/39

– 32 – CS 105

Locality ExampleQuestion:Question: Can you permute the loops so that theCan you permute the loops so that the

function scans the 3 -d arrayfunction scans the 3 -d array a#$a#$ with a stride-1with a stride-1reference pattern (and thus has good spatialreference pattern (and thus has good spatiallocality)?locality)?

i!t sumarra *d(i!t a#&$#'$#'$)

i!t i, j, +, sum = 0;

for (i = 0; i '; i"") for (j = 0; j '; j"")

for (+ = 0; + &; +"") sum "= a#+$#i$#j$; retur! sum;


33/39

– 33 – CS 105

Memory HierarchiesSome fundamental and enduring properties o fSome fundamental and enduring properties o f

hardware an d software:hardware an d software:Fast storage technologies cost more p er byte a nd have lesscapacityGap between CPU and main memory speed is widening

Well-written programs tend to exhibit good locality

These fundamental properties co mplement each otherThese fundamental properties co mplement each otherbeautifullybeautifully

They suggest an approach for organizing memory andThey suggest an approach for organizing memory andstorage systems known as astorage systems known as a memory hierarchymemory hierarchy


34/39

– 34 – CS 105

An Example Memory Hierarchy

registers

on-chip L1cache (SRAM)

main memory(DRAM)

local secondary storage(local disks)

Larger,slower,

andcheaper(per byte)storagedevices

remote seco ndary storage(distributed le s ystems, Web servers)

Local disks hold lesretrieved from disks o nremote network ser vers

Main memory holds diskblocks ret rieved from localdisks

off-chip L2cache (SRAM)

L1 c ache h olds c ache l ines retrievedfrom the L2 cache m emory

CPU registers h old words r etrievedfrom L1 cache

L2 cache holds cac he linesretrieved from main memory

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,and

costlier(per byte)storagedevices


35/39

– 35 – CS 105

CachesCache:Cache: Smaller, faster storage device that acts asSmaller, faster storage device that acts as

staging area for subset of data in a larger, slowerstaging area for subset of data in a larger, slowerdevicedevice

Fundamental idea of a memory hierarchy:Fundamental idea of a memory hierarchy:For each k, the faster, smaller device at level k serves as

cache for larger, slower device at level k+1Why do memory hierarchies work?Why do memory hierarchies work?

Programs t end to access data a t level k more o ften than theyaccess data at level k+1

Thus, storage at level k+1 can be slower, and thus larger andcheaper per bitNet effect: Large p ool of memory that costs a s l ittle a s t hecheap storage n ear the bottom, but that serves d ata toprograms at ≈ rate of the fast storage near the top.


36/39

– 36 – CS 105

Caching in a Memory Hierarchy

0 1 2 3

4 5 6 7

8 9 10 1112 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitioned

into blocks.

Data is co pied betweenlevels in block-sized transferunits

8 9 14 3Smaller, faster, more expensivedevice at level k caches asubset of the b locks f rom level k+1

Level k:

Level k+1: 4

4

4 10

10

10


37/39

– 37 – CS 105

Request 14Request 12

General Caching ConceptsProgram needs o bject d, which is s toredProgram needs o bject d, which is s tored

in some block bin some block bCache hitCache hit

Program nds b in the cache at levelk. E.g., block 14

Cache missCache missb is not at level k, so level k cachemust fetch it from level k+1.E.g., block 12If level k cache is full, then some

current block must be replaced(evicted). W hich one is the “victim”?Placement policy: where can the newblock go? E.g., b mod 4Replacement policy: which blockshould be e victed? E.g., LRU

9 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Level k:

Levelk+1:

1414

12

14

4*

4*12

12

0 1 2 3

Request 12

4*4*12


38/39

– 38 – CS 105

General Caching ConceptsTypes o f cache misses:Types o f cache misses:

Cold (compulsory) missCold misses o ccur because the cache is em pty

Conict missMost caches limit blocks a t level k to a s mall subset (sometimes

a singleton) of the block positions at level k+1E.g. block i at level k+1 must be p laced in block (i mod 4) atlevel kConict misses o ccur when the level k ca che is large e nough,but multiple d ata o bjects a ll map to the s ame level k b lock

E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every timeCapacity miss

Occurs when the set of active cache blocks (working set) islarger than the c ache


39/39

Examples o f Caching in theHierarchy

Hardware0On-Chip TLBAddresstranslations

TLB

Webbrowser

10,000,000Local diskWeb pagesBrowser cache

Web cache

Network buffer

cache

Buffer cache

VirtualMemory

L2 cache

L1 cache

Registers

Cache Type

Web pages

Parts of les

Parts of les

4-KB page32-byte block

32-byte block

4-byte word

What Cached

Web proxyserver

1,000,000,000Remote serverdisks

OS100Main memory

Hardware1On-Chip L1

Hardware10Off-Chip L2

AFS/NFS

client

10,000,000Local disk

Hardware+OS

100Main memory

Compiler0CPU registers

Managed

By

Latency

(cycles)

Where Cached

Documents

Class10 Memory