The Memory Hierarchy - University of California, San Diego€¦ · There are many banks per die (16 at left) • Multiple pages can be open at once. • Can keep pages open longer

1

The Memory

HierarchyIn the book: 5.1-5.3, 5.7, 5.10

2

• Understand how CPUs run programs• How do we express the computation the CPU?

• How does the CPU execute it?

• How does the CPU support other system components (e.g., the OS)?

• What techniques and technologies are involved and how do they work?

• Understand why CPU performance (and other metrics) varies• How does CPU design impact performance?

• What trade-offs are involved in designing a CPU?

• How can we meaningfully measure and compare computer systems?

• Understand why program performance varies• How do program characteristics affect performance?

• How can we improve a programs performance by considering the CPU running it?

• How do other system components impact program performance?

Goals for this Class

3

memory

Abstraction: Big array of bytes

Memory

CPU

Memory

4

Main points for today• What is a memory hierarchy?

• What is the CPU-DRAM gap?

• What is locality? What kinds are there?

• Learn a bunch of caching vocabulary.

5

Processor vs Memory

Performance • Memory is very slow compared to

processors.

Pe

rfo

rma

nce

vs 1

98

0

SRAM and DRAM

6

7

Silicon Memories

• Why store things in silicon?• It’s fast!!!

• Compatible with logic devices (mostly)

• The main goal is to be cheap• Dense -- The smaller the bits, the less area you

need, and the more bits you can fit on a chip/wafer/through your fab.

• Bit sizes are measured in F2 -- the smallest feature you can create.

• The number of F2 /bit is a function of the memory technology, not the manufacturing technology.

• i.e. an SRAM in todays technology will take the same number of F2 in tomorrow’s technology

8

Questions

• What physical quantity should represent the bit?• Voltage/charge -- SRAMs, DRAMs, Flash memories

• Magnetic orientation -- MRAMs

• Crystal structure -- phase change memories

• The orientation of organic molecules -- various exotic technologies

• All that’s required is that we can sense it and turn it into a logic one or zero.

• How do we achieve maximum density?

• How do we make them fast?

9

Anatomy of a Memory

• Dense: Build a big array• bigger the better

• less other stuff

• Bigger -> slower

• Row decoder• Select the row by

raising a “word line”

• Column decoder• Select a slice of the row

• Decoders are pretty big.

10

The Storage Array

• Density is king.• Highly engineered, carefully tuned, automatically

generated.

• The smaller the devices, the better.

• Making them big makes them slow.• Bit/word lines are long (millimeters)

• They have large capacitance, so their RC delay is long

• For the row decoder, use large transistors to drive them hard.

• For the bit cells...• There are lots of these, so they need to be as small as

possible (but not smaller)

11

Measuring Memory Density

• We use a “technology independent” metric to measure the inherent size of different memory cells.• F == the “feature size” == the smallest dimension a CMOS

process can create (e.g., the width of the narrowest wire).

• In a 22nm process technology, F = 22nm.

• F2 (F-squared) is the smallest 2D feature we can manufacture.

• A single bit of a given type of memory (e.g., SRAM or DRAM) requires a fixed number of F2

• This number doesn’t change with process technology.

• e.g., NAND flash memory is 4F2 in 90nm and in 22nm.

• Using this metic is useful because the relative sizes of different memory technologies don’t change much, although absolute densities do.

12

Sense Amps

• Sense amplifiers take a difference between two signals and amplify it

• Two scenarios• Inputs are initially equal (“precharged”) -- they

each move in opposite directions

• One input is a reference -- so only one signal moves

• Frequently used in memories• Storage cells are small, so the signals they produce

are inherently weak

• Sense amps can detect these weak, analog signals and convert them into a logic one or logic zero.

13

Static Random Access Memory (SRAM)

• Storage• Voltage on a pair of cross-

coupled inverters

• Durable in presence of power

• To read• Pre-charge two bit lines to

Vcc/2

• Turn on the “word line”

• Read the output of the sense-amp

1 0

1 0

1

14

1 0

SRAM Writes

• To write• Turn off the sense-amp

• Turn on the wordline

• Drive the bitlines to the correct state

• Turn off the wordline

1 0

0

10

15

Building SRAM

• This is “6T SRAM”• 6 transistors is pretty

big

• SRAMs are not dense

16

SRAM Density

• At 65nm: 0.52um2

• 123-140 F2

• [ITRS 2008]

65nm TSMC 6T SRAM

17

SRAM Ports

• Add word and bit lines

• Read/write multiple things at once

• Density decreases quadratically

• Bandwidth increase linearly

18

SRAM Performance

• Read and write times• 10s-100s of ps

• Bandwidth• Registers -- 324GB/s

• L1 cache -- 128GB/s

19

DRAM

20

Dynamic Random Access Memory (DRAM)

• Storage• Charge on a capacitor

• Decays over time (us-scale)

• This is the “dynamic” part.

• About 6F2: 20x better than SRAM

• Reading• Precharge

• Assert word line

• Sense output

• Refresh data Only one bit line is read at a time.

The other bit line serves as a reference.

The bit cells attached to Wordline 1 are not shown.

21

DRAM: Write and Refresh

• Writing• Turn on the wordline

• Override the sense amp.

• Refresh• Every few milli-seconds,

read and re-write every bit.

• Consumes power

• Takes time

22

DRAM Lithography:

How do you get a big capacitor?

C ~ Area/dielectric-thickness

Stacked Capacitors

23

DRAM Lithography

Trench Capacitors

24

Accessing DRAM• Apply the row address

• “opens a page”

• Slow (~12ns read + 24 ns precharge)

• Contents in a “row buffer”

• Apply one or more column addrs

• fast (~3ns)

• Reads and/or writes

16k Rows

One DD3

DRAM bank

25

DRAM Devices

• There are many banks per die (16 at left)• Multiple pages can be open at once.

• Can keep pages open longer

• Parallelism

• Example• open bank 1, row 4

• open bank 2, row 7

• open bank 3, row 10

• read bank 1, column 8

• read bank 2, column 32

• ...

Micron 78nm 1Gb DDR3

26

DRAM: Micron MT47H512M4

27

DRAM: Micron MT47H512M4

28

DRAM Variants

• The basic DRAM technology has been wrapped in several different interfaces.

• SDRAM (synchronous)

• DDR SDRAM (double data-rate)• Data clocked on rising and falling edge of the

clock.

• DDR2 -- faster, lower voltage DDR

• DDR3 -- even faster, even lower-voltage

• GDDR2-5 -- For graphics cards.

29

Current State-of-the-art: DDR3 SDRAM

• DIMM data path is 64bits (72 with ECC)

• Data rate: up to 1066Mhz DDR (2133Mhz effective)

• Bandwidth per DIMM GTNE: 16GB/s• guaranteed not to exceed

• Multiple DIMMs can attach to a bus• Reduces bandwidth/GB (a good idea?)

Each chip provides one

8-bit slice.

The chips are all

synchronized and

received the same

commands

30

DRAM Scaling

• Long term need for performance has driven DRAM hard

• complex interface.

• High performance

• High power.

• DRAM used to be the main driver for process scaling, now it’s flash.

• Power is now a major concern.

• Scaling is expected to match CMOS tech scaling

• F2 cell size will probably not decrease

• Historical foot note: Intel got its start as a DRAM company, but got out of it when it became a commodity.

31

A Typical Hierarchy: Costs and Speeds

On-chip L1 cache

SRAM

KBs

main memory

GBs

Disk

TBs

Cost

0.009 $/MB

0.00004 $/MB

Access time

60ns

10,000,000ns

< 1ns

On-chip L2 cache

SRAM

KBs

On-chip L3 cache

SRAM

MBs

< 2-3ns

< 10ns

???

???

???

SSDs

GB0.0006 $/MB 20,000ns

32© 2004 Jim Gray, Microsoft Corporation

Los Angeles

32

How far away is the data?

33

Typical Hierarchy: Architecture

34

The Principle of Locality

• “Locality” is the tendency of data access to be predictable. There are two kinds:

• Spatial locality: The program is likely to access data that is close to data it has accessed recently

• Temporal locality: The program is likely to access the same data repeatedly.

35

Memory’s impactM = % mem ops

Mlat (cycles) = average memory latency

BCPI = base CPI with single-cycle data memory

CPI =

36

Memory’s impact

M = % mem ops

Mlat (cycles) = average memory latency

TotalCPI = BaseCPI + M*Mlat

Example:

BaseCPI = 1; M = 0.2; Mlat = 240 cycles

TotalCPI = 49

Speedup = 1/49 = 0.02 => 98% drop in performance

Remember!: Amdahl’s law does not bound the slowdown. Poor memory performance can make your program arbitrarily slow.

37

• Why did branch prediction work?

Why should we expect caching to work?

38

• Why did branch prediction work?

• Where is memory access predictable• Predictably accessing the same data• In loops: for(i = 0; i < 10; i++) {s += foo[i];}

• foo = bar[4 + configuration_parameter];

• Predictably accessing different data• In linked lists: while(l != NULL) {l = l->next;}

• In arrays: for(i = 0; i < 10000; i++) {s += data[i];}

• structure access: foo(some_struct.a, some_struct.b);

Why should we expect caching to work?

39

The Principle of Locality• “Locality” is the tendency of data access to

be predictable. There are two kinds:

• Spatial locality: The program is likely to access data that is close to data it has accessed recently

• Temporal locality: The program is likely to access the same data repeatedly.

40

Locality in Action• Label each access

with whether it has temporal or spatial locality or neither• 1

• 2

• 3

• 10

• 4

• 1800

• 11

• 30

• 1

• 2

• 3

• 4

• 10

• 190

• 11

• 30

• 12

• 13

• 182

• 1004

41

Locality in Action• Label each access

with whether it has temporal or spatial locality or neither• 1 n

• 2 s

• 3 s

• 10 n

• 4 s

• 1800 n

• 11 s

• 30 n

• 1 t

• 2 s, t

• 3 s,t

• 4 s,t

• 10 s,t

• 190 n

• 11 s,t

• 30 s

• 12 s

• 13 s

• 182 n?

• 1004 n

There is no hard and fast rule here. In practice, locality

exists for an access if the cache performs well.

42

Cache Vocabulary• Hit - The data was found in the cache

• Miss - The data was not found in the cache

• Hit rate - hits/total accesses

• Miss rate = 1- Hit rate

• Locality - see previous slides

• Cache line - the basic unit of data in a cache. generally several words.

• Tag - the high order address bits stored along with the data to identify the actual address of the cache line.

• Hit time -- time to service a hit

• Miss time -- time to service a miss (this is a function of the lower level caches.)

43

Cache Vocabulary• There can be many caches stacked on top of each

other

• if you miss in one you try in the “lower level cache” Lower level, mean higher number

• There can also be separate caches for data and instructions. Or the cache can be “unified”

• In the 5-stage MIPS pipeline• the L1 data cache (d-cache) is the one nearest processor. It

corresponds to the “data memory” block in our pipeline diagrams

• the L1 instruction cache (i-cache) corresponds to the “instruction memory” block in our pipeline diagrams.

• The L2 sits underneath the L1s.

• There is often an L3 in modern systems.

44

Typical Cache Hierarchy

45

Data vs Instruction Caches• Why have different I and D caches?

46

Data vs Instruction Caches• Why have different I and D caches?• Different areas of memory

• Different access patterns• I-cache accesses have lots of spatial locality. Mostly sequential

accesses.

• I-cache accesses are also predictable to the extent that branches are predictable

• D-cache accesses are typically less predictable

• Not just different, but often across purposes.• Sequential I-cache accesses may interfere with the data the D-

cache has collected.

• This is “interference” just as we saw with branch predictors

• At the L1 level it avoids a structural hazard in the pipeline

• Writes to the I cache by the program are rare enough that they can be slow (i.e., self modifying code)

Documents

The Memory Hierarchy - University of California, San Diego€¦ · There are many banks per die (16 at left) • Multiple pages can be open at once. • Can keep pages open longer