Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
1
The Memory
HierarchyIn the book: 5.1-5.3, 5.7, 5.10
2
• Understand how CPUs run programs• How do we express the computation the CPU?
• How does the CPU execute it?
• How does the CPU support other system components (e.g., the OS)?
• What techniques and technologies are involved and how do they work?
• Understand why CPU performance (and other metrics) varies• How does CPU design impact performance?
• What trade-offs are involved in designing a CPU?
• How can we meaningfully measure and compare computer systems?
• Understand why program performance varies• How do program characteristics affect performance?
• How can we improve a programs performance by considering the CPU running it?
• How do other system components impact program performance?
Goals for this Class
3
memory
Abstraction: Big array of bytes
Memory
CPU
Memory
4
Main points for today• What is a memory hierarchy?
• What is the CPU-DRAM gap?
• What is locality? What kinds are there?
• Learn a bunch of caching vocabulary.
5
Processor vs Memory
Performance • Memory is very slow compared to
processors.
Pe
rfo
rma
nce
vs 1
98
0
SRAM and DRAM
6
7
Silicon Memories
• Why store things in silicon?• It’s fast!!!
• Compatible with logic devices (mostly)
• The main goal is to be cheap• Dense -- The smaller the bits, the less area you
need, and the more bits you can fit on a chip/wafer/through your fab.
• Bit sizes are measured in F2 -- the smallest feature you can create.
• The number of F2 /bit is a function of the memory technology, not the manufacturing technology.
• i.e. an SRAM in todays technology will take the same number of F2 in tomorrow’s technology
8
Questions
• What physical quantity should represent the bit?• Voltage/charge -- SRAMs, DRAMs, Flash memories
• Magnetic orientation -- MRAMs
• Crystal structure -- phase change memories
• The orientation of organic molecules -- various exotic technologies
• All that’s required is that we can sense it and turn it into a logic one or zero.
• How do we achieve maximum density?
• How do we make them fast?
9
Anatomy of a Memory
• Dense: Build a big array• bigger the better
• less other stuff
• Bigger -> slower
• Row decoder• Select the row by
raising a “word line”
• Column decoder• Select a slice of the row
• Decoders are pretty big.
10
The Storage Array
• Density is king.• Highly engineered, carefully tuned, automatically
generated.
• The smaller the devices, the better.
• Making them big makes them slow.• Bit/word lines are long (millimeters)
• They have large capacitance, so their RC delay is long
• For the row decoder, use large transistors to drive them hard.
• For the bit cells...• There are lots of these, so they need to be as small as
possible (but not smaller)
11
Measuring Memory Density
• We use a “technology independent” metric to measure the inherent size of different memory cells.• F == the “feature size” == the smallest dimension a CMOS
process can create (e.g., the width of the narrowest wire).
• In a 22nm process technology, F = 22nm.
• F2 (F-squared) is the smallest 2D feature we can manufacture.
• A single bit of a given type of memory (e.g., SRAM or DRAM) requires a fixed number of F2
• This number doesn’t change with process technology.
• e.g., NAND flash memory is 4F2 in 90nm and in 22nm.
• Using this metic is useful because the relative sizes of different memory technologies don’t change much, although absolute densities do.
12
Sense Amps
• Sense amplifiers take a difference between two signals and amplify it
• Two scenarios• Inputs are initially equal (“precharged”) -- they
each move in opposite directions
• One input is a reference -- so only one signal moves
• Frequently used in memories• Storage cells are small, so the signals they produce
are inherently weak
• Sense amps can detect these weak, analog signals and convert them into a logic one or logic zero.
13
Static Random Access Memory (SRAM)
• Storage• Voltage on a pair of cross-
coupled inverters
• Durable in presence of power
• To read• Pre-charge two bit lines to
Vcc/2
• Turn on the “word line”
• Read the output of the sense-amp
1 0
1 0
1
14
1 0
SRAM Writes
• To write• Turn off the sense-amp
• Turn on the wordline
• Drive the bitlines to the correct state
• Turn off the wordline
1 0
0
10
15
Building SRAM
• This is “6T SRAM”• 6 transistors is pretty
big
• SRAMs are not dense
16
SRAM Density
• At 65nm: 0.52um2
• 123-140 F2
• [ITRS 2008]
65nm TSMC 6T SRAM
17
SRAM Ports
• Add word and bit lines
• Read/write multiple things at once
• Density decreases quadratically
• Bandwidth increase linearly
18
SRAM Performance
• Read and write times• 10s-100s of ps
• Bandwidth• Registers -- 324GB/s
• L1 cache -- 128GB/s
19
DRAM
20
Dynamic Random Access Memory (DRAM)
• Storage• Charge on a capacitor
• Decays over time (us-scale)
• This is the “dynamic” part.
• About 6F2: 20x better than SRAM
• Reading• Precharge
• Assert word line
• Sense output
• Refresh data Only one bit line is read at a time.
The other bit line serves as a reference.
The bit cells attached to Wordline 1 are not shown.
21
DRAM: Write and Refresh
• Writing• Turn on the wordline
• Override the sense amp.
• Refresh• Every few milli-seconds,
read and re-write every bit.
• Consumes power
• Takes time
22
DRAM Lithography:
How do you get a big capacitor?
C ~ Area/dielectric-thickness
Stacked Capacitors
23
DRAM Lithography
Trench Capacitors
24
Accessing DRAM• Apply the row address
• “opens a page”
• Slow (~12ns read + 24 ns precharge)
• Contents in a “row buffer”
• Apply one or more column addrs
• fast (~3ns)
• Reads and/or writes
16k Rows
One DD3
DRAM bank
25
DRAM Devices
• There are many banks per die (16 at left)• Multiple pages can be open at once.
• Can keep pages open longer
• Parallelism
• Example• open bank 1, row 4
• open bank 2, row 7
• open bank 3, row 10
• read bank 1, column 8
• read bank 2, column 32
• ...
Micron 78nm 1Gb DDR3
26
DRAM: Micron MT47H512M4
27
DRAM: Micron MT47H512M4
28
DRAM Variants
• The basic DRAM technology has been wrapped in several different interfaces.
• SDRAM (synchronous)
• DDR SDRAM (double data-rate)• Data clocked on rising and falling edge of the
clock.
• DDR2 -- faster, lower voltage DDR
• DDR3 -- even faster, even lower-voltage
• GDDR2-5 -- For graphics cards.
29
Current State-of-the-art: DDR3 SDRAM
• DIMM data path is 64bits (72 with ECC)
• Data rate: up to 1066Mhz DDR (2133Mhz effective)
• Bandwidth per DIMM GTNE: 16GB/s• guaranteed not to exceed
• Multiple DIMMs can attach to a bus• Reduces bandwidth/GB (a good idea?)
Each chip provides one
8-bit slice.
The chips are all
synchronized and
received the same
commands
30
DRAM Scaling
• Long term need for performance has driven DRAM hard
• complex interface.
• High performance
• High power.
• DRAM used to be the main driver for process scaling, now it’s flash.
• Power is now a major concern.
• Scaling is expected to match CMOS tech scaling
• F2 cell size will probably not decrease
• Historical foot note: Intel got its start as a DRAM company, but got out of it when it became a commodity.
31
A Typical Hierarchy: Costs and Speeds
On-chip L1 cache
SRAM
KBs
main memory
GBs
Disk
TBs
Cost
0.009 $/MB
0.00004 $/MB
Access time
60ns
10,000,000ns
< 1ns
On-chip L2 cache
SRAM
KBs
On-chip L3 cache
SRAM
MBs
< 2-3ns
< 10ns
???
???
???
SSDs
GB0.0006 $/MB 20,000ns
32© 2004 Jim Gray, Microsoft Corporation
Los Angeles
32
How far away is the data?
33
Typical Hierarchy: Architecture
34
The Principle of Locality
• “Locality” is the tendency of data access to be predictable. There are two kinds:
• Spatial locality: The program is likely to access data that is close to data it has accessed recently
• Temporal locality: The program is likely to access the same data repeatedly.
35
Memory’s impactM = % mem ops
Mlat (cycles) = average memory latency
BCPI = base CPI with single-cycle data memory
CPI =
36
Memory’s impact
M = % mem ops
Mlat (cycles) = average memory latency
TotalCPI = BaseCPI + M*Mlat
Example:
BaseCPI = 1; M = 0.2; Mlat = 240 cycles
TotalCPI = 49
Speedup = 1/49 = 0.02 => 98% drop in performance
Remember!: Amdahl’s law does not bound the slowdown. Poor memory performance can make your program arbitrarily slow.
37
• Why did branch prediction work?
Why should we expect caching to work?
38
• Why did branch prediction work?
• Where is memory access predictable• Predictably accessing the same data• In loops: for(i = 0; i < 10; i++) {s += foo[i];}
• foo = bar[4 + configuration_parameter];
• Predictably accessing different data• In linked lists: while(l != NULL) {l = l->next;}
• In arrays: for(i = 0; i < 10000; i++) {s += data[i];}
• structure access: foo(some_struct.a, some_struct.b);
Why should we expect caching to work?
39
The Principle of Locality• “Locality” is the tendency of data access to
be predictable. There are two kinds:
• Spatial locality: The program is likely to access data that is close to data it has accessed recently
• Temporal locality: The program is likely to access the same data repeatedly.
40
Locality in Action• Label each access
with whether it has temporal or spatial locality or neither• 1
• 2
• 3
• 10
• 4
• 1800
• 11
• 30
• 1
• 2
• 3
• 4
• 10
• 190
• 11
• 30
• 12
• 13
• 182
• 1004
41
Locality in Action• Label each access
with whether it has temporal or spatial locality or neither• 1 n
• 2 s
• 3 s
• 10 n
• 4 s
• 1800 n
• 11 s
• 30 n
• 1 t
• 2 s, t
• 3 s,t
• 4 s,t
• 10 s,t
• 190 n
• 11 s,t
• 30 s
• 12 s
• 13 s
• 182 n?
• 1004 n
There is no hard and fast rule here. In practice, locality
exists for an access if the cache performs well.
42
Cache Vocabulary• Hit - The data was found in the cache
• Miss - The data was not found in the cache
• Hit rate - hits/total accesses
• Miss rate = 1- Hit rate
• Locality - see previous slides
• Cache line - the basic unit of data in a cache. generally several words.
• Tag - the high order address bits stored along with the data to identify the actual address of the cache line.
• Hit time -- time to service a hit
• Miss time -- time to service a miss (this is a function of the lower level caches.)
43
Cache Vocabulary• There can be many caches stacked on top of each
other
• if you miss in one you try in the “lower level cache” Lower level, mean higher number
• There can also be separate caches for data and instructions. Or the cache can be “unified”
• In the 5-stage MIPS pipeline• the L1 data cache (d-cache) is the one nearest processor. It
corresponds to the “data memory” block in our pipeline diagrams
• the L1 instruction cache (i-cache) corresponds to the “instruction memory” block in our pipeline diagrams.
• The L2 sits underneath the L1s.
• There is often an L3 in modern systems.
44
Typical Cache Hierarchy
45
Data vs Instruction Caches• Why have different I and D caches?
46
Data vs Instruction Caches• Why have different I and D caches?• Different areas of memory
• Different access patterns• I-cache accesses have lots of spatial locality. Mostly sequential
accesses.
• I-cache accesses are also predictable to the extent that branches are predictable
• D-cache accesses are typically less predictable
• Not just different, but often across purposes.• Sequential I-cache accesses may interfere with the data the D-
cache has collected.
• This is “interference” just as we saw with branch predictors
• At the L1 level it avoids a structural hazard in the pipeline
• Writes to the I cache by the program are rare enough that they can be slow (i.e., self modifying code)