www.cs.utah.edu/~udipi
Designing Efficient Memory for Future Computing Systems
Aniruddha N. Udipi University of Utah
www.cs.utah.edu/~udipi
Ph.D. Dissertation Defense, March 7, 2012 Advisor: Rajeev Balasubramonian
www.cs.utah.edu/~udipi
My other computer is..
2
www.cs.utah.edu/~udipi
Scaling server farms
• Facebook: 30,000 servers, 80 Billion images stored, serves 600,000 photos a second, logs 25 TB of data per day… the statistics can go on..
• The primary challenge to scaling: efficient supply of data to thousands of cores
• It’s all about the memory!
3
www.cs.utah.edu/~udipi
Performance Trends
• Demand-side– Multi-socket, multi-core, multi-thread – Large datasets - big data analytics,
scientific computation models– RAMCloud-like designs– 1 TB/s per node by 2017
• Supply-side– Pin count, per pin BW, capacity– Severely power limited
4
Source: ZDNet
Source: Tom’s Hardware
www.cs.utah.edu/~udipi
• Datacenters consume ~2% of all power generated in the US– Operation + cooling
• 100 Billion kWh, $7.4 Billion• 25-40 % of total power in large
systems consumed in memory• As processors get simpler, this
fraction likely to increase
Energy Trends
5
www.cs.utah.edu/~udipi
Cost-per-bit
• Traditionally the holy grail of DRAM design• Operational expenditure over 3 years == Capital
expenditure in datacenter servers– Cost-per-bit less important than before
6
$3.00 13W
$0.30 60W
www.cs.utah.edu/~udipi
• The job of the memory controller is hard– 18+ timing parameters for DRAM!– Maintenance operations
Refresh, scrub, power down, etc.
• Several DIMM and controller variants– Hard to provide interoperability– Need processor-side support for new
memory features
• Now throw in heterogeneity – Memristors, PCM, STT-RAM, etc.
Complexity Trends
7
www.cs.utah.edu/~udipi
Reliability Trends
• Shrinking feature sizes not helping• Nor is the scale
– 64 x 1015 DRAM cells in a typical datacenter• DRAM errors the #1 reason for servers at Google
to enter repair• Datacenters are the backbone of web-connected
infrastructure– Reliability is essential
• Server downtime has huge economic impact– Breached SLAs, for example
8
www.cs.utah.edu/~udipi
Thesis statement
• Main memory systems are at an inflection point– Convergence of several trends
• Major overhaul required to achieve a system that is– Energy-efficient, high-performance, low-complexity,
reliable, and cost effective• Combination of two things
– Prudent application of novel technologies– Fundamental rethinking of conventional design
decisions
9
www.cs.utah.edu/~udipi
Designing Future Memory Systems
10
CPU
MC
DIMM…
1
2Memory Interconnect – Prudent use of Silicon Photonics, without modifying DRAM dies [ISCA ’11]
Memory Reliability – Efficient RAID-based high-availability Chipkill memory [ISCA ’12]
1
1Memory Chip Architecture – reducing overfetch & increasing parallelism [ISCA ’10]
3 Memory protocol – Streamlined Slot-based Interface with semi-autonomous memory [ISCA ’11]
4
23 4
4
2 3
www.cs.utah.edu/~udipi
PART 1 – Memory Chip Organization
www.cs.utah.edu/~udipi
Key bottleneck
12
RAS
CAS
Cache Line
DRAM Chip DRAM Chip DRAM Chip DRAM Chip
Row Buffer
One bank shown in each chip
www.cs.utah.edu/~udipi
Why this is a problem
13
www.cs.utah.edu/~udipi
…
14
www.cs.utah.edu/~udipi
SSA Architecture
15
MEMORY CONTROLLER
8 8
ADDR/CMD BUS
64 Bytes
Bank
Subarray
Bitlines
Row buffer
Global Interconnect to I/O
ONE DRAM CHIP
DIMM
8 8 8 8 8 88DATA BUS
www.cs.utah.edu/~udipi
SSA Operation
16
Address
Cache Line
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarray
DRAM ChipSubarraySubarray Subarray Subarray Subarray
Sleep Mode(or other parallelaccesses)
Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray
www.cs.utah.edu/~udipi
SSA Impact
• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer
spells of inactivity – better power down
• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays
• Area increase– More peripheral circuitry and I/O at finer granularities
– area overhead (< 5%)
17
www.cs.utah.edu/~udipi
Key Contributions
• Up to 6X reduction in DRAM chip dynamic energy
• Up to 5X reduction in DRAM chip static energy
• Up to 50% improvements in performance in applications limited by bank contention
• All for ~5% increase in area
18
www.cs.utah.edu/~udipi
PART 2 – Memory Interconnect
www.cs.utah.edu/~udipi
Key Bottleneck
• Fundamental nature of electrical pins– Limited pin count, per pin bandwidth, memory
capacity, etc. • Diverging growth rates of core count and pin
count• Limited by physics, not engineering!
20
www.cs.utah.edu/~udipi 21
Silicon Photonic Interconnects
• We need something that can break the edge-bandwidth bottleneck
• Ring modulator based photonics– Off chip light source
– Indirect modulation using resonant rings
– Relatively cheap coupling on- and off-chip
• DWDM for high bandwidth density– As many as 67 wavelengths possible
– Limited by Free Spectral Range, and coupling losses between rings
Source: Xu et al. Optical Express 16(6), 2008
DWDM
64 λ × 10 Gbps/ λ = 80 GB/s per waveguide
www.cs.utah.edu/~udipi
The Questions We’re Trying to Answer
22
Should we replace allinterconnects with photonics? On-chip too?
Should we be designing photonic DRAM dies? Stacks? Channels?
How do we make photonics less invasive to memory die design?
What should the role of 3D be in an optically connected memory?
What should the role of electrical signaling be?
www.cs.utah.edu/~udipi
Design Considerations – I
• Photonic interconnects– Large static power dissipation: ring tuning
Rings are designed to resonate at a specific frequency Processing defects and temperature change this Need to heat the rings to correct for this
– Much lower dynamic energy consumption – relatively independent of distance
• Electrical interconnects– Relatively small static power dissipation– Large dynamic energy consumption
23
www.cs.utah.edu/~udipi
Design Considerations – II
• Should not over-provision photonic bandwidth, use only where necessary
• Use photonics where they’re really useful– To break the off-chip pin barrier
• Exploit 3D-Stacking and TSVs– High bandwidth, low static power, decouples
memory dies• Exploit low-swing wires
– Cheap on-chip communication
24
www.cs.utah.edu/~udipi
Proposed Design
25
Processor
DIMMWaveguide
DRAM chips
Photonic Interface die
Memory controller
ADVANTAGE 1:Increased activity factor, more efficient use of photonicsADVANTAGE 3:Not disruptive to the design of commodity memory diesADVANTAGE 2:Rings are co-located; easier to isolate or tune thermally
www.cs.utah.edu/~udipi
Key Contributions
• 23% reduced energy consumption• 4X capacity per channel• Potential for performance improvements
due to increased bank count• Less disruptive to memory die design
26
Processor
DIMMWaveguide
DRAM chips
Photonic Interface die
Memory controller
Makes the job of the memory controller difficult!
www.cs.utah.edu/~udipi
PART 3 – Memory Access Protocol
www.cs.utah.edu/~udipi
Key Bottleneck
• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface
• Memory controller micro-manages every operation of the memory system– Processor-side support required for every memory
innovation– Several signals between processor and memory
Heavy pressure on address/command bus Worse with several independent banks, large amounts
of state
28
www.cs.utah.edu/~udipi
Proposed Solution
• Release MC’s tight control, make memory stack more autonomous
• Move mundane tasks to the interface die– Maintenance operation (refresh, scrub, etc.)– Routine operations (DRAM precharge, NVM wear
leveling)– Timing control (18+ constraints for DRAM alone)– Coding and any other special requirements
• Processor-side controller only schedules requests and controls data bus
29
www.cs.utah.edu/~udipi
Memory Access Operation
30
S1
Arrival First free slot
Issue Start looking
Backup slot
MLML > ML
Time
Slot – Cache line data bus occupancyX – Reserved SlotML – Memory Latency = Addr. latency + Bank access + Data bus latency
x xx S2
www.cs.utah.edu/~udipi
Performance Impact – Synthetic Traffic
31
< 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth
www.cs.utah.edu/~udipi
Performance Impact – PARSEC/STREAM
32
Apps have very low BW requirements Scaled down system, similar trends
www.cs.utah.edu/~udipi
Key Contributions• Plug and play
– Everything is interchangeable and interoperable– Only interface-die support required (communicate ML)
• Better support for heterogeneous systems– Easier DRAM-NVM data movement on the same channel
• More innovation in the memory system– Without processor-side support constraints
• Fewer commands between processor and memory– Energy, performance advantages
33
www.cs.utah.edu/~udipi
PART 4 – Memory Reliability
www.cs.utah.edu/~udipi
Key Bottleneck
• Increased access granularity– Every data access is spread across 36 DRAM chips– DRAM industry standards define minimum access granularity
from each chip– Massive overfetch of data at multiple levels
Wastes energy Wastes bandwidth Occupies ranks/banks for longer, hurting performance
• x4 device width restriction– fewer ranks for given DIMM real estate– x8/x16/x32 more power efficient per capacity
• Reliability level: 1 failed chip out of 36
35
www.cs.utah.edu/~udipi
A new approach: LOT-ECC• Operate on a single rank of memory: 9 chips
– and support failure of 1 chip per rank (9 chips)• Multiple tiers of localized protection
– Tier-1: Local Error Detection (checksums)– Tier 2: Global Error Correction (parity)– T3 & T4 to handle specific failure cases
• Error correction data stored in data memory• Data mapping handled by memory controller
with firmware support– Transparent to OS, caches, etc.
36
www.cs.utah.edu/~udipi
LOT-ECC Design
37
www.cs.utah.edu/~udipi
The Devil is in the Details
• We’re borrowing one bit from [data + LED] to use in the GEC– Put them all in the same DRAM row
• When a cache line is written, – Write data, LED, GEC – all “self-contained”– no read-before-write– Guaranteed row-buffer hit
38
7b 1b 1b
PA0-6 PA7-13 PA49-55 PPA . .T4 T4 T4
PA56
T4
Surplus bit borrowed from data + LED
Chip 0 Chip 1 Chip 7 Chip 8
www.cs.utah.edu/~udipi
Key Benefits
• Energy Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes
• Performance Gains: More rank-level parallelism, reduced access granularity
• Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently
• Flexibility: Works with a single rank of x4 DRAMs or more efficient wide-I/O x8/x16 DRAMs
• Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS
39
www.cs.utah.edu/~udipi
Power Results
40
-55%
www.cs.utah.edu/~udipi
Performance Results
41
Latency Reduction: LOT-ECC x8 – 43% +GEC Coalescing – 47% Oracular – 57%
www.cs.utah.edu/~udipi
Exploiting features in SSA
42
DIMM
L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C
L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..
C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C
.
...
.
...
.
...
.
...
P7
DRAM DEVICE
L – Cache Line C – Local Checksum P – Global Parity
www.cs.utah.edu/~udipi
Putting it all together
www.cs.utah.edu/~udipi
Summary
• Tremendous pressure on the memory system– Bandwidth, energy, complexity, reliability
• Prudently apply novel technologies– Silicon photonics– Low-swing wires– 3D-stacking
• Rethink some fundamental design choices– Micromanagement by the memory controller– Overfetch in the face of diminishing locality– Conventional ECC codes
44
www.cs.utah.edu/~udipi
Impact
• Significant static/dynamic energy reduction– Memory core, channel, controller, reliability
• Significant performance improvement– Bank parallelism, channel bandwidth, reliability
• Significant complexity reduction– Memory controller
• Improved reliability
45
www.cs.utah.edu/~udipi
Synergies
• SSA Photonics • Photonics Autonomous memory• SSA Reliability• SSA, Photonics, and LOT-ECC provide additive energy
benefits– Each targets one of three major sources of energy
consumption – DRAM array, off-chip channel, reliability• SSA, Photonics, and LOT-ECC also provide additive
performance benefits– Each targets one of three major performance bottleneck –
Bank-contention, off-chip BW, reliability
46
www.cs.utah.edu/~udipi
Research Contributions
• Memory reliability• Memory access protocol• Memory channel architecture• Memory chip microarchitecture
• On-chip networks• Non-uniform power caches• 3D stacked cache design
47
[ISCA 2012]
[ISCA 2011]
[ISCA 2010]
[HPCA 2010]
[HiPC 2009]
[HPCA 2009]
www.cs.utah.edu/~udipi
Future Work
• Future project ideas include– Memory architectures for graphics/throughput-
oriented applications– Memory optimizations for handheld devices
Tightly integrated software support Managing heterogeneity, reconfigurability Novel memory hierarchies
– Memory autonomy and virtualization– Refresh management in DRAM
48
www.cs.utah.edu/~udipi
Acknowledgements
• Rajeev• Naveen• Committee: Al, Norm, Erik, Ken• Awesome lab-mates• Karen, Ann, Emily… front office• Parents & family • Friends
49