Download ppt - Www.cs.utah.edu/~udipi Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah udipi Ph.D. Dissertation

www.cs.utah.edu/~udipi

Designing Efficient Memory for Future Computing Systems

Aniruddha N. Udipi University of Utah


Ph.D. Dissertation Defense, March 7, 2012 Advisor: Rajeev Balasubramonian


My other computer is..

2


Scaling server farms

• Facebook: 30,000 servers, 80 Billion images stored, serves 600,000 photos a second, logs 25 TB of data per day… the statistics can go on..

• The primary challenge to scaling: efficient supply of data to thousands of cores

• It’s all about the memory!

3


Performance Trends

• Demand-side– Multi-socket, multi-core, multi-thread – Large datasets - big data analytics,

scientific computation models– RAMCloud-like designs– 1 TB/s per node by 2017

• Supply-side– Pin count, per pin BW, capacity– Severely power limited

4

Source: ZDNet

Source: Tom’s Hardware


• Datacenters consume ~2% of all power generated in the US– Operation + cooling

• 100 Billion kWh, $7.4 Billion• 25-40 % of total power in large

systems consumed in memory• As processors get simpler, this

fraction likely to increase

Energy Trends

5


Cost-per-bit

• Traditionally the holy grail of DRAM design• Operational expenditure over 3 years == Capital

expenditure in datacenter servers– Cost-per-bit less important than before

6

$3.00 13W

$0.30 60W


• The job of the memory controller is hard– 18+ timing parameters for DRAM!– Maintenance operations

Refresh, scrub, power down, etc.

• Several DIMM and controller variants– Hard to provide interoperability– Need processor-side support for new

memory features

• Now throw in heterogeneity – Memristors, PCM, STT-RAM, etc.

Complexity Trends

7


Reliability Trends

• Shrinking feature sizes not helping• Nor is the scale

– 64 x 1015 DRAM cells in a typical datacenter• DRAM errors the #1 reason for servers at Google

to enter repair• Datacenters are the backbone of web-connected

infrastructure– Reliability is essential

• Server downtime has huge economic impact– Breached SLAs, for example

8


Thesis statement

• Main memory systems are at an inflection point– Convergence of several trends

• Major overhaul required to achieve a system that is– Energy-efficient, high-performance, low-complexity,

reliable, and cost effective• Combination of two things

– Prudent application of novel technologies– Fundamental rethinking of conventional design

decisions

9


Designing Future Memory Systems

10

CPU

MC

DIMM…

1

2Memory Interconnect – Prudent use of Silicon Photonics, without modifying DRAM dies [ISCA ’11]

Memory Reliability – Efficient RAID-based high-availability Chipkill memory [ISCA ’12]

1

1Memory Chip Architecture – reducing overfetch & increasing parallelism [ISCA ’10]

3 Memory protocol – Streamlined Slot-based Interface with semi-autonomous memory [ISCA ’11]

4

23 4

4

2 3


PART 1 – Memory Chip Organization


Key bottleneck

12

RAS

CAS

Cache Line

DRAM Chip DRAM Chip DRAM Chip DRAM Chip

Row Buffer

One bank shown in each chip


Why this is a problem

13


…

14


SSA Architecture

15

MEMORY CONTROLLER

8 8

ADDR/CMD BUS

64 Bytes

Bank

Subarray

Bitlines

Row buffer

Global Interconnect to I/O

ONE DRAM CHIP

DIMM

8 8 8 8 8 88DATA BUS


SSA Operation

16

Address

Cache Line

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarray

DRAM ChipSubarraySubarray Subarray Subarray Subarray

Sleep Mode(or other parallelaccesses)

Subarray Subarray Subarray SubarraySubarray Subarray Subarray Subarray


SSA Impact

• Energy reduction– Dynamic – fewer bitlines activated– Static – smaller activation footprint – more and longer

spells of inactivity – better power down

• Latency impact– Limited pins per cache line – serialization latency– Higher bank-level parallelism – shorter queuing delays

• Area increase– More peripheral circuitry and I/O at finer granularities

– area overhead (< 5%)

17


Key Contributions

• Up to 6X reduction in DRAM chip dynamic energy

• Up to 5X reduction in DRAM chip static energy

• Up to 50% improvements in performance in applications limited by bank contention

• All for ~5% increase in area

18


PART 2 – Memory Interconnect


Key Bottleneck

• Fundamental nature of electrical pins– Limited pin count, per pin bandwidth, memory

capacity, etc. • Diverging growth rates of core count and pin

count• Limited by physics, not engineering!

20

www.cs.utah.edu/~udipi 21

Silicon Photonic Interconnects

• We need something that can break the edge-bandwidth bottleneck

• Ring modulator based photonics– Off chip light source

– Indirect modulation using resonant rings

– Relatively cheap coupling on- and off-chip

• DWDM for high bandwidth density– As many as 67 wavelengths possible

– Limited by Free Spectral Range, and coupling losses between rings

Source: Xu et al. Optical Express 16(6), 2008

DWDM

64 λ × 10 Gbps/ λ = 80 GB/s per waveguide


The Questions We’re Trying to Answer

22

Should we replace allinterconnects with photonics? On-chip too?

Should we be designing photonic DRAM dies? Stacks? Channels?

How do we make photonics less invasive to memory die design?

What should the role of 3D be in an optically connected memory?

What should the role of electrical signaling be?


Design Considerations – I

• Photonic interconnects– Large static power dissipation: ring tuning

Rings are designed to resonate at a specific frequency Processing defects and temperature change this Need to heat the rings to correct for this

– Much lower dynamic energy consumption – relatively independent of distance

• Electrical interconnects– Relatively small static power dissipation– Large dynamic energy consumption

23


Design Considerations – II

• Should not over-provision photonic bandwidth, use only where necessary

• Use photonics where they’re really useful– To break the off-chip pin barrier

• Exploit 3D-Stacking and TSVs– High bandwidth, low static power, decouples

memory dies• Exploit low-swing wires

– Cheap on-chip communication

24


Proposed Design

25

Processor

DIMMWaveguide

DRAM chips

Photonic Interface die

Memory controller

ADVANTAGE 1:Increased activity factor, more efficient use of photonicsADVANTAGE 3:Not disruptive to the design of commodity memory diesADVANTAGE 2:Rings are co-located; easier to isolate or tune thermally


Key Contributions

• 23% reduced energy consumption• 4X capacity per channel• Potential for performance improvements

due to increased bank count• Less disruptive to memory die design

26

Processor

DIMMWaveguide

DRAM chips

Photonic Interface die

Memory controller

Makes the job of the memory controller difficult!


PART 3 – Memory Access Protocol


Key Bottleneck

• Large capacity, high bandwidth, and evolving technology trends will increase pressure on the memory interface

• Memory controller micro-manages every operation of the memory system– Processor-side support required for every memory

innovation– Several signals between processor and memory

Heavy pressure on address/command bus Worse with several independent banks, large amounts

of state

28


Proposed Solution

• Release MC’s tight control, make memory stack more autonomous

• Move mundane tasks to the interface die– Maintenance operation (refresh, scrub, etc.)– Routine operations (DRAM precharge, NVM wear

leveling)– Timing control (18+ constraints for DRAM alone)– Coding and any other special requirements

• Processor-side controller only schedules requests and controls data bus

29


Memory Access Operation

30

S1

Arrival First free slot

Issue Start looking

Backup slot

MLML > ML

Time

Slot – Cache line data bus occupancyX – Reserved SlotML – Memory Latency = Addr. latency + Bank access + Data bus latency

x xx S2


Performance Impact – Synthetic Traffic

31

< 9% latency impact, even at maximum load Virtually no impact on achieved bandwidth


Performance Impact – PARSEC/STREAM

32

Apps have very low BW requirements Scaled down system, similar trends


Key Contributions• Plug and play

– Everything is interchangeable and interoperable– Only interface-die support required (communicate ML)

• Better support for heterogeneous systems– Easier DRAM-NVM data movement on the same channel

• More innovation in the memory system– Without processor-side support constraints

• Fewer commands between processor and memory– Energy, performance advantages

33


PART 4 – Memory Reliability


Key Bottleneck

• Increased access granularity– Every data access is spread across 36 DRAM chips– DRAM industry standards define minimum access granularity

from each chip– Massive overfetch of data at multiple levels

Wastes energy Wastes bandwidth Occupies ranks/banks for longer, hurting performance

• x4 device width restriction– fewer ranks for given DIMM real estate– x8/x16/x32 more power efficient per capacity

• Reliability level: 1 failed chip out of 36

35


A new approach: LOT-ECC• Operate on a single rank of memory: 9 chips

– and support failure of 1 chip per rank (9 chips)• Multiple tiers of localized protection

– Tier-1: Local Error Detection (checksums)– Tier 2: Global Error Correction (parity)– T3 & T4 to handle specific failure cases

• Error correction data stored in data memory• Data mapping handled by memory controller

with firmware support– Transparent to OS, caches, etc.

36


LOT-ECC Design

37


The Devil is in the Details

• We’re borrowing one bit from [data + LED] to use in the GEC– Put them all in the same DRAM row

• When a cache line is written, – Write data, LED, GEC – all “self-contained”– no read-before-write– Guaranteed row-buffer hit

38

7b 1b 1b

PA0-6 PA7-13 PA49-55 PPA . .T4 T4 T4

PA56

T4

Surplus bit borrowed from data + LED

Chip 0 Chip 1 Chip 7 Chip 8


Key Benefits

• Energy Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes

• Performance Gains: More rank-level parallelism, reduced access granularity

• Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently

• Flexibility: Works with a single rank of x4 DRAMs or more efficient wide-I/O x8/x16 DRAMs

• Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS

39


Power Results

40

-55%


Performance Results

41

Latency Reduction: LOT-ECC x8 – 43% +GEC Coalescing – 47% Oracular – 57%


Exploiting features in SSA

42

DIMM

L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C

L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C..

C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C

.

...

.

...

.

...

.

...

P7

DRAM DEVICE

L – Cache Line C – Local Checksum P – Global Parity


Putting it all together


Summary

• Tremendous pressure on the memory system– Bandwidth, energy, complexity, reliability

• Prudently apply novel technologies– Silicon photonics– Low-swing wires– 3D-stacking

• Rethink some fundamental design choices– Micromanagement by the memory controller– Overfetch in the face of diminishing locality– Conventional ECC codes

44


Impact

• Significant static/dynamic energy reduction– Memory core, channel, controller, reliability

• Significant performance improvement– Bank parallelism, channel bandwidth, reliability

• Significant complexity reduction– Memory controller

• Improved reliability

45


Synergies

• SSA Photonics • Photonics Autonomous memory• SSA Reliability• SSA, Photonics, and LOT-ECC provide additive energy

benefits– Each targets one of three major sources of energy

consumption – DRAM array, off-chip channel, reliability• SSA, Photonics, and LOT-ECC also provide additive

performance benefits– Each targets one of three major performance bottleneck –

Bank-contention, off-chip BW, reliability

46


Research Contributions

• Memory reliability• Memory access protocol• Memory channel architecture• Memory chip microarchitecture

• On-chip networks• Non-uniform power caches• 3D stacked cache design

47

[ISCA 2012]

[ISCA 2011]

[ISCA 2010]

[HPCA 2010]

[HiPC 2009]

[HPCA 2009]


Future Work

• Future project ideas include– Memory architectures for graphics/throughput-

oriented applications– Memory optimizations for handheld devices

Tightly integrated software support Managing heterogeneity, reconfigurability Novel memory hierarchies

– Memory autonomy and virtualization– Refresh management in DRAM

48


Acknowledgements

• Rajeev• Naveen• Committee: Al, Norm, Erik, Ken• Awesome lab-mates• Karen, Ann, Emily… front office• Parents & family • Friends

49