25
LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm Jouppi* University of Utah and *HP Labs § Currently with ARM

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Embed Size (px)

Citation preview

Page 1: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems

Ani Udipi§

Naveen Muralimanohar*Rajeev Balasubramonian Al Davis

Norm Jouppi*

University of Utah and *HP Labs§ Currently with ARM

Page 2: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Memory Reliability• Datacenters are the backbone of the

web-connected infrastructure– Reliability is essential

• Memory reliability is a major concern [Schroeder et al., SIGMETRICS ‘09]

– among the most error-prone parts of a server

– Even a few uncorrectable errors will require DIMM replacement

ranks near the top of component replacements in datacenters

Increases downtime Increases operational cost

2Source: Nagios

Page 3: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Some Numbers

A single server blade

2 Billion DRAM cells per chipX

36 DRAM chips per DIMM X

2 DIMMs per channelX

4 Channels per processorX

4 processors per blade=

~2.5 x 1012 DRAM cells

Datacenter

16 blades per enclosureX

4 enclosures per rackX

10 racks per containerX

40 containers per datacenter=

~64 x 1015 DRAM cells

3

Assume MTTF per cell is the age of the universe ~14 Billion YearsBlade DRAM MTTF = 2 days Datacenter DRAM MTTF = 7 seconds

Page 4: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Target Reliability• High-end servers commonly have high reliability

expectations– Single Symbol Correct Double Symbol Detect– One symbol == one DRAM chip (“Chipkill”)

• Today’s systems employ symbol-based ECC codes

4

Page 5: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Problems with Existing Solutions

• Increased access granularity– Every data access is spread across 36 DRAM chips– JEDEC standards define minimum access granularity per

chip– Massive overfetch of data at multiple levels

Wastes energy Wastes bandwidth Reduced rank-level parallelism

• x4 device width restriction– fewer ranks for given DIMM real estate

• Reliability level: 1 failed chip out of 36

5

Page 6: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

A New Approach: LOT-ECC• Operate on a single rank of x8 memory: 9 chips

– and support 1 failed chip out of 9• Multiple tiers of localized protection

– Tier 1: Local Error Detection (checksum)– Tier 2: Global Error Correction (parity)– T3 & T4 to handle specific failure cases

• Data mapping handled by memory controller with firmware support– Transparent to OS, caches, etc.– Strictly commodity DRAM used

• Significant power and performance benefits

6

Page 7: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Tier 1 – Local Error Detection (LED)

7

• Standard x72 DIMM (Nine x8 parts): Eight data + One ECC• We use all 9 chips for both data and ECC• 64 bits per chip per burst – 57 data + 7 checksum

Chip 0 Chip 8Chip 7

Page 8: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Tier 1 – Local Error Detection (LED)• 57 bits * 9 = 513

– Only 1 cache line read at a time– 57 bits/chip on first 8 chips; 56 bits on 9th chip

1 bit extra on the 9th chip– Use in a different tier of protection

• No performance impact on memory reads or writes– LED ops occur in parallel with data ops

• Note that LED is local to each chip– Need to pin-point exact failed chip, not simply detect

an error in the rank

8

Page 9: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

G0 G7 G8G1

Tier 2 – Global Error Correction (GEC)

9

A0

PA 0-6

A7 A8

A, B, C, D, E, F, G, H – Cache Lines, each comprised of segments X0 through X8 LXN – L1 Local Error Detection for Cache Line X, Segment N

[PX0:PXN] – L2 Global Error Correction across segments X0 through X8

PPX – Parity across GEC segments PX0-6 through PX49-55

DataLEDGEC

PA 49-55 PPA

57 bits 7 bits

Chip 0 Chip 7 Chip 8

A1

PA 7- 13

Chip 1

. .

PA 56

✔✖

Page 10: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

The Devil is in the Details..• ..and the details are in the paper!• Need to detect and correct additional errors in GEC region

– Parity is 57 bits; write granularity is 72 bits– Use the remaining 15 bits wisely, add two more tiers of

protection

10

7b 1b

PA0-6 PA7-13 PA49-55 PPA ..T4 T4 T4

PA56

T4

Surplus bit borrowed from data + LED

Chip 0 Chip 1 Chip 7 Chip 8

7b 1b

7b 1b 7b 1b

Page 11: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Optimizing Write Behavior

• Every write has to update its GEC bits– Already borrowing one bit from [data + LED] to use in the

GEC– Put them all in the same DRAM row!– Guaranteed row-buffer hit– Data mapping handled by the memory controller

11

A7 B7

G7 H7PA 49-55

PB 49-55

PH 49-55

A8 B8

G8 H8

PPA PPB PPH

A0 B0

PA 0-6

G0 H0PB

0-6

PH

0-6

57 bits 7 bits

Chip 0 Chip 7 Chip 8

A0 B0

PA 0-6

G0 H0

PB

0-6

PG0-6

57 bits 7 bitsPH0-6

A8 B8PPA

G8 H8

PPB

PPG

57 bits 7 bits

PPH

Page 12: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

GEC Coalescing• DDR3 burst of 8 forces 72 bytes per access

– GEC per cache line is only 72 bits

• With sufficient locality, one GEC write can potentially cover 8 data writes – In reality, each write becomes 1 + δ writes (for 0.125 < δ ≤ 1)

• Note that even with δ = 1, benefits of row-buffer hit remain

• Write typically buffered at the memory controller to avoid bus turnaround overheads– Controller can re-order accesses to accommodate coalescing

• Results show three cases: Basic design (δ = 1), Simple coalescing (measured δ), and Oracular design (δ = 0.125)

12

Page 13: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Constructing the LED Code

• Use a 7-bit ECC code to detect errors in 57 data bits– We choose a 7-bit 1’s complement checksum

• Paper details code operation and computes FIT – single-bit, double-bit, row, column, row-column, pin, chip,

multiple random, combinations• Very small rate of undetected errors

– Caused by very specific, uncommon bit-flip combinations– Less than 5E-5 FIT!

• Captures some failure modes NOT captured by existing mechanisms (failure of 2 chips out of 18, errors in >2 chips/rank, etc.)

13

Page 14: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Checksum Design

• Not all error combinations actually occur in DRAM– Small number of failure modes with specific root causes– Code’s effectiveness under those failures is important

• Current symbol-based codes guarantee capturing 100% of SSC-DSD errors– At huge power and performance penalties– Likely overkill

• Not scalable as error rates increase– Use strong yet practical codes + RAS features– Example: Proactive patrol scrubbing will capture a

majority of soft errors; may not coincide with hard errors

14

Page 15: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Evaluation Methodology

• Performance analysis: In-house DRAM simulator– Models refresh, address/command bus, data bus,

banks/ranks/channels contention, read/write queues• Power analysis: Micron power calculator spreadsheet

– Reflects timing parameters assumed for performance simulations

– Bus utilization and bank utilization numbers obtained from performance simulations

– Accounts for activation power, read/ write power, termination power, and background power

– Includes low-power sleep modes

15

Page 16: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Evaluation Platforms• Xeon 7500-like system

– 8 DDR3 channels, 2 DIMMs/channel– Dual-ranked x4 or Quad-ranked x8 DIMMs– “Lockstep mode” is the only supported mode

Two ranks operate together to provide a 144-bit bus Wasted bandwidth by masking out half the burst, OR Forced prefetching

• Also evaluate Xeon 5500-like systems– 3 DDR3 channels, 3 DIMMs/channel– “Lockstep mode” wastes one channel entirely, gangs other two

• Evaluate five design points each– Baseline symbol-based SSC-DSD– Virtualized ECC (Yoon & Erez, ASPLOS ’10)– LOT-ECC with no coalescing, simple coalescing, oracular

coalescing

Page 17: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Power Results 7500

17

-43%

Page 18: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Power Results 5500

18

-32%

Page 19: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Performance Results 7500

19

Latency Reduction: LOT-ECC 4.6% +GEC Coalescing 7.7% Oracular 16.2%

Page 20: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Performance Results 5500

20

Latency Reduction: LOT-ECC 42.9% +GEC Coalescing 46.9% Oracular 57.3%

Page 21: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Storage Overhead

• For each 64-byte cache line– 63 bits of LED checksum– 57 bits of GEC parity– 7 bits of T3 code– 9 bits of T4 code

• Total storage overhead of 26.5%• Current ECC implementations and DIMMs

already accept 12.5% through extra chip• Additional 14% in data memory via firmware• Memory capacity is cheap if commodity

– Better to spend on this than power/performance

21

Page 22: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Key Contributions

• Multi-tiered protection design to keep fault tolerance contained to fewer chips

• Unique data layout tailored to the access mechanism of commodity DRAM systems

• Exploit row-buffer efficiency – co-locate data and all tiers of fault-tolerance codes– Mitigates overheads of additional writes typical in

parity-based systems• Coalescing optimization to further minimize

impact of parity writes

22

Page 23: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Key Benefits

• Power Efficiency: Fewer chips activated per access, reduced access granularity, reduced static energy through better use of low-power modes (43% memory power savings)

• Performance Gains: More rank-level parallelism, reduced access granularity (7.7% memory latency reduction)

• Improved Protection: Can handle 1 failed chip out of 9, compared to 1 in 36 currently

• Flexibility: Works with a single rank of x4 DRAMs or more efficient x8 DRAMs

• Implementation Ease: Changes to memory controller and system firmware only; commodity processor/memory/OS

23

Page 24: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

BACKUP SLIDES

24

Page 25: LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm

Tier 2 – Global Error Correction (GEC)

• GEC is a parity written across the cache line segments in each chip

• LED has already pinpointed erroneous segment– Error correction is trivial

• Storing the parity– A portion of memory set aside to hold GEC– Handled by memory controller + firmware

• No impact on reads unless error is detected• GEC also self contained (single cache line)

– No read-before-write

25