24
Reliability of ECC-Based Memory Architectures with Online Self-Repair Capabilities Gian Mayuga 1 , Yuta Yamato 1 , Tomokazu Yoneda 1 , Yasuo Sato 2 , Michiko Inoue 1 1 Nara Institute of Science and Technology, Ikoma, Nara, Japan 2 Kyushu Institute of Technology, Iizuka, Fukuoka, Japan

Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

  • Upload
    -

  • View
    120.777

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Reliability of ECC-Based Memory Architectures with Online Self-Repair

Capabilities

Gian Mayuga1, Yuta Yamato1, Tomokazu Yoneda1, Yasuo Sato2, Michiko Inoue1 1Nara Institute of Science and Technology, Ikoma, Nara, Japan

2Kyushu Institute of Technology, Iizuka, Fukuoka, Japan

Page 2: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Outline

•  Research Background •  Proposed ECC-Based Memory Architecture •  Proposed Online Repair Strategy •  Reliability Evaluation •  Results •  Conclusion

12/19/14 IEICE DC 2

Page 3: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Issues on Embedded Memory

•  Memory takes up most area in large-scale SoC’s

•  Post-manufacturing failures highly occur in memory

•  Periodic field-level test and repair, and error correction required to maintain reliability

Source: Semico Research Corp http://www.semico.com/content/semico-systems-chip-%E2%80%93-braver-new-world

20 38 44 53 58 65 69

0

20

40

60

80

100

1999 2000 2005 2008* 2011* 2014* 2017*

Perc

ent o

f Are

a

% area new logic % area reused logic % area memory

SoC Area Partitioning

12/19/14 IEICE DC 3

Page 4: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Memory Errors and Mechanisms for Repair and Correction

•  Conventionally, errors treated independently –  Hard error

• Repair –  Soft error

• Correction

•  Recently, Combined Approach is used –  Hard error also Corrected –  Errors in memory word can be classified under:

• Uncorrectable error • Correctable error

Repair

Hard Errors

Soft Errors Alpha Ray

Cosmic Ray Row/Column Faults

Array Faults

Correction

… Given m faulty bits in a word, if correction capable up to m-bits, then word can be corrected

Memory word

12/19/14 IEICE DC 4

Random Bit Faults

Page 5: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Issue on Memory Word Reliability (1/2)

•  Memory word with Uncorrectable error

–  Must be Repaired

•  Memory word with Correctable error

–  Corrected

–  More vulnerable if word already has faulty bit

5 12/19/14 IEICE DC

Word with 2-bit Errors

For example, using Single-Error Correction

Spare Word

Word with 1-bit Error Corrected Word ✓

Word with 1-bit Error Word failure

Page 6: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Issue on Memory Word Reliability (2/2)

•  Memory word with Uncorrectable error –  Must be repaired

•  Memory word with Correctable error –  Can be corrected

–  Can also be repaired

6 12/19/14 IEICE DC

Repair

Hard Errors

Soft Errors Alpha Ray

Cosmic Ray Row/Column Faults

Array Faults

Correction

Random Bit Faults

Page 7: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Combined Approach of Repair and Correction

[2] T.H. Wu, et al. A Memory Yield Improvement Scheme Combining Built- In Self-Repair and Error Correction Codes. ITC 2012. [5] C.L. Su, et al. An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement. DFT 2005. [6] M. Nicolaidis, P. Papavramidou. Transparent BIST for ECC-Based Memory Repair, IOLTS 2013.

Reference Target Error Time Test

Performed Remark ECC Repair

[2] Correctable Uncorrectable Manufacturing Enhances Yield

[5] Correctable Uncorrectable In-Field Transparent BIST w/ ECC

[6] Soft Hard In-Field Repairs all Hard errors

Proposed Correctable Uncorrectable

and Correctable

In-Field Enhances Reliability

12/19/14 IEICE DC 7

Comparison of Studies using Combined Approach

Page 8: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Proposed Online Repair Scheme

To enhance reliability, erroneous words are repaired with spare words as long as possible •  Both Uncorrectable error and Correctable error

–  Repair (Address Remapping) •  Uncorrectable error is repaired •  Correctable error is repaired if spare space is available

•  Remapped word with Correctable error is cancelled if needed –  Correction

12/19/14 IEICE DC 8

Page 9: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Test, Repair and Correction in Field

•  Test (mBIST) –  Test performed to identify errors

•  Repair (mBISR) –  Use Spare Space to remap address

of memory word –  Physical replacement performed in

manufacturing test

•  Correction (ECC) –  Use ECC to correct errors

•  Scrubbing –  Re-write data to eliminate soft errors

Processors Cache Cache

Logic block

Memory

Logic block

Test and Repair

Sample SoC with Embedded Memory

12/19/14 IEICE DC 9

Correction

Page 10: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Proposed Architecture and Strategy Overview

•  Normal Mode •  ECC and Scrubbing Controller

•  Test Mode •  BIST and Diagnosis CAM •  Remap CAM and Remap Controller

–  Proposed Online Repair Scheme »  Remap CAM Strategy »  Cases for Remap CAM

10 12/19/14 IEICE DC

Proposed ECC-Based Memory Architecture

Page 11: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Proposed ECC-Based Architecture Normal Mode

Remap CAM enabled for remapping operation, and ECC/Scrubbing are in effect 12/19/14 IEICE DC 11

Remap CAM: If memory address stored, it is remapped to spare address

Scrubbing Controller: Protection against soft error

ECC: Protection against errors

Page 12: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Proposed ECC-Based Architecture Test Mode

Diagnosis CAM: Error information saved in Diagnosis CAM, error information includes no. of errors

Remap Controller: Based on error classification, faulty words are remapped

BIST: Performs test and determines error information

Remap CAM: Where remapping information is stored

12/19/14 IEICE DC 12

Test mode performed when memory is idle

Page 13: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Remap CAM Strategy

Spare Available? No. of Hard Errors Action Remark

Yes 1 Repair - ≥2 Repair -

Limited ~ None

1 - Correction

≥2 Repair Cancel

word with 1 error

In this work, Single Error Correction ECC used 12/19/14 IEICE DC 13

Traditional Scheme

Proposed Scheme

No. of Hard Errors Action Remark 1 - Correction ≥2 Repair -

Page 14: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Uses Two Counters: RC2F and RC1F

Remap CAM

Faulty Address

RC2F

Bottom Address

Top Address

RC1F

Faulty Words with 2 or more errors

Faulty Words with 1 error

12/19/14 IEICE DC 14

Faulty Address Faulty Address

Faulty Address Faulty Address Faulty Address Faulty Address

Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address

RC2F and RC1F points at next address to be written to

Page 15: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

How Remap CAM works – Spare Available (1/2)

Faulty Address Bottom Address

Top Address

RC1F

Faulty Words with 2 or more errors

Faulty Words with 1 error

12/19/14 IEICE DC 15

Faulty Address Faulty Address

Faulty Address Faulty Address Faulty Address Faulty Address

Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address

Write Faulty Address with ≥2 errors

Faulty Address RC2F

RC2F points at next address to be written to

Page 16: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

How Remap CAM works – Spare Available (2/2)

Faulty Address

RC2F

Bottom Address

Top Address

Faulty Words with 2 or more errors

Faulty Words with 1 error

12/19/14 IEICE DC 16

Faulty Address Faulty Address

Faulty Address Faulty Address Faulty Address Faulty Address

Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address

Write Faulty Address with 1 error

Faulty Address RC1F

RC1F points at next address to be written to

Page 17: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

How Remap CAM works – FULL (1/2)

Faulty Address Bottom Address

Top Address

Faulty Words with 2 or more errors

Faulty Words with 1 error

12/19/14 IEICE DC 17

Faulty Address Faulty Address

Faulty Address Faulty Address Faulty Address Faulty Address

Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address

Faulty Address Faulty Address Faulty Address Faulty Address

Write Faulty Address with ≥2 errors RC2F

RC1F

Previous remapping cancelled

Page 18: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

How Remap CAM works – FULL (2/2)

Faulty Address Bottom Address

Top Address

RC2F RC1F

Faulty Words with 2 or more errors

Faulty Words with 1 error

12/19/14 IEICE DC 18

Faulty Address Faulty Address

Faulty Address Faulty Address Faulty Address Faulty Address

Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address

Faulty Address Faulty Address Faulty Address Faulty Address

Write Faulty Address with 1 error

Do nothing

Page 19: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

time

Normal Mode

Reliability of Proposed Architecture

0

scrubbing

1st Self-Test and Repair period

ECC

Normal Mode

scrubbing

ECC

2nd Self-Test and Repair period

Normal Mode

k-th Self-Test and Repair period

scrubbing

ECC

Poisson distribution for hard error λh=10-11 ,10-10

Poisson distribution for soft error λs=10-7

Memory is reliable if there are no uncorrectable errors that are not repaired/corrected at any given time

12/19/14 IEICE DC 19

Self-test: March-like test that repeatedly perform write and read in cells Assume little chance of soft errors between read and the last write

Page 20: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Reliability Evaluation

•  Evaluation done using R

•  Conditions: λh<<λs since hard errors occur less frequently

•  Scrubbing period: every 6 minutes

•  Self-test period: every 10 days

•  Reliabilities observed up until 50 years 12/19/14 IEICE DC 20

Page 21: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Remapping Scheme

•  Proposed – Proposed Method

•  Traditional – Repair only Uncorrectable word

21 12/19/14 IEICE DC

Page 22: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Reliability of proposed scheme vs traditional scheme λh=10-11

Hard errors do not occur frequently and reliability is expected to be near 1

Rel

iabi

lity

Years 12/19/14 IEICE DC 22

Page 23: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Reliability of proposed scheme vs traditional scheme λh=10-10

When hard errors occur more often, proposed scheme may use all the spare words, but still has better reliability than traditional scheme

Rel

iabi

lity

Years 12/19/14 IEICE DC 23

Page 24: Reliability of ECC-based Memory Architectures with Online Self-repair Capabilities

Conclusion

•  Given an ECC-based memory architecture, online repair scheme that repair uncorrectable words and possibly correctable word is proposed

•  Novel memory reconfiguration using remap CAM is proposed

•  Reliability is evaluated under Poisson distributions of hard errors and soft errors

•  Reliability of proposed scheme is demonstrated to be effective, and extends memory lifetime

12/19/14 IEICE DC 24