Upload
-
View
120.777
Download
0
Embed Size (px)
Citation preview
Reliability of ECC-Based Memory Architectures with Online Self-Repair
Capabilities
Gian Mayuga1, Yuta Yamato1, Tomokazu Yoneda1, Yasuo Sato2, Michiko Inoue1 1Nara Institute of Science and Technology, Ikoma, Nara, Japan
2Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
Outline
• Research Background • Proposed ECC-Based Memory Architecture • Proposed Online Repair Strategy • Reliability Evaluation • Results • Conclusion
12/19/14 IEICE DC 2
Issues on Embedded Memory
• Memory takes up most area in large-scale SoC’s
• Post-manufacturing failures highly occur in memory
• Periodic field-level test and repair, and error correction required to maintain reliability
Source: Semico Research Corp http://www.semico.com/content/semico-systems-chip-%E2%80%93-braver-new-world
20 38 44 53 58 65 69
0
20
40
60
80
100
1999 2000 2005 2008* 2011* 2014* 2017*
Perc
ent o
f Are
a
% area new logic % area reused logic % area memory
SoC Area Partitioning
12/19/14 IEICE DC 3
Memory Errors and Mechanisms for Repair and Correction
• Conventionally, errors treated independently – Hard error
• Repair – Soft error
• Correction
• Recently, Combined Approach is used – Hard error also Corrected – Errors in memory word can be classified under:
• Uncorrectable error • Correctable error
Repair
Hard Errors
Soft Errors Alpha Ray
Cosmic Ray Row/Column Faults
Array Faults
Correction
… Given m faulty bits in a word, if correction capable up to m-bits, then word can be corrected
Memory word
12/19/14 IEICE DC 4
Random Bit Faults
Issue on Memory Word Reliability (1/2)
• Memory word with Uncorrectable error
– Must be Repaired
• Memory word with Correctable error
– Corrected
– More vulnerable if word already has faulty bit
5 12/19/14 IEICE DC
Word with 2-bit Errors
For example, using Single-Error Correction
Spare Word
Word with 1-bit Error Corrected Word ✓
Word with 1-bit Error Word failure
Issue on Memory Word Reliability (2/2)
• Memory word with Uncorrectable error – Must be repaired
• Memory word with Correctable error – Can be corrected
– Can also be repaired
6 12/19/14 IEICE DC
Repair
Hard Errors
Soft Errors Alpha Ray
Cosmic Ray Row/Column Faults
Array Faults
Correction
Random Bit Faults
Combined Approach of Repair and Correction
[2] T.H. Wu, et al. A Memory Yield Improvement Scheme Combining Built- In Self-Repair and Error Correction Codes. ITC 2012. [5] C.L. Su, et al. An Integrated ECC and Redundancy Repair Scheme for Memory Reliability Enhancement. DFT 2005. [6] M. Nicolaidis, P. Papavramidou. Transparent BIST for ECC-Based Memory Repair, IOLTS 2013.
Reference Target Error Time Test
Performed Remark ECC Repair
[2] Correctable Uncorrectable Manufacturing Enhances Yield
[5] Correctable Uncorrectable In-Field Transparent BIST w/ ECC
[6] Soft Hard In-Field Repairs all Hard errors
Proposed Correctable Uncorrectable
and Correctable
In-Field Enhances Reliability
12/19/14 IEICE DC 7
Comparison of Studies using Combined Approach
Proposed Online Repair Scheme
To enhance reliability, erroneous words are repaired with spare words as long as possible • Both Uncorrectable error and Correctable error
– Repair (Address Remapping) • Uncorrectable error is repaired • Correctable error is repaired if spare space is available
• Remapped word with Correctable error is cancelled if needed – Correction
12/19/14 IEICE DC 8
Test, Repair and Correction in Field
• Test (mBIST) – Test performed to identify errors
• Repair (mBISR) – Use Spare Space to remap address
of memory word – Physical replacement performed in
manufacturing test
• Correction (ECC) – Use ECC to correct errors
• Scrubbing – Re-write data to eliminate soft errors
Processors Cache Cache
Logic block
Memory
Logic block
Test and Repair
Sample SoC with Embedded Memory
12/19/14 IEICE DC 9
Correction
Proposed Architecture and Strategy Overview
• Normal Mode • ECC and Scrubbing Controller
• Test Mode • BIST and Diagnosis CAM • Remap CAM and Remap Controller
– Proposed Online Repair Scheme » Remap CAM Strategy » Cases for Remap CAM
10 12/19/14 IEICE DC
Proposed ECC-Based Memory Architecture
Proposed ECC-Based Architecture Normal Mode
Remap CAM enabled for remapping operation, and ECC/Scrubbing are in effect 12/19/14 IEICE DC 11
Remap CAM: If memory address stored, it is remapped to spare address
Scrubbing Controller: Protection against soft error
ECC: Protection against errors
Proposed ECC-Based Architecture Test Mode
Diagnosis CAM: Error information saved in Diagnosis CAM, error information includes no. of errors
Remap Controller: Based on error classification, faulty words are remapped
BIST: Performs test and determines error information
Remap CAM: Where remapping information is stored
12/19/14 IEICE DC 12
Test mode performed when memory is idle
Remap CAM Strategy
Spare Available? No. of Hard Errors Action Remark
Yes 1 Repair - ≥2 Repair -
Limited ~ None
1 - Correction
≥2 Repair Cancel
word with 1 error
In this work, Single Error Correction ECC used 12/19/14 IEICE DC 13
Traditional Scheme
Proposed Scheme
No. of Hard Errors Action Remark 1 - Correction ≥2 Repair -
Uses Two Counters: RC2F and RC1F
Remap CAM
Faulty Address
RC2F
Bottom Address
Top Address
RC1F
Faulty Words with 2 or more errors
Faulty Words with 1 error
12/19/14 IEICE DC 14
Faulty Address Faulty Address
Faulty Address Faulty Address Faulty Address Faulty Address
Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address
RC2F and RC1F points at next address to be written to
How Remap CAM works – Spare Available (1/2)
Faulty Address Bottom Address
Top Address
RC1F
Faulty Words with 2 or more errors
Faulty Words with 1 error
12/19/14 IEICE DC 15
Faulty Address Faulty Address
Faulty Address Faulty Address Faulty Address Faulty Address
Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address
Write Faulty Address with ≥2 errors
Faulty Address RC2F
RC2F points at next address to be written to
How Remap CAM works – Spare Available (2/2)
Faulty Address
RC2F
Bottom Address
Top Address
Faulty Words with 2 or more errors
Faulty Words with 1 error
12/19/14 IEICE DC 16
Faulty Address Faulty Address
Faulty Address Faulty Address Faulty Address Faulty Address
Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address
Write Faulty Address with 1 error
Faulty Address RC1F
RC1F points at next address to be written to
How Remap CAM works – FULL (1/2)
Faulty Address Bottom Address
Top Address
Faulty Words with 2 or more errors
Faulty Words with 1 error
12/19/14 IEICE DC 17
Faulty Address Faulty Address
Faulty Address Faulty Address Faulty Address Faulty Address
Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address
Faulty Address Faulty Address Faulty Address Faulty Address
Write Faulty Address with ≥2 errors RC2F
RC1F
Previous remapping cancelled
How Remap CAM works – FULL (2/2)
Faulty Address Bottom Address
Top Address
RC2F RC1F
Faulty Words with 2 or more errors
Faulty Words with 1 error
12/19/14 IEICE DC 18
Faulty Address Faulty Address
Faulty Address Faulty Address Faulty Address Faulty Address
Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address Spare Address
Faulty Address Faulty Address Faulty Address Faulty Address
Write Faulty Address with 1 error
Do nothing
time
Normal Mode
Reliability of Proposed Architecture
0
scrubbing
1st Self-Test and Repair period
ECC
Normal Mode
scrubbing
ECC
2nd Self-Test and Repair period
Normal Mode
k-th Self-Test and Repair period
scrubbing
ECC
Poisson distribution for hard error λh=10-11 ,10-10
Poisson distribution for soft error λs=10-7
Memory is reliable if there are no uncorrectable errors that are not repaired/corrected at any given time
12/19/14 IEICE DC 19
Self-test: March-like test that repeatedly perform write and read in cells Assume little chance of soft errors between read and the last write
Reliability Evaluation
• Evaluation done using R
• Conditions: λh<<λs since hard errors occur less frequently
• Scrubbing period: every 6 minutes
• Self-test period: every 10 days
• Reliabilities observed up until 50 years 12/19/14 IEICE DC 20
Remapping Scheme
• Proposed – Proposed Method
• Traditional – Repair only Uncorrectable word
21 12/19/14 IEICE DC
Reliability of proposed scheme vs traditional scheme λh=10-11
Hard errors do not occur frequently and reliability is expected to be near 1
Rel
iabi
lity
Years 12/19/14 IEICE DC 22
Reliability of proposed scheme vs traditional scheme λh=10-10
When hard errors occur more often, proposed scheme may use all the spare words, but still has better reliability than traditional scheme
Rel
iabi
lity
Years 12/19/14 IEICE DC 23
Conclusion
• Given an ECC-based memory architecture, online repair scheme that repair uncorrectable words and possibly correctable word is proposed
• Novel memory reconfiguration using remap CAM is proposed
• Reliability is evaluated under Poisson distributions of hard errors and soft errors
• Reliability of proposed scheme is demonstrated to be effective, and extends memory lifetime
12/19/14 IEICE DC 24