14
LA-UR-14-27913 Supercomputer Field Data – DRAM, SRAM, and Projections for Future Systems Nathan DeBardeleben, Ph.D. (LANL) Ultrascale Systems Research Center (USRC) 6 th Soft Error Rate (SER) Workshop Santa Clara, October 16, 2014

Supercomputer Field Data – DRAM, SRAM, and Projections for ...ewh.ieee.org/soc/cpmt/presentations/cpmt1410we.pdf · Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

LA-UR-14-27913

Supercomputer Field Data –

DRAM, SRAM, and Projections for Future Systems

Nathan DeBardeleben, Ph.D. (LANL) Ultrascale Systems Research Center

(USRC) 6th Soft Error Rate (SER) Workshop

Santa Clara, October 16, 2014

LA-UR-14-27913

Acknowledgements

n  Collaborators •  Vilas Sridharan, AMD RAS, Advanced Micro Devices Inc. •  Sudhanva Gurumurthi, AMD Research, Advanced Micro Devices Inc. •  Jon Stearley, Scalable Architectures, Sandia National Laboratories •  Kurt Ferreira, Scalable Architectures, Sandia National Laboratories •  John Shalf, Computational Research Division, Lawrence Berkeley National

Laboratory

n  Thanks to •  Many folks at Cray •  Many folks at LANL, LBNL, SNL, ORNL

n  This is a team effort •  By no means is this entirely my own work!

LA-UR-14-27913

Cielo Data – A More Detailed Analysis Than Is Often Possible

n  Cielo affords us more data sources than are often available

n  It’s very easy to jump to wrong conclusions with the kind of data usually available

n  Such as . . .

LA-UR-14-27913

Vendor effects

}  A correlation to physical location... }  ...is due to non-uniform distribution of vendor...

}  ...and disappears when examined by vendor.

}  DRAM reliability studies must account for DRAM vendor or risk inaccurate conclusions

Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013

South?

Hotter?

why?

LA-UR-14-27913

Field Data – It Matters

A B CManufacturer

Billio

ns o

f DR

AM H

ours

05

1015

20

3.14

14.48

5.41

A B CManufacturer

FITs/DRAM

020

4060

80 PermanentTransient

73.6

41.2

18.8

n  Not all vendors are created equal

n  As much as a 4x difference in FIT rate depending on DRAM vendor used in Cielo nodes

n  While B and C are about 50/50 vulnerable to permanent/transient, vendor is is closer to 30/70 permanent/transient

Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013

LA-UR-14-27913

What About Hopper?

A B CManufacturer

FITs/DRAM

020

4060

80 PermanentTransient

73.6

41.2

18.8

n  Cielo = 8.5k nodes, Hopper = 6k nodes

n  Cielo at 7.3k feet elevation

n  Hopper at 43 feet elevation

n  Same DRAM vendor IDs, approximately same relative concentration in entire system

n  Vendor A higher fault rate in Cielo, likely attributable to altitude

Currently under peer review . . .

A B CDRAM Vendor

FITs

/DR

AM0

510

1520

2530

35

TransientPermanent

32.3

22.2 21.5

Cielo

Hopper

LA-UR-14-27913

Positional Effects

n  Fault rates increase as you go vertically in a rack

n  Shielding? Temperature?

n  We found a similar correlation in DRAM

n  More studies are needed to explain why we see this

Sridharan et al., Feng Shui of Supercomputer Memory, SC 2013

Bottom Middle TopChassis Location in Rack

Rel

ative

SR

AM F

ault

Rat

e0.

00.

40.

81.

2

1+11%

+22%

Bottom Middle TopChassis Location in Rack

Rel

ative

SR

AM F

ault

Rat

e0.

00.

40.

81.

2

1+14% +19%

Cielo

Jaguar

LA-UR-14-27913

DDR Command and Address Parity – It’s a Good Thing

n  Key feature of DDR3 (and on) is the ability to add parity-check logic to the command and address bus.

n  Can have a significant positive impact on DDR memory reliability •  Not previously shown empirically

n  DDR3 sub-system on Cielo includes command and address parity checking.

n  Rate of command/address parity errors was 75% that of the rate of uncorrected ECC errors.

n  Increasing DDR memory channel speeds may cause an increase in signaling-related errors.

DeBardeleben et al., Extra Bits on SRAM and DRAM Errors, SELSE 2014

LA-UR-14-27913

y  Data  from  Hopper  ‒  12,000  CPU  sockets  ‒  12MB  L3  cache  /  socket  ‒  3MB  L2  cache  /  socket  ‒  ~1.5  years  of  observa?on  

 

y  Faults  occur  o1en  ‒  Even  small  structures  (TLBs)  see  faults  

y  Exascale  systems  will:  ‒  Have  4-­‐5x  the  number  of  compute  sockets  ‒  Have  much  more  SRAM  per  socket  ‒  Have  more  faults!  

Where do faults Occur?

Caveat: vendors must pay attention to reliable design

Attribution: Vilas Sridharan, AMD

LA-UR-14-27913

Accelerated Testing Comparison

n  Studies of DOE supercomputers compared to AMD accelerated testing

n  Accelerated testing remains a good proxy for what is seen in the field

n  We would expect lower field FIT rates than accelerated testing due to workload differences, faults being overwritten, etc.

0  0.2  0.4  0.6  0.8  1  

1.2  

Fault  rate  pe

r  bit  (a.u.)  

LA-UR-14-27913

What will SRAM errors look like in exascale? SRAM  UNCORRECTED  ERROR  RATE  RELATIVE  TO  CIELO  

y  Two  potenIal  systems  ‒  Small:  50k  nodes  ‒  Large:  100k  nodes  

y  Same  fault  rate  as  45nm  ‒  Sky  is  falling  

y  Scale  faults  per  current  trend  ‒  Sky  falls  more  slowly  ‒  Switch  to  FinFETs  may  make  this  even  beYer  

y  Add  some  engineering  effort  ‒  Sky  stops  falling  

SRAM faults are unlikely to be a significantly larger problem than today

Attribution: Vilas Sridharan, AMD

LA-UR-14-27913

y  DRAM  faults  are…weird    ‒  Affect  mul?ple  rows/columns/chips  ‒  Not  just  simple  par?cle  strikes…    

y  Many  permanent  faults  ‒  En?rely  unlike  SRAM  

What about DRAM? AND  OTHER  EXTERNAL  MEMORY  SUBSYSTEMS  

0  

512  

1024  

1536  

2048  Co

lumn  

Row  

0  

512  

1024  

1536  

2048  

Column  

Row  

Fault Mode % Faulty DRAMs Single-bit 67.7%

Single-word 0.2%

Single-column 8.7%

Single-row 11.8%

Single-bank 9.6%

Multi-bank 1.0%

Multi-rank 1.1%

Attribution: Vilas Sridharan, AMD

LA-UR-14-27913

0.1  

1  

10  

100  

2   3   4   5   6   7   8   9   10   11  RelaIv

e  Ra

te  of  

Uncorrected

 Errors  

Month  

SEC-­‐DED   Chipkill  

Projecting to Exascale

Sridharan and Liberty, A Study of DRAM Failures in the Field, SC12

y  Uncorrected  error  rate  ‒  10-­‐70x  error  rate  of  current  systems  ‒  Is  the  sky  falling?  

y  This  is  not  just  a  problem  for  exascale  ‒  Cost  problem  for  data  centers  /  cloud  ‒  Reliability  problem  in  client?  

y  SoluIons  are  out  there  ‒  Including  for  die-­‐stacked  DRAM  ‒  Lots  of  people  working  on  this…  

y  Historical  example  ‒  Chipkill  vs.  SEC-­‐DED  

Caveat: DRAM subsystems need higher reliability than today, but will likely get it

Attribution: Vilas Sridharan, AMD

LA-UR-14-27913

Conclusions

n  It is not often one gets to see field studies in HPC

n  We have shown the value of: •  Collaborating with vendors to interpret the data •  Analyzing reliability based on vendor choice •  Studying positional effects of faults in a data center

n  SRAM would benefit from more advanced ECC

n  Accelerated testing is useful

n  DDR3 address and command parity check is useful

n  Exascale trends are a mixed bag: •  Sky is probably not falling •  But there is no doubt that user experienced uncorrectable error rates will increase

n  Always interested in collaboration!

n  I haven’t said anything about silent data corruption here