Emerging NVM Enabled Storage Architectureacs.ict.ac.cn/ncis2014/slides/NCIS2014_Plenary_ChenYiran.pdf · Memristor –Rebirth of Neuromorphic Circuits • Two terminal, high density

1

YiranChenElectricalandComputerEngineeringUniversityofPittsburghSponsors:NSF,DARPA,AFRL,andHPLabs

EmergingNVMEnabledStorageArchitecture:FromEvolutiontoRevolution.

2

Outline

• Introduction• EvolutionwitheNVM:

– On‐chiphighspeedstorage;

– Off‐chipsecondarystorage;• RevolutionwitheNVM:

– Memristor‐basedneuromorphic accelerator• Conclusion

3

ConventionalMemoryScaling

2012– 201338nm‐ 32nmM:StackedMIMP:PlanarA:6F2, bWLG:poly/SiO2C: SiV: 1.35V

2014– 201529nm‐ 22nmM:StackedMIMP:Planar,HKMGA:6F2,bWLG:HKMGC:SiV:1.2V

2016– 201722nm‐ 16nmM:StackedMIMP:PlanarA:6F2,bBL,LBL,1T1C(VFET)G:HKMGC:SiV:1.1V

2018– 201916nm‐ 14nmM:FBRAM,STT‐RAM,RRAM,PCRAMP:PlanarA:4F2,1T,1T1R,1TMTJ(VFET)G:HKMGC:SiV:~1V

Burj KhalifaA/R=6

100

20

80

60

40

AspectRatioA/R

60 50 40 30 20

11Å9Å

8Å7Å

5Å

3Å

TOX

TechnologyNode1990 2000 2010

101

102

103

104

Mb/Chip

EDO50

SDRAM133

DDR1200-400

DDR2400-800

DDR3800-1600

Mbps

Sources:ASML,ITRS,IMEC,Hynix,IBM

Intrinsic difficulty of charge-based computing and storage!

4

EmergingNonvolatileMemory

5

MemoryTechnologiesComparison

ReRAM

>10y

<1

1015

None

STT‐RAM

>10y

8

1015

None

NANDFLASH

10y

4

0.1ms

1/0.1ms

105

High

None

PCRAM

>10y

4

12ns

<50ns

108

Low

None

DRAM

4ms

7‐9

2ns

1ns

1016

Low

RefreshPower

SRAM

N

120‐140

0.2 ns

70ps

1016

Low

LeakageCurrent

DataRetention

MemoryCell (F2)

ReadTime

Write/EraseTime

Number ofRewrites

PowerConsumptionRead/Write

PowerConsumptionotherthanR/W

N 4ms

0.1ms

1/0.1ms <50ns

LeakageCurrent

RefreshPower

High

>10y >10y

5‐10ns 5‐10ns

<10ns<10ns

5‐10 ns

<10ns <10ns

5‐10ns

Low LowLow Low

None None

Source:ITRSERDworkshoppresentationbyProf.Y. Chen

6

6

Challenges:

• Identifyingtheevolutional applicationsthatcan– Easilyandseamlesslyintegratedintothecurrentmemoryhierarchyandcomputingplatform;

– FullyleveragetheadvantagesofemergingNVM;

– Notbeeasilyreplacedbyotheralternativetechnologyorarchitecture.

• Inventingarevolutionary computingandstoragearchitecturethatcan– Offerahigh‐performance,powerefficient,andscalablecomputingmodel;

– Provideatrulyseamlessintegrationbetweencomputingandmemory.

7

Outline


– On‐chiphighspeedstorage;• STT‐RAMbased3DcacheforCPU.

• RacetrackbasedregisterfileforGPU.


– Memristor‐basedneuromorphic accelerator.• Conclusion

8

Writing‘1’

1T‐1MTJSTT‐RAMSchematic

STT‐RAMbased3DcacheSpin‐TransferTorqueRandomAccessMemory

Source‐line

MTJ

ReferenceLayer

FreeLayer

Bit‐line

Word‐line

Ascalabletechnology

Writing‘0’

MgO Layer

Magnetictunnelingjunction

9

• Pros:Lowleakagepower,highdensity.

• Cons:Longwritelatencyandlargewritepower

SRAMvs.MRAM(STT‐RAM)

Area (65nm) 3.66mm2 SRAM 3.30mm2 MRAM

Capacity/Bank 128KB 512KB

Read latency 2.25ns 2.32ns

Write latency 2.26ns 11.02ns

Read energy 0.90nJ 0.86nJ

Write energy 0.80nJ 5.00nJ

Cache configurations Leakage power

2MB (16x128KB) SRAM cache 2.09W

8MB (16x512KB) MRAM cache 0.26W

10

STT‐RAMbased3Dcache

• Baseline3DArchitecture– CoreLayer+CacheLayers.

– NUCAcacheswithNOCconnections.

Layer 1

Cache Controller

Core

Layer 2

TSV

Cache Bank

Router

Cache Bank

Cache Bank

Cache Bank

Cache Bank

R

R

R

R

Horizontal Hop

Ver

tica

l Hop

Data Migration

G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, HPCA, 2009.

11


• Challenges:longwritelatencyofSTT‐RAM.

• Solution1(S1):Read‐PreemptiveWriteBuffer.

STT-RAMCaches

Cores

Write Op.

Read Op.

Read Op.

Read Data

Read Data

Write Buffer (FIFO) Write Req.

Read Req.

Write just begins.Write is almost done.

12

STT‐RAMbased3Dcache• SolutionS2:SRAM‐MRAMHybridL2Cache

Core

Core

Core

Core

MRAM Bank

TSV

Core

Core

Core

Core

SRAM Bank

32-Way STT-RAM31-Way STT-RAM &

1-Way SRAM

13


• Result(S1&S2):– Performanceisimprovedby4.91%comparedwithSTT‐RAMbaseline.

– Powerconsumptionisreducedby73.5%.

0

0.2

0.4

0.6

0.8

1

2M-SRAM-DNUCA 8M-MRAM-DNUCA8M Hybrid DNUCA

0

0.2

0.4

0.6

0.8

1

IPC

Pow

er

14

Outline


– On‐chiphighspeedstorage;• STT‐RAMbased3DcacheforCPU.

• RacetrackbasedregisterfileforGPU.



15

RacetrackforGPU

• Racetrackcell:

– Twofixedpinningregions:freeregion,andfixregion

– Write`0’

– Write`1’

– Read

WWL RWL

SL

BL

Pinning layer Pinning layer

Free layer

Reference layer

• Racetrack

– Racetrack‐magnetictrack– Injectcurrenttomovecell– Accessport

16

RacetrackforGPU

• BenefitsfromRacetrack:– Extremelysmallcellsize;

• Majorchallenges:– Shiftingcauseddelay/energy.

• Warpregisterremapping(WRR)– 60.0%RFareallocatedduringtheexecution

– Non‐optimalwarpregistermapping,maxshiftdistance—8‐cell

– WRR,interleavesthewarpregistersacrosstheaccessports,maxshiftdistance—4‐cell

WWL

RWL

WWL

RWL

…...

SLBL SLBL SLBL SLBL

Row Decod

er

Write/Read/Shifter Driver

Column MuxSense Amplifier Arrays

Shift ControllerArbitrator

Warp 0 Warp 0

M. Mao, W. Wen, Y. Zhang, Y. Chen, H. Li, DAC 2014

17

RacetrackforGPU• Writebuffer

– “piggyback‐write”towritebacktoRFfromwritebuffer;

– Relyonthetrackmovementtriggeredbythereadrequests;

– Positiveside‐effect:filtertheredundantRFR/WbyleveragingRAWandWAW.

1

32 4 8

7

9

6

5

To EXE/MEM

18

RacetrackforGPU

• Experimentresults:– Baseline:SRAM‐basedregisterfiles.

– Energyreduction:59%.

– Performanceimprovement:4%.

19

Outline



– Secondarystorage;• PCRAMandNANDhybridSSD;

• RevolutionwitheNVM:– Memristor‐basedneuromorphic accelerator.

• Conclusion

20

HybridSSD

• Memoryhierarchy

Off-chip memory 100~300 cycles

On-chip memory1~30 cycles Page mode

↓Random

access

erase-before-write (EBW)

↓In-place-

update (IPU)

Courtesy: Al Fazio (Intel)

Solid State Disk(Flash)

25K~2M cycles

PN=0, V

Erase Unit

PN=1, V

PN=2, V

…

PN=n, V

X

X

21

• Onetransistor/diodeandoneGST(GeSbTe).

• In‐placeupdating(IPU)

PRAM(PCM)Cell

High resistance: ‘0’Low resistance: ‘1’

Top ElectrodeGST

Substrate

Bottom Electrode

Heater

+NTop Electrode

GST

Substrate

Bottom Electrode

Heater

+N

AmorphousCrystalline

22

HybridSSD

• ConventionalSSD:FLASH.

• Promisingcandidate:PRAM(Phasechange).

• Tocombinebenefitsofbothtechnologies:

– HybridSSD.

• Twousage:– Performance;

– Reliability.

23

HybridSSD:performanceenhancement

PN=0, V

Erase Unit 1

PN=1, V

PN=2, V

…

PN=n, V

PN=Page Number; V=Valid; I=Invalid

Erase Unit 2

…

PN=0, V

Erase Unit 3

PN=1, V

PN=2, V

…

PN=n, VPN=n, I

(Empty Pages)

PN=2, VPN=2, I

PN=n, V

Merge Operation (time consuming)

Erase Unit = 128/256KB, Page = 512Bytes ~ 8KBG.Sun, Y. Joo, Y. Chen, Y. Xie, Y.Chen, H. Li, HPCA, 2010.

24


… …Data Region

DataBuffer

inMemory

Hybrid ArchitecturePhysical View Structural View

… …Log Region

NANDflash

PRAM

Erase Unit

In-place updating

Sector (512Bytes)

25

DifferentLogAssignments

Data Region

Log Region

Erase Unit

FixedAssignment

Data Region

Log Region

Erase Unit

Organizelog pages in group

Data Region

Log Region

Erase Unit

DynamicAssignment

Static log assignmentGroup log assignmentDynamic log assignment

26


27

Outline



– Secondarystorage;• RevolutionwitheNVM:


28

Computing:PresentandFuture

2000 20101990

1000

100

Multi‐core

ClockFrequency(MHz)

NewTrend:- Multi‐core,advancedpowermanagement,largeon‐chipstorage.

Future:- Heterogeneoussystem,Brain‐like computing.

Source:CPUDB,Intel

NeuralNetwork

2000 20101990

10000

RocketLaunch

NuclearReactor

HotPlate

PowerDensity(m

W/m

m2 )

1000

100

29

GraymatterWhitematter

Neocortex6layersSignalstravelwithinandbetweenlayers

Brain– TheMostEfficientComputingMachine

Brain:15–30BneuronsExtremelycomplexorgan4km/mm3

35w

Neuron:Processsignalsfromotherneurons.

Synapse:MemoryWeightsignals

NeuralNetwork

30

Brain‐likeNeuromorphicCircuits

HighlyparallelUltrapowerefficient

Flexible Extremelyrobust

Realworldinput

Humanfriendlyoutput

Datafriendly

Slowprogressinneuoromoprhic hardwareimplementation• Lackofefficientsynapsedesign• Notsupportivetomassconnection

31

0 10 20 30 40 50 60 70300

400

500

600

700

Pulse number

Res

ista

nce

()

0 10 20 30 40 50 60 70-4

-2

0

2

4Vo

ltage

(V)

Memristor– RebirthofNeuromorphicCircuits

• Twoterminal,highdensity• Non‐volatility• Analog/multi‐levelstates

• Naturalmatrixfunction• AMIMOsystem• Goodcombinationwithmemristor

Memristor↔ Synapse Crossbar↔Network

TaN1+x

HPlab,2012

EIlab,DAC’12

2

3

4

i

i+1

n

1 2 3 j-1 j n-1 n

1

EIlab,APL’13

EIlab &HPlabTiN-TaOx device, pulses grows linearly in amplitude

32

Conclusion

• Emergingnonvolatilememorytechnology(NVM)suchasSTT‐RAM,racetrack,PRAMdeliverssignificantimprovementforvariousapplications.

• Challengesexistandcanbesolvedbyarchitectureleveloptimization.

• InnovationofrevolutionaryarchitecturewhichprovidesMulti‐orderspeedup,powerefficiencyimprovement,andhardwarecostreductionispromised.

Documents

Emerging NVM Enabled Storage Architectureacs.ict.ac.cn/ncis2014/slides/NCIS2014_Plenary_ChenYiran.pdf · Memristor –Rebirth of Neuromorphic Circuits • Two terminal, high density