Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
1
YiranChenElectricalandComputerEngineeringUniversityofPittsburghSponsors:NSF,DARPA,AFRL,andHPLabs
EmergingNVMEnabledStorageArchitecture:FromEvolutiontoRevolution.
2
Outline
• Introduction• EvolutionwitheNVM:
– On‐chiphighspeedstorage;
– Off‐chipsecondarystorage;• RevolutionwitheNVM:
– Memristor‐basedneuromorphic accelerator• Conclusion
3
ConventionalMemoryScaling
2012– 201338nm‐ 32nmM:StackedMIMP:PlanarA:6F2, bWLG:poly/SiO2C: SiV: 1.35V
2014– 201529nm‐ 22nmM:StackedMIMP:Planar,HKMGA:6F2,bWLG:HKMGC:SiV:1.2V
2016– 201722nm‐ 16nmM:StackedMIMP:PlanarA:6F2,bBL,LBL,1T1C(VFET)G:HKMGC:SiV:1.1V
2018– 201916nm‐ 14nmM:FBRAM,STT‐RAM,RRAM,PCRAMP:PlanarA:4F2,1T,1T1R,1TMTJ(VFET)G:HKMGC:SiV:~1V
Burj KhalifaA/R=6
100
20
80
60
40
AspectRatioA/R
60 50 40 30 20
11Å9Å
8Å7Å
5Å
3Å
TOX
TechnologyNode1990 2000 2010
101
102
103
104
Mb/Chip
EDO50
SDRAM133
DDR1200-400
DDR2400-800
DDR3800-1600
Mbps
Sources:ASML,ITRS,IMEC,Hynix,IBM
Intrinsic difficulty of charge-based computing and storage!
4
EmergingNonvolatileMemory
5
MemoryTechnologiesComparison
ReRAM
>10y
<1
1015
None
STT‐RAM
>10y
8
1015
None
NANDFLASH
10y
4
0.1ms
1/0.1ms
105
High
None
PCRAM
>10y
4
12ns
<50ns
108
Low
None
DRAM
4ms
7‐9
2ns
1ns
1016
Low
RefreshPower
SRAM
N
120‐140
0.2 ns
70ps
1016
Low
LeakageCurrent
DataRetention
MemoryCell (F2)
ReadTime
Write/EraseTime
Number ofRewrites
PowerConsumptionRead/Write
PowerConsumptionotherthanR/W
N 4ms
0.1ms
1/0.1ms <50ns
LeakageCurrent
RefreshPower
High
>10y >10y
5‐10ns 5‐10ns
<10ns<10ns
5‐10 ns
<10ns <10ns
5‐10ns
Low LowLow Low
None None
Source:ITRSERDworkshoppresentationbyProf.Y. Chen
6
6
Challenges:
• Identifyingtheevolutional applicationsthatcan– Easilyandseamlesslyintegratedintothecurrentmemoryhierarchyandcomputingplatform;
– FullyleveragetheadvantagesofemergingNVM;
– Notbeeasilyreplacedbyotheralternativetechnologyorarchitecture.
• Inventingarevolutionary computingandstoragearchitecturethatcan– Offerahigh‐performance,powerefficient,andscalablecomputingmodel;
– Provideatrulyseamlessintegrationbetweencomputingandmemory.
7
Outline
• Introduction• EvolutionwitheNVM:
– On‐chiphighspeedstorage;• STT‐RAMbased3DcacheforCPU.
• RacetrackbasedregisterfileforGPU.
– Off‐chipsecondarystorage;• RevolutionwitheNVM:
– Memristor‐basedneuromorphic accelerator.• Conclusion
8
Writing‘1’
1T‐1MTJSTT‐RAMSchematic
STT‐RAMbased3DcacheSpin‐TransferTorqueRandomAccessMemory
Source‐line
MTJ
ReferenceLayer
FreeLayer
Bit‐line
Word‐line
Ascalabletechnology
Writing‘0’
MgO Layer
Magnetictunnelingjunction
9
• Pros:Lowleakagepower,highdensity.
• Cons:Longwritelatencyandlargewritepower
SRAMvs.MRAM(STT‐RAM)
Area (65nm) 3.66mm2 SRAM 3.30mm2 MRAM
Capacity/Bank 128KB 512KB
Read latency 2.25ns 2.32ns
Write latency 2.26ns 11.02ns
Read energy 0.90nJ 0.86nJ
Write energy 0.80nJ 5.00nJ
Cache configurations Leakage power
2MB (16x128KB) SRAM cache 2.09W
8MB (16x512KB) MRAM cache 0.26W
10
STT‐RAMbased3Dcache
• Baseline3DArchitecture– CoreLayer+CacheLayers.
– NUCAcacheswithNOCconnections.
Layer 1
Cache Controller
Core
Layer 2
TSV
Cache Bank
Router
Cache Bank
Cache Bank
Cache Bank
Cache Bank
R
R
R
R
Horizontal Hop
Ver
tica
l Hop
Data Migration
G. Sun, X. Dong, Y. Xie, J. Li, Y. Chen, HPCA, 2009.
11
STT‐RAMbased3Dcache
• Challenges:longwritelatencyofSTT‐RAM.
• Solution1(S1):Read‐PreemptiveWriteBuffer.
STT-RAMCaches
Cores
Write Op.
Read Op.
Read Op.
Read Data
Read Data
Write Buffer (FIFO) Write Req.
Read Req.
Write just begins.Write is almost done.
12
STT‐RAMbased3Dcache• SolutionS2:SRAM‐MRAMHybridL2Cache
Core
Core
Core
Core
MRAM Bank
TSV
Core
Core
Core
Core
SRAM Bank
32-Way STT-RAM31-Way STT-RAM &
1-Way SRAM
13
STT‐RAMbased3Dcache
• Result(S1&S2):– Performanceisimprovedby4.91%comparedwithSTT‐RAMbaseline.
– Powerconsumptionisreducedby73.5%.
0
0.2
0.4
0.6
0.8
1
2M-SRAM-DNUCA 8M-MRAM-DNUCA8M Hybrid DNUCA
0
0.2
0.4
0.6
0.8
1
IPC
Pow
er
14
Outline
• Introduction• EvolutionwitheNVM:
– On‐chiphighspeedstorage;• STT‐RAMbased3DcacheforCPU.
• RacetrackbasedregisterfileforGPU.
– Off‐chipsecondarystorage;• RevolutionwitheNVM:
– Memristor‐basedneuromorphic accelerator.• Conclusion
15
RacetrackforGPU
• Racetrackcell:
– Twofixedpinningregions:freeregion,andfixregion
– Write`0’
– Write`1’
– Read
WWL RWL
SL
BL
Pinning layer Pinning layer
Free layer
Reference layer
• Racetrack
– Racetrack‐magnetictrack– Injectcurrenttomovecell– Accessport
16
RacetrackforGPU
• BenefitsfromRacetrack:– Extremelysmallcellsize;
• Majorchallenges:– Shiftingcauseddelay/energy.
• Warpregisterremapping(WRR)– 60.0%RFareallocatedduringtheexecution
– Non‐optimalwarpregistermapping,maxshiftdistance—8‐cell
– WRR,interleavesthewarpregistersacrosstheaccessports,maxshiftdistance—4‐cell
WWL
RWL
WWL
RWL
…...
SLBL SLBL SLBL SLBL
Row Decod
er
Write/Read/Shifter Driver
Column MuxSense Amplifier Arrays
Shift ControllerArbitrator
Warp 0 Warp 0
M. Mao, W. Wen, Y. Zhang, Y. Chen, H. Li, DAC 2014
17
RacetrackforGPU• Writebuffer
– “piggyback‐write”towritebacktoRFfromwritebuffer;
– Relyonthetrackmovementtriggeredbythereadrequests;
– Positiveside‐effect:filtertheredundantRFR/WbyleveragingRAWandWAW.
1
32 4 8
7
9
6
5
To EXE/MEM
18
RacetrackforGPU
• Experimentresults:– Baseline:SRAM‐basedregisterfiles.
– Energyreduction:59%.
– Performanceimprovement:4%.
19
Outline
• Introduction• EvolutionwitheNVM:
– On‐chiphighspeedstorage;
– Secondarystorage;• PCRAMandNANDhybridSSD;
• RevolutionwitheNVM:– Memristor‐basedneuromorphic accelerator.
• Conclusion
20
HybridSSD
• Memoryhierarchy
Off-chip memory 100~300 cycles
On-chip memory1~30 cycles Page mode
↓Random
access
erase-before-write (EBW)
↓In-place-
update (IPU)
Courtesy: Al Fazio (Intel)
Solid State Disk(Flash)
25K~2M cycles
PN=0, V
Erase Unit
PN=1, V
PN=2, V
…
PN=n, V
X
X
21
• Onetransistor/diodeandoneGST(GeSbTe).
• In‐placeupdating(IPU)
PRAM(PCM)Cell
High resistance: ‘0’Low resistance: ‘1’
Top ElectrodeGST
Substrate
Bottom Electrode
Heater
+NTop Electrode
GST
Substrate
Bottom Electrode
Heater
+N
AmorphousCrystalline
22
HybridSSD
• ConventionalSSD:FLASH.
• Promisingcandidate:PRAM(Phasechange).
• Tocombinebenefitsofbothtechnologies:
– HybridSSD.
• Twousage:– Performance;
– Reliability.
23
HybridSSD:performanceenhancement
PN=0, V
Erase Unit 1
PN=1, V
PN=2, V
…
PN=n, V
PN=Page Number; V=Valid; I=Invalid
Erase Unit 2
…
PN=0, V
Erase Unit 3
PN=1, V
PN=2, V
…
PN=n, VPN=n, I
(Empty Pages)
PN=2, VPN=2, I
PN=n, V
Merge Operation (time consuming)
Erase Unit = 128/256KB, Page = 512Bytes ~ 8KBG.Sun, Y. Joo, Y. Chen, Y. Xie, Y.Chen, H. Li, HPCA, 2010.
24
HybridSSD:performanceenhancement
… …Data Region
DataBuffer
inMemory
Hybrid ArchitecturePhysical View Structural View
… …Log Region
NANDflash
PRAM
Erase Unit
In-place updating
Sector (512Bytes)
25
DifferentLogAssignments
Data Region
Log Region
Erase Unit
FixedAssignment
Data Region
Log Region
Erase Unit
Organizelog pages in group
Data Region
Log Region
Erase Unit
DynamicAssignment
Static log assignmentGroup log assignmentDynamic log assignment
26
HybridSSD:performanceenhancement
27
Outline
• Introduction• EvolutionwitheNVM:
– On‐chiphighspeedstorage;
– Secondarystorage;• RevolutionwitheNVM:
– Memristor‐basedneuromorphic accelerator.• Conclusion
28
Computing:PresentandFuture
2000 20101990
1000
100
Multi‐core
ClockFrequency(MHz)
NewTrend:- Multi‐core,advancedpowermanagement,largeon‐chipstorage.
Future:- Heterogeneoussystem,Brain‐like computing.
Source:CPUDB,Intel
NeuralNetwork
2000 20101990
10000
RocketLaunch
NuclearReactor
HotPlate
PowerDensity(m
W/m
m2 )
1000
100
29
GraymatterWhitematter
Neocortex6layersSignalstravelwithinandbetweenlayers
Brain– TheMostEfficientComputingMachine
Brain:15–30BneuronsExtremelycomplexorgan4km/mm3
35w
Neuron:Processsignalsfromotherneurons.
Synapse:MemoryWeightsignals
NeuralNetwork
30
Brain‐likeNeuromorphicCircuits
HighlyparallelUltrapowerefficient
Flexible Extremelyrobust
Realworldinput
Humanfriendlyoutput
Datafriendly
Slowprogressinneuoromoprhic hardwareimplementation• Lackofefficientsynapsedesign• Notsupportivetomassconnection
31
0 10 20 30 40 50 60 70300
400
500
600
700
Pulse number
Res
ista
nce
()
0 10 20 30 40 50 60 70-4
-2
0
2
4Vo
ltage
(V)
Memristor– RebirthofNeuromorphicCircuits
• Twoterminal,highdensity• Non‐volatility• Analog/multi‐levelstates
• Naturalmatrixfunction• AMIMOsystem• Goodcombinationwithmemristor
Memristor↔ Synapse Crossbar↔Network
TaN1+x
HPlab,2012
EIlab,DAC’12
2
3
4
i
i+1
n
1 2 3 j-1 j n-1 n
1
EIlab,APL’13
EIlab &HPlabTiN-TaOx device, pulses grows linearly in amplitude
32
Conclusion
• Emergingnonvolatilememorytechnology(NVM)suchasSTT‐RAM,racetrack,PRAMdeliverssignificantimprovementforvariousapplications.
• Challengesexistandcanbesolvedbyarchitectureleveloptimization.
• InnovationofrevolutionaryarchitecturewhichprovidesMulti‐orderspeedup,powerefficiencyimprovement,andhardwarecostreductionispromised.