Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
MICRO ‘18
Amber: Enabling Precise Full-System Simulation with Detailed Modeling of All SSD Resources
Donghyun Gouk, Miryeong Kwon, Jie Zhang, Sungjoon KohWonil Choi, Nam Sung Kim, Mahmut Kandemir and Myoungsoo Jung
Computer Architecture and MEmory Systems Lab
SimpleSSD 2.0
Executive Summary
SSD simulation
Hard Disk Drives
Solid State Drives
DRAM
CPU Caches
Memory Hierarchy
CPU
Registers
Fast, expensive and small capacity
Slow, cheap and large capacity
Full-System simulation
Co-simulation
Ban
dw
idth
(M
B/s
)
Real
Simulator
Amberoffers the full-system simulationsupport with a good performance accuracy compared to real devices.
Agenda• Background
• What and why are trace-based storage simulations not good enough?
• Amber overview• What do we have to model for SSD internal components?
• Storage complex
• Computation complex
• Storage interface models• UFS/SATA/NVM Express/Open-Channel SSD
• Evaluation results
• Demo
• Conclusion
Full System Simulation
Models Computer System
CP
UCPU
Caches
Simics
Full-system Simulators
Running real OS with detailed hardware model.
GPU Network Storage
Bandwidth:Avg: 660.3MB/sMin: 237.6MB/sMax: 1.935GB/s...
Latency:Avg: 110.5msMin: 1.66msMax: 1.207s...
GC statistics:# GC: 0# page copy: 0......
Simulation Results
SSD Simulator
Trace-based SimulationBlock Trace File(s)
MAJ,MIN CPUID SEQ# SEC NS PID OFFSET LENGTH259,0 4 2 0.000007455 3003 Q RM 76040 + 8259,0 4 3 0.000008428 3003 G RM 76040 + 8259,0 4 4 0.000013460 3003 D RM 76040 + 8259,0 4 5 0.000128406 0 C RM 76040 + 8...
VFS / FS
Block I/O Layer
Device Driver
Storage
Ker
nel
Use
rD
evic
e
I/O Trace File
0.000013460 Read 76040 + 8...
Convert for SSD simulators
Tick Oper. Offset Length
Trace ReplayerR
ead
Host protocol simulation
DRAM latency model
Data buffer model
Flash Translation Layer (FTL)Address Translation, GC, Wear-leveling NAND Transaction Scheduling
NAND Flash Array ModelInternal parallelism, Thermal/Power
Stat
isti
c C
olle
cto
r
Write
Trace-based Simulation
User-level programKernelHardware
CPU, cache and DRAM
Host
Real system Trace-based Simulation
Host interfaceEmbedded CPU & DRAMSSD FirmwareNAND Flash Array
SSD SSD
I/O trace fileStatic values
Host interfaceEmbedded CPU & DRAMSSD FirmwareNAND Flash Array
Trace-based simulation(s) cannot capture…
Performance brought by a bidirectional communication
between the host and SSD
Performance simulated from an isolated storage model
Trace-based Simulation1. Host hardware changes 2. Host software changes
CPU
CPU@ 2GHz
CPU
CPU@ 4GHz
DDR3-1600 DDR4-2400Linux 3.x Linux 4.x
3. Protocol differences 4. Active/Passive Storage
64K queues64K entries/queue
NVMe protocol
NCQ with 32 entries
SATA protocol & PHY
App
Active storage
Block I/O
Device driver
Host interface
Data buffer
FTL
Ho
stD
evic
e
App
Passive storage
Block I/O
Device driver
Host interface
Data buffer
FTL
Ho
st
Accuracy – Bandwidth
Sequential read Random read
Sequential write Random write
bs=4K, QD=1~32
10 15 20 25 30
400
800
1200
1600
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
400
800
1200
1600 A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
400
800
1200
1600
D
A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
400
800
1200
1600
CB
D
A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
400
800
1200
1600
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
400
800
1200
1600 A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
400
800
1200
1600
CB
D
A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth
10 15 20 25 30
200
400
600
800
1000
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000 D
A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000
C
D
A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000
C B
D
A
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000
DA
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000
C
DA
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth 10 15 20 25 30
200
400
600
800
1000
CB
DA
Intel 750
Ba
nd
wid
th (
MB
/s)
I/O Depth
34.8%28.7%
64.9%59.3%
Average error rate:A: 128% B: 65%C: 74% D: 42%
Accuracy – Latency
Sequential read Random read
Sequential write Random write
bs=4K, QD=1~32
10 15 20 25 30
100
200
300
400L
ate
ncy (
s)
I/O Depth
Intel 750
10 15 20 25 30
100
200
300
400L
ate
ncy (
s)
I/O Depth
Intel 750
A
10 15 20 25 30
100
200
300
400L
ate
ncy (
s)
I/O Depth
C
Intel 750
A
10 15 20 25 30
100
200
300
400L
ate
ncy (
s)
I/O Depth
BC
Intel 750
A
D
10 15 20 25 30
100
200
300
400
La
ten
cy (
s)
I/O DepthIntel 750
10 15 20 25 30
100
200
300
400
La
ten
cy (
s)
I/O DepthIntel 750
A
10 15 20 25 30
100
200
300
400
La
ten
cy (
s)
I/O Depth
BC
Intel 750
A
D
10 15 20 25 30
100200300400
13001400
La
ten
cy (
s)
I/O Depth
Intel 750
10 15 20 25 30
100200300400
13001400
La
ten
cy (
s)
I/O Depth
Intel 750
A
10 15 20 25 30
100200300400
13001400
La
ten
cy (
s)
I/O Depth
C
Intel 750
A
D
10 15 20 25 30
100200300400
13001400
La
ten
cy (
s)
I/O Depth
B
C
Intel 750
A
D
10 15 20 25 30
400
800
1200
1600
6000
La
ten
cy (
s)
I/O Depth
Intel 750
10 15 20 25 30
400
800
1200
1600
6000
La
ten
cy (
s)
I/O Depth
B
C
Intel 750
A
D
37.4% 50.2%
140.3%
179.5%
Average error rate:A: 4560% B: 507%C: 4090% D: 602%
Amber*
* Amber is a project name of SimpleSSD 2.0
• Offering full-system simulation by tightly integrating storage and computing resources over diverse interface protocols(e.g., SATA, UFS, NVMe and Open-Channel SSD)
• For storage specific studies, it also support a traditional trace-based simulation model.
• Provides good accuracy.4 ~ 28% (bandwidth) and 6 ~ 36% (latency) error rates with diverse real SSDs
Co
re
PC
H SATA
/PC
Ie
PH
Y
MC
H
SSD
Fun
ctio
nal
CP
U (A
tom
icSi
mpl
e)
DR
AM
Tim
ing
CP
U (I
nOrd
er, O
3, S
MT)
Co
reC
ore
Sto
rage
co
mp
lex
FTL FIL
Inte
rna
l req
ues
ts
I/O
que
ue
HIL ICL
LBA
LBA
LBA
S-PP
NS-
PPN
S-PP
N
Internal DRAM GC&WL
Cache Mapping table
Tra
nsla
tion
Flus
h/pa
ge re
plac
emen
t
Tra
nsac
tion
Inte
rnal
par
alle
lism
Memory mapped region
SW queuesData
Ho
st
con
tro
ller
(SA
TA/U
FS)
ICH
HW
que
ue
Poin
ter
list
SATA
/PC
Ie P
HY
De
vice
co
ntr
olle
r
Em
be
dd
ed c
ore
Powered by gem5, popular full-system simulator.
SSD interface modules are added by Amber.
SSDHost
SSD InternalsC
omp
uta
tio
n c
om
ple
x
L1L1
L1
DR
AM
DR
AM
L2 c
ach
e
Core
PCIe/SATA
Host interface
DR
AM
DR
AM
co
ntr
oll
er
Core
Core
Sto
rage
co
mp
lex
Flash interface
Ch. 0 Ch. 1 Ch. N
NA
ND
p
acka
ge
NA
ND
p
acka
ge
NA
ND
p
acka
ge
NA
ND
p
acka
ge
NA
ND
p
acka
ge
NA
ND
p
acka
ge
Way
0W
ay 1
Computation Complex• Including multiple CPUs, Caches and DRAM modules.• Running diverse components of the SSD firmware.
Storage Complex• Composed by multiple flash packages, channels and flash controllers,
including flash interface management.
NAND Flash Package• Multiple components for
massive parallelism• Multiple planes, dies,
control logic, etc. • Some components cannot
operate concurrently
NAND package
Die 0
Control logic I/O
Die
1
Plane 0
Pla
ne
1Block 0Page 0Page 1
Page N
Emb
ed
ded
Co
re
Cac
he
NAND Flash Array
Modeling all timing details of storage complex is required to bring a high accuracy, but it unfortunately can lead a slow simulation speed in full-system simulation!
Storage Complex AbstractionPAL: Parallelism Abstraction Layer- Goal: Capturing all the latency for activations, conflicts and idle of each component, but simplifying unnecessary operation timings/details
PALPhysical Address Latency
• Simplified NAND command protocol
• Conflict modeling
Consecutive I/O on same channel, different die. Consecutive I/O on same channel, same die.
Channel
Die mem_op
pre-dma post-dmaChannel
Die mem_op
CMDDataOpCode Addr tADL Data CMD
Die 1
Channel 0
Die 0
IO #1 IO #2
mem-op (w)
mem-op (r)
Conflict
IO #1 IO #2
Die 1
Channel 0
Die 0 Mem -op (w)
Mem -op (r)
Conflict
Memory operation latency• Idleness of Plane• Idleness of Block• Page offset
LSB? CSB? MSB?
Computation Complex Modeling
Co
re
PC
H SATA
/PC
Ie
PH
Y
MC
H
SSD
Fun
ctio
nal
CP
U (A
tom
icSi
mpl
e)
DR
AM
Tim
ing
CP
U (I
nOrd
er, O
3, S
MT)
Co
reC
ore
Sto
rage
co
mp
lex
FTL FIL
Inte
rna
l req
ues
ts
I/O
que
ue
HIL ICL
LBA
LBA
LBA
S-PP
NS-
PPN
S-PP
N
Internal DRAM GC&WL
Cache Mapping tableTr
ans
latio
n
Flus
h/pa
ge re
plac
emen
t
Tra
nsac
tion
Inte
rnal
par
alle
lism
Memory mapped region
SW queuesData
Ho
st
con
tro
ller
(SA
TA/U
FS)
ICH
HW
que
ue
Poin
ter
list
SATA
/PC
Ie P
HY
De
vice
co
ntr
olle
r
Em
be
dd
ed c
ore
PAL
Core model(ARMv8/aarch64)
DRAM model(from gem5)
HIL: Host Interface Layer• Handling the communication and controlling
the datapath between host and SSD• Implementation can be either hardware or
software
ICL: Internal Cache Layer• Buffering the data by respecting the storage
interface adopted• Different associativity and $ schemes support
FTL: Flash Translation Layer• Address translation (between virtual and
physical)• Various algorithms (page, etc.) support
• Reliability management such as garbage collection and wear-leveling
SSD Interface Modeling
eMMC UFS
H-Type – Hardware driven S-Type – Software driven
PHY
DRAMCPU
Device driver
Queue
datadata datadata
Host controller
Internal buffer
Ho
stSS
D
PHY
Device controllerInternal DRAM
Storage complex
Dedicated host controllerDMA
Redundant data copy
datadata
PCIe Root Port
DRAMCPU
Device driver
MCH
Ho
stSS
D
PCIe End Point
Device controllerInternal DRAM
Storage complex
MSIDB
datadata
Device driver do everything
datadata
DMA
Cmd.Offset
Entry
EntryEntry
Buffer
Buffer
Buffer
PRDT
. . .
. . .
PRDTLength
PRDTOffset
SATA: Command HeaderUFS: UTP Transfer Request Descriptor (UTRD)
OPCODE
SLBA
NLB
SATA: Command FISUFS: UFS Protocol Information Unit (UPIU)
Doorbell
Queue OffsetMMIO
Registers
Controller Registers
SSD Interface Modeling
UFSUniversal Flash Storage
H-Type
Open-Channel SSD
S-TypeSystem Memory
SATA: Command ListUFS: UTP Transfer Request List
Entry 0
Entry 1
Entry 30
Entry 31
. . .
Host Controller (H-type)
SATA: Advanced Host Controller Interface (AHCI)UFS: UFS Host Controller Interface (UFSHCI)
Controller Internal Memory PH
Y
Inte
rfac
e
Transfer to SSDSATA: FIS / UFS: UPIU
SSD Interface Modeling
Open-Channel SSD
S-TypeSystem Memory
SSD (S-type)
Internal Memory Sto
rage
Co
mp
lex
Inte
rfac
e
Device Controllerwith embedded coresQueue
OffsetMMIORegisters
Controller Registers
NVMe: Submission Queue
Entry 0
Entry 1
Entry 65534
Entry 65536
. . .
Buffer
Buffer
Buffer
EntryEntry
PRP List
. . .
PRP 2
PRP 1
NVMe: Submission Queue Entry
SLBA
NLB
OPCODE
Doorbell
SSD Interface Modeling
[1] "Open-Channel SSD Spec. Revision 2.0," http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf[2] M. Bjøring, et. al, "LightNVM: The Linux Open-Channel SSD Subsystem," FAST 17
NVMe Open-Channel SSD [1]
ApplicationUser-level
Virtual File System
File System
Block I/O Layer
NVMe DriverKernel-level
Host Interface
Flash Translation Layer
NAND Flash Array
Device-level
ApplicationUser-level
Virtual File System
File System
Block I/O Layer
NVMe DriverKernel-level
Host Interface
Flash Translation Layer
NAND Flash Array
Device-level
pblk driver [2]
LightNVM driver [2]
File I/O
Block I/O
NAND I/O
File I/O
Block I/O
NAND I/O
Defines or overrides NVMe commands for low-level NAND I/O operations.
Power/Energy Modeling
• Embedded Core: McPAT from HP Labs [1]
• DRAM: DRAMPower from gem5 [2]
• NAND Flash Array
[1] S. Li, et. al, "McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures," MICRO 42[2] K. Chandrasekar, et. al, "DRAMPower: Open-source DRAM Power & Energy Estimation Tool," http://www.drampower.info
Cache registers Page/Block access I/O bus
Core model(ARMv8/aarch64)
DRAM model(from gem5)
Latency
Func. Instru-ctions
CmdsLatency
Access
McPAT
DRAMPower
Modeling current and voltage of SRAM-like 6 transistors cell at a specified clock speed
Current/voltage of flash cell array (page) or block for:• READ• PROGRAM• ERASE
Current/voltage of NVDDR bus protocol at a specific clock speed (configurable)
Pre-DMA times of each readPost-DMA times of each write
Memory operation times (measured at simulation)
Active time of flash bus (between AMBA channel and internal registers)
Evaluation – Setup
User-level programs
1) Microbenchmarks- Flexible I/O tester [1]
2) Workloads- Reconstructed MSPS [2]
- Replayed with FIO [1]
[1] J. Axboe, "Flexible I/O tester," https://github.com/axboe/fio[2] M. Kwon, et. al, "TraceTracker: Hardware/Software Co-Evaluation for Large-Scale I/O Workload Reconstruction," IISWC 17
Unmodified Linux Kernel- 4.9.92 for performance validation- 4.14.42 for Open-Channel SSD evaluations
gem5 configuration
Amber configurations1) Intel 750 400GB2) Z-SSD 800GB3) 983 DCT 1.92TB4) 850 PRO 256GB
ISA ARMv8
CPU 4 cores @ 4.4GHz
L1I 32KB, 2-way, private
L1D 64KB, 2-way, private
L2 2MB, 8-way, shared
DRAM DDR4-2400 4GBNote that all performance results and validation are performed at actual user-level applications on the Linux-enabled system emulation
Evaluation – ValidationSequential read Random read
Sequential write Random write
bs=4K, QD=1~32
10 20 300
400
800
1200
1600
2000
10 20 30Bandw
idth
(M
B/s
)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
400
800
1200
1600
2000
10 20 30
Intel 750
Bandw
idth
(M
B/s
)
I/O Depth
850 PRO
Z-SSD
983 DCT
10 20 300
400
800
1200
1600
2000
10 20 30Ba
nd
wid
th (
MB
/s)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
400
800
1200
1600
2000
10 20 30Ba
nd
wid
th (
MB
/s)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
400
800
1200
1600
2000
10 20 30Bandw
idth
(M
B/s
)
I/O Depth
Intel 750 (72%)
850 PRO (91%)
Z-SSD (88%)
983 DCT (72%)
10 20 300
400
800
1200
1600
2000
10 20 30Ba
nd
wid
th (
MB
/s)
I/O Depth
Intel 750 (93%)
850 PRO (86%)
Z-SSD (83%)
983 DCT (94%)
10 20 300
400
800
1200
1600
2000
10 20 30Ba
nd
wid
th (
MB
/s)
I/O Depth
Intel 750 (88%)
850 PRO (86%)
Z-SSD (80%)
983 DCT (91%)
10 20 300
400
800
1200
1600
2000
10 20 30
(81%)Intel 750
Bandw
idth
(M
B/s
)
I/O Depth
850 PRO (78%)
Z-SSD (96%)
983 DCT (88%)
<Device name> (<Accuracy in %>)
Sequential read Random read
Sequential write Random write
bs=4K, QD=1~32Evaluation – Validation
10 20 300
80
160
240
320
10 20 30
Late
ncy (
us)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
80
160
240
320
10 20 30
Late
ncy (
us)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
80
160
240
320
10 20 30
La
ten
cy (
us)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
80
160
240
320
10 20 30
La
ten
cy (
us)
I/O Depth
Intel 750
850 PRO
Z-SSD
983 DCT
10 20 300
80
160
240
320
10 20 30
(64%)
Late
ncy (
us)
I/O Depth
Intel 750 (76%)
850 PRO
Z-SSD (96%)
983 DCT (86%)
10 20 300
80
160
240
320
10 20 30
(92%)
Late
ncy (
us)
I/O Depth
Intel 750
850 PRO (72%)
(87%)Z-SSD
983 DCT (94%)
10 20 300
80
160
240
320
10 20 30
(86%)
Late
ncy (
us)
I/O Depth
Intel 750
850 PRO (72%)
(84%)
(91%)Z-SSD
983 DCT
10 20 300
80
160
240
320
10 20 30
(72%)
Late
ncy (
us)
I/O Depth
Intel 750
850 PRO (86%)
Z-SSD (92%)
983 DCT (86%)
<Device name> (<Accuracy in %>)
Evaluation – Open-Channel SSDKernel: 4.14 with Open-Channel SSD 1.2, use pblk driver (Host FTL) to map Open-Channel SSD to block device.SSD: Modified Intel 750 (4 channels, 1 pkgs/ch)
How much should Kernel CPU be involvedin handling I/O requests over Open-Channel SSD?
4 64 4 64 4 64 4 64
Seq. Read Rnd. Read Seq. Write Rnd. Write
0
50
100
150
200
Bandw
idth
(M
B/s
)
Block Size (KiB)
NVMe
4 64 4 64 4 64 4 64
Seq. Read Rnd. Read Seq. Write Rnd. Write
0
50
100
150
200
Bandw
idth
(M
B/s
)
Block Size (KiB)
NVMe OCSSD
4 64 4 64 4 64 4 64
Seq. Read Rnd. Read Seq. Write Rnd. Write
0
50
100
150
200
Bandw
idth
(M
B/s
)
Block Size (KiB)
NVMe OCSSD
0 1 2 3 4 5 60
20
40
60
80
100
Kern
el C
PU
Utiliz
ation (
%)
Time (sec)
NVMe
0 1 2 3 4 5 60
20
40
60
80
100
Kern
el C
PU
Utiliz
ation (
%)
Time (sec)
NVMe OCSSD
11MB/sfaster
11MB/sfaster
26MB/sfaster
20MB/sfaster
29MB/sslower
29MB/sslower
11MB/sslower
21MB/sslower
Less than 10%
Over 50%
Data buffering and FTL operations
consume host CPU cycles
Small data bufferSmall I/O unit for low-
level NAND accessI/O phase
FIOinit.
pblk init.
DEMO
Download
Website of SimpleSSD 2.0• Providing step-by-step execution instructions and tutorials• Offering the detailed explanations of each simulation model• Will be updated at November, 2018
Source-code• SimpleSSD
SSD models, including the computation and storage complexes
• SimpleSSD-FullSystemFork of gem5 by employing Amber specific host interfaces
• SimpleSSD-StandaloneFor trace-based simulation studies
https://github.com/SimpleSSD
https://simplessd.org/
Conclusion• We introduced Amber, which is a project name of SimpleSSD 2.0
• Amber enables all necessary SSD internals, including hardware and software
• gem5-integrated Amber can support the models of SATA, UFS, NVMe, Open-Channel SSD, which can be used for diverse computing domains ranging from embedded systems to personal computer to servers
• Amber can also • allow system designers to modify the kernel stack and entire operating systems with
diverse sets of SSD devices
• simulate an emerging system like Open-Channel SSDs, which are in development
• In the paper, there are more simulation-based studies and validations, related to power, computing model, memory usages, etc.
Q&A