Network-on-SSD: A Scalable and High-Performance ...  A Scalable and High-Performance Communication Design Paradigm for SSDs Arash Tavakkol , Mohammad Arjomand and Hamid Sarbazi-Azady page 1
Network-on-SSD: A Scalable and High-Performance ...  A Scalable and High-Performance Communication Design Paradigm for SSDs Arash Tavakkol , Mohammad Arjomand and Hamid Sarbazi-Azady page 2
Network-on-SSD: A Scalable and High-Performance ...  A Scalable and High-Performance Communication Design Paradigm for SSDs Arash Tavakkol , Mohammad Arjomand and Hamid Sarbazi-Azady page 3
Network-on-SSD: A Scalable and High-Performance ...  A Scalable and High-Performance Communication Design Paradigm for SSDs Arash Tavakkol , Mohammad Arjomand and Hamid Sarbazi-Azady page 4

Network-on-SSD: A Scalable and High-Performance ... A Scalable and High-Performance Communication Design Paradigm for SSDs Arash Tavakkol , Mohammad Arjomand and Hamid Sarbazi-Azady

Embed Size (px)

Text of Network-on-SSD: A Scalable and High-Performance ... A Scalable and High-Performance Communication...

  • Network-on-SSD: A Scalable and High-PerformanceCommunication Design Paradigm for SSDs

    Arash Tavakkol, Mohammad Arjomand and Hamid Sarbazi-AzadHPCAN Lab, Computer Engineering Department, Sharif University of Technology, Tehran, IranSchool of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

    {tavakkol,arjomand}, azad@{,}

    AbstractIn recent years, flash memory solid state disks (SSDs) have shown a great potential to change storage infrastructurebecause of its advantages of high speed and high throughput random access. This promising storage, however, greatly suffersfrom performance loss because of frequent erase-before-write and garbage collection operations. Thus, novel circuit-level,architectural, and algorithmic techniques are currently explored to address these limitations. In parallel with others, currentstudy investigates replacing shared buses in multi-channel architecture of SSDs with an interconnection network to achievescalable, high throughput, and reliable SSD storage systems. Roughly speaking, such a communication scheme providessuperior parallelism that allows us to compensate the main part of the performance loss related to the aforementioned limitationsthrough increasing data storage and retrieval processing throughput.

    Index TermsFlash memory, Solid state disk, Inter-package parallelism, Interconnection network.



    DUE to high-latency access to randomly-addressed dataand low-throughput concurrent access to multiple data,widely-used rotating Hard Disk Drives (HDD) are knownto be performance- and bandwidth-limited mediums. Flash-based Solid State Disks (SSD), on the other hand, are be-coming the mainstream high-performance and energy-efficientstorage devices because of their fast random access, superiorthroughput, as well as low power features. Leveragingsuch a promising medium, researchers have made extensivestudies to exploit this technology in storage systems andthus proposed solutions for performance optimization. Mostof these studies have focused on either improving SSD tech-nology limitations (e.g. slow random write access), loweringperformance overhead of erase-before-write and garbage collec-tion processes, or enabling access load balancing by meansof circuit-level, architectural, or algorithmic techniques.

    As random writes to flash memory are slow and maxi-mum parallelism within a package is limited to concurrentaccess of including planes, one flash package can onlyprovide a limited bandwidth of 32-40MB/s [1]. Therefore,SSDs usually utilize a multi-channel multi-way bus struc-ture to conduct multiple data accesses in parallel whileproviding maximum stripping likelihood. This enhancesaggregate bandwidth of a set of flash packages and im-proves back-end reliability. Although increasing the numberof bus channels and hence the level of parallelism resultsin increased performance and bandwidth, there has to betradeoffs between flash controller complexity, design cost,and maximum achievable throughput. These tradeoffs areeven more challenging as the need for better performanceremains due to the stiff increase of processor parallelism inmany-core systems.

    In this paper, we suggest replacing shared multi-channelbus wiring with an interconnection network. Using a net-

    Manuscript submitted: 07-Feb-2012. Manuscript accepted: 28-Feb-2012.Final manuscript received: 04-Mar-2012.

    work can result in higher performance, larger aggregatebandwidth, more scalability, and better reliability. In fact,interconnection network structures the global wires so thattheir electrical properties are optimized and well-controlled.These controlled electrical parameters, finally, result in re-duced propagation latency and increased aggregate band-width [3]. Moreover, sharing the communication resourcesbetween many read/write requests makes more efficient useof the resources: when one set of flash packages are idle,others continue to make use of the network resources.

    In a nutshell, networks are generally preferable to sharedbuses because they have higher bandwidth and supportmultiple concurrent communications. Note that some ofour motivations for inter-flash package network stem frominter-chip networks in general purpose multi-processorsand intra-chip networks (Network-on-Chip) in chip multi-processors.


    Figure 1 illustrates internal structure of an SSD that usessome microchips of non-volatile memories to retain data.As of 2012, most SSDs use NAND flash memories to providea high dense medium with random read and write accesscapabilities. Each NAND flash package consists of multipledies and each die corresponds to multiple planes containingthe actual physical memory pages inside (e.g. 2, 4, or 8KB).Though each die can perform read, write, or erase operationindependent of the others, all planes within a die can onlycarry out same (or at most two) command(s) at a time.To support parallelism inside a die, each plane includesan embedded data register to hold data when a read orwrite request is issued. Both read and write operations areprocessed in page units. Nevertheless, flash memory pagesmust be erased before any write which incurs a latency ofmilliseconds. To hide this large stall time, the unit of anerase operation in NAND flash memories is generally intens of consecutive pages, forming an erase block. To further

  • Flash Translation Layer

    RC ProcessorF






    Cache Controller

    SDRAM Cache


    st I



    ce (






    Flash Package



    Plane 0

    Plane 1

    Plane 2

    Plane 3

    Plane 0

    Plane 1

    Plane 2

    Plane 3

    Serial Connection

    Page 0

    . . .

    Page 63

    Block 4096

    Page 0

    . . .

    Page 63

    Block 0


    Data Reg.

    Cache Reg.

    Fig. 1. A shared bus-based SSD internal structure.

    reduce frequent erases, current flash memories use a simpleinvalidation mechanism along with out-of-place update [1].

    In order to emulate a block-device interface providedin conventional HDDs, an SSD contains a special soft-ware layer, namely Flash Translation Layer (FTL). FTL hastwo main roles: the address mapping for translating logicaladdresses to physical addresses and garbage collection forreclaiming invalid pages. Moreover, FTL is responsible forprolonging the lifetime of NAND memories through a wear-leveling process. In more details, as the number of reliableerase cycles onto each flash page is limited to about 105

    times [6], this process tries to reduce the erase count andmake all blocks within an SSD to wear out evenly. To realizeFTL processes, an embedded RC processor augmented withan SDRAM buffer (responsible for keeping block mappinginformation and acting as a write buffer) is utilized.

    Considering a single flash package, limited support ofparallel access to planes of a single die and high-latencyrandom write limit the maximum provided bandwidth to32-40MB/s [1]. In addition, garbage collection and erase-before-write operations require repetitive high-latency write anderase operations which incur large latencies. To overcomethese shortcomings, a multi-channel bus structure is used toconnect flash packages to the embedded controller. In thisstructure, a group of flash packages share a single serial buswhere a controller dynamically selects the target of eachcommand while a shared data path connects flash pack-ages to the controller. This configuration increases capacitywithout requiring more pins, but it does not increase thebandwidth.

    By enhancing parallelism through increased number ofparallel channels, the SSD performance has been improvedrapidly over the last few years. So that cutting-edge SSDsnow exhibit sustained read/write throughput of up to250MB/s. However, the strong demand for storage deviceswith better performance and bandwidth (1GB/s as pro-jected in [4]) is increasingly developed.

    3 NOSSD: NETWORK-ON-SSDAs mentioned before, the solutions for communicationstructure between flash packages and the embedded con-troller have generally been characterized by design of multi-channel multi-way buses. Although channels in this archi-tecture build on well-understood on-board routing conceptswith minimum complexity, enhancing parallelism in termsof increased number of channels has some advantagesand some disadvantages. On one hand, high concurrencywould generally improve resource utilization and increasethroughput. On the other hand, addressing the channelaccess arbitration problem in the controller is not trivialand its latency and cost overhead may not be tolerable.For instance, a fully-interconnected structure (one outgoing

    channel per flash package) is optimal in terms of band-width, latency and power usage. However, the controllercomplexity in this structure linearly increases with thenumber of flash packages. Thus, there always has to betradeoffs between the controller complexity, cost, and themaximum throughput. On the contrary, attaching more flashpackages to a shared channel in a bus-based communicationstructure largely degrades the overall SSD performance andpower efficiency. Indeed, this inefficiency is a direct effectof increased access contention and capacitive load of theattached units.

    Lack of pipelining support is the other shortcoming ofshared channels. More accurately, when depth of the ser-vice queue in the controller increases over the numberof physical channels, a single channel has to be sharedby more than one job which must be serialized [2]. Inshort, any change in communication structure, supportingpipeline data transmission, would substantially affect theaforementioned performance and power loss.

    For maximum flexibil