Network-on-SSD: A Scalable and High-Performance ... A Scalable and High-Performance Communication Design Paradigm for SSDs Arash Tavakkol , Mohammad Arjomand and Hamid Sarbazi-Azady

  • Published on

  • View

  • Download


<ul><li><p>Network-on-SSD: A Scalable and High-PerformanceCommunication Design Paradigm for SSDs</p><p>Arash Tavakkol, Mohammad Arjomand and Hamid Sarbazi-AzadHPCAN Lab, Computer Engineering Department, Sharif University of Technology, Tehran, IranSchool of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran</p><p>{tavakkol,arjomand}, azad@{,}</p><p>AbstractIn recent years, flash memory solid state disks (SSDs) have shown a great potential to change storage infrastructurebecause of its advantages of high speed and high throughput random access. This promising storage, however, greatly suffersfrom performance loss because of frequent erase-before-write and garbage collection operations. Thus, novel circuit-level,architectural, and algorithmic techniques are currently explored to address these limitations. In parallel with others, currentstudy investigates replacing shared buses in multi-channel architecture of SSDs with an interconnection network to achievescalable, high throughput, and reliable SSD storage systems. Roughly speaking, such a communication scheme providessuperior parallelism that allows us to compensate the main part of the performance loss related to the aforementioned limitationsthrough increasing data storage and retrieval processing throughput.</p><p>Index TermsFlash memory, Solid state disk, Inter-package parallelism, Interconnection network.</p><p>F</p><p>1 INTRODUCTION</p><p>DUE to high-latency access to randomly-addressed dataand low-throughput concurrent access to multiple data,widely-used rotating Hard Disk Drives (HDD) are knownto be performance- and bandwidth-limited mediums. Flash-based Solid State Disks (SSD), on the other hand, are be-coming the mainstream high-performance and energy-efficientstorage devices because of their fast random access, superiorthroughput, as well as low power features. Leveragingsuch a promising medium, researchers have made extensivestudies to exploit this technology in storage systems andthus proposed solutions for performance optimization. Mostof these studies have focused on either improving SSD tech-nology limitations (e.g. slow random write access), loweringperformance overhead of erase-before-write and garbage collec-tion processes, or enabling access load balancing by meansof circuit-level, architectural, or algorithmic techniques.</p><p>As random writes to flash memory are slow and maxi-mum parallelism within a package is limited to concurrentaccess of including planes, one flash package can onlyprovide a limited bandwidth of 32-40MB/s [1]. Therefore,SSDs usually utilize a multi-channel multi-way bus struc-ture to conduct multiple data accesses in parallel whileproviding maximum stripping likelihood. This enhancesaggregate bandwidth of a set of flash packages and im-proves back-end reliability. Although increasing the numberof bus channels and hence the level of parallelism resultsin increased performance and bandwidth, there has to betradeoffs between flash controller complexity, design cost,and maximum achievable throughput. These tradeoffs areeven more challenging as the need for better performanceremains due to the stiff increase of processor parallelism inmany-core systems.</p><p>In this paper, we suggest replacing shared multi-channelbus wiring with an interconnection network. Using a net-</p><p>Manuscript submitted: 07-Feb-2012. Manuscript accepted: 28-Feb-2012.Final manuscript received: 04-Mar-2012.</p><p>work can result in higher performance, larger aggregatebandwidth, more scalability, and better reliability. In fact,interconnection network structures the global wires so thattheir electrical properties are optimized and well-controlled.These controlled electrical parameters, finally, result in re-duced propagation latency and increased aggregate band-width [3]. Moreover, sharing the communication resourcesbetween many read/write requests makes more efficient useof the resources: when one set of flash packages are idle,others continue to make use of the network resources.</p><p>In a nutshell, networks are generally preferable to sharedbuses because they have higher bandwidth and supportmultiple concurrent communications. Note that some ofour motivations for inter-flash package network stem frominter-chip networks in general purpose multi-processorsand intra-chip networks (Network-on-Chip) in chip multi-processors.</p><p>2 THE SSD STRUCTURE</p><p>Figure 1 illustrates internal structure of an SSD that usessome microchips of non-volatile memories to retain data.As of 2012, most SSDs use NAND flash memories to providea high dense medium with random read and write accesscapabilities. Each NAND flash package consists of multipledies and each die corresponds to multiple planes containingthe actual physical memory pages inside (e.g. 2, 4, or 8KB).Though each die can perform read, write, or erase operationindependent of the others, all planes within a die can onlycarry out same (or at most two) command(s) at a time.To support parallelism inside a die, each plane includesan embedded data register to hold data when a read orwrite request is issued. Both read and write operations areprocessed in page units. Nevertheless, flash memory pagesmust be erased before any write which incurs a latency ofmilliseconds. To hide this large stall time, the unit of anerase operation in NAND flash memories is generally intens of consecutive pages, forming an erase block. To further</p></li><li><p>Flash Translation Layer</p><p>RC ProcessorF</p><p>lash</p><p> Co</p><p>ntr</p><p>oll</p><p>er</p><p>Cache Controller</p><p>SDRAM Cache</p><p>Ho</p><p>st I</p><p>nte</p><p>rfa</p><p>ce (</p><p>SA</p><p>TA</p><p>)</p><p>Die</p><p> 0</p><p>Flash Package</p><p>Die</p><p> 1</p><p>Plane 0</p><p>Plane 1</p><p>Plane 2</p><p>Plane 3</p><p>Plane 0</p><p>Plane 1</p><p>Plane 2</p><p>Plane 3</p><p>Serial Connection</p><p>Page 0</p><p>. . .</p><p>Page 63</p><p>Block 4096</p><p>Page 0</p><p>. . .</p><p>Page 63</p><p>Block 0</p><p>...</p><p>Data Reg.</p><p>Cache Reg.</p><p>Fig. 1. A shared bus-based SSD internal structure.</p><p>reduce frequent erases, current flash memories use a simpleinvalidation mechanism along with out-of-place update [1].</p><p>In order to emulate a block-device interface providedin conventional HDDs, an SSD contains a special soft-ware layer, namely Flash Translation Layer (FTL). FTL hastwo main roles: the address mapping for translating logicaladdresses to physical addresses and garbage collection forreclaiming invalid pages. Moreover, FTL is responsible forprolonging the lifetime of NAND memories through a wear-leveling process. In more details, as the number of reliableerase cycles onto each flash page is limited to about 105</p><p>times [6], this process tries to reduce the erase count andmake all blocks within an SSD to wear out evenly. To realizeFTL processes, an embedded RC processor augmented withan SDRAM buffer (responsible for keeping block mappinginformation and acting as a write buffer) is utilized.</p><p>Considering a single flash package, limited support ofparallel access to planes of a single die and high-latencyrandom write limit the maximum provided bandwidth to32-40MB/s [1]. In addition, garbage collection and erase-before-write operations require repetitive high-latency write anderase operations which incur large latencies. To overcomethese shortcomings, a multi-channel bus structure is used toconnect flash packages to the embedded controller. In thisstructure, a group of flash packages share a single serial buswhere a controller dynamically selects the target of eachcommand while a shared data path connects flash pack-ages to the controller. This configuration increases capacitywithout requiring more pins, but it does not increase thebandwidth.</p><p>By enhancing parallelism through increased number ofparallel channels, the SSD performance has been improvedrapidly over the last few years. So that cutting-edge SSDsnow exhibit sustained read/write throughput of up to250MB/s. However, the strong demand for storage deviceswith better performance and bandwidth (1GB/s as pro-jected in [4]) is increasingly developed.</p><p>3 NOSSD: NETWORK-ON-SSDAs mentioned before, the solutions for communicationstructure between flash packages and the embedded con-troller have generally been characterized by design of multi-channel multi-way buses. Although channels in this archi-tecture build on well-understood on-board routing conceptswith minimum complexity, enhancing parallelism in termsof increased number of channels has some advantagesand some disadvantages. On one hand, high concurrencywould generally improve resource utilization and increasethroughput. On the other hand, addressing the channelaccess arbitration problem in the controller is not trivialand its latency and cost overhead may not be tolerable.For instance, a fully-interconnected structure (one outgoing</p><p>channel per flash package) is optimal in terms of band-width, latency and power usage. However, the controllercomplexity in this structure linearly increases with thenumber of flash packages. Thus, there always has to betradeoffs between the controller complexity, cost, and themaximum throughput. On the contrary, attaching more flashpackages to a shared channel in a bus-based communicationstructure largely degrades the overall SSD performance andpower efficiency. Indeed, this inefficiency is a direct effectof increased access contention and capacitive load of theattached units.</p><p>Lack of pipelining support is the other shortcoming ofshared channels. More accurately, when depth of the ser-vice queue in the controller increases over the numberof physical channels, a single channel has to be sharedby more than one job which must be serialized [2]. Inshort, any change in communication structure, supportingpipeline data transmission, would substantially affect theaforementioned performance and power loss.</p><p>For maximum flexibility and scalability, it is generallyaccepted to move towards a shared, segmented networkcommunication structure. This notion translated into a data-routing network consisting of communication links androuting nodes, namely Network-on-SSD (NoSSD). In contrastto the shared channel communication scheme, such a dis-tributed medium scales well with package count and in-ternal parallelism. Additional advantages include increasedbandwidth as well as improved throughput and reliabilityby exploiting the provided parallelism in NoSSD.</p><p>3.1 NoSSD Design Concepts</p><p>Figure 2 illustrates topological aspects of a sample NoSSDstructured as a 4-by-4 grid that is connected to the flashcontroller from one side (one outgoing channel per rowfor the controller). The NoSSD, in a simplified perspective,contains routing nodes to route data according to chosenrouting protocol and inter-router links connecting the nodes.Each routing node has to be logically embedded within theflash access circuitry, so that routing the inter-node linksonto SSD board shows more regularity.</p><p>Crossbar&amp; Routing </p><p>Logic&amp; Arbiter</p><p>&amp; Input Buffer</p><p>Fla</p><p>sh C</p><p>on</p><p>tro</p><p>lle</p><p>r</p><p>Internal Flash</p><p>MemoryHierarchy</p><p>Fla</p><p>sh C</p><p>on</p><p>tro</p><p>l L</p><p>og</p><p>ic</p><p>Fig. 2. Logical view of a NoSSD with 4-by-4 grid topology.</p><p>Details of the input-buffered router element alongsidethe flash memory logic are shown in Figure 2. The routerconnects to the local flash and adjacent units through bidi-rectional channels. A local port is used to connect flashlogic to the network substrate while other fours provideconnections to the neighbors in grid topology. This richconnectivity, however, is negatively limited by package pincount. Alleviating this constraint, we assume that dataand control pins of a flash package are now multiplexed,forming a wide access port (Section 3.2). This port is thenuniformly separated into parts each connected to an adja-cent package.</p></li><li><p>Considering commodity design of buffer storage, routingalgorithm, and arbiters, router logic is simple with fewhundred gates per flash package. Among router elements,the area of a router is dominated by buffer storage. If we as-sume one-flit of buffering for each input channel (with 8 bitwidth as shown in Figure 3), the total buffering requirementbecomes about 40 bits per router. Our estimation showsthe routing logic, arbiters, and buffer storage will occupyan area less than 1% of a dense flash chip. In additionto this area, we assume same inter-package wiring layoutas shared-channel SSDs use which makes the wiring effecton design complexity to be substantially less than routerdesign.</p><p>3.2 Messaging Protocol in NoSSDUsing the concepts from interconnection networks, anNoSSD can use wormhole packetization model [3] in whichcontrol information and raw data are all integrated intopackets. In more details, each packet carries read/writedata, flash access commands, and network address infor-mation. The packet format, shown in Figure 3, consists ofheader flow control units (flits) followed by the payloadflits. Header carries destination address, request/responseinformation, and packet size while payload flits include rawdata. An additional 1-bit head identifies the type of a flitthat can be head or payload. A routing logic decides onthe output port for a message when its header reaches thebuffer head of an input port. When the route is determined,the message waits until the next hop buffer is available andpreempts an output channel after arbitration. A crossbarswitch uses arbitration signals to provide synchronous con-nections between any pair of input and output lines. Finally,a flit physically transmits over a channel. When a messagereaches its destination, the local port is allocated and fedinto flash control logic flit by flit.</p><p>Destination AddressRequest/Response </p><p>Info</p><p>4 bit 3 bit</p><p>Packet Size</p><p>...</p><p>1 bit</p><p>TypeHead</p><p>TypeHead</p><p>Packet SizeTypeHead</p><p>DataType</p><p>Payload</p><p>DataType</p><p>Payload</p><p>Command Type Bin. Code</p><p>3 bit Req./Res. Info</p><p>000</p><p>001</p><p>010</p><p>011</p><p>100</p><p>101</p><p>ReadReq</p><p>Read</p><p>Write</p><p>WriteAck</p><p>Erase</p><p>EraseAck</p><p>Fig. 3. Packet format for 4 4 NoSSD messaging protocol.</p><p>To access a page for read operation, a read command isencapsulated into a packet (i.e. ReadReq packet) and is issuedfrom the controller toward the target flash package. Theresponse would be returned back by the destination througha Read packet which includes both data of the read page andits meta-data (e.g. CRC and wear-leveling information). Forwrite operation, on the other hand, the controller initiallyissues a packet (i.e. Write packet) containing both writecommand and data with target package as destination. Inreply and after a successful write, flash package forms aWriteAck packet with controller as destination and meta-data as payload which will be then injected into the net-work for routing. The same mechanism is applied for eraseoperation through Erase and EraseAck packets.</p><p>3.3 Routing Mechanism in NoSSDSimilar to conventional SSDs, NoSSD is designed as a mod-ular architecture but some components are added to obtain a</p><p>fully-pipelined inter-package communication structure withspecific characteristics (e.g. messaging protocol). Comparedto interconnection networks, NoSSD has some commoncharacteristics and specific features as follows.</p><p>NoSSD uses synchronous pipelined router architecturein order to be compliant with available network and flashmemory standards. The data transmission between routers,on the other hand, is asynchronous....</p></li></ul>