11
The Journal of Systems and Software 97 (2014) 86–96 Contents lists available at ScienceDirect The Journal of Systems and Software j our na l ho me page: www.elsevier.com/locate/jss O1FS: Flash file system with O(1) crash recovery time Hyunchan Park a , Sam H. Noh b , Chuck Yoo c,a Graduate School of Convergence Information Technology, Korea University, Anam-Dong, Sungbuk-Gu, Seoul 136-713, Republic of Korea b School of Computer and Information Engineering, Hongik University, 72-1 Sangsu-Dong, Mapo-Gu, Seoul 121-791, Republic of Korea c College of Information and Communications, Korea University, Anam-Dong, Sungbuk-Gu, Seoul 136-713, Republic of Korea a r t i c l e i n f o Article history: Received 5 September 2013 Received in revised form 30 June 2014 Accepted 1 July 2014 Available online 10 July 2014 Keywords: O1FS Crash recovery technique NAND flash file system a b s t r a c t The crash recovery time of NAND flash file systems increases with flash memory capacity. Crash recovery usually takes several minutes for a gigabyte of flash memory and becomes a serious problem for mobile devices. To address this problem, we propose a new flash file system, O1FS. A key concept of our system is that a small number of blocks are modified exclusively until we change the blocks explicitly. To recover from crashes, O1FS only accesses the most recently modified blocks rather than the entire flash memory. Therefore, the crash recovery time is bounded by the size of the blocks. We develop mathematical models of crash recovery techniques and prove that the time complexity of O1FS is O(1), whereas that of other methods is proportional to the number of blocks in the flash memory. Our evaluation shows that the crash recovery of O1FS is about 18.5 times faster than that of a state-of-the-art method. © 2014 Elsevier Inc. All rights reserved. 1. Introduction In recent years, the NAND flash memory has been widely adopted in portable devices, such as mobile phones, personal digi- tal assistants (PDAs), and MP3 players, because of its non-volatility, low power consumption, and shock resistance (Douglis et al., 1994). Flash memory capacity has also increased rapidly in order to keep pace with market demands. A 128 GB multi-level cell (MLC) NAND flash memory chip is currently the largest available memory of its kind in the market (Micron, 2013). However, as flash memory capacity continues to increase, portable device users are conse- quently burdened with longer boot times (Bird, 2004). The main cause of longer boot times is that the time required to initialize flash file systems increases in proportion to the capac- ity of the flash memory. This is because the entire flash memory is accessed during initialization in most current systems. Typical flash file systems are log-structured due to flash memory charac- teristics (Rosenblum and Ousterhout, 1992; Kawaguchi et al., 1995; Gal and Toledo, 2005). Therefore, the metadata in the logs are stored at arbitrary positions in the flash memory. During file system ini- tialization, metadata scattered throughout the flash memory need to be accessed to construct an in-memory metadata structure. This has a time requirement that is proportional to the flash memory capacity. Corresponding author. Tel.: +82 2 3290 3198; fax: +82 2 922 6341. E-mail address: [email protected] (C. Yoo). Several techniques have been proposed to reduce the initializa- tion time (Lim and Park, 2006; Wu et al., 2006; Yim et al., 2005). The aim of these initialization techniques is to store essential meta- data in an indexed location so that initialization may be completed simply by reading the metadata. However, if the file system is abnormally unmounted by a system crash or abrupt removal of battery, the metadata may not reflect the correct state of the file system. In these situations, crash recovery is necessary to check the integrity of the file system, which requires accessing the entire flash memory. Hereafter, we refer to initialization without recovery as normal initialization and initialization after abnormal unmount as crash recovery. In this paper, we propose O1FS to provide a time-bound crash recovery technique irrespective of flash memory capacity. The key concept behind O1FS is to divide the entire flash memory into work- ing areas. A working area is simply a logical group of blocks. At a specific time, all file system modifications are recorded in only one working area. The working area can change and when it does, O1FS records relevant information, such as the locations of the previous and next working areas at designated locations. O1FS recovers effi- ciently from crashes because it can immediately access the most recent working area where all the missing data are stored. To ver- ify the effectiveness of O1FS, we develop execution time models for existing crash recovery techniques. Using these models, we show that the time complexities of existing techniques are O(n), where n denotes the capacity of the flash memory, whereas the time com- plexity of O1FS is O(1). We implement O1FS in Linux 2.6.29 and evaluate several aspects of its performance. We first experimentally verify the complexity http://dx.doi.org/10.1016/j.jss.2014.07.008 0164-1212/© 2014 Elsevier Inc. All rights reserved.

O1FS: Flash file system with O(1) crash recovery time

  • Upload
    chuck

  • View
    216

  • Download
    0

Embed Size (px)

Citation preview

Page 1: O1FS: Flash file system with O(1) crash recovery time

O

Ha

b

c

a

ARRAA

KOCN

1

atlFpflicq

tiifltGatthc

h0

The Journal of Systems and Software 97 (2014) 86–96

Contents lists available at ScienceDirect

The Journal of Systems and Software

j our na l ho me page: www.elsev ier .com/ locate / j ss

1FS: Flash file system with O(1) crash recovery time

yunchan Parka, Sam H. Nohb, Chuck Yooc,∗

Graduate School of Convergence Information Technology, Korea University, Anam-Dong, Sungbuk-Gu, Seoul 136-713, Republic of KoreaSchool of Computer and Information Engineering, Hongik University, 72-1 Sangsu-Dong, Mapo-Gu, Seoul 121-791, Republic of KoreaCollege of Information and Communications, Korea University, Anam-Dong, Sungbuk-Gu, Seoul 136-713, Republic of Korea

r t i c l e i n f o

rticle history:eceived 5 September 2013eceived in revised form 30 June 2014ccepted 1 July 2014vailable online 10 July 2014

a b s t r a c t

The crash recovery time of NAND flash file systems increases with flash memory capacity. Crash recoveryusually takes several minutes for a gigabyte of flash memory and becomes a serious problem for mobiledevices. To address this problem, we propose a new flash file system, O1FS. A key concept of our systemis that a small number of blocks are modified exclusively until we change the blocks explicitly. To recover

eywords:1FSrash recovery techniqueAND flash file system

from crashes, O1FS only accesses the most recently modified blocks rather than the entire flash memory.Therefore, the crash recovery time is bounded by the size of the blocks. We develop mathematical modelsof crash recovery techniques and prove that the time complexity of O1FS is O(1), whereas that of othermethods is proportional to the number of blocks in the flash memory. Our evaluation shows that thecrash recovery of O1FS is about 18.5 times faster than that of a state-of-the-art method.

. Introduction

In recent years, the NAND flash memory has been widelydopted in portable devices, such as mobile phones, personal digi-al assistants (PDAs), and MP3 players, because of its non-volatility,ow power consumption, and shock resistance (Douglis et al., 1994).lash memory capacity has also increased rapidly in order to keepace with market demands. A 128 GB multi-level cell (MLC) NANDash memory chip is currently the largest available memory of

ts kind in the market (Micron, 2013). However, as flash memoryapacity continues to increase, portable device users are conse-uently burdened with longer boot times (Bird, 2004).

The main cause of longer boot times is that the time requiredo initialize flash file systems increases in proportion to the capac-ty of the flash memory. This is because the entire flash memorys accessed during initialization in most current systems. Typicalash file systems are log-structured due to flash memory charac-eristics (Rosenblum and Ousterhout, 1992; Kawaguchi et al., 1995;al and Toledo, 2005). Therefore, the metadata in the logs are storedt arbitrary positions in the flash memory. During file system ini-ialization, metadata scattered throughout the flash memory needo be accessed to construct an in-memory metadata structure. This

as a time requirement that is proportional to the flash memoryapacity.

∗ Corresponding author. Tel.: +82 2 3290 3198; fax: +82 2 922 6341.E-mail address: [email protected] (C. Yoo).

ttp://dx.doi.org/10.1016/j.jss.2014.07.008164-1212/© 2014 Elsevier Inc. All rights reserved.

© 2014 Elsevier Inc. All rights reserved.

Several techniques have been proposed to reduce the initializa-tion time (Lim and Park, 2006; Wu et al., 2006; Yim et al., 2005).The aim of these initialization techniques is to store essential meta-data in an indexed location so that initialization may be completedsimply by reading the metadata. However, if the file system isabnormally unmounted by a system crash or abrupt removal ofbattery, the metadata may not reflect the correct state of the filesystem. In these situations, crash recovery is necessary to checkthe integrity of the file system, which requires accessing the entireflash memory. Hereafter, we refer to initialization without recoveryas normal initialization and initialization after abnormal unmountas crash recovery.

In this paper, we propose O1FS to provide a time-bound crashrecovery technique irrespective of flash memory capacity. The keyconcept behind O1FS is to divide the entire flash memory into work-ing areas. A working area is simply a logical group of blocks. At aspecific time, all file system modifications are recorded in only oneworking area. The working area can change and when it does, O1FSrecords relevant information, such as the locations of the previousand next working areas at designated locations. O1FS recovers effi-ciently from crashes because it can immediately access the mostrecent working area where all the missing data are stored. To ver-ify the effectiveness of O1FS, we develop execution time models forexisting crash recovery techniques. Using these models, we showthat the time complexities of existing techniques are O(n), where n

denotes the capacity of the flash memory, whereas the time com-plexity of O1FS is O(1).

We implement O1FS in Linux 2.6.29 and evaluate several aspectsof its performance. We first experimentally verify the complexity

Page 2: O1FS: Flash file system with O(1) crash recovery time

stems and Software 97 (2014) 86–96 87

auOmmfrt

i

ii

iv

v

tetrIOw

2

muido

2

rou4ka(ii

m(mDotrnlabn

H. Park et al. / The Journal of Sy

nalysis. In our experiments, performance comparisons are madesing the state-of-the-art log-based method (Wu et al., 2006).1FS completes crash recovery within 1.5 s, whereas the log-basedethod takes approximately 26 s in the worst case with 1 GB flashemory. Furthermore, we show that O1FS incurs negligible per-

ormance degradation during flash file system operations, such asead and write, and has no negative effects on the wear leveling ofhe flash memory.

The main contributions of this paper are summarized as follows:

i. We develop mathematical models of crash recovery techniquesfor flash file systems.

i. We prove that crash recovery times of existing techniquesincrease with flash memory capacity.

i. We propose a new flash file system, O1FS, with O(1) crash recov-ery time.

. We implement O1FS in Linux and verify the bounded crashrecovery time of O1FS.

. O1FS provides a similar level of performance and lifetime of flashmemory as existing file systems.

The remainder of this paper is organized as follows. We presenthe background and related work in Section 2. In Section 3, wexplain our design goal and details of O1FS. Section 4 contains theime complexity analysis of various crash recovery techniques. Theesults of our experimental evaluation are presented in Section 5.n Section 6, we discuss the possibility of adopting the key idea of1FS in other flash storage systems. Our conclusions and futureork are presented in Section 7.

. Background and related work

In this section, we present the characteristics of NAND flashemory. We also describe crash recovery techniques currently

sed by flash file systems, which form the basis of our study. Ournterest is initialization and crash recovery, on account of which weo not give detailed descriptions of these systems and concentratenly on relevant issues.

.1. Characteristics of NAND flash memory

We describe in this section the features of NAND flash memoryelated to our research. NAND flash memory devices are composedf blocks, each of which contains pages. A block is an erase operationnit, whereas a page, the size of which is currently 512 B, 2 KB, and

KB, is a read/write operation unit. Each page has additional space,nown as the spare area, for storing metadata. The size of the sparerea is currently 16 B, 64 B, and 128 B. The remainder of the pageother than the spare area) is known as the data area. The spare areas much smaller than the data area, due to which a read operationn the spare area is faster than one in the data area.

Out-of-place updates are an important characteristic of flashemory. Unlike magnetic storage methods (e.g., hard disk drives

HDDs)), the flash memory cannot be over-written, and a blockust be erased before more data can be written to the memory.ue to the electrical properties of NAND flash memory, the numberf erase operations in a block is limited. If the erase count exceedshe limit, the block wears out and this may produce errors duringead/write operations. To prevent blocks from wearing out, theyeed to be erased as evenly as possible. This is known as the wear

eveling of flash memory and is an essential design consideration inll flash memory file systems. The wear leveling is important issueecause it is related with the lifetime of flash-based storage. Thus,ew wear leveling algorithm or lifetime aware garbage collection

Fig. 1. Examples of pages accessed during initialization of JFFS2 and YAFFS2 (accessleft to right). Compared with JFFS2, YAFFS2 accesses the full page only when thepage contains an I-node.

algorithm are proposed in recent years (Kwon et al., 2011; Wanget al., 2012).

Because of the peculiar characteristics of flash memory devices,several researches on storage system are conducted. Chung andPark propose new flash driver algorithm to minimize the eraseoperations (Chung and Park, 2007). They pose that the numerouserase operations can degrade storage performance because an eraseoperation is much slower than a read or write operation. They delaythe erase operation as late as possible by tracking the detailed statesof blocks, so that unnecessary erase operations are reduced. Denget al. models the aging process of flash storage according to dataaccess pattern (Deng et al., 2014). They present the analysis resultthat the aging rate of flash storage can be stabilized if it does notexceed a certain threshold. However, their model and analysis areabout flash storage that is used for redundant array of independentdisks (RAID), so it is difficult to apply to flash file system whichmostly uses a single NAND flash chip. The access pattern for flashstorage also affects the performance of flash storage. Jung et al.proposes new buffer cache replacement algorithm to enhance I/Operformance of flash storage (Jung et al., 2008b). Because a writeoperation is slower than a read operation and causes erase oper-ation, which is the most slowest operation in flash device, theyreduce the write operation in buffer cache layer. By hot and coldanalysis of data, they delay the flush of not-cold data to flash storageso that the hit-ratio for write operation of buffer cache is increasedand write operations in flash storage are reduced.

Although the above researches address important issues forflash-based storage systems, they are not germane to our focus, thecrash recovery of NAND flash file system. Next subsection intro-duces the crash recovery techniques in NAND flash file systemsfrom the basic techniques to the state-of-the-art.

2.2. Crash recovery techniques

We categorize crash recovery techniques into the basic tech-nique, the super block technique, and the logical block layertechnique. Below are the characteristics and disadvantages of eachof these techniques.

2.2.1. Basic techniques in JFFS2 and YAFFS2The Journaling Flash File System version 2 (JFFS2) and the

Yet Another Flash File System version 2 (YAFFS2) are two rep-resentative flash file systems that are used widely (Woodhouse,2001; AlephOneLtd, 2002). The JFFS2 and YAFFS2 access pages

during initialization, as shown in Fig. 1. JFFS2 has to read all usedpages because it stores the metadata and data together in the dataareas. YAFFS2 reads the spare areas of all used pages in order tocollect the metadata. YAFFS2 reads only those pages that contain
Page 3: O1FS: Flash file system with O(1) crash recovery time

88 H. Park et al. / The Journal of Systems

Fsi

mJtFaew

toiirhonntfiari

o(satEcdE

2

tmirHiosnts

tstaiT

ig. 2. Example of pages accessed during the initialization of flash file systems usinguper blocks. The dotted-line arrows represent arbitrary metadata locations storedn the super block.

etadata, due to which it requires far fewer read operations thanFFS2. However, YAFFS2 still needs to access every block. We refero these access methods as full scanning of the flash memory.ull scanning requires that the first spare area in each block beccessed at least once in order to determine whether the block ismpty. Crash recovery in both file systems operates in the sameay as initialization.

To expedite initialization, Yim et al. have proposed the Snapshotechnique (Yim et al., 2005). “Snapshot” here refers to a collectionf in-memory information for the file system. When the file systems unmounted, a snapshot of it is stored in some fixed blocks. Dur-ng the next bootup of the device, initialization is completed afterestoring the snapshot to the main memory. Experimental resultsave shown that the initialization of a device with 100 MB of datan a 128 MB NAND flash memory device using the Snapshot tech-ique takes 160 ms. However, the Snapshot technique introducesew overheads when storing the snapshot during the unmountime. In the same experiments, the overhead required 250–500 msor a 128 MB flash memory (Yim et al., 2005). The unmount timencreases with the size of the stored data. The Snapshot techniquechieves faster initialization than JFFS2 and YAFFS2, but its crashecovery does not improve because a snapshot cannot be recordedf the file system crashes.

In 2013, Park and Kim proposed the Enhanced NAND Flash mem-ry File System (ENFFiS) for mobile embedded multimedia systemsPark and Kim, 2013). The design focus of ENFFiS is on building atable flash file system that is resistant to crashing. The file systemdopts the Snapshot technique to improve normal initializationime and records shorter times than YAFFS2 and JFFS2. AlthoughNFFiS has been shown to be stable in case of a system crash, therash recovery time of ENFFiS has not been evaluated. Because itoes not have a scheme for efficient crash recovery, we assume thatNFFiS has the same limitation as the Snapshot technique.

.2.2. Super block techniqueA super block is a group of blocks at a fixed location that stores

he addresses of the metadata. The file system allocates blocks foretadata on demand and records the addresses of these blocks

n the super block. Because the super block technique graduallyecords metadata, it does not incur any unmount time overheads.owever, the technique introduces runtime overheads when stor-

ng the address of the block containing the metadata, although thisverhead is not concentrated at a specific period. Fig. 2 shows theearch procedure during initialization with the super block tech-ique (accessing metadata from the super block). It is faster thanhe initialization of YAFFS2 because it eliminates the search of thepare areas in all used pages.

The log-based method (LBM) is a representative super blockechnique (Wu et al., 2006). The LBM records a log when the fileystem performs a write/delete operation. It completes the ini-

ialization process by replaying the operations in the log. All logsre stored at an indexed position called the log-segment, and thendex is stored in reserved blocks called log-segment directories.herefore, the LBM can directly access logs without full scanning.

and Software 97 (2014) 86–96

Because the LBM stores metadata in the form of a log, it is sim-ilar to the checkpoint technique used for crash recovery by theSprite log-structured file system (LFS) (Rosenblum and Ousterhout,1992). However, Sprite LFS always stores its logs at a fixed loca-tion because it is designed for HDDs, which are capable of in-placeupdates. Therefore, Sprite LFS does not use any type of super block.

The Core Flash File System (CFFS) is another flash file system thatuses the super block technique (Lim and Park, 2006). Unlike otherflash file systems, CFFS is not log-structured. Instead, it is structuredin a manner similar to the UNIX file system, which stores metadataand data for files in separate locations. CFFS stores metadata indynamically allocated blocks, whereas the index for the blocks isstored in the super block known as the InodeMapBlock.

The super block technique achieves fast initialization, but crashrecovery still requires full scanning to search for missing modifi-cations. A system crash deprives a file system of the free block list,block states, and data page addresses. To recover from a crash, thestates of the blocks and pages that were not updated have to beidentified. The main difference between O1FS and previous tech-niques is the block allocation policy: O1FS limits the blocks thatneed to be scanned when searching for missing data.

2.2.3. Logical block layer techniqueA flash file system with a logical block layer consists of two lay-

ers: a file system layer at the top and a logical block layer at thebottom. The logical block layer provides a logical view of the flashblocks in the upper layer. The two layers manage the logical andphysical blocks, respectively, such that each of them has to indepen-dently recover from crashes. We now describe the crash recoverytechniques for the file system layer and the logical block layer inJFFS3 and ScaleFFS.

JFFS3 (also known as Unsorted Block Images-file System (UBI-FS)) is a representative flash file system built on a logical blocklayer (Hunter, 2008). JFFS3 comprises an upper flash file systemlayer and a lower flash block layer, known as a UBI-layer. The maingoal of a UBI-layer is to ensure wear leveling across partitions in aflash memory device. If different flash management systems man-age each partition in a flash device in different ways, wear levelingcannot be achieved in a flash device. To overcome this problem, theUBI-layer maps physical blocks to logical blocks and dynamicallyallocates the blocks across the partitions.

The file system on top of the UBI-layer uses the super blocktechnique to solve the initialization time problem in JFFS2. Thefile system stores the indices of all the metadata and data locatedin scattered and arbitrary positions, and maintains the addressesof the positions in the super block in a master node stored at afixed position. Furthermore, the file system also uses a journalingmechanism to store recent modifications to the index. Journalingmeans that the file system itself does not require full scanning tofind missing modifications to the index after a crash.

Despite the efforts of the file system in its top half, the overallinitialization time of JFFS3 has the same problem as YAFFS2 becauseit depends largely on the initialization time of the UBI-layer in itsbottom half. The UBI-layer stores the mapping information in anerase block assignment table (EAT). To construct the EAT after acrash, the UBI-layer requires a full scanning of the flash memory.The UBI-layer needs only one page for a block to store the metadatafor EAT. Thus, the initialization of the UBI-layer is faster than thatin YAFFS2. However, the initialization problem persists because itincreases with the capacity of the flash memory.

ScaleFFS is another log-structured flash file system that uses thelogical block layer technique (Jung et al., 2008). ScaleFFS divides the

flash memory into two regions: a log area and a checkpoint area.The log area stores logs in virtually sequentially ordered blocks.To implement virtual ordering, blocks in the log area are logicallythreaded similarly to a linked list structure by storing a pointer in
Page 4: O1FS: Flash file system with O(1) crash recovery time

stems and Software 97 (2014) 86–96 89

tcsattctnt

SHbTcaictmJt(ci

wonas4e

3

mstft

3

srtoositwsnoemDflh

H. Park et al. / The Journal of Sy

he next adjacent block in the spare area of the very first page. Theheckpoint area stores checkpoints that contain complete and con-istent file system metadata, including a block number in the logrea that will be modified next. As with the Snapshot technique, ini-ialization in ScaleFFS can be achieved using the information fromhe checkpoint. ScaleFFS loads the file system metadata from theheckpoint into the main memory and searches for any modifica-ions in the log area that were not stored in the checkpoint of theext target block. If any modifications are found during the search,he in-memory information will be updated appropriately.

The virtual ordering of blocks using a logical block layer incaleFFS avoids full scanning to search for missing modifications.owever, the implementation of virtual ordering may be difficultecause of the out-of-place update characteristics of flash memory.he pointer to the next adjacent block cannot be updated if garbageollection or bad block management replaces the next block. Tovoid updating the pointers during virtual ordering, the currentmplementation of ScaleFFS uses a round-robin policy of garbageollection, similar to the original JFFS. However, this policy is knowno be unrealistic because it requires excessive erase operations to

ove valid data, which motivated the development of JFFS2 fromFFS. The authors of ScaleFFS also admit that their garbage collec-ion policy needs to be improved for better flash memory utilizationJung et al., 2008). Furthermore, there are no evaluation results forrash recovery using ScaleFFS, whereas it was evaluated for normalnitialization.

In summary, the initialization time of a flash storage systemith a logical block layer technique still increases with flash mem-

ry capacity, whereas O1FS achieves faster crash recovery that isot dependent on capacity. Furthermore, O1FS does not requiren additional layer, which may slow down the performance of atorage system. We analyze crash recovery techniques in Section

and verify the crash recovery performance of O1FS with severalvaluations in Section 5.

. Design of O1FS

In this section, we detail the design of O1FS, which divides flashemory into working areas where the data and the metadata are

tored in a log-structured manner. Our design goals for O1FS are: (1)o construct a file system structure that does not lose metadata usedor crash recovery, and (2) to develop an efficient crash recoveryechnique using the metadata.

.1. Block classification

To achieve our design goals, we classify all blocks into four types:uper block, journal block, data block, and free block. For clarity, weefer to a block of flash memory as a flash block. A super block ishe only type of block found in a fixed location, and consists of fourr more flash blocks. The super block contains static configurationsf the file system and a series of journal block tables (JBTs). Thetatic configurations, such as the size of a working area, are storedn the very first page of the block, whereas the JBTs are stored inhe remaining pages in the block. A JBT comprises several pages,here the data area stores the addresses of the journal blocks. We

tore in the spare area information such as the length of the JBT, theumber of journal blocks in the JBT, the unmount flag, the locationf the latest working area, and the version number. When a JBT isxhausted, a new JBT is appended to the super block. Therefore, the

ost recent JBT is located in the last written page in the super block.uring the unmounting of the file system, O1FS sets the unmountag of the most recent JBT to zero. This shows that the file systemas been correctly unmounted.

Fig. 3. Flash layout example of O1FS. A working area consists of six flash blocks, andthere are three working areas in the flash memory.

The journal block contains journal pages that store the jour-nals. A journal stores metadata related to file system modifications:namely, I-node number, operation type, start offset, target pagenumber, size, and version. The version is an incremental counterused for sequence recognition. O1FS initialization is completed bysequentially replaying journals because it represents all modifica-tions made to the file system. In the spare area of each journal page,we record information related to the current working area: namely,the number of remaining free blocks and the total number of eraseoperations performed on the area. The next working area is decidedbased on this information.

The data block stores file data. Data blocks are allocated sequen-tially, starting from the first free block in the current working area.Free blocks are flash blocks that have not been allocated.

3.2. Block management

In this subsection, we describe the management of blocks. Dur-ing file system generation, we first statically allocate the superblock using the first few flash blocks. The number of blocks mustbe a multiple of two because we duplicate the super block. Thisprevents the loss of JBT during wear out or other incidents. Further-more, we reserve the same number of blocks to immediately changethe old block to a new block if pages in the blocks are exhausted.Therefore, the minimum number of flash blocks allocated to a superblock is four. As shown in Fig. 3, O1FS stores the new JBT in the firstflash block, and a third flash block is reserved to change with thefirst block. The second and fourth flash blocks are duplicates of thefirst and third, respectively.

The static allocation of super blocks can cause a wear levelingproblem because the flash blocks allocated for the super block canwear out earlier than other blocks. Yim et al. also experienced thesame problem when they designed the Snapshot technique. This isbecause a snapshot needs to be stored in several blocks at a fixedposition in a round-robin manner similar to a super block (Yimet al., 2005). To address this problem, they dynamically adjustedthe length of the blocks that stored the snapshot. If these blocks areerased more often than the rest of blocks, they allocate more blocksto store snapshots.

A journal block is allocated from a free block in the currentworking area. Frequent updates to the JBT will inhibit file sys-tem performance. Thus, a bundle of journal blocks are allocatedtogether and the addresses of the blocks are immediately recordedin the JBT. The pre-allocation of journal blocks and the pre-updateof JBT (using a bundle of blocks) improve file system performanceby reducing synchronous write operations for JBT while user I/Orequests are being processed.

O1FS caches journals in the main memory and flushes them tothe flash memory when cached journals fill a page. To ensure thatthe flushed journals represent the actual data blocks in the flash

memory, O1FS also flushes the write buffer at the same time. Thisis a common approach to journal management. For example, thethird extended filesystem (EXT3) uses a similar mechanism, knownas the ordered mode (Johnson, 2010).
Page 5: O1FS: Flash file system with O(1) crash recovery time

90 H. Park et al. / The Journal of Systems and Software 97 (2014) 86–96

crash

tafl

3

anabwsew5

awaOTlpa

rilitppitaStb

3

e

burr

Fig. 4. Steps during

Fig. 3 shows an example of an on-flash layout for O1FS based onhe above description. The example contains three working areasnd each comprises six flash blocks. The super block comprises fourash blocks.

.3. Working area management

The management of working areas is a key component of ourpproach. During file system generation, the user configures theumber of blocks in a working area. This number cannot be changedfter this phase. The size of the working area produces a trade-offetween crash recovery time and file system performance. Largerorking areas increase crash recovery execution time whereas

maller ones cause more frequent working area changes, whereach change requires one JBT page write. However, the choice oforking area size is not a significant issue, as we show in Section

.The working area changes if free blocks in the current working

rea are exhausted. This change requires an immediate one-pagerite operation to update the JBT. We flush all journals in the cache

nd update the information for the current area in the journal block.1FS chooses the next working area based on the total erase count.o support wear leveling, O1FS chooses the working area with theowest total erase count and at least one free block. This policyrevents the concentration of erase operations on one workingrea.

If free blocks are no longer available, garbage collection is car-ied out. The selection of the victim block during garbage collectionnvolves two steps. We first select the victim working area with theowest erase count. If there is no dirty block to collect, the work-ng area with next lowest erase count is selected as the victim. Wehen select the victim block with the highest number of invalidages among blocks in the selected working area. Therefore, ourolicy simultaneously considers the erase count and the number of

nvalid pages. We expect that the wear leveling of O1FS is similaro YAFFS2 because it is a widely used flash file system that uses

similar policy (AlephOneLtd, 2002). The experimental results inection 5.4 confirm this similarity. O1FS performs garbage collec-ion during its idle time because on-demand garbage collection cane detrimental to file system performance.

.4. Initialization procedure

This section describes the normal initialization and crash recov-ry process of O1FS. The steps of crash recovery are shown in Fig. 4.

When the system is switched on, O1FS finds the valid super

lock. It then scans the version in the spare area of the JBT pagesntil the most recent JBT is found. If the unmount flag in the mostecent JBT is set to true, O1FS performs normal initialization. Iteads the addresses of the journal blocks in JBT as well as the

recovery with O1FS.

journals. We complete initialization by replaying these journalsand constructing the in-memory metadata.

If the file system is not unmounted correctly, crash recoverykicks in and the journals are collected, which is identical to the firststep in normal initialization. O1FS then checks modifications thatwere not journaled by searching the latest working area. O1FS canrecognize the most recent working area because the most recentJBT shows its location. O1FS sequentially scans the spare areas ofblocks in the most recent working area and checks the correspon-dence between the blocks and the in-memory metadata in validjournals. The integrity of the file system is guaranteed if all meta-data correspond to the information in the data blocks. We discardthe metadata and the data blocks that do not correspond. O1FScompletes crash recovery when the scan ends.

During the boot period, O1FS constructs in-memory metadatathat represent the locations of all data, including the I-node of allfiles, using the information in the journals. O1FS manages the I-nodes and their locations in a hash table, using the I-node numberto facilitate rapid searching. All the locations of the data blocks forI-nodes are stored in a B-tree. Because the I-node hash table andthe B-tree of the data blocks are constructed completely duringthe mounting period, O1FS can immediately access any item of themetadata without having to search after the system boots up.

4. Performance analysis

In this section, we develop a mathematical model of executiontimes for existing crash recovery techniques. Our analysis aimsto prove that the execution time of existing crash recovery tech-niques increases in proportion to the total number of blocks in aNAND flash memory. By contrast, the crash recovery time of O1FSis independent of flash memory capacity, and we prove this throughmathematical analysis. We provide comparisons based on experi-mental results, but the mathematical analysis is important becauseit enables us to forecast the future performance of crash recoverymethods.

When the first flash file system was introduced, the flash mem-ory capacity was 16 MB and the crash recovery time was thus not anissue. At present, crash recovery takes several minutes for mobiledevices with a 32–64 GB flash memory, which is a very annoyingproblem for users. However, the relationship between the flashmemory capacity and crash recovery time has not been examinedmathematically.

Our analysis compares YAFFS2 and LBM with O1FS. We omitJFFS2 because the crash recovery of JFFS2 obviously increases withflash memory capacity. We selected YAFFS2 because Snapshot usesthe same crash recovery technique as this file system. We selected

LBM as representative of the super block technique. We alsoomit the logical block layer technique because it uses the YAFFS2technique or the super block technique. Table 1 shows a summaryof the results.
Page 6: O1FS: Flash file system with O(1) crash recovery time

H. Park et al. / The Journal of Systems and Software 97 (2014) 86–96 91

Table 1Model of the execution times of crash recovery techniques.

Technique Execution time model Time complexity Proportional to

YAFFS2 and Snapshot (1) Ty = (Bf + Np)Ts + Nf(Td + Ts) (2) Ty(Bt) ∈ O(Bt) # of free blocks, # of files and their data pages

(4)

(6)

dtot

4

YttuttmbtrsWt

T

Pcr

B

pab

B

MpN

N

IT

T

Bt

4

iAfidfT

Log-based method (3) Tlbm = (Bf + Mp)Ts + (L + Mf)(Td + Ts)

O1FS (5) Twam = (Bfw + Mpw)Ts + (J + Mfw)(Td + Ts)

The notations used in this section are defined as follows: theata and the spare area reading time are Td and Ts, respectively;he total number of blocks in the flash memory is Bt; the numberf free blocks in the flash memory is Bf, and the number of files andhe total pages in all files are Nf and Np, respectively.

.1. YAFFS2

The execution time model Ty of YAFFS2 is shown in Table 1.AFFS2 first searches all free blocks by reading the spare area inhe first page of each of the blocks. We express this search as BfTs. Ifhe spare area contains valid data, YAFF2 recognizes the block as inse. YAFFS2 then reads all spare areas in the used pages to collecthe metadata. We express this access as NpTs. If a page containshe metadata of a file in its data area, such as chunk 0 in YAFFS2, it

ust read the data area to build the I-node of the file. This read cane expressed as Nf(Td + Ts). We rearrange the equation using twoerms, Ts and (Td + Ts), because Ts is much smaller than (Td + Ts). Theesult is Ty, which is expressed as (1) in Table 1. This expressionhows that the number of files dominates the crash recovery delay.

e now prove (2) from (1) to show that crash recovery executionime increases with flash memory capacity.

heorem 1. The time complexity of Ty(Bt) is O(Bt).

roof. We rewrite (1) as an expression related to flash memoryapacity to prove Theorem 1. We present Bt below, where p is theatio of blocks allocated to used pages with 0 < p ≤ 1.

t = Bf + p(Nf + Np) (7)

(Nf + Np) represents the number of used blocks, Bu. If we define qs the ratio of the number of used blocks to the total number oflocks, where 0 ≤ q ≤ 1, Bf and Bu can be expressed as (8).

f = (1 − q)Bt, Bu = p(Nf + Np) = qBt (8)

oreover, if we define r as the ratio of Nf to the total number of usedages, where 0 < r ≤ 1, we can express Nf and Np as Nf = r(Nf + Np),p = (1 − r)(Nf + Np). Thus, Nf and Np are expressed as

f = qr

pBt, Np = q(1 − r)

pBt (9)

f we rewrite the execution time model of YAFFS2 using (8) and (9),y can be expressed as

y =(

(1 − q)Ts + q

pTs + qr

pTd

)Bt (10)

ecause p, q, r, Ts, and Td are constants, we can conclude that theime complexity of Ty(Bt) is O(Bt) from (10). �

.2. Log-based method

The main difference between LBM and YAFFS2 is that the logsndicate whether there are any missing blocks following a crash.fter reading the log-segments, LBM scans the unspecified blocks to

nd any missing pages. If the first spare area of a block contains validata, LBM scans the block because it contains missing pages. There-ore, we express the execution time model of LBM crash recoverylbm as (3) in Table 1. Instead of Nf and Np in the model of YAFFS2,

Tlbm(Bt) ∈ O(Bt) # of free blocks, # of missing files and their data pages

Twam(Bt) ∈ O(1) Nothing

we insert the number of missing files and the total number of pagesof the missing files, Mf and Mp, respectively, into the LBM model.Further, L(Td + Ts) expresses the scanning of all log segments, whereL is the total number of used pages in all log segments.

We now prove (4) from (3).

Theorem 2. The time complexity of Tlbm(Bt) is O(Bt).

Proof. We rewrite (3) as an expression related to flash memorycapacity. We represent Bt as below, where p is the ratio of allocatedblocks to used pages, and 0 < p ≤ 1.

Bt = Bf + p(Mf + Mp + L) (11)

p(Mf + Mp + L) represents the number of used blocks, Bu. If we defineq as the ratio of the number of used blocks to the total number ofblocks, where 0 ≤ q ≤ 1, Bf and Bu can be expressed as (12).

Bf = (1 − q)Bt, Bu = p(Mf + Mp + L) = qBt (12)

Moreover, we define r as the ratio of Mf to the total num-ber of missing pages, and express Mf and Mp as Mf = r(Mf + Mp),Mp = (1 − r)(Mf + Mp), where 0 < r ≤ 1. Thus, Mf and Mp can beexpressed as

Mf = r(

q

pBt − L

), Mp = (1 − r)

(q

pBt − L

)(13)

If we rewrite (3) using (12) and (13), Tlbm can be expressed as fol-lows.

Tlbm =(

(1 − q)Ts + q

pTs + qr

pTd

)Bt + (1 − r)TdL (14)

Note that p, q, r, Ts, and Td are constants, and we can assume thatL is a constant value for Bt because it depends only on the numberof operations performed on valid files. Therefore, we conclude thatthe time complexity of Tlbm(Bt) is O(Bt) from (14). �

4.3. O1FS

The main difference between O1FS and LBM is that unspec-ified blocks are located only in the latest working area, ratherthan throughout the entire flash memory. After reading the jour-nal pages, O1FS scans the unspecified blocks in the latest workingarea to find any missing pages. Therefore, we express the executiontime model of O1FS crash recovery Twam as (5) in Table 1. Insteadof Nf and Np in (1), we insert Mfw, the number of missing files, andMpw, the total number of pages of the missing files, into the model.Moreover, we use J(Td + Ts) to express the scanning of the journalpages, where J is the number of pages in each journal block.

We now prove (6) from (5). To do this, we define Bfw as thenumber of free blocks in the most recent working area. Bfw, Mpw,and Mfw are constant because a working area has a fixed numberof blocks.

Theorem 3. The time complexity of Twam(Bt) is O(1).

Proof. We rewrite (5) as an expression related to flash memorycapacity. We present Bt as shown below, where p is the ratio of the

number of allocated blocks to the number of used pages in the mostrecent working area and 0 < p ≤ 1.

Bt = wBw = w(Bfw + p(Mf + Mp + Jw)) (15)

Page 7: O1FS: Flash file system with O(1) crash recovery time

92 H. Park et al. / The Journal of Systems

Table 2Target system environments.

Component Detail

Evaluation board SIB-N200CPU S3C6410 667 MHzMain memory 128 MB SDRAMNAND flash memory 1 GB (4096 blocks)

NAND page read time Data + Spare (2 KB + 64 B): 638.43 usSpare only (64 B): 78.92 us

Kernel Version 2.6.29

Wwotwwbome

B

FiaT

M

If

T

Joagt

1000 to 5000 in a target file system with a 1 GB partition. We fixed

YAFFS2 Version 12/29/09 GIT download version

e define w as the number of working areas in the flash memory,here w is an integer with the range 1 ≤ w ≤ Bt . Bw is the number

f blocks in a working area and Jw is the number of journal pages inhe most recent working area. Based on the assumption that eachorking area has an equal number of journal pages, J is equal toJw . Therefore, p(Mfw + Mpw + Jw) represents the number of usedlocks in the latest working area, Buw. If we define q as the ratiof the number of used blocks to the total number of blocks in theost recent working area, where 0 ≤ q ≤ 1, then Bfw and Buw can be

xpressed as (16).

fw = (1 − q)Bw, Buw = p(Mfw + Mpw + Jw) = qBw (16)

urther, we define r as the ratio of Mfw to the total number of miss-ng pages in the most recent working area and express Mfw and Mpw

s r(Mfw + Mpw) and (1 − r)(Mfw + Mpw), respectively, where 0 < r ≤ 1.hus, Mfw and Mpw can be expressed as

fw = r(

q

pBw − Jw

), Mpw = (1 − r)

(q

pBw − Jw

)(17)

f we rewrite (5) using (16) and (17), Twam can be expressed asollows:

wam =(

(1 − q)Ts + q

pTs + qr

pTd

)Bw + (Ts + Td)J − (Ts + rTd)Jw

(18)

Note that p, q, r, Ts, and Td are constants, and we can assume that has a constant value for Bt because it depends only on the numberf operations in the file system. Jw is a constant because Jw = J/w

nd w is an integer. O1FS lets users configure Bw during file systemeneration, so that Bw is a constant value. Thus, we prove that theime complexity of Twam(Bt) is O(1) from (18). �

0

1

2

3

4

5

6

100 300 500 700 900

)ces(e

m ityre

vocer

hsarC

File system size (MB)

YAFFS2

LBM

JFFS3

O1FS

(a) File system sizes

0

5

10

15

20

25

30

1000 2000 3

)c es(e

mity re

voce r

hs arC

Number of

YAFFS2 LBM

(b) Nu mber

Fig. 5. The execution time for crash recovery using YAFFS2, LBM, JFFS3, and O1F

and Software 97 (2014) 86–96

5. Evaluation

We implemented O1FS in Linux 2.6.29. Table 2 describes theenvironment used for the implementation and the evaluations.Once our 1 GB flash memory had stored a boot loader, a kernelimage, and a random-access memory (RAM) disk, we configured thesize of one working area as 34 MB by equally dividing the remainingspace.

We evaluated O1FS and other file systems based on threecharacteristics: (1) the time complexity of crash recovery, whichwe analyzed in Section 4; (2) a comparison of the mount andunmount times; and (3) the effect on file system performanceand wear leveling. Using these evaluations, we showed that O1FScompletes initialization very efficiently without significant per-formance degradation. Memory usage and journaling cost areimportant issues, but we did not evaluate them because Wu et al.have already examined these issues (Wu et al., 2006).

5.1. Comparison of crash recovery times

We ascertained how flash memory capacity affects the crashrecovery techniques YAFFS2, LBM, JFFS3, and O1FS. As described inTable 1, crash recovery is proportional to the number of free blocks,the number of files, and the size of the missing data. We created var-ious states in the file systems to control each factor independentlyand measured the crash recovery times for each of the file systems.Note that the result of JFFS3 is the sum of the crash recovery timesfor the UBI-layer and the FS-layer.

In the first experiment, we varied file system size from 100 MBto 900 MB. We created the target file system in partitions withthese sizes and one empty file, before crashing the file system. Asshown in Fig. 5a, the crash recovery times of YAFFS2, LBM, andJFFS3 increased with file system size because they have to accessentire blocks to determine whether a block is empty. With YAFFS2and LBM, scanning free blocks only requires accessing the spareareas, and thus their crash recovery times were less than 1 s each.However, if each GB of free space requires 1 s for crash recovery,this becomes a serious problem for systems using hundreds of GBs.With JFFS3, the crash recovery of UBI-FS always required approxi-mately 0.4 s, while the UBI-layer required most of the total time. TheUBI-layer checks the integrity of the block mapping information inall flash blocks, due to which the crash recovery time increases from1 s to 4.9 s depending on partition size.

In the second experiment, we varied the number of files from

the total size used to 600 MB in all cases. Having created the files,we crashed the file system. As shown in Fig. 5b, the crash recoverytime of YAFFS2 increased with the number of files. Most of its

000 4000 5000 fil es

JFFS3 O1 FS

of files

0

5

10

15

20

25

30

100 300 500 700 900

) ces (e

mity re

voce r

hs arC

Miss ing Data size (MB)

YAFFS2

LBM

JFFS3

O1FS

(c) Siz e of missin g data

S on file system sizes (a), number of files (b), and size of missing data (c).

Page 8: O1FS: Flash file system with O(1) crash recovery time

H. Park et al. / The Journal of Systems and Software 97 (2014) 86–96 93

Table 3Mount and unmount times.

Initialization type Normal initialization Crash recovery

Used space/# of files 0 MB/0 files 388.6 MB/6656 files 388.6 MB/6656 files

Execution time of M UM M UM M UM

JFFS2 3618 17 54,966 44 55,095 117YAFFS2 398 192 31,101 196 31,239 189Snapshot 54 489 681 485 30,354 488JFFS3 5399 28 5435 861 12,840 62LBM 31 17 843 18 2555 19

T

eEwtWldttcdriatsw

ofitbtaLiarrJtJwtpca

5

mTsPdwa(tt

O1FS 31 19

he unit is millisecond. “M” means mount and ‘UM” means unmount.

xecution time, roughly 18 s, was spent scanning the data page.ach file required a one-page read operation to access its I-node,hich required 187 ns on an average. Therefore, crash recovery

ime using YAFFS2 increased in proportion to the number of files.ith JFFS3, the time increased with the number of files until the

atter reached 4000. With 5000 files, the crash recovery time isecreased to about 8 s from 20 s with 4000 files. The reason forhe decrement was that UBI-FS committed to journals a shortime before the crash. UBI-FS with 5000 files required 2.83 s forrash recovery, but the overall crash recovery time was 7.81 sue to the crash recovery of the UBI-layer. In each case, the crashecovery time for the UBI-layer was approximately 5 s because its dependent only on flash memory capacity. The results for LBMnd O1FS were irregular because their performance was not linkedo the number of files, as described in Table 1. The results for LBMeem to be inversely proportional to the number of files, but thisas just a coincidence caused by a decrease in missing data size.

Finally, we varied the missing data size to compare O1FS andther methods. We created the file system in a 1 GB partition, usedles of various sizes, ranging from 100 MB to 900 MB, and crashedhe file system. During the experiment, LBM did not flush any logsecause the file creation process required the allocation of sequen-ial pages, which eventually generated only one log. Therefore,ll allocated pages became missing pages when a crash occurred.BM yielded a performance similar to that of YAFFS2, as shownn Fig. 5c. LBM only scanned the spare area of the missing pages,nd thus the results were poor because every 1 MB of missing dataequired approximately 32 ms. Thus, 1 GB of missing data wouldequire roughly 30 s for crash recovery. The crash recovery time forFFS3 also increased with missing data size, although it was shorterhan that for YAFFS2 and LBM. In this experiment, the UBI-layer ofFFS3 still had a constant crash recovery time of approximately 5 s,

hich was the same as that in the previous experiment becausehe flash memory capacity was kept constant. By contrast, O1FSroduced constant, short execution times in each situation. O1FSompleted crash recovery in less than 1.5 s in all cases. The resultsre consistent with our analysis in Section 4.

.2. Comparison of mount and unmount times

In this experiment, we measured the execution times for theount and unmount operations using empty and aged file systems.

his evaluation compared the initialization techniques of all fileystems described in this paper. Unlike previous studies (Lim andark, 2006; Wu et al., 2006; Yim et al., 2005), we used the workloadescribed by Smith and Seltzer (Smith and Seltzer, 1997), whichas designed to allow a more realistic evaluation of file system

ging. Smith and Seltzer collected 10 months of network file systemNFS) operation data from a research group and used this as a toolo replay the operations collected using other file systems. We refero this data set as the Smith-workload.

80 42 2538 20

To age a file system, we replayed the Smith-workload on thetarget file system. After the execution of 60,000 operations, wecrashed the system. Until the crash point, the operations produceda total of 4.33 GB of write requests and 6656 files remained storing388.6 MB of data. We also measured the mount and unmount timein the same condition without a crash. Table 3 shows the results.JFFS2 required approximately 3.618 s for normal initialization withan empty file system. For an aged file system, normal initializationand crash recovery required approximately 55 s in both cases. Thisresult was not surprising because the initialization technique ofJFFS2 is the same whether the file system crashes or not.

Like JFFS2, YAFFS2 uses a uniform initialization technique, andthus also produced similar results during normal initialization andcrash recovery with an aged file system, i.e., approximately 31 s.YAFFS2 determines whether a flash block is empty or not by read-ing the spare area of very first page of the block. Thus, normalinitialization with an empty file system required only 0.398 s.

The crash recovery results with aged file systems using Snap-shot were 30.354 s. This is almost identical to the results of YAFFS2,although Snapshot completed its normal initialization in the short-est time, 0.681 s. Crash recovery was slow because Snapshot couldnot record a snapshot to the flash memory during the systemcrash. In this situation, Snapshot initialized the file system usingthe same technique as YAFFS2. This is a major drawback of Snap-shot. Moreover, Snapshot required approximately 0.485 s to recordthe snapshot during unmounting.

JFFS3 showed approximately 5.4 s during normal initializationsof empty and aged file systems. The UBI-layer gave a constantresult of approximately 5 s with the same flash capacity. The nor-mal initialization of UBI-FS required less than 0.45 s because ofthe successful committing of journals, which required 0.86 s forunmounting. During crash recovery with an aged file system, UBI-FS required 7.87 s while the UBI-layer took 4.97 s, which was thesame as during its normal initialization. The longer crash recoverytime of UBI-FS was caused by searching and recovering incorrectlycommitted journals. The crash recovery time of UBI-FS was longerthan its normal initialization, but its overall result with JFFS3 wasbetter than with JFFS2 and YAFFS2 using Snapshot.

LBM and O1FS performed much better than YAFFS2 and JFFS2 inall cases. The normal initializations of the empty file system witheach of LBM and O1FS required 0.031 s because they only neededto access the super block at a fixed location. With aged file systems,LBM took approximately 0.843 s. It was thus slower than Snapshotduring normal initialization because it had to search for a valid logsegment directory. O1FS was faster than LBM because the validJBT was stored at the front of the super block. Both file systemsalso completed unmounting much more quickly than Snapshot forempty and aged file systems.

During crash recovery, LBM and O1FS required 2.555 s and2.538 s, respectively. O1FS was slightly faster than LBM becauseit accessed only the blocks in the most recent working area whensearching for missing data, whereas LBM searched the entire flash

Page 9: O1FS: Flash file system with O(1) crash recovery time

94 H. Park et al. / The Journal of Systems and Software 97 (2014) 86–96

Table 4Comparison of file system performance.

Throughput of O1FS-16 O1FS-34 O1FS-128 LBM YAFFS2 JFFS3

Seq. Write 1568 (+3%) 1546 (+1%) 1476 (-3%) 1573 (+3%) 1528 15,030Over-write 1060 (+4%) 1046 (+2%) 1011 (-1%) 1058 (+4%) 1021 5691Random Read 1673 (+2%) 1670 (+1%) 1671 (+1%) 1699 (+3%) 1647 1380Random Write 858 (+2%) 853 (+1%) 825 (-2%) 849 (+1%) 843 1961

nPageWrites 197,407 197,341 197,303 197,682 197,136 6670Diff. Writes +271 (0.14%) +205 (0.10%) +167 (0.08%) +546 (0.28%) – –

T . Them umbei

mFbWd

e2cY62de

5

apwLwstdrwtt/

wpetaawanpr

rbtiampU

he unit is KB/s. The number next to “O1FS” refers to the size of the working areaeans that the file system outperforms YAFFS2. “nPageWrites” refers to the total n

n “nPageWrites” compared to YAFFS2.

emory. This difference was much larger than the value shown inig. 5a because LBM accessed the very first spare area in the emptylock, as well as the spare areas in pages containing obsolete data.hen the size of the journal cache was larger than one page, the

ifference was greater because of the increased loss of data.We also report overheads incurred by the journals. In our

xperiments, O1FS wrote 2,726,183 pages, whereas YAFFS2 wrote,715,591 pages. This excluded page writes required for garbageollection in both cases. O1FS allocated 0.39% more pages thanAFFS2 to record 89,914 journals, which were produced through0,000 file system operations. These journals were reduced to5,347 after crash recovery because O1FS removed the journals foreleted files. If a crash did not occur, the reduction procedure wasxecuted during unmounting.

.3. File system performance

O1FS requires synchronous operations to record journal pagesnd JBT pages when changing the working area. To evaluate therecise performance overheads of these synchronous operations,e executed the IOzone benchmark (Norcott, 2005) using YAFFS2,

BM, JFFS3, and three O1FS settings of 16 MB, 34 MB, and 128 MBorking area sizes. The IOzone benchmark is a micro-benchmark

uite used to evaluate file system performance. We measured thehroughput of the sequential write, rewrite, random read, and ran-om write operations. The target file size was 128 MB and theecord size was 2 KB. The total data size written was 384 MB, whichas large enough to change the working area several times. Once

he operations were executed, we recorded a number of changes inhe working area. We also collected the number of page writes inproc/yaffs during the experiments.

As shown in Table 4, the difference between the throughputith O1FS and YAFFS2 was less than 4% in all cases. The number ofage writes was larger with all working area settings. However, itsffect on the overall performance was only approximately 0.14% ofotal page writes. Because the change in working area only requires

single update of the journal block in JBT, this is carried out with single-page write. For example, O1FS-128 requires a 2 KB pagerite for almost 128 MB data writing for a working area when the

rea is changed. The other increased page writes are used for jour-al pages. Because the working area is much larger than a singleage with practical configurations, such as 16, 32, and 128 MB, theuntime overhead of O1FS is negligible.

LBM shows similar performance with O1FS and YAFFS2. JFFS3ecorded the highest write performance because it uses write-backuffering for write operations. Due to the 128 MB main memory,he almost of written data is buffered in the main memory. The files deleted after each experiment so that the data pages of the file

re not flushed. In spite of high write performance, the read perfor-ance was slower than that for other file systems. The random read

erformance of JFFS3 is about 84% of YAFFS2. The FS-layer and theBI-layer processed the I/O operations for JFFS3 while the other file

percentage in parentheses presents the difference ratio compared to YAFFS2. “+”r of pages written during the benchmark execution. “Diff. Writes” is the difference

systems required only one layer. The transmission of I/O requestsbetween the two layers and the address translation in the UBI-layerconstituted the main overhead of synchronous I/O processing.

Our experiments showed that the performance degradationcaused by the introduction of the working area was negligible. Wealso showed that the trade-off between the working area size andfile system performance is an insignificant issue.

5.4. Wear leveling effect

We consider in this subsection how the logical boundaries forblock allocation affect wear leveling. To evaluate the precise effectof O1FS on wear leveling, we executed 200,000 operations of theSmith-workload using YAFFS2 and O1FS. The workload involved102,563 write operations and wrote 12.99 GB of data.

The erase counts of the entire flash blocks are shown in Fig. 6,which shows that O1FS differed slightly from YAFFS2 and LBM interms of erase counts. In the figures, we specify the total number,the average, and the standard deviation of erase operations for eachfile system.

YAFFS2, LBM, and O1FS show similar results while JFFS3 showsvery low total erase counts and a uniform distribution of erase oper-ations on all flash blocks. The write buffering of JFFS3 leads thenoticeable result. JFFS3 uses the free space of the 128 MB systemmemory as a write buffer. Thus, most I/O operations are processedin the main memory. Write buffering is arguable feature forembedded devices because the power supply of devices can beeasily removed, due to which user data in the write buffer will belost. Because YAFFS2, LBM, and O1FS do not adopt write buffering,they show more erase operations than JFFS3 for the same workload.However, we believe that user data is preserved more securely inthem.

In summary, the result showed that O1FS has no negative effecton wear leveling in NAND flash memory compared to YAFFS2 andLBM.

6. Discussion

The concept and design of O1FS are not only effective for flashfile systems, but can also be adopted by the flash translation layer(FTL). The flash file system and the FTL are representative storagearchitectures for flash memory (Gal and Toledo, 2005; Deng andZhou, 2011). The flash file system is a new file system for flashmemory, whereas FTL is a new storage layer to support compati-bility with ordinary file systems, and is designed for hard disk drives(HDDs). The main design goal of FTL is mapping the physical addressof flash memory into a logical address for above storage system,such as file system or block management layer of database system

(Gal and Toledo, 2005; Chung et al., 2009; Deng and Zhou, 2011).Due to preserve mapping information in main memory, FTL storesa mapping table in flash memory and also covers some fastidiouscharacteristics of flash memory, such as a wear leveling.
Page 10: O1FS: Flash file system with O(1) crash recovery time

H. Park et al. / The Journal of Systems and Software 97 (2014) 86–96 95

0

1

2

3

4

5

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

esarE

Block address

JFF S3. Total: 2723, Average: 0.68, Std. Deviation : 0.44

0

10

20

30

40

50

60

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

esarE

Block address

O1FS. Total: 64434, Average : 16.00, Std. Deviation : 5.90

0

10

20

30

40

50

60

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

esarE

Block address

YAFFS2. Total: 65111, Average: 16.16, Std. Deviation : 5.91

(a) YAFFS2

0

10

20

30

40

50

60

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

esa rE

Block address

LBM. Total: 68021, Average : 16.89, Std. Deviation : 6.27

(b) Log-based Method

0

1

2

3

4

5

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

esarE

Block address

JFF S3. Total: 2723, Average: 0.68, Std. Deviation : 0.44

(c) JFFS3 with write buffering

0

1

2

3

4

5

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

es arE

Block address

JFFS3. Total: 2723, Average : 0.68, Std. Deviation : 0.44

0

10

20

30

40

50

60

0 512 1024 1536 2048 2560 3072 3584 4096

kcol

bhcae

fo

tn

uoc

es arE

Block address

O1FS. Total: 64434, Average : 16.00, Std. Deviation : 5.90

(d) O1FS

S3 (c),

tcaodFflmriCnamWicm

oflbatata

impoOc2wfm

Fig. 6. Comparison of wear leveling between YAFFS2 (a), LBM (b), JFF

FTL requires a crash recovery technique because the mappingable can be corrupted by a crash, and its integrity should behecked and recovered at initialization (Chung et al., 2009; Dengnd Zhou, 2011). As with the flash file system, the crash recoveryf FTL requires the full scanning of flash memory to gather meta-ata that are not stored in the most recent mapping table. AlthoughTL is also affected by crash recovery time, which increases withash memory capacity, most research on FTL has focused on perfor-ance and wear leveling instead of efficient crash recovery. Some

esearch has been conducted on crash recovery in FTL, but none oft insists that crash recovery should be done in O(1) time. In 2008,hung et al. proposed a POwer-off Recovery sChEme (PORCE) tech-ique (Chung et al., 2008). However, while their goal is to develop

crash recovery technique with no degradation in runtime perfor-ance, they do not perform any relevant experiments. In 2012, C.H.u and H.H. Lin proposed a timing analysis of the system initial-

zation of FTL schemes (Wu and Lin, 2012). They proved that therash recovery times of traditional FTL schemes increase with flashemory capacity.We believe that the key idea of O1FS can be adopted in FTL with-

ut major modifications. The key idea of O1FS is to divide the entireash memory into logical groups of blocks, and to only modifylocks in one group until the group changes to another group, with

definite record of the change. Although many FTL algorithms haveheir own method of storing metadata, we can apply the divisionnd change of logical groups without needing to take into accounthe algorithm of FTL. It only restricts an available range of blocks,nd an FTL algorithm can work in its own manner with those blocks.

In spite of our expectations of the easy adoption of the O1FSdea to FTL, more research is needed into the garbage collection

echanism and the performance of SSD. Because O1FS restricts theossible target blocks for garbage collection victims, the efficiencyf garbage collection can be reduced. For FTL of SSD, a division of1FS may disturb the maximizing parallelism for multiple flashhips and reduce overall I/O performance of SSD (Agrawal et al.,

008). As an example, fragmentation, which can be intensified byorking areas, could seriously degrade the read and write per-

ormance in SSD (Chen et al., 2009). Therefore, more advancedethods to solve the issues with SSD are needed.

and O1FS (d). Each bar indicates the erase count for each flash block.

7. Conclusion and future work

We proposed O1FS in this paper as an efficient crash recov-ery method for flash file systems to address the problem oflengthy boot times in mobile devices that use NAND flash mem-ory. We proved that the time complexity of the crash recoveryO1FS is O(1), irrespective of flash memory capacity. The state-of-the-art log-based method (LBM) delivers initialization withan O(1) time complexity, but is ineffective for crash recoverywhen the system has not been shut down properly. O1FS has anO(1) time complexity during both normal initialization and crashrecovery.

In various experiments using 1 GB flash memory, O1FS com-pleted crash recovery within 1.5 s in all cases, whereas JFFS3(UBI-FS) and LBM required 19.36 s and 26.32 s at most, respec-tively. The comparative recovery time will increase with an increasein flash memory capacity. We also showed that O1FS has negli-gible effects on the performance of the file system, and has nonegative effects on wear leveling. Therefore, if our technique isadopted by mobile computing devices, users will no longer experi-ence long initialization times during normal initialization or crashrecovery.

In future research, we will focus on journal management. Werecognize that the efficient merging and flushing of journals areimportant issues that affect embedded systems because they arerelated directly to main memory usage. Furthermore, we plan todevelop efficient garbage collection methods for journals becauseinvalid journals delay crash recovery time.

Acknowledgments

The authors would like to thank the anonymous reviewers of theJournal of Systems and Software for their valuable comments andsuggestions to improve the quality of this paper. Kiman Kim and

Seungyup Kang contributed implementations and experiments.This work was supported by a National Research Foundation ofKorea (NRF) grant funded by the Korean government (MEST) (No.2010-0029180) with KREONET.
Page 11: O1FS: Flash file system with O(1) crash recovery time

9 stems

R

A

AB

C

C

C

C

D

D

D

G

H

J

J

J

K

K

L

6 H. Park et al. / The Journal of Sy

eferences

grawal, N., Prabhakaran, V., Wobber, T., Davis, J.D., Manasse, M.S., Panigrahy, R.,2008. Design tradeoffs for SSD performance. In: USENIX Annual Technical Con-ference, pp. 57–70.

lephOneLtd, 2002. Yaffs: A nand-flash filesystem. http://www.aleph1.co.uk/yaffsird, T.R., 2004. Methods to improve bootup time in Linux. In: Linux Symposium,

vol. 1, pp. 79–88.hen, F., Koufaty, D.A., Zhang, X., 2009. Understanding intrinsic characteristics and

system implications of flash memory based solid state drives. In: ACM SIGMET-RICS Performance Evaluation Review, vol. 37, ACM, pp. 181–192.

hung, T.-S., Lee, M., Ryu, Y., Lee, K., 2008. PORCE: an efficient power off recoveryscheme for flash memory. J. Syst. Architect. 54 (10), 935–943.

hung, T.-S., Park, D.-J., Park, S., Lee, D.-H., Lee, S.-W., Song, H.-J., 2009. A survey offlash translation layer. J. Syst. Architect. 55 (5), 332–343.

hung, T.-S., Park, H.-S., 2007. STAFF: a flash driver algorithm minimizing blockerasures. J. Syst. Architect. 53 (12), 889–901.

eng, Y., Lu, L., Zou, Q., Huang, S., Zhou, J., 2014. Modeling the aging process of flashstorage by leveraging semantic i/o. Future Gener. Comp. Syst. 32, 338–344.

eng, Y., Zhou, J., 2011. Architectures and optimization methods of flash memorybased storage systems. J. Syst. Architect. 57 (2), 214–227.

ouglis, F., Caceres, R., Kaashoek, F., Li, K., Marsh, B., Tauber, J.A., 1994. Storagealternatives for mobile computers. In: Proceedings of the 1st USENIX confer-ence on Operating Systems Design and Implementation, USENIX Association, p.3.

al, E., Toledo, S., 2005. Algorithms and data structures for flash memories. ACMComput. Surv. (CSUR) 37 (2), 138–163.

unter, A., 2008. A Brief Introduction to the Design of UBIFS, http://www.linux-mtd.infradead.org/doc/ubifs whitepaper.pdf

ohnson, M.K., 2010. Red Hat’s New Journaling File System: ext3. Red-hat Inc, http://www.redhat.com/resourcelibrary/whitepapers/ext3/639ea22d7f000001055ebf86b6df975d

ung, D., Kim, J., Kim, J.-S., Lee, J., 2008. ScaleFFS: a scalable log-structured flash filesystem for mobile multimedia systems. ACM Trans. Multim. Comput. Commun.Appl. 5 (1), 9.

ung, H., Shim, H., Park, S., Kang, S., Cha, J., 2008b. LRU-WSR: integration of LRU andwrites sequence reordering for flash memory. IEEE Trans. Consum. Electron. 54(3), 1215–1223.

awaguchi, A., Nishioka, S., Motoda, H., 1995. A flash-memory based file system. In:Proceedings of the Winter 1995 USENIX Technical Conference, pp. 155–164.

won, O., Koh, K., Lee, J., Bahn, H., 2011. FeGC: an efficient garbage collec-tion scheme for flash memory based storage systems. J. Syst. Softw. 84 (9),1507–1523.

im, S.-H., Park, K.-H., 2006. An efficient NAND flash file system for flash memorystorage. IEEE Trans. Comput. 55 (7), 906–912.

and Software 97 (2014) 86–96

Micron, 2013. NAND flash products of micron. http://www.micron.com/products/nand-flash

Norcott, O.W.D., 2005. Iozone file system benchmark white paper.Park, S.O., Kim, S.J., 2013. ENFFiS: an enhanced NAND flash memory file system for

mobile embedded multimedia system. ACM Trans. Embed. Comp. Syst. 12 (2),23.

Rosenblum, M., Ousterhout, J.K., 1992. The design and implementation of a log-structured file system. ACM Trans. Comp. Syst. 10 (1), 26–52.

Smith, K.A., Seltzer, M.I., 1997. File system aging – increasing the relevance of filesystem benchmarks. In: ACM SIGMETRICS Performance Evaluation Review, vol.25, ACM, pp. 203–213.

Wang, W.-n., Chen, F.-s., Wang, Z.-q., 2012. An endurance solution for solid statedrives with cache. J. Syst. Softw. 85 (11), 2553–2558.

Woodhouse, D., 2001. JFFS: the journalling flash file system. In: Ottawa Linux Sym-posium, vol. 2001.

Wu, C.-H., Kuo, T.-W., Chang, L.-P., 2006. The design of efficient initialization andcrash recovery for log-based file systems over flash memory. ACM Trans. Storage2 (4), 449–467.

Wu, C.-H., Lin, H.-H., 2012. Timing analysis of system initialization and crash recov-ery for a segment-based flash translation layer. ACM Trans. Des. Autom. Electron.Syst. 17 (2), 14.

Yim, K.S., Kim, J., Koh, K., 2005. A fast start-up technique for flash memory basedcomputing systems. In: Proceedings of the 2005 ACM Symposium on Appliedcomputing, ACM, pp. 843–849.

Hyunnchan Park received the BS degree and MS/Ph.D. in computer science fromKorea University, Korea, in 2004 and 2014, respectively. He is now in a post-doctoralcourse in the Graduate School of Convergence IT, Korea University. His currentresearch interests embedded system, virtualization, storage performance, and oper-ating systems.

Sam H. Noh received the BS degree in computer engineering from Seoul NationalUniversity, Korea, in 1986, and the PhD degree from the University of Marylandat College Park in 1993. He held a visiting faculty position at George Washing-ton University from 1993 to 1994 before joining Hongik University in Seoul Korea,where he is now a professor in the School of Information and Computer Engineering.His current research interests include I/O issues in operating systems, parallel anddistributed systems, and real-time systems.

Chuck Yoo received BS and MS degrees in electronic engineering from SeoulNational University, Seoul, Korea, and an MS degree and PhD in computer science

from University of Michigan. He worked as a researcher in Sun Microsystems Lab-oratory from 1990 to 1995. He is now a professor in the College of Information andCommunications, Korea University, Seoul, Korea since 1995. His research interestsinclude operating systems, embedded system, virtualization, and high performancenetwork.