ZFS File System Project

Improving Performance of a Distributed File System Using OSDs and Cooperative Cache

Submitted By:Parvez Gupta

Varenya Agrawal

Introduction

This work describes a cooperative cache algorithm used in zFS and explores the effectiveness of this algorithm and of zFS as a file system

This is done by comparing the system’s performance to NFS using the IOZONE benchmark

Results show that :

• zFS performs better than NFS when cooperative cache is activated

• Using pre-fetching in zFS also increases performance significantly

zFS

It is a distributed file system that uses Object Store Devices (OSD) and a set of cooperating machines

The objectives of zFS design are :

• Achieving a scalable file system

• Built from off-the-shelf components

• Make use of the memory of all participating machines

• Linear increase in performance with each added machine

• Separation of storage management from file management

The Architecture

zFS has six components :

• Front End (FE)• Cooperative Cache (Cache)• File Manager (FMGR)• Lease Manager (LMGR)• Transaction Server (TSVR)• Object Store (ObS)

The Components

Object Store

• It is the storage device on which files and directories are created and from where they are retrieved

• It handles the physical disk chores of block allocation and mapping

• ObS API enables creation and deletion of objects (files)

Front End

• Runs on every workstation on which client wants to use zFS• Provides access to zFS files and directories

Lease Manager

• Leases are used to maintain data integrity in zFS• They have an expiration period that is set in advance• Each ObS has one lease manager which acquires the major lease• It grants exclusive leases on objects residing on the ObS

File Manager

• Each zFS file is managed by a single file manager• It obtains the exclusive lease from the lease manager• It keeps track of each accomplished open() and read() request

Cooperative Cache

• Due to fast network connections, it takes lesser time to retrieve data from another machines memory than from a local disk

Transaction Server

• Each directory operation is protected inside a transaction• It helps maintain consistency of the file-system• Acquires all required leases and holds onto them for as long as it

can

The Cooperative Cache

It is integrated with the Linux kernel cache as :

• OS does not require two separate caches with different policies that may interfere

• This provides comparable local performance between zFS and other local file systems in Linux

As a result of above, following is achieved :

• Kernel evokes page eviction when available memory is low• Caching is done per page basis-not on whole files• Pages of zFS and other file systems are treated equally• Pages remain in cache until memory pressure causes kernel to

discard them• When eviction is invoked and a zFS page is the candidate then

decision is passed to a zFS routine

Cooperative cache algorithm

A page in cooperative cache is either singlet or replicated

When a client wants to open a file for reading :• The local cache is checked for the page• In case of a cache miss, zFS requests the page and its lease from

the file manager• The file manager checks if the requested pages are already

present in another machine's memory in the network• If not, zFS grants the leases to the client, which in turn reads the

pages from the OSD directly marking each page as a singlet• If the pages requested reside in the memory of some other node

B, it sends a message to B to send the pages and leases to A• Both A and B mark the pages as replicated. Node B is called a

third-party node


When memory becomes scarce , kernel invokes page eviction

• page is a replicated

• page is a singlet, the page is forwarded to another node using the

following steps :

1. A message is sent to the zFS file manager indicating that the page is sent to

another machine B, the node with the largest free memory known to A

2. The page is forwarded to B

3. The page is discarded from the page cache of A


Effects of Node Failure and Network Delays

Node Failure :

• acceptable for the file manager to assume existence of pages on nodes

• unacceptable to have pages on nodes, where the file manager is unaware

• Thus order of steps for forwarding singlet page is important– Node failure before step 1 - The file manager will eventually detect this and

update its data– Node failure after step 1 - The file manager is informed that the page is on B

although it is not true. Same situation as 1– Failure after step 2 - does not pose any problem


Network Delays :

Case 1 :

• A replicated page residing on nodes M and N is discarded from M

– zFS file manager sends a singlet message to N

– Due to network delay, this message reaches N after memory pressure developed

on N and it discarded the page as it was marked replicated


Case 2 :

• A page has not arrived on N and a singlet message arrived and was

ignored. N sent a reject message when asked to forward the page

• No problem if the page never arrives

• However, if the page arrives after the reject message is sent, it causes

inconsistency


Case 3 :


Case 4 :

• Page was moved from N to M to O where its recirculation

count exceeded its limit

• O sends a release_lease message which arrives before move

notification

Choosing proper third party node

• zFS FMGR uses enhanced round robin method• For each page range granted to node N, FMGR records time t(N)• For every request the FMGR scans all nodes holding the page

range• For each selected node Ni, the FMGR checks if currentTime -t(Ni) > C. This checks whether enough time

passed for the pages granted to Ni to reach it• If true, Ni is marked as potential provider; next node is checked• Among the marked nodes, the node with largest range Nmax is

chosen• For the next request, FMGR starts scan from node Nmax+1

Pre-fetching data in zFS

• Overhead for transmitting a data block over a network is composed of two parts : – The network setup overhead– The transmission time of the data block

• It is more efficient to transmit k pages in one message rather than transmitting them in a separate message

– Researchers tested the time it takes to transmit a file of N pages in chunks of 1...k pages in one message

– Best results were achieved for k=4 and k=8– Similar performance was achieved by zFS pre-fetching mechanism

zFS Testing environment

The Server PC ran an OSD simulator

Another PC ran the Lease Manager, File Manager and Transaction Manager

Four PCs ran the zFS front-end

NFS Testing environment

The Server PC ran an NFS server with eight NFS daemons (nfsd)

Four PCs ran the NFS clients

Methodology Used

• IOZONE benchmark tool was used to compare zFS’ performance to that of NFS

• NFS does not carry out pre-fetching so to make up for this, IOZONE was configured to read the NFS mounted file using record sizes of n=1,4,8,16 pages

• zFS mounted files were read with record size of one page but with pre-fetching parameter R=1,4,8,16 pages

Comparing zFS and NFS

Two scenarios were investigated during testing :

• file size smaller than the server's cache and all the data resided in the server’s cache

• The file size much larger than the size of the server’s cache

Results for scenario I

Results for scenario II

Observations

• The performance of NFS was almost the same for different block sizes

• But its performance is almost four times better when the file fits entirely in the memory

• The performance of zFS with cooperative cache is much better than NFS

• When cooperative cache was deactivated, different behaviors were observed for different range of pages

Observations

• The performance of zFS for R=1 is lower than that of NFS

• For larger ranges, the performance of zFS was slightly better than that of NFS due to pre-fetching

• When cooperative cache is used, zFS performance is significantly better than NFS

• Performance with cooperative cache is lower in second case due to memory pressure and discarded pages generating reject messages

Conclusion

• The results show that using the cache of all the clients as one cooperative cache gives better performance as compared to NFS as well as the case when cooperative cache is not used

• The results also show that using pre-fetching with ranges of four and eight pages results in much better performance

Documents

ZFS File System Project