31
Performance of Large Journaling File Systems

Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Embed Size (px)

Citation preview

Page 1: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Performance of Large Journaling FileSystems

���

Page 2: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

ii Performance of Large Journaling File Systems

Page 3: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Contents

Performance of Large Journaling FileSystems . . . . . . . . . . . . . . 1Objectives for the journaling file systemsperformance tests . . . . . . . . . . . . . 1Summary for the journaling file systems performancetests . . . . . . . . . . . . . . . . . 1Hardware equipment and software environment forthe journaling file systems performance tests . . . 3

Environment . . . . . . . . . . . . . 4Workload description for the journaling file systemsperformance tests . . . . . . . . . . . . . 4

IOzone . . . . . . . . . . . . . . . 4Format utility . . . . . . . . . . . . . 5Filling up the file systems . . . . . . . . . 5dbench . . . . . . . . . . . . . . . 6

System setup for the journaling file systemsperformance tests . . . . . . . . . . . . . 7

Linux disk I/O features. . . . . . . . . . 7

Logical volumes . . . . . . . . . . . . 8Multipath . . . . . . . . . . . . . . 8File systems . . . . . . . . . . . . . 9Storage server disk usage. . . . . . . . . 10Instruction sequence to prepare the measurement 10Understanding the CPU utilization . . . . . 11

Results for the journaling file systems performancetests . . . . . . . . . . . . . . . . . 11

Finding the best logical volume configuration . . 12Run times for the format utility. . . . . . . 13Run times to fill up the file system to 40%utilization . . . . . . . . . . . . . . 16Measurements on a 40% filled file system . . . 17Measurements on an empty file system . . . . 24

Other sources of information for the Journaling filesystems performance tests . . . . . . . . . 26Notices for the journaling file systems performancetests . . . . . . . . . . . . . . . . . 27

iii

Page 4: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

iv Performance of Large Journaling File Systems

Page 5: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Performance of Large Journaling File Systems

The paper investigates the performance of journaling files systems on Linux onsystem z using a large file system size of 1.6 TB. The file systems tested are ext3with and without directory index, reiserfs v3, xfs and as reference for a nonjournaling file system ext2. All tests were done with FCP/SCSI disks. It determinesat first the best logical volume configuration and fail over mode for theenvironment. The various journaling file systems are then tested using sequentialI/O and a workload emulating a file server workload.

Published November 2007

To view or download the PDF version of this document, click on the followinglink:

Performance of Large Journaling File Systems (about 943 KB)

Objectives for the journaling file systems performance testsLinux® file system sizes have grown rapidly over the last few years with a size ofseveral terabytes being common. This study was intended to show how current filesystem types, like EXT3, XFS, and ReiserFS v3 behave with a file system of 1.6 TBin a multi processor environment on Linux on IBM System z®.

This is especially true in the mainframe area for applications like database systemsor large backup volumes for IBM® Tivoli® Storage Manager (TSM). Such large filesystems are related to having a large amount of metadata to manage the file data.The effort of finding and navigating inside a file system of 1 terabyte isconsiderably higher than for a file system size of 10 GB.

Because all file systems are expected to perform equally on an empty file system,regardless of the file system size, the file system we used was filled up to 40% toemulate a file system in use.

The file system itself was created by using a striped logical volume includingfailover handling by using two paths to one disk. The best number of disks andthe best stripe size was evaluated, before we analyzed the file systems. We wantedto see the impact of:v The number of disksv The stripe sizev The multipath environmentv The various file systems

Summary for the journaling file systems performance testsAfter performing performance tests on the large journaling file systemsenvironment, we compiled a summary of our results and recommendations.

Our test results and recommendations are specific to our environment. Parametersuseful in our environment might be useful in other environments, but aredependent on application usage and system configuration. You will need todetermine what works best for your environment.

1

Page 6: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

All tests were done with FCP/SCSI disks.

Our summary results regarding the correct configuration for the logical volumesfollow:v The singlepath multipath mode can give the best throughput, but there is no

fault tolerance. For high availability we recommend the failover multipath mode.The throughput was close to the results of a single path connection, which werevery good. We used the failover multipath mode for the rest of ourmeasurements.

v We varied the number of physical disks of a logical volume from 4 disks up to32 disks. The best throughput and lowest CPU utilization was seen with 8 and16 disks.

v The best stripe size for the logical volume depends on the access pattern. Wefound a good compromise at 64 KB, which is also the default. Sequential andrandom write workloads were best at 128 KB and larger. Random read had itspeak at 16 KB and sequential read worked well with everything between 16 KBand 128 KB.

Our summary results regarding our tests with the journaling file system follow:v The format times for the file systems were very short for all the file system types

we tested. Creating a file system on a 1.6 TB volume takes between someseconds and one and a half minutes. Even the longest time might be negligible.

v When filling the 1.6 TB file system with 619 GB of data, we found that thefastest journaling file system was EXT3. XFS needed a little bit longer. Bothconsumed the same amount of CPU utilization for this work.

v Two different disk I/O workloads were tried on the non-empty file system. Firstwe used IOzone, a file I/O workload with separate phases ofsequential/random and read/write access on dedicated files. Second we trieddbench, which generated a file server I/O mix on a larger set of files withvarious sizes.– The IOzone workload causes significant metadata changes only during the

initial write of the files. This phase was not monitored during our tests.Overall XFS was the best file system regarding highest throughput and lowestCPU utilization. XFS was not the best in all disciplines, but showed the bestaverage, being even better than the non-journaling file system, EXT2.

– dbench causes a lot of metadata changes when creating, modifying, ordeleting larger numbers of files and directories. This workload is closer to acustomer-like workload than IOzone. In this test, XFS was the best journalingfile system.

v When comparing the measurement results from the empty file system with thosefrom the 40% filled file system we saw no big difference with EXT2, EXT3, andReiserFS. XFS degraded a bit more in the 40% filled case, but was still muchfaster than the other journaling file systems. EXT2 and EXT3 both showed equalresults with empty and filled file systems. This was not expected because theydistribute the metadata all over the volume.

v Using the index option with EXT3 did not lead to better results in ourmeasurements.

v Overall we could recommend XFS as the best file system for largemultiprocessor systems because it is fast at low CPU utilization. EXT3 is the bestsecond choice.

2 Performance of Large Journaling File Systems

Page 7: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Hardware equipment and software environment for the journaling filesystems performance tests

To perform our large journaling file system performance tests, we created acustomer-like environment. We configured the hardware, software, and storageserver.

Server hardware

Host

18-way IBM System z9® Enterprise Class (z9® EC), model 2094-S18 with:v 0.58ns (1.7 GHz)v 2 books with 8/10 CPUsv 2 * 40 MB L2 cachev 128 GB memoryv FICON® Express 2 cards

One LPAR was used for our measurements with:v 8 shared CPUsv 256 MB memory (2048 MB memory for the ″Measurements on a 40%

filled file system″ and ″Measurements on an empty file system″ testcases)

v 8 FICON Channelsv 8 FCP Channels

Storage server setup2107-922 (DS8300)v 256 GB cachev 8 GB NVSv 256 * 73 GB disks (15,000 RPM)

Organized in units to 8 disks building one RAID5 array (called a rank)v 8 FCP attachmentsv 8 FICON attachments

For the operating system:v 2 ECKD™ mod9 from one rank/LCUv 8 FICON paths

For the file system disks:v 32 SCSI disks - 100 GB spread over 16 ranksv 8 FCP paths

Server software

Table 1. Server software used

Product Version/Level

SUSE Linux Enterprise Server (64-bit) SLES10 GA

v 2.6.16.21-0.8-default (SUSE)

v 2.6.16.21-0.8-normalstacksize (builtourselves)

dbench 2.1 compiled for 64-bit

Performance of Large Journaling File Systems 3

Page 8: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Table 1. Server software used (continued)

Product Version/Level

IOzone 3.196 compiled for 64-bit

File systemsFor this evaluation we used several popular Linux file systems. EXT2 is agood reference file system because it has no journaling and has been usedover many years without major problems.

The file systems we used were:1. EXT2 (1993) – the “old” Linux file system without any journaling.2. EXT3 (1999) – the EXT2 file system with a journal. We used EXT3 with

and without a dir_index.3. ReiserFS v3 (2001) – the standard file system on SUSE Linux which was

developed by Namesys (Hans Reiser).4. XFS (1994) – is the IRIX file system, which was released in 2000 as open

source. This file system was developed by SGI and is one of the oldestjournaling file systems.

EnvironmentOur environment consisted of an IBM System z and a DS8300 storage server. Theywere connected together over a switched fabric with eight 2 Gbps FCP links.

On System z we used one LPAR (configured as mentioned above) which hadaccess to the disks on the storage server. Figure 1 shows the storage server on theright side. In this storage server we used up to 32 disks. Details on the storageservers are described in “Storage server disk usage” on page 10.

Workload description for the journaling file systems performance testsWe used IOzone and dbench as our benchmark tools for our large journaling filesystems tests.

IOzoneIOzone is a file system benchmark tool. This tool reads and writes to a file.

Figure 1. Journaling test environment

4 Performance of Large Journaling File Systems

Page 9: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

We had sixteen instances running at the same time. IOzone can perform sequentialand random operations to the file.

The first write operation was only used to allocate the space on the file system. Forour measurements, only the rewrite and read results were used.

IOzone setup used for our measurements:v Sequential I/O

– Write, rewrite (cmdline option “-i 0”), and read (cmdline option “-i 1”) of a2000 MB (cmdline option “-s 2000m”) file

– 16 threads (cmdline option “-t 16”) working on one file system– 64 KB record size (cmdline option “-r 64k”)

v Random I/O– Write, random write, and random read (cmdline option “-i 2”) of a 2000 MB

file (cmdline option “-s 2000m”)– 16 threads (cmdline option “-t 16”) working on one file system– 64 KB record size (cmdline option “-r 64k”)– The random I/O modes produce separate values for read and write

throughput, but only one value for CPU utilization because the reads andwrites are mixed

Other command line options we used for our measurements follow.v “-C” - Show bytes transferred by each child in throughput testingv “-e” - Include flush (fsync, fflush) in the timing calculationsv “-R” - IOzone will generate an Excel-compatible report to standard outv “-w” - Do not unlink temporary files when finished using them

For a detailed description of these and all possible parameters see thedocumentation located at http://www.iozone.org/.

Format utilityA format utility was used to format the file systems for our large journaling filesystems performance tests.

We used the tools provided with the file system tools. To compare the time it tookto format the file systems we collected sysstat data (using sadc with a resolution ofone second) and CPU times (using the time command) while the format utility wasrunning.

Filling up the file systemsTo create a more customer-like environment, we filled the file system to 40%.

To fill the file system up to 40% we needed a large amount of data. We decided touse a Linux vanilla kernel 2.6.19 archive (240690 Kbytes). This archive was anuncompressed tar, containing 1252 directories and 20936 files.

Performance of Large Journaling File Systems 5

Page 10: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

The directory structure produced had three levels. Each first level directory, out often, contained eleven second level directories. Each level 2 directory contained 21subdirectories, each had one extracted kernel.

The CPU times were collected by sysstat using sadc with a resolution of onesecond.

The tar archive was extracted 2310 times into the 21 level three directories of onelevel two tree in parallel. Because the tar file is not compressed this load consistedmostly of file system operations. The totally used file system space was 619 GB.

The execution time and the CPU utilization were measured.

dbenchWe used dbench as one of our benchmark tools in our large journaling file systemsperformance tests.

dbench (samba.org/ftp/tridge/dbench) is an emulation of the file system loadfunction of the Netbench benchmark. It does all the same I/O calls that the smbdserver daemon in Samba would produce when it is driven by a Netbenchworkload. However, it does not perform network calls. This benchmark is a goodsimulation of a real server setup (such as Web-server, proxy-server, mail-server), inwhich a large amount of files with different sizes and directions have to be created,written, read, and deleted.

dbench takes only one parameter on the command line, which is the number ofprocesses (clients) to start. Issuing the command “dbench 30” for example, creates30 parallel processes of dbench. All processes are started at the same time and eachof them runs the same workload. The workload for each dbench process isspecified by a client.txt configuration file in the working (testing) directory. Itconsists of a mixture of file system operations executed by each dbench process.dbench runs with n parallel processes and delivers only one value as a result. The

Figure 2. Schemata of extracted kernel directories

6 Performance of Large Journaling File Systems

Page 11: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

resulting value is an average throughput of the file system operations described inclient.txt and measured in megabytes per second.

With a large amount of memory, dbench measurements can be used to detect theeffects of memory scaling.

For the dbench workload used for our measurements the number of clients wasscaled as shown in the following sequence: 1, 4, 8, 12, 16, 20, 32, 40, 46, 50, 54, 62.

System setup for the journaling file systems performance testsTo emulate a customer-like environment, we configured Linux disk I/O features,logical volumes, the multipath mode, and our storage server disk usage.

We used a SLES10 installation on System z with developer tools.

Linux disk I/O featuresTo perform our large journaling file systems performance tests, we attached FCPdisks, used the deadline I/O scheduler, and set the read ahead on our Linux datadisk.

Attaching FCP disks

For attaching the FCP disks we followed the instructions described in Chapter 6,″SCSI-over-Fibre Channel device driver,″ section ″Working with the zfcp devicedriver″ in Linux on System z - Device Drivers, Features, and Commands, SC33-8289.This book can be found at:

http://www.ibm.com/developerworks/linux/linux390/

Choose your stream (for example, October 2005 stream) and then click on theDocumentation link.

I/O scheduler

For our measurements we used the deadline I/O Scheduler. This setting was madeto the kernel option at boot time (option ″elevator=deadline″ in the zipl.conf file).

After booting with this scheduler (verify with ’grep scheduler /var/log/boot.msg’)we modified the following parameters in /sys/block/<sd[a-z][a-z]>/queue/iosched for each disk device concerned:v front_merges to 0 (default 1)

Avoids scanning of the scheduler queue for front mergers, which rarely occur.Using this setting saves CPU utilization and time.

write_expire to 500 (default 5000)read_expire to 500 (default 500)We set write to the same expiration value as read to force the same timeout forread and write requests.

v writes_starved to 1 (default 2)Processes the next write request after each read request (per the default, tworead requests are done before the next write request). This handles reads andwrites with the same priority.

Performance of Large Journaling File Systems 7

Page 12: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

These settings are mostly suitable for servers such as databases and file servers.The intention of these settings is to handle writes in the same manner as reads.

Read ahead

The read ahead on any Linux data disk and logical volume is set to zero tomeasure exactly the requested I/O.

Logical volumesThe logical volume setup we used for our large journaling file systemsperformance tests is described below.

The logical volume was striped over several disks depending on the test case. Donot use the devices /dev/sd* for your logical volume. You will bypass the devicemapper. Use the device nodes under /dev/disk/by-id/ instead.

MultipathThe multipath setup we used for our large journaling file systems performancetests is described below.

The multipath setup is quite simple, but you must check that the utilization ofyour fibre channel paths is balanced. The critical point here is the order of thedevice names. The Multipath / Device mapper uses the old device names such assda, sdb, and so on. This means that the device sdaa comes before device sdbwhich may cause an unbalanced load on the fibre channel paths.

To make sure the fibre channel path utilization is optimal, we used the followingsequence for adding disks:1. Connect the first path to the disks in ascending order over all eight fibre

channel paths.2. Connect the second path to the disks in descending order over all eight fibre

channel paths.

If you use two paths to every disk you should add the primary path to the firstdisk, the secondary path to the first disk, the primary path to the second disk, thesecondary path to the second disk and so on. With this you will get a balancedfibre channel usage.

The two path setup is described in Table 2.

Table 2. FCP disk setup with 16 disks

Attachedorder

Disk ID Fibre channelpath

Deviceadapter pair

Internalserver

Rank (RAIDArray)

1 0x1413 1 0-1 0 9

2 0x1413 8 0-1 0 9

3 0x1613 1 0-1 0 12

4 0x1613 8 0-1 0 12

5 0x1513 2 0-1 1 11

6 0x1513 7 0-1 1 11

7 0x1713 2 0-1 1 14

8 0x1713 7 0-1 1 14

8 Performance of Large Journaling File Systems

Page 13: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Table 2. FCP disk setup with 16 disks (continued)

Attachedorder

Disk ID Fibre channelpath

Deviceadapter pair

Internalserver

Rank (RAIDArray)

9 0x1013 3 2-3 0 1

10 0x1013 6 2-3 0 1

11 0x1213 3 2-3 0 4

12 0x1213 6 2-3 0 4

13 0x1113 4 2-3 1 3

14 0x1113 5 2-3 1 3

15 0x1313 4 2-3 1 6

16 0x1313 5 2-3 1 6

17 0x1c13 5 4-5 0 24

18 0x1c13 4 4-5 0 24

19 0x1e13 5 4-5 0 29

20 0x1e13 4 4-5 0 29

21 0x1d13 6 4-5 1 26

22 0x1d13 3 4-5 1 26

23 0x1f13 6 4-5 1 31

24 0x1f13 3 4-5 1 31

25 0x1813 7 6-7 0 16

26 0x1813 2 6-7 0 16

27 0x1a13 7 6-7 0 21

28 0x1a13 2 6-7 0 21

29 0x1913 8 6-7 1 18

30 0x1913 1 6-7 1 18

31 0x1b13 8 6-7 1 23

32 0x1b13 1 6-7 1 23

For your setup you will use either “failover” or “multibus” as the policy for themultipath daemon. Be sure that you set this using “multipath -p <mypolicy>” afterstarting the multipath daemon.

File systemsThe file systems setup we used for our large journaling file systems performancetests is detailed here.

The file systems were created with their default values using the followingcommands:v mkfs.ext2 -qv mkfs.ext3 -qv mkfs.ext3 -q -O dir_indexv mkfs.reiserfs -qv mkfs.xfs -q -f

Performance of Large Journaling File Systems 9

Page 14: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

The option “-q” means quiet execution. In most commands this shows nointeractive output and no prompts. The option “-f” at mkfs.xfs tells the mkfs that itis OK to override an existing file system on the target device.

Storage server disk usageThe set up for our storage server disks we used for our large journaling filesystems performance tests is described below.

To get good results you should choose the disks from your storage server withcare. For example, if you have a DS 8300, choose the disks from different ranks(RAID arrays). See the storage server manual (IBM System Storage DS8000 Series:Architecture and Implementation, SG24-6786) for details.

The figure below shows the disk layout used for our tests.

For performance reasons, we selected disks evenly over as many ranks as possible(which was 16) and over both servers within the storage server. The disks with thewhite background were controlled by the first internal server of the DS8300 andthe other disks were controlled by the second internal server. The disks inside thestorage server were organized into 16 ranks and connected via four device adapterpairs.

Instruction sequence to prepare the measurementThe instruction sequence we used to prepare our measurement for each of ourlarge journaling file systems performance tests is detailed below.

We performed the steps shown below to execute each run:v Boot linux

SLES10 default zipl with one change: elevator=deadlinev Attach SCSI disksv Start multipath daemon

Figure 3. Storage server disk layout

10 Performance of Large Journaling File Systems

Page 15: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

service multipathd startv Configure multipath policy

multipath -p multibus or multipath -p failoverv Prepare the disks to use with LVM (each disk has one big partition)

/sbin/pvcreate -f -y /dev/disk/by-id/scsi-3...1013-part1v Create volume group

/sbin/vgcreate largefs /dev/disk/by-id/scsi-3...1413-part1/dev/disk/by-id/scsi-3...1513-part1 ...

v Create logical volume/sbin/lvcreate -i 16 -I 64 -L 1.56T -n lvol0 largefs

v Set read_ahead for physical and logical volumes to 0lvchange -r 0 /dev/16disks/lvol0blockdev --setra /dev/disk/by-id/scsi...361013 /dev/disk/by-id/scsi-3...1113 ...

v Set parameters for the I/O scheduler (see “I/O scheduler” on page 7)v Perform run (contains the formatting of the file system)

Understanding the CPU utilizationThe five different types of CPU loads used in the CPU utilization reports aredescribed below.

The CPU utilization reports show five different types of CPU load. These types aredefined in Table 3, which comes from the sar main page.

Table 3. CPU utilization types

%idle Percentage of time that the CPU or CPUs were idle and the system didnot have an outstanding disk I/O request.

%iowait Percentage of time that the CPU or CPUs were idle during which thesystem had an outstanding disk I/O request.

%system Percentage of CPU utilization that occurred while executing at thesystem level (kernel).

%nice Percentage of CPU utilization that occurred while executing at the userlevel with nice priority.

%user Percentage of CPU utilization that occurred while executing at the userlevel (application).

Results for the journaling file systems performance testsAfter performing our large journaling file systems performance tests, we chartedour test results, interpreted the results, and created recommendations.

The tests we performed include:v “Finding the best logical volume configuration” on page 12v “Run times for the format utility” on page 13v “Run times to fill up the file system to 40% utilization” on page 16v “Measurements on a 40% filled file system” on page 17v “Measurements on an empty file system” on page 24

Performance of Large Journaling File Systems 11

Page 16: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Finding the best logical volume configurationThese tests were used to determine the best logical volume configuration for ourtests.

To find the best logical volume configuration for our testing environment, weperformed the following tests:v Using different multipath modesv Scaling disks in failover multipath modev Finding the best stripe size for the logical volume

Using different multipath modes

Our goal in this test case was to determine the best multipath mode.

To exploit fault tolerance we used the following multipath modes:v failover (one active path and one standby path)v multibus (two active paths)v singlepath (no fault tolerance) - used for performance comparisons

We used the IOzone workload for these tests. The workload operates on a stripedlogical volume (LVM2) with 4, 8, 16, 24, and 32 physical disks. The file system wasEXT2. The stripe size used was 64 KB (the default).

Conclusion from the multipath modes measurements

The failover multipath mode behaves similarly to the singlepath mode, except forrandom readers. The failover mode provides the fault tolerance with the lowestimpact on performance. The multibus multipath mode always has the lowerthroughput. Based on these results we decided to use the failover mode for ourremaining measurements.

Scaling disks in failover multipath mode

Our goal with this test case was to find the best number of physical volumeswithin our striped logical volume. In this test case we continued using the IOzoneworkload. We used the same setup that we used in “Using different multipathmodes.” The file system was EXT2.

We used the failover multipath mode (as determined in our previous tests) andfour different physical volume configurations:v 4 physical disksv 8 physical disksv 16 physical disksv 24 physical disksv 32 physical disks

The best configuration was expected to have the best throughput and least amountof CPU utilization (user and system CPU together). The stripe size was 64 KB (thedefault).

12 Performance of Large Journaling File Systems

Page 17: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Conclusion from the scaling disks in failover multipath modemeasurements

The best throughput at low CPU utilization was seen with 8 and 16 disks. For ourremaining measurements we used 16 disks.

Finding the best stripe size for the logical volume

To determine the best stripe size for our logical volume we used the same IOzoneworkload as in the two previous test cases. For this test we scaled the stripe size ofthe logical volume from 8 KB to 512 KB.

We wanted to determine which stripe size is the best for a logical volume of 8disks and which stripe size is the best for a logical volume of 16 disks.

Conclusion from the stripe size measurements

We have two optimal stripe sizes. 32 KB is the optimal stripe size for randomworkloads and 128 KB is optimal for sequential workloads. We assumed that mostreal-life workloads are somewhere in between sequential and randomized.Therefore, we decided to use a 64 KB stripe size as a good overall solution. Adatabase server with many small randomized reads would get better results with asmaller stripe size.

For all further measurements we used a striped (64 KB) logical volume (LVM2)with 16 physical disks and a Linux system with 2 GB of memory.Related information

Journaling file systems performance tests - hardware equipment and softwareTo perform our large journaling file system performance tests, we created acustomer-like environment. We configured the hardware, software, and storageserver.Journaling file systems performance tests - workload descriptionWe used IOzone and dbench as our benchmark tools for our large journaling filesystems tests.Journaling file systems performance tests - system setupTo emulate a customer-like environment, we configured Linux disk I/O features,logical volumes, the multipath mode, and our storage server disk usage.Journaling file systems performance tests - other sources of informationAdditional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and at various Websites.

Run times for the format utilityThese tests were performed to determine how long the formatting of different filesystems takes.

Formatting a file system is a fast and easy task, but how long does the formattingof a 1.6 TB file system actually take? Usually, there is no time limitation the firsttime this task is performed. However, the format time needed in a disaster recovercase may be critical. This test case was used to find an average format time.

We used the UNIX® time command and sysstat to measure the format time.

Performance of Large Journaling File Systems 13

Page 18: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Figure 4 and Figure 5 on page 15 show the total run time to format the file system.The lower values were the better times. The bars in the charts are split to alsoshow the CPU utilization. Non-appearing colors have a value of zero.

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Figure 5 on page 15 is the detailed view of the ReiserFS3 and XFS file systems.These file systems are much faster than the EXT2 and EXT3 file systems so theoverall chart does not show the details of the CPU utilization.

Figure 4. Run times for the format utility - file system format times

14 Performance of Large Journaling File Systems

Page 19: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Observations

EXT2 and EXT3 are writing the metadata all over the file system on the disk,which takes a little extra time. However, both are still done in less than twominutes. All other file systems are writing only some information at the beginning,which takes between one and four seconds. There is almost no “nice” or “user”CPU consumption.

Conclusion

ReiserFS and XFS are very fast when formatting the disk. The EXT2 and EXT3 arewriting some data over the whole disk, which takes a little more time. However,the format times are also low enough that they might be negligible.

Figure 5. Run times for the format utility - file system format times OCFS2, ReiserFS3,andXFS details

Performance of Large Journaling File Systems 15

Page 20: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Related information

Journaling file systems performance tests - hardware equipment and softwareTo perform our large journaling file system performance tests, we created acustomer-like environment. We configured the hardware, software, and storageserver.Journaling file systems performance tests - workload descriptionWe used IOzone and dbench as our benchmark tools for our large journaling filesystems tests.Journaling file systems performance tests - system setupTo emulate a customer-like environment, we configured Linux disk I/O features,logical volumes, the multipath mode, and our storage server disk usage.Journaling file systems performance tests - other sources of informationAdditional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and at various Websites.

Run times to fill up the file system to 40% utilizationThese tests were performed to discover how low it took to fill up a file system to40% on the different file system types.

In this test case we measured the time to fill our 1.6 TB file system to 40% with aworkload as described in “Filling up the file systems” on page 5 to fill up our filesystem. We measured the fill up times and CPU utilization. We changed thekernel’s stack size to normal because, with XFS and the small stack size (thedefault), we hit a kernel stack overflow. The Linux kernel was self-built withnormal stack size.

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Figure 6. File system fill up times

16 Performance of Large Journaling File Systems

Page 21: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Figure 6 on page 16 shows the total fill up time in seconds. The bars in the chartare split to show the CPU utilization. Colors which do not appear have a value ofzero.

Observations

We have the following observations from these test runs:v EXT2 is the fastest file system running at about 40 minutes. EXT3, with and

without an index, took about one hour, which is close to the EXT2 time.v XFS needs about one hour and 20 minutes and has no I/O wait.v ReiserFS Version 3 needs about two hours and 50 minutes and has high system

CPU utilization.v As in “Run times for the format utility” on page 13 we have no “nice” and only

very little “user” CPU consumption.

Conclusion

The best journaling file system for the fill up is EXT3. XFS needs a little bit longer,but the CPU utilization is about the same.Related information

Journaling file systems performance tests - hardware equipment and softwareTo perform our large journaling file system performance tests, we created acustomer-like environment. We configured the hardware, software, and storageserver.Journaling file systems performance tests - workload descriptionWe used IOzone and dbench as our benchmark tools for our large journaling filesystems tests.Journaling file systems performance tests - system setupTo emulate a customer-like environment, we configured Linux disk I/O features,logical volumes, the multipath mode, and our storage server disk usage.Journaling file systems performance tests - other sources of informationAdditional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and at various Websites.

Measurements on a 40% filled file systemThese tests were performed to determine how a 40% filled file system on differentfile system types affected performance. We used the IOzone and dbench workloadsto measure the behavior of the different file systems.

IOzone workload

The following charts show the results of the IOzone workload on the 40% utilized1.6 TB file system. We varied the file systems in this test. Be aware that on thethroughput charts the Y-axis does not start at zero!

Sequential I/O readers

Figure 7 on page 18 and Figure 8 on page 18 show the throughput and CPUutilization for the sequential read I/O.

Performance of Large Journaling File Systems 17

Page 22: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Observations

Figure 7. Disk I/O measurements - sequential I/O readers throughput

Figure 8. Disk I/O measurements - sequential I/O readers CPU utilization

18 Performance of Large Journaling File Systems

Page 23: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

XFS has the best read performance and ReiserFS has the worst. The performance ofEXT3, with and without an index, is very close to XFS and slightly better thanEXT2. XFS also has the lowest CPU utilization, but EXT3 and EXT2 are very close.

Sequential I/O writers

Figure 9 and Figure 10 on page 20 show the throughput and CPU utilization forthe sequential write I/O.

Figure 9. IOzone workload sequential I/O writers throughput

Performance of Large Journaling File Systems 19

Page 24: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Observations

ReiserFS has the best throughput. XFS is only 1% below ReiserFS and EXT3 is 4%lower than ReiserFS (regardless of whether an index is used). The lowest CPUutilization was with EXT2. From the journaling file systems, XFS has the lowestCPU utilization, which is also very close to the EXT2 CPU utilization, but thethroughput for XFS is higher.

Random I/O

Figure 11 on page 21 and Figure 12 on page 21 show the throughput and CPUutilization for the random write and read I/O.

Figure 10. IOzone workload sequential I/O writers CPU utilization

20 Performance of Large Journaling File Systems

Page 25: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Observations

Figure 11. IOzone workload random I/O readers and writers throughput

Figure 12. IOzone workload random I/O readers and writers CPU utilization

Performance of Large Journaling File Systems 21

Page 26: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

EXT2 is the best file system for reading random data from the storage server, but itis also the worst for writing random data. For EXT3, with and without an index,we see similar results, but a bit worse for reading and a bit better for writing.ReiserFS has the lowest throughput for reading random data, but has goodperformance for writing random data. The best throughput for writing randomdata is XFS. The lowest CPU utilization is with EXT2, EXT3, and XFS, while XFShas the lowest I/O wait.

Conclusion

The IOzone workload has very little effort for journaling because most of themetadata changes occur during the initial write phase (creating and enlarging thefile), which is not monitored here. In our tests, only the rewrite phase is shown,which only has updating the access time in the file and directory inode formetadata changes. The same is true for reading. This workload shows just theoverhead related with the journaling mechanisms for the normal read and updateoperations with minimum meta data changes.

For the IOzone workload we see that XFS is the best overall journaling file systemsolution for both throughput and CPU utilization. Even if in some categories otherfile systems might be better, they have significant weaknesses elsewhere. This isalso true in comparison to EXT2.

dbench workload

Figure 13 and Figure 14 on page 23 show the results from the dbench workload ona 40% filled 1.6 TB file system.

Figure 13. dbench workload on a 40% filled file system throughput

22 Performance of Large Journaling File Systems

Page 27: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Observations

The best performance for our dbench workload is seen with EXT2, which has noadditional effort for journaling. The journaling file system with the best throughputis XFS, especially with a high number of workload generators. XFS also has up to70% more throughput than the second best journaling file system, EXT3. Lookingat the cost in terms of throughput per 1% CPU, XFS drives more throughput withthe same amount of CPU.

The worst performance with the highest cost is seen with ReiserFS.

Conclusion

The dbench workload makes a lot of changes in metadata (creating, modifying,deleting a high number of files and directories), which causes a significant effortfor journaling. This workload is much closer to a customer-like workload. On a40% filled large volume, XFS is the best journaling file system solution with thedbench workload.

Figure 14. dbench workload on a 40% filled file system CPU utilization

Performance of Large Journaling File Systems 23

Page 28: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Related information

Journaling file systems performance tests - hardware equipment and softwareTo perform our large journaling file system performance tests, we created acustomer-like environment. We configured the hardware, software, and storageserver.Journaling file systems performance tests - workload descriptionWe used IOzone and dbench as our benchmark tools for our large journaling filesystems tests.Journaling file systems performance tests - system setupTo emulate a customer-like environment, we configured Linux disk I/O features,logical volumes, the multipath mode, and our storage server disk usage.Journaling file systems performance tests - other sources of informationAdditional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and at various Websites.

Measurements on an empty file systemThese tests were performed to determine how an empty file system affectsperformance on different file system types. We used the dbench workload to takesome reference numbers.

We used the dbench workload to compare the results of the empty large filesystem with the results from our test case “Measurements on a 40% filled filesystem” on page 17. This test was intended to show how a 40% filled systemimpacts throughput. We varied the file systems. The charts show only results forEXT3 with an index because the results without the index were nearly identical.

The results for a certain file system have the same color, the empty file systemcurve uses the square (") symbol, the curve for the 40% utilized file system usesthe triangle (�) symbol.

Figure 15. dbench workload throughput

24 Performance of Large Journaling File Systems

Page 29: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Note: CPU utilization types are explained in “Understanding the CPU utilization”on page 11.

Observations

Throughput results for most of the file systems are only slightly degraded whenthe file system is filled to 40%. XFS shows a degradation in throughput of 14%between an empty and 40% filled file system, which is the largest impact so far.However, it is still much faster than the other journaling file systems. Once againthere is no difference for EXT3 regardless of whether an index is used.

For all file systems, the CPU utilization is very similar for the empty and the 40%filled file system.

Conclusion

Comparing the results for the empty file system with the 40% filled file systemshows no major differences. This was not expected, especially for EXT2 and EXT3because they distribute the metadata all over the disk. Only XFS showsdegradation in throughput, but it is still much faster than the other journaling filesystems.

Figure 16. dbench workload CPU utilization

Performance of Large Journaling File Systems 25

Page 30: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Related information

Journaling file systems performance tests - hardware equipment and softwareTo perform our large journaling file system performance tests, we created acustomer-like environment. We configured the hardware, software, and storageserver.Journaling file systems performance tests - workload descriptionWe used IOzone and dbench as our benchmark tools for our large journaling filesystems tests.Journaling file systems performance tests - system setupTo emulate a customer-like environment, we configured Linux disk I/O features,logical volumes, the multipath mode, and our storage server disk usage.Journaling file systems performance tests - other sources of informationAdditional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and at various Websites.

Other sources of information for the Journaling file systemsperformance tests

Additional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and at various Websites.

For information on IOzone see:v www.iozone.org

For information on dbench see:v samba.org/ftp/tridge/dbench/

For information on Linux on System z see:v www.ibm.com/servers/eserver/zseries/os/linux/

For information about FCP disks and other Linux device driver specifics see:v Linux on System z - Device Drivers, Features, and Commands - SC33-8289

www.ibm.com/developerworks/linux/linux390/

For information on IBM open source projects see:v www.ibm.com/developerworks/opensource/index.html

See the following Redbooks/Redpapers for additional information:v IBM System Storage™ DS8000® Series: Architecture and Implementation -

SG24-6786www.redbooks.ibm.com/Redbooks.nsf/RedbookAbstracts/sg246786.html?OpenDocument

v Linux on zSeries®: Samba-3 Performance Observationswww.redbooks.ibm.com/abstracts/redp3988.html

v Linux Performance and Tuning Guidelineswww.redbooks.ibm.com/redpieces/abstracts/redp4285.html

26 Performance of Large Journaling File Systems

Page 31: Performance of Large Journaling File Systems - ibm.com filePerformance of Large Journaling File Systems The paper investigates the performance of journaling files systems on Linux

Notices for the journaling file systems performance testsAdditional resources to provide information on the products, hardware, andsoftware discussed in this paper can be found in various books and various Websites.

IBM, IBM eServer, IBM logo, DB2, DB2 Universal Database, DS8000, ECKD,FICON, HiperSockets, Performance Toolkit for z/VM, System Storage, System z,System z9, WebSphere, xSeries, and z/VM are trademarks or registered trademarksof International Business Machines Corporation of the United States, othercountries or both.

The following are trademarks or registered trademarks of other companies

Java and all Java-based trademarks and logos are trademarks of Sun Microsystems,Inc. in the United States, other countries or both.

UNIX is a registered trademark of The Open Group in the United States and othercountries.

Intel and Xeon are trademarks of Intel Corporation in the United States, othercountries or both.

Linux is a registered trademark of Linus Torvalds in the United States and othercountries.

Microsoft and Windows are registered trademarks of Microsoft Corporation in theUnited States, other countries, or both.

Other company, product and service names may be trademarks or service marks ofothers.

Information concerning non-IBM products was obtained from the suppliers of theirproducts or their published announcements. Questions on the capabilities of thenon-IBM products should be addressed with the suppliers.

IBM hardware products are manufactured from new parts, or new and serviceableused parts. Regardless, our warranty terms apply.

IBM may not offer the products, services or features discussed in this document inother countries, and the information may be subject to change without notice.Consult your local IBM business contact for information on the product or servicesavailable in your area.

All statements regarding IBM’s future direction and intent are subject to change orwithdrawal without notice, and represent goals and objectives only.

Performance is in Internal Throughput Rate (ITR) ratio based on measurementsand projections using standard IBM benchmarks in a controlled environment. Theactual throughput that any user will experience will vary depending uponconsiderations such as the amount of multiprogramming in the user’s job stream,the I/O configuration, the storage configuration, and the workload processed.Therefore, no assurance can be given that an individual user will achievethroughput improvements equivalent to the performance ratios stated here.

Performance of Large Journaling File Systems 27