Operating System Enhancements for Data-Intensive Server ... · 3.6 Performance of server-style workloads on a Xen virtual machine platform where the I/O scheduling is oblivious to

Operating System Enhancements for

Data-Intensive Server Systems

by

Chuanpeng Li

Submitted in Partial Fulfillment

of the

Requirements for the Degree

Doctor of Philosophy

Supervised by

Kai Shen

Department of Computer ScienceThe College

Arts and Sciences

University of RochesterRochester, New York

2008

ii

Curriculum Vitae

Chuanpeng Li was born in Shandong province, China, on May 2nd, 1977. He

attended Northeastern University in Shenyang, China from 1995 to 1999, and

graduated with a Bachelor of Engineering degree in Computer Science and Tech-

nology. He attended the same university from 1999 to 2002, and graduated with

a Master of Engineering degree. He entered the Ph.D. program in the Computer

Science Department at the University of Rochester in Fall 2002. Under the direc-

tion of Professor Kai Shen, his research is on operating system enhancements for

data-intensive server systems. He received a Master of Science degree in Computer

Science from the University of Rochester in 2003.

iii

Acknowledgments

First of all, I would like to thank Professor Kai Shen. I am very lucky to have

him as my advisor. He spent a lot of time on my work. I remember many nights

he worked with me until early morning. I remember many times he sacrificed

his weekend and even dinner time in order to discuss the work with me. I still

remember several times he encouraged me to hold on when I did not even believe

in myself. I appreciate all the time he spent with me and all the great ideas and

efforts he put into this work. In addition to guiding my research, Professor Shen

also gave me great advices on how to be a happy person and how to achieve a

successful career. His advices will continue to help me in the future.

I am really grateful to other members in my thesis committee: Professor Chen

Ding, Professor Michael Huang, and Professor Michael Scott for their advices and

patience. Their feedbacks and comments are extremely important to help me

adjust my research work and finally finish the thesis. Their encouragement helps

me build up confidence. I am very lucky to have them in my committee.

I would like to thank Athanasios Papathanasiou for many excellent discussions

on my research. I also learned a lot of Linux kernel hacking experience from

him. Many thanks to Professor Sandhya Dwarkadas for her advices during our

collaborated work. To Ming Zhong, Christopher Stewart, and Pin Lu for their

comments on my research. To the whole URCS systems group for their comments

on my work and my practice talks.

ACKNOWLEDGEMENTS iv

I would like to thank the other URCS faculty, staff, and my fellow graduate

students for making this department a pleasant place to work at. Thanks to my

office mates Manu Chhabra and Kirk Kelsey for many helpful discussions about

work and life.

I would also like to thank Ask.com for providing a prototype of index searching

application and its workload trace in their support of this work.

Finally, I am deeply indebted to my great parents and my dear wife. They

always tell me I am the best. They always let me feel comfortable. They stand

behind me all the time no matter when I was excited or when I was frustrated. I

love them.

The material presented in this dissertation is based upon work supported by

the National Science Foundation (NSF) grants CCR-0306473, ITR/IIS-0312925,

CNS-0615045, CCF-0621472, CCF-0448413, and an IBM Faculty Award. Any

opinions, findings, and conclusions or recommendations expressed in this material

are those of the author(s) and do not necessarily reflect the views of the above

named organizations.

v

Abstract

Recent studies on operating system support for concurrent server systems

mostly target CPU-intensive workloads with light disk I/O activities. However, an

important class of server systems that access a large amount of disk-resident data,

such as the index searching server of large-scale Web search engines, has received

limited attention. In this thesis work, we examine operating system techniques to

improve the performance of data-intensive server systems under concurrent exe-

cution. We propose OS enhancements in three aspects of the operating system:

file system prefetching, memory management, and disk I/O system. First, we

propose a competitive prefetching strategy that can balance the overhead of disk

I/O switching and the wasted I/O bandwidth on prefetching unnecessary data.

Second, we explore a new memory management scheme for prefetched data, in or-

der to reduce prefetching-incurred page thrashing at high execution concurrency.

Third, we present five performance anomalies we identified in the Linux I/O sys-

tem, and provide general discussions on operating system performance debugging.

We have implemented the proposed techniques in the Linux 2.6 kernel. Per-

formance evaluation on microbenchmarks and real applications shows promising

results.

vi

Table of Contents

Curriculum Vitae ii

Acknowledgments iii

Abstract v

List of Figures x

List of Tables xiv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Targeted applications . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Memory Replacement . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Disk I/O Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 OS Support for Highly-Concurrent Servers . . . . . . . . . . . . . 10

2.6 Other Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

TABLE OF CONTENTS vii

3 Competitive Prefetching 12

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Existing Operating System Support . . . . . . . . . . . . . 14

3.2.2 Disk Drive Characteristics . . . . . . . . . . . . . . . . . . 16

3.3 Competitive Prefetching . . . . . . . . . . . . . . . . . . . . . . . 17

3.3.1 2-Competitive Prefetching . . . . . . . . . . . . . . . . . . 18

3.3.2 Optimality of 2-Competitiveness . . . . . . . . . . . . . . . 21

3.3.3 Accommodation for Random-Access

Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28

3.5.1 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . 29

3.5.2 Base-Case Performance . . . . . . . . . . . . . . . . . . . . 33

3.5.3 Performance with

Process-Context-Oblivious I/O Scheduling . . . . . . . . . 37

3.5.4 Performance on Different Storage Devices . . . . . . . . . 38

3.5.5 Performance of Serial Workloads . . . . . . . . . . . . . . 40

3.5.6 Summary of Results . . . . . . . . . . . . . . . . . . . . . 42

3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.7 Conclusion and Discussions . . . . . . . . . . . . . . . . . . . . . 45

4 Managing Prefetch Memory 49

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Targeted Applications and Existing OS Support . . . . . . . . . . 52

TABLE OF CONTENTS viii

4.3 Prefetch Memory Management . . . . . . . . . . . . . . . . . . . . 55

4.3.1 Page Reclamation Policy . . . . . . . . . . . . . . . . . . . 56

4.3.2 Memory Allocation for Prefetched Pages . . . . . . . . . . 58

4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 63

4.4.1 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . 64

4.4.2 Microbenchmark Performance . . . . . . . . . . . . . . . . 65

4.4.3 Performance of Real Applications . . . . . . . . . . . . . . 70

4.4.4 Impact of Factors . . . . . . . . . . . . . . . . . . . . . . . 73

4.4.5 Summary of Results . . . . . . . . . . . . . . . . . . . . . 75

4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5 I/O System Performance Anomalies 81

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2 Linux I/O System Algorithms . . . . . . . . . . . . . . . . . . . . 83

5.2.1 Prefetching Algorithm in Linux . . . . . . . . . . . . . . . 83

5.2.2 I/O Scheduling Algorithms in Linux . . . . . . . . . . . . 86

5.3 Anomalies in Linux I/O System Implementation . . . . . . . . . . 88

5.3.1 Suppression of Prefetching . . . . . . . . . . . . . . . . . . 89

5.3.2 Threshold of Disk Congestion . . . . . . . . . . . . . . . . 91

5.3.3 Estimation of Seek Cost . . . . . . . . . . . . . . . . . . . 92

5.3.4 Impact of System Timer Frequency . . . . . . . . . . . . . 93

5.3.5 Timing to Start Anticipation . . . . . . . . . . . . . . . . 96

5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

TABLE OF CONTENTS ix

5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6 Conclusions and Future Work 102

6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 102

6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

Bibliography 106

x

List of Figures

2.1 Concurrent sequential I/O in a data-intensive server system. . . . 6

2.2 Concurrent sequential I/O under alternating accesses from a single

process/thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 An illustration of I/O scheduling at the virtual machine monitor

(VMM). Typically, the VMM I/O scheduler has no knowledge of

the request-issuing process contexts within the guest systems. As

a result, the anticipatory I/O scheduling at the VMM may not

function properly. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Sequential transfer rate and seek time for two disk drives and a

RAID. Most results were acquired under the Linux 2.6.10 kernel.

The transfer rate of the IBM drive appears to be affected by the

operating system software. We also show its transfer rate on a

Linux 2.6.16 kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 I/O access pattern of certain application workloads. Virtual time

represents the sequence of discrete data accesses. In (A), each

straight line implies sequential access pattern of a file. In (B),

random and non-sequential accesses make the files indistinguishable. 32

3.4 Application I/O throughput of server-style workloads (four mi-

crobenchmarks and two real applications). The emulated optimal

performance is presented only for the microbenchmarks. . . . . . 34

LIST OF FIGURES xi

3.5 Normalized execution time of non-server concurrent applications.

The values on top of the “Linux” bars represent the running time

of each workload under the original Linux kernel (in minutes or

seconds). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.6 Performance of server-style workloads on a Xen virtual machine

platform where the I/O scheduling is oblivious to request-issuing

process contexts. We present results only for workloads whose per-

formance substantially differ from that in the base case (shown in

Figure 3.4). For the microbenchmark, we also show the perfor-

mance of an emulated optimal prefetching. . . . . . . . . . . . . 38

3.7 Normalized execution time of non-server concurrent applications on

a virtual machine platform where the I/O scheduling is oblivious

of request-issuing process contexts. We present results only for

workloads whose performance substantially differ from that in the

base case (shown in Figure 3.5). The values on top of the “Linux”

bars represent the running time of each workload under the original

Linux kernel (in seconds). . . . . . . . . . . . . . . . . . . . . . . 39

3.8 Normalized execution time of two non-server concurrent workloads

on three different storage devices. . . . . . . . . . . . . . . . . . 40

3.9 Normalized inverse of throughput for two server-style concurrent

applications on three different storage devices. We use the inverse

of throughput (as opposed to throughput itself) so that the illustra-

tion is more comparable to that in Figure 3.8. For the microbench-

mark, we also show an emulated optimal performance. For clear

illustration, we only show results at concurrency level 4. The per-

formance at other concurrency levels are in similar pattern. . . . . 41

LIST OF FIGURES xii

3.10 Normalized execution time of serial application workloads. The

values on top of the “Linux” bars represent the running time of

each workload under the original Linux prefetching (in minutes,

seconds, or milliseconds). . . . . . . . . . . . . . . . . . . . . . . . 42

3.11 Normalized latency of individual I/O request under concurrent work-

load. The latency of an individual small I/O request is measured

with microbenchmark Two-Rand-0 running in the background. The

values on top of the “Linux” bars represent the request latency un-

der the original Linux prefetching (in seconds). . . . . . . . . . . . 47

4.1 Application throughput and prefetch page thrashing rate of a trace-

driven index searching server under the original Linux 2.6.10 kernel.

A prefetch thrashing is counted as a page miss on prefetched but

then prematurely evicted memory page. . . . . . . . . . . . . . . . 54

4.2 A separate prefetch cache for prefetched pages. . . . . . . . . . . . 55

4.3 An illustration of the prefetch cache data structure and our imple-

mentation in Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Microbenchmark I/O throughput at different concurrency levels.

Our proposed prefetch memory management produces throughput

improvement on One-Rand and Two-Rand. . . . . . . . . . . . . . 68

4.5 Comparison among AccessHistory, Coldest+, and an approximated

optimal page reclamation policy. The throughput of Coldest+ is

10–32% lower than that of the approximated optimal while it is

8–97% higher than that of AccessHistory in high concurrencies. . 70

4.6 I/O throughput and page thrashing rate of the index searching server. 71

LIST OF FIGURES xiii

4.7 Percentage of prematurely evicted pages that have stayed in mem-

ory for more than 8 seconds for the index searching server. We

only consider those prefetched but then prematurely evicted pages

which later cause page misses. . . . . . . . . . . . . . . . . . . . . 72

4.8 Throughput of the Apache Web server hosting media clips. . . . . 73

4.9 Performance impact of the maximum prefetching depth for the mi-

crobenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and

“AH” stands for AccessHistory. . . . . . . . . . . . . . . . . . . . 74

4.10 Performance impact of the server memory size for the microbench-

mark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”

stands for AccessHistory. . . . . . . . . . . . . . . . . . . . . . . . 75

4.11 Performance impact of the workload data size for the microbench-

mark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”

stands for AccessHistory. . . . . . . . . . . . . . . . . . . . . . . . 76

5.1 Current and ahead windows in asynchronous prefetching . . . . . 84

5.2 Throughput of a server workload. The anticipatory I/O scheduler

is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.3 Throughput of a server-style microbenchmark workload. Change of

congestion threshold may cause anomalous performance deviations. 91

5.4 Effect of kernel correction (for anomaly #3) on a microbenchmark

that alternatively reads random size of data in two files. The an-

ticipatory I/O scheduler is used. . . . . . . . . . . . . . . . . . . . 93

5.5 Effect of kernel correction (for anomaly #4) on a microbenchmark

that reads 256 KB randomly in large files. . . . . . . . . . . . . . 95

5.6 Effect of kernel corrections on a microbenchmark that reads 256 KB

randomly in large files. . . . . . . . . . . . . . . . . . . . . . . . . 97

xiv

List of Tables

4.1 Benchmark statistics. The column “Whole-file access?” indicates

whether a request handler would prefetch some data that it would

never access. No such data exists if whole files are accessed by

application request handlers. The column “Stream per request”

indicates the number of prefetching streams each request handler

initiates simultaneously. More prefetching streams per request han-

dler would create higher memory contention. . . . . . . . . . . . 66

1

1 Introduction

1.1 Motivation

With the ubiquity of computer networks, network-based applications become

more and more popular. A large number of network and distributed applica-

tions in the world are built upon on-line servers. Server systems usually provide

computing or storage services to many simultaneous clients. In the meanwhile,

modern computing systems become increasingly data-driven. The dataset size

of typical server applications increases evidently. As an example, from 2003 to

2008, the estimated number of indexed web pages at Google increased from 3.2

billion to over 20 billion [SWW, SES]. Under the circumstances, many data-

intensive server systems have come on the stage. In addition to large-scale Web

search engines, examples of such systems also include parallel file systems and

FTP/Web servers that host big files, such as free software and multimedia data.

Such data-intensive server systems access large disk-resident dataset while serving

many clients simultaneously.

Operating system (OS) is the lowest layer of software in a computer system.

Efficient OS support for upper-layer applications has direct and significant impact

on the overall system performance and reliability. Previous studies have investi-

CHAPTER 1. INTRODUCTION 2

gated OS support for concurrent server systems. Techniques such as event-driven

concurrency management [PDZ99a, WCB01] and user-level threads [vBCZ+03]

were proposed to avoid expensive kernel-level context switches and reduce the

synchronization overhead. However, these techniques are mostly designed to sup-

port CPU-intensive workloads with light disk I/O activities (e.g., the typical Web

server workload and application-level packet routing). The existing OS support

is insufficient for concurrent server systems to achieve high performance while

accessing a large amount of disk-resident data.

1.2 Thesis Statement

The goal of this thesis work is to enhance the OS support to achieve high

performance and efficient resource utilization for data-intensive server systems.

Specifically, we investigate transparent enhancements in three aspects

of the operating system: file system prefetching, memory management,

and disk I/O system. Although large scale server applications often run on

clusters or distributed systems, we concentrate on OS enhancements for single

independent node of a large system. Studies on OS support for server systems in

cluster level is beyond the scope of this work.

First, for data-intensive server systems, the disk I/O performance dominates

the overall system throughput when the dataset size far exceeds the available

server memory. Many applications are carefully constructed to follow sequential

pattern on accessing the disk-resident data in order to achieve high I/O perfor-

mance. However, such sequential access patterns are often disrupted by concur-

rent request processing. More specifically, data access of one request handler

(e.g., server process or thread) can be frequently interrupted by other active re-

quest handlers in the system. Such a phenomenon, we call disk I/O switches,

can severely affect the I/O efficiency due to long disk seek and rotational delays.


Aggressive prefetching can improve the granularity of sequential data access, but

it comes with a higher risk of fetching unneeded data. To address this problem,

we propose a competitive prefetching technique which can maintain the balance

between the overhead of I/O switches and that of unnecessary prefetching.

Second, data-intensive server systems may contain a substantial amount of

prefetched data in memory due to large-granularity I/O prefetching and high

execution concurrency. Traditional access recency or frequency-based page recla-

mation methods are not effective for prefetched data because prefetched memory

pages have no access history. Using a traditional page reclamation policy, memory

contention can result in premature eviction of prefetched pages before they are

actually accessed. Such a phenomenon, we call page thrashing, may degrade the

server system performance dramatically. To address this problem, we explore a

new memory management scheme for prefetched but not-yet-accessed pages. We

examine efficient heuristic page reclamation policies that are not based on access

recency or frequency.

Third, OS support for data-intensive server systems can also be improved by

reducing I/O system performance bugs or anomalies. I/O storage system is usually

the slowest part of a computer system. Its efficiency is crucial for data-intensive

server applications. In a typical monolithic operating system, I/O system is large

and complex. It may not always behave as well as expected. We investigate the

Linux I/O system and identify several performance anomalies. In this dissertation,

we present five performance anomalies and discuss our experience in operating

system performance debugging. In particular, two of the bugs we identified are

reported to the Linux kernel development team [Bug]. One of the two has been

accepted by the team and its fix will be included in an upcoming Linux kernel

version. The other three of the anomalies are not unqualified “bugs” that have to

be corrected (see section 5.4 for the detail).

This work aims to provide OS enhancements that can transparently benefit


a large number of data-intensive applications. No application assistance or hints

will be needed to take advantage of our research results. Our design will also

take measures to minimize negative impact on applications that are not directly

targeted in this work.

1.3 Organization

The rest of this dissertation is organized as follows. Chapter 2 provides more

background information on this work. Chapter 3 presents the design, implementa-

tion, and evaluation of the competitive prefetching technique. Chapter 4 presents

the design, implementation, and evaluation of the enhanced memory management

strategy for prefetched data. Chapter 5 presents the performance anomalies we

identified in the Linux I/O system and provides discussions on operating sys-

tem performance debugging. Chapter 6 concludes this work with a summary of

contributions and provides discussions on future directions.

5

2 Background

In this chapter, we introduce some high-level background information on this

thesis work. We first describe the I/O characteristics of our targeted data-intensive

server systems. Then some concepts and existing work about prefetching, memory

management, disk I/O, and concurrent servers are reviewed. Limitations of the

existing work will also be discussed.

2.1 Targeted applications

The targeted applications that motivate us to conduct this work are data-

intensive server systems that may serve many clients simultaneously. In these

systems, each user request is typically serviced by a request handler which can be

a thread in a multi-threaded server or a series of event handlers in an event-driven

server. The request handler then repeatedly accesses disk data and consumes

the CPU before completion. We define a sequential I/O stream as a group of

spatially contiguous data items that are accessed by a single process or thread.

Concurrent requests to a data-intensive server will cause concurrent accesses to

multiple sequential I/O streams in the storage system. We call this type of I/O

workload concurrent sequential I/O, or concurrent I/O in short.

CHAPTER 2. BACKGROUND 6

Requests

from the

network

Timeline for a concurrent execution

Complete

Complete

Complete

Disk I/O CPUWaiting for

resources

Figure 2.1: Concurrent sequential I/O in a data-intensive server system.

Each request handler on the server does not necessarily make use of the full

content of each I/O stream. It may also perform interleaving I/O operations that

do not belong to the same sequential stream. Disk I/O can be triggered explicitly

through library calls such as fread, or implicitly through memory-mapped I/O.

For instance, a substantial fraction of disk operations in FTP and web servers,

which tend to retrieve whole or partial files, are associated with sequential I/O

streams. For more complex server applications, their developers may employ

data correlation techniques in order to layout their data sequentially on disk and

increase I/O efficiency. Figure 2.1 illustrates concurrent I/O operations for a

typical data-intensive server application.

In addition to data-intensive server systems, concurrent I/O may also be gen-

erated by concurrent execution of independent processes or a standalone process

that multiplexes accesses among multiple sequential I/O streams (Figure 2.2). Al-

though these applications are not our major concern, the proposed techniques are

also beneficial for them. Examples of such applications include concurrent copying

of multiple files, the file comparison utility diff, and several database workloads.

For instance, web search engines [Ask] maintain an ordered list of matching web

pages for each indexed keyword. For multi-keyword queries, the search engine has

to compute the intersection of multiple such lists. Additionally, SQL multi-table

join queries require accessing a number of files sequentially.


A sequential stream

Timeline

Disk I/O CPU

A sequential stream

Figure 2.2: Concurrent sequential I/O under alternating accesses from

a single process/thread

Many applications read a portion of each data stream at a time instead of

fetching the complete data stream into application-level buffers at once. Reading

complete data streams is generally avoided for the following reasons. First, some

applications do not have prior knowledge of the exact amount of data required

from a certain data stream. For instance, index search and SQL join queries will

often terminate when a desired number of “matches” are found. Second, applica-

tions that employ memory-mapped I/O do not directly initiate I/O operations.

The operating system loads into memory a small number of pages (through page

faults) at a time on their behalf. Third, fetching complete data streams into

application-level buffers may result in double buffering and/or increase memory

contention. For instance, while GNU diff [GNU06] reads the complete files into

memory at once, the BSD diff can read a portion of the files at a time and thus

it consumes significantly less memory when comparing large files.

2.2 Prefetching

Prefetching or read-ahead means reading adjacent pages from the file before

they are actually accessed by the application. Then if prefetched pages are ac-

cessed later by the application, they are already available in the page/buffer cache

in memory. Therefore, prefetching can improve disk I/O efficiency by reducing

seek and rotational overhead for sequential accesses to disk-resident data, although


it is not very useful for random data accesses. Section 5.2.1 provides a description

on the prefetching algorithm in Linux.

Previous research has examined file-system prefetching. We will discuss related

work on prefetching in detail in Section 3.6. Here we briefly mention three pieces

of representative work. Cao et al. proposed integrated prefetching and caching

strategies in [CFKL95]. They presented four rules that an optimal prefetching

and caching strategy must follow. Patterson et al. proposed informed prefetching

which used application hints to guide prefetching and used a cost-benefit model

to evaluate the benefit of prefetching [PGG+95, TPG97]. Aggressive prefetching

was also proposed to increase disk access burstiness and thus to save energy for

mobile devices [PS03, PS04, PS05].

2.3 Memory Replacement

Virtual memory [BCD72] is one of the great ideas in computer systems [BO03,

Nut99]. It gives applications the impression that they have contiguous working

memory which may be larger than the physical memory. It is the basis for many

more advanced techniques or concepts such as multitasking and shared memory.

In virtual memory, memory/page replacement policies decide which memory pages

to page out when page faults happen and no free pages are available. Due to its

ease of implementation and good performance, an approximated LRU policy is

widely used in modern operating systems.

In addition to LRU, a large body of previous work has explored efficient mem-

ory/page replacement policies for data-intensive applications, including AR-1 [Bel66],

Working-Set [Den68], LFU [MGST70], CLOCK [Smi78], CLOCK-Pro [JCZ05],

FBR [RD90], LRU-K [OOW93], 2Q [JS94], Unified Buffer Management [KCK+00],

Multi-Queue [ZPL01], and LIRS [JZ02]. All these memory replacement strategies

are based on the page reference history such as reference recency or frequency. In


data-intensive server systems, a large number of prefetched but not-yet-accessed

pages may emerge due to aggressive I/O prefetching and high execution concur-

rency. Because of the lack of any access history, these pages may not be properly

managed by access recency or frequency-based replacement policies.

2.4 Disk I/O Scheduler

I/O schedulers are also called elevators. They manage I/O operations for each

block device by reordering and merging requests. Many previous studies about

disk scheduling policy exist. SSF (Shortest Seek First) and Scan [Tan01] are

seek-reducing schedulers. YFQ (Yet-another Fair Queuing) [BBG+99], Lottery

scheduling [WW94], and Stride scheduling [WW95] are examples of proportional-

share schedulers. A disk scheduler is work-conserving if it always tries to schedule

the next request whenever the previous request has finished. Both seek-reducing

schedulers and proportional-share schedulers are work-conserving.

The drawback of work-conserving disk schedulers is that they all suffer from de-

ceptive idleness. Deceptive idleness is a condition where the scheduler incorrectly

assumes that the last request-issuing process has no further requests, and the

scheduler becomes forced to switch to a request from another process [ID01]. In

data-intensive server systems, because the workloads mostly do sequential access

to disk, deceptive idleness happens very often while a work-conserving scheduler

is used. This can negatively affect the server performance.

Iyer and Druschel proposed anticipatory disk scheduling framework to avoid

deceptive idleness [ID01]. An anticipatory disk scheduler is not work-conserving.

It will always delay servicing concurrent disk I/O requests from other processes for

a short period of time by waiting for a closer I/O request from the same process as

issued the previous request. It is shown that anticipatory scheduling can improve

the performance of many data-intensive workloads. But anticipatory scheduling


may not be effective when substantial think time exists between consecutive I/O

requests or when the request handler has to perform some interleaving synchronous

I/O that does not exhibit strong locality. Section 3.2.1 has more on this issue.

2.5 OS Support for Highly-Concurrent Servers

In support of concurrent servers, operating systems traditionally provide sev-

eral programming models, such as multi-process (MP), multi-threaded (MT),

and single-process event-driven (SPED) models. The debate between threads

and events has been ongoing in the literature for a long time. Pai et al. pro-

posed a server architecture, called AMPED (Asymmetric Multiprocess Event

Driven) [PDZ99a]. It extends the SPED with multiple helper processes (or threads),

which handle blocking disk I/O operations. Welsh et al. proposed a staged event-

driven architecture (SEDA) for high-concurrent online servers [WCB01]. The

basic idea is to construct each application as a network of event-driven stages

connected by explicit event queues. More recently, Behren et al. argued that

thread-based systems can not only provide a simpler programming model but

also provide equivalent or superior performance compared with event-based sys-

tems [BCB03, vBCZ+03]. They implemented a user-level thread library, called

Capriccio, which scales well when supporting hundreds of thousands of threads

running Internet services.

However, these studies mainly focus on the CPU utilization efficiency of highly-

concurrent servers. In our study, the throughput of data-intensive server systems

is bounded by the I/O performance when the application data size far exceeds the

available system memory.


2.6 Other Aspects

There is some other research work that is relevant to data-intensive server

systems.

• Many researchers have explored file system data allocation to reduce aver-

age disk head positioning overheads. Some studies focused on collocation of

related data [RO92, BHS95, SS96, BC02, LSG+00]. Some other studies sug-

gested making use of the high bandwidth of outer regions of the disk to host

frequently accessed data [Met97, BDF+96]. In a recent work, Schindler et

al. proposed a way to avoid rotational delay and track crossing overheads

by allocating related data on disk track boundaries [SGLG02].

• Several work studied efficient I/O buffering and caching to avoid redun-

dant data copying and multiple buffering in server systems [FP93, DP93,

KEG+97, Bru99, CP99, PDZ99b]. For instance, Pai et al. proposed to unify

all buffering and caching in the system, so that the file system, network sub-

system, IPC system, and application can share a single physical copy the

data.

Those studies provide important OS support for data-intensive server systems. In

contrast with our work, they look at different aspects of OS. Those techniques

complement our work and can work along with the techniques we propose.

In summary, the existing operating system supports for data-intensive server

systems are insufficient in terms of file-system prefetching, memory management,

and disk I/O system. In the following chapters, we will propose system-level

enhancements in the three aspects.

12

3 Competitive Prefetching

3.1 Introduction

Concurrent access to multiple sequential I/O streams is commonplace in both

server systems and desktop workloads. For such concurrent I/O workloads, con-

tinuous accesses to one sequential data stream can be interrupted by accesses to

other streams in the system. Frequent switching between multiple sequential I/O

streams may severely affect I/O efficiency due to long disk seek and rotational de-

lays of disk-based storage devices. In the rest of this dissertation, we refer to any

disk head movement that interrupts a sequential data transfer as an I/O switch.

The cost of an I/O switch includes both seek and rotational delays.

The problem of unnecessary disk I/O switch during concurrent I/O workloads

may be alleviated partially by non-work-conserving disk schedulers. In particular,

the anticipatory scheduler [ID01] may temporarily idle the disk (even when there

are outstanding I/O requests) so that consecutive I/O requests that belong to the

same process/thread may be serviced without interruption. However, anticipatory

scheduling is ineffective, in cases where a single process or thread synchronously

accesses multiple sequential data streams in an interleaving fashion or when a sub-

stantial amount of processing or think time exists between consecutive sequential

CHAPTER 3. COMPETITIVE PREFETCHING 13

requests from a process. In addition, anticipatory scheduling may not function

properly when it lacks the knowledge of I/O request-issuing process contexts, as

in the cases of I/O scheduling within a virtual machine monitor [JADAD06] or

within a parallel/distributed file system server [LS07, CIRT00].

Another technique to reduce I/O switch frequency for concurrent I/O work-

loads is to increase the granularity of sequential I/O operations. This can be

accomplished by aggressive OS-level prefetching that prefetches deeper in each

sequential I/O stream. However, under certain scenarios, aggressive prefetching

may have a negative impact on performance due to buffer cache pollution and inef-

ficient use of I/O bandwidth. Specifically, aggressive prefetching has a higher risk

of retrieving data that are not needed by the application. This chapter proposes

an OS-level prefetching technique that balances the overhead of I/O switching

and the disk bandwidth wasted on prefetching unnecessary data. We focus on the

mechanism that controls the prefetching depth for sequential or partially sequen-

tial accesses. We do not consider other aspects or forms of prefetching.

The proposed prefetching technique is based on a fundamental 2-competitiveness

result: “when the prefetching depth is equal to the amount of data that can be

sequentially transferred within the average time of a single I/O switch, the total

disk resource consumption of an I/O workload is at most twice that of the optimal

offline prefetching strategy”. Our proposed prefetching technique does not require

a-priori knowledge of an application’s data access pattern or any other type of

application support. We also show that the 2-competitiveness result is optimal,

i.e., no transparent OS-level online prefetching can achieve a competitiveness ra-

tio lower than 2. Finally, we further extend the competitiveness result to capture

prefetching in the case of random-access workloads.

The rest of this chapter is structured as follows. Section 3.2 provides more

background information on existing OS support and relevant storage device char-

acteristics. Section 3.3 presents the analysis and design of the proposed competi-


tive prefetching strategy. Section 3.4 discusses practical issues concerning storage

device characterization and the implementation of our proposed strategy. Sec-

tion 3.5 evaluates the proposed strategy using several microbenchmarks and a

variety of real application workloads. Section 3.6 discusses related previous work.

Section 3.7 discusses some performance issues and concludes the chapter.

3.2 Background

3.2.1 Existing Operating System Support

During concurrent I/O workloads, sequential access to a data stream can be

interrupted by accesses to other streams, increasing the number of expensive disk

seek operations. Anticipatory scheduling [ID01] can partially alleviate the prob-

lem. At the completion of an I/O request, anticipatory scheduling may keep the

disk idle for a short period of time (even if there are outstanding requests) in

anticipation of a new I/O request from the process/thread that issued the just-

completed request. Consecutive requests generated from the same process/thread

often involve data residing on neighboring disk locations and require little or no

seeking. This technique results in significant performance improvement for con-

current I/O workloads, where all I/O requests, issued by an individual process,

exhibit strong locality. However, anticipatory scheduling is ineffective in several

cases.

• Since anticipatory scheduling expects a process to issue requests with strong

locality in a consecutive way, it cannot avoid disk seek operations when a

single process multiplexes accesses among two or more data streams (Fig-

ure 2.2).

• The limited anticipation period introduced by anticipatory scheduling may

not avoid I/O switches when a substantial amount of processing or think


proc1 proc2 proc3

virtual I/O device

guest system

…

…

...

other guest systems

virtual machine monitor

I/O scheduler

Physical

device

req req req

Figure 3.1: An illustration of I/O scheduling at the virtual machine

monitor (VMM). Typically, the VMM I/O scheduler has no knowledge of

the request-issuing process contexts within the guest systems. As a result,

the anticipatory I/O scheduling at the VMM may not function properly.

time exists between consecutive sequential requests from the same process.

More specifically, the anticipation may timeout before the anticipated re-

quest actually arrives.

• Anticipatory scheduling needs to keep track of the I/O behavior of each pro-

cess so that it can decide whether to wait for a process to issue a new request

and how long to wait. It may not function properly when the I/O scheduler

does not know the process contexts from which I/O requests are issued. In

particular, Jones et al. [JADAD06] pointed out that I/O schedulers within

virtual machine monitors typically do not know the request’s process con-

texts inside the guest systems. As shown in figure 3.1, the process contexts


information is encapsulated inside the guest OS and can not be easily used by

the I/O scheduler in the virtual machine monitor. Another example is that

the I/O scheduler at a parallel/distributed file system server [LS07, CIRT00]

may not have the knowledge of remote process contexts for I/O requests.

An alternative to anticipatory scheduling is I/O prefetching, which can in-

crease the granularity of sequential data accesses and, consequently, decrease

the frequency of disk I/O switch. Since sequential transfer rates improve sig-

nificantly faster than seek and rotational latencies on modern disk drives, aggres-

sive prefetching becomes increasingly appealing. In order to limit the amount of

wasted I/O bandwidth and memory space, operating systems often employ an

upper-bound on the prefetching depth, without a systematic understanding of the

impact of the prefetching depth on performance.

Our analysis in this chapter assumes that application-level logically sequential

data blocks are mapped to physically contiguous blocks on storage devices. This

is mostly true because file systems usually attempt to organize logically sequential

data blocks in a contiguous way in order to achieve high performance for sequential

access. Bad sector remapping on storage devices can disrupt sequential block

allocation. However, such remapping does not occur for the majority of sectors in

practice. We do not consider its impact on our analysis.

3.2.2 Disk Drive Characteristics

The disk service time of an I/O request consists of three main components: the

head seek time, the rotational delay, and the data transfer time. Disk performance

characteristics have been extensively studied in the past [KTR94, RW94, SMW98].

The seek time depends on the distance to be traveled by the disk head. It consists

roughly of a fixed head initiation cost plus a cost linear to the seek distance. The

rotational delay mainly depends on the rotation distance (fraction of a revolution).


Techniques such as out-of-order transfer and free-block scheduling [LSGN00] are

available to reduce the rotational delay. The sequential data transfer rate varies

at different data locations because of zoning on modern disks. Specifically, tracks

on outer cylinders often contain more sectors per track and, consequently, exhibit

higher data transfer rates.

Modern disks are equipped with read-ahead caches so they may continue to

read data after the requested data have already been retrieved. However, the disk

cache size is relatively small (usually less than 16 MB) and, consequently, data

stored in the disk cache are replaced frequently under concurrent I/O workloads.

In addition, disk cache read-ahead is usually performed only when there are no

pending requests, which is rare in the context of I/O-intensive workloads. There-

fore, we do not consider the impact of disk cache read-ahead on our analysis of

OS-level prefetching.

3.3 Competitive Prefetching

We investigate I/O prefetching strategies that address the following tradeoff in

deciding the I/O prefetching depth: conservative prefetching may lead to a high

I/O switch overhead, while aggressive prefetching may waste too much I/O band-

width on fetching unnecessary data. Section 3.3.1 presents a prefetching policy

that can achieve at least half the performance (in terms of I/O throughput) of the

optimal offline prefetching strategy without a-priori knowledge of the application

data access pattern. In the context of this work, we refer to the “optimal” prefetch-

ing as the best-performing sequential prefetching strategy that minimizes the com-

bined cost of disk I/O switch and retrieving unnecessary data. We do not consider

other issues such as prefetching-induced memory contention in our analysis. Pre-

vious work has explored such issues [CFKL95, KMC02, PGG+95, TPG97]. We

will look at memory-related issues in chapter 4. In section 3.3.2, we show that the


2-competitive prefetching is the best competitive strategy possible. Section 3.3.3

examines prefetching strategies adapted to accommodate mostly random-access

workloads and their competitiveness results.

The competitiveness analysis in this section focuses on systems employing

standalone disk drives. In later sections, we will discuss practical issues and

present experimental results when extending our results for disk arrays.

3.3.1 2-Competitive Prefetching

Consider a concurrent I/O workload consisting of N sequential streams. For

each stream i (1 ≤ i ≤ N), we assume the total amount of accessed data (or stream

length) is Stot[i]. Due to zoning on modern disks, the sequential data transfer rate

on a disk can vary depending on the data location. Since each stream is likely

localized to one disk zone, it is reasonable to assume that the transfer rate for any

data access on stream i is a constant, denoted as Rtr[i].

For stream i, in the ideal case, the optimal prefetching strategy has a-priori

knowledge of Stot[i] (the amount of data to be accessed). It performs a single

I/O switch (including seek and rotation) to the beginning of the stream and then

sequentially transfers data of size Stot[i]. Let the I/O switch cost for stream i be

Cswitch[i]. The disk resource consumption (in time) for accessing all N streams

under optimal prefetching is:

Copttotal =

∑

1≤i≤N

(

Stot[i]

Rtr[i]+ Cswitch[i]

)

(3.1)

However, in practice the total amount of data to be accessed Stot[i] is not known

a-priori. Consequently, OS-level prefetching tries to approximate Stot[i] using an

appropriate prefetching depth. For simplicity, we assume that the OS employs a

constant prefetching depth Sp[i] for each stream.1 Therefore, accessing the stream

1The assumption of a constant prefetching depth constrains only the design of the prefetching


i requires ⌈Stot[i]Sp[i]

⌉ prefetch operations. Each prefetch operation requires one I/O

switch. Let the I/O switch cost be Cswitch[i, j] for the j-th prefetch operation of

stream i. The total I/O switch cost for accessing stream i is represented by:

Ctot switch[i] =∑

1≤j≤⌈Stot[i]Sp[i]

⌉

Cswitch[i, j] (3.2)

The time wasted while fetching unnecessary data is bounded by the cost of

the last prefetch operation:

Cwaste[i] ≤Sp[i]

Rtr[i](3.3)

Therefore, the total disk resource (in time) consumed to access all streams is

bounded by:

Ctotal

=∑

1≤i≤N

(

Stot[i]

Rtr[i]+ Cwaste[i] + Ctot switch[i]

)

≤∑

1≤i≤N

Stot[i]

Rtr[i]+

Sp[i]

Rtr[i]+

∑

1≤j≤⌈Stot[i]Sp[i]

⌉

Cswitch[i, j]

(3.4)

The expected average I/O switch time (including seek and rotation) for a

concurrent I/O workload can be derived from the disk drive characteristics and

the workload concurrency. We consider the seek time and rotational delay below.

• I/O scheduling algorithms such as the Cyclic-SCAN reorder outstanding

I/O requests based on the data location and schedule the I/O request clos-

est to the current disk head location. Suppose constant C is the size of

the contiguous portion of the disk where the application dataset resides.

policy. However, what can be achieved with a constrained policy can certainly be achieved with

a non-constrained, broader policy. Thus, our simplification does not restrict the competitiveness

result presented in this section.


When the disk scheduler can choose from n concurrent I/O requests at uni-

formly random disk locations, the inter-request seek distance Dseek follows

the following distribution:

Pr[Dseek ≥ x] = (1 −x

C)n (3.5)

The expected seek time can then be acquired by combining Equation (3.5)

and the seek distance to seek time mapping of the disk. (Figure 3.2(B)

shows an example of the seek distance to seek time mapping of some disk

drives.)

• The expected rotational delay of an I/O workload is the mean rotational

time between two random track locations (i.e., the time it takes the disk to

spin half a revolution).

The above result suggests that the expected average I/O switch time of a

concurrent workload is independent of the prefetching schemes (optimal or not)

being employed. Instead, it is related to the I/O concurrency n. For a specific

workload with I/O concurrency N , we denote the expected average I/O switch

cost as Cavgswitch. Therefore, Equation (3.1) can be simplified to:

Copttotal =

∑

1≤i≤N

(

Stot[i]

Rtr[i]

)

+ N · Cavgswitch (3.6)

And Equation (3.4) can be simplified to:

Ctotal

≤∑

1≤i≤N

(

Stot[i]

Rtr[i]+

Sp[i]

Rtr[i]

)

+∑

1≤i≤N

⌈Stot[i]

Sp[i]⌉ · Cavg

switch

<∑

1≤i≤N

(

Stot[i]

Rtr[i]+

Sp[i]

Rtr[i]

)

+∑

1≤i≤N

(

Stot[i]

Sp[i]+ 1

)

· Cavgswitch

(3.7)

Based on Equations (3.6) and (3.7), we find that:

if Sp[i] = Cavgswitch · Rtr[i], then Ctotal < 2 · Copt

total (3.8)


Equation 3.8 allows us to design a prefetching strategy with bounded worst-

case performance:

When the prefetching depth is equal to the amount of data that can be

sequentially transferred within the average time of a single I/O switch,

the total disk resource consumption of an I/O workload is at most twice

that of the optimal offline prefetching strategy.

In the meantime, the I/O throughput of a prefetching strategy that follows the

above guideline is at least half of the throughput of the optimal offline prefetching

strategy. Competitive strategies commonly refer to solutions whose performance

can be shown to be no worse than some constant factor of an optimal offline

strategy. We follow such convention to name our prefetching strategy competitive

prefetching.

3.3.2 Optimality of 2-Competitiveness

In this section, we show that no transparent OS-level online prefetching strat-

egy (without prior knowledge of the sequential stream length Stot[i] or any other

application-specific information) can achieve a competitiveness ratio lower than

2. Intuitively, we can reach the above conclusion since:

• Short sequential streams demand conservative prefetching, while long streams

require aggressive prefetching.

• No online prefetching strategy can unify the above conflicting goals with a

competitiveness ratio lower than 2.

We prove the above by contradiction. We assume there exists an α-competitive

solution X (where 1 < α < 2). In the derivation of the contradiction, we assume


the I/O switch cost Cswitch is constant.2 The discussion below is in the context of

a single stream, therefore we do not specify the stream id for notational simplicity

(e.g., Stot[i] → Stot). In X , let Snp be the size of the n-th prefetch for a stream.

Since X is α-competitive, we have Ctotal ≤ α · Copttotal.

First, we show that the size of the first prefetch operation S1p satisfies:

S1p < Cswitch · Rtr (3.9)

Otherwise, we let Stot = ǫ where 0 < ǫ < (2−αα

) · Cswitch · Rtr. In this case,

Copttotal = Cswitch + ǫ

Rtrand ǫ < S1

p . Because Ctotal = Cswitch +S1

p

Rtr≥ 2 · Cswitch, we

have Ctotal > α · Copttotal. Contradiction!

We then show that:

∀k ≥ 2, Skp <

(

k−1∑

j=1

Sjp

)

− (k − 2) · Cswitch · Rtr (3.10)

Otherwise (if Equation (3.10) is not true for a particular k), we define T as below,

and we know T > 0.

T = Skp + (k − α) · Cswitch · Rtr + (1 − α) ·

(

k−1∑

j=1

Sjp

)

We then let Stot =(

∑k−1j=1 Sj

p

)

+ ǫ where 0 < ǫ < Tα. In this case, Copt

total = Cswitch +

(Pk−1

j=1 Sjp)+ǫ

Rtrand Ctotal ≥ k ·Cswitch +

(Pk

j=1 Sjp)

Rtr. Therefore, we have Ctotal > α · Copt

total.

Contradiction!

We next show:

∀k ≥ 2, Skp < Sk−1

p (3.11)

We prove Equation (3.11) by induction. Equation (3.11) is true for k = 2 as a

direct result of Equation (3.10). Assume Equation (3.11) is true for all 2 ≤ k ≤

2This assumption is safe because contradiction under the constrained disk system model

entails contradiction under the more general disk system model. Specifically, if an α-competitive

solution does not exist under the constrained disk system model, then such solution certainly

should not exist under the more general case.


k − 1. This also means that Skp < Cswitch · Rtr for all 2 ≤ k ≤ k − 1. Below we

show Equation (3.11) is true for k = k:

S kp <

k−1∑

j=1

Sjp

− (k − 2) · Cswitch · Rtr

= S k−1p +

k−2∑

j=1

(Sjp − Cswitch · Rtr)

< S k−1p

(3.12)

Our final contradiction is that for any k ≥2·Cswitch·Rtr−S1

p

Cswitch·Rtr−S1p

, we have Skp < 0:

Skp <

(

k−1∑

j=1

Sjp

)

− (k − 2) · Cswitch · Rtr

< (k − 1) · S1p − (k − 2) · Cswitch · Rtr

= 2 · Cswitch · Rtr − S1p − k · (Cswitch · Rtr − S1

p)

≤ 2 · Cswitch · Rtr − S1p − (2 · Cswitch · Rtr − S1

p)

= 0

(3.13)

3.3.3 Accommodation for Random-Access

Workloads

Competitive strategies only address the worst-case performance. A general-

purpose operating system prefetching policy should deliver high average-case per-

formance for common application workloads. In particular, many applications

only exhibit fairly short sequential access streams or even completely random ac-

cess patterns. In order to accommodate these workloads, our competitive prefetch-

ing policy employs a slow-start phase, commonly employed in modern operating

systems such as Linux. For a new data stream, prefetching takes place with a

relatively small initial prefetching depth. Upon detection of a sequential access


pattern, the depth of each additional prefetching operation is increased (often

doubled) until it reaches the desired maximum prefetching depth. In our imple-

mentation, the maximum prefetching depth is set to be the competitive prefetching

depth.

The following example illustrates the slow-start phase. Let the initial prefetch-

ing depth be 16 pages (64 KB). Upon detection of a sequential access pattern, the

depth of each consecutive prefetch operation increases by a factor of two until it

reaches its maximum permitted value. For a sequential I/O stream with a com-

petitive prefetching depth of 98 pages (392 KB), a typical series of prefetching

depths should look like: 16 pages → 32 pages → 64 pages → 98 pages → 98 pages

→ · · ·

For our targeted sequential I/O workloads, the introduction of slow-start phase

will affect the competitiveness of our proposed technique. In the rest of this

subsection, we extend the previous competitiveness result when a slow-start phase

is employed by the prefetching policy. Assume the slow-start phase consists of at

most P prefetching operations (3 in the above example). Also let the total amount

of data fetched during a slow-start phase be bounded by T (112 pages in the above

example). The total disk resource consumed (Equation (3.7)) becomes:

Ctotal

≤∑

1≤i≤N

(

Stot[i]

Rtr[i]+

Sp[i]

Rtr[i]

)

+∑

1≤i≤N

(

⌈Stot[i] − T

Sp[i]⌉ + P

)

· Cavgswitch

<∑

1≤i≤N

(

Stot[i]

Rtr[i]+

Sp[i]

Rtr[i]

)

+∑

1≤i≤N

(

Stot[i]

Sp[i]+ 1 + P

)

· Cavgswitch

(3.14)

Since the competitive prefetching depth Sp[i] is equal to Cavgswitch ·Rtr[i], Equa-

tion (3.14) becomes:

Ctotal < 2 ·∑

1≤i≤N

(

Stot[i]

Rtr[i]

)

+ (2 + P ) · N · Cavgswitch (3.15)


Equations (3.6) and (3.15) lead to

Ctotal < 2 · Copttotal + P · N · Cavg

switch (3.16)

Consequently, the prefetching strategy that employs a slow-start phase is still

competitive, but with a constant term (P · N · Cavgswitch). Despite this weaker ana-

lytical result, our experimental results in Section 4.4 show that the new strategy

still delivers at least half the emulated optimal performance in practice.

3.4 Implementation Issues

We have implemented the proposed competitive prefetching technique in the

Linux 2.6.10 kernel. In our implementation, we use the competitive prefetching

policy to control the depth of prefetching operations. In default Linux kernel, the

maximum prefetch depth is constrained by the constant VM MAX READAHEAD

defined in the header file mm.h. We make the maximum prefetch depth a global

variable, whose value may be changed during application execution. We have not

modified other aspects of the prefetching algorithm used in the Linux kernel. In

particular, the slow start phase of default Linux kernel is kept to accommodate

random-access workloads.

To control the maximum prefetch depth, our competitive prefetching policy

depends on several storage device characteristics, including the functional map-

ping from seek distance to seek time (denoted by fseek) and the mapping from the

data location to the sequential transfer rate (denoted by ftransfer).

The disk drive characteristics can be evaluated offline, at disk installation time

or OS boot time. Using the two mappings, we determine the runtime Cavgswitch and

Rtr[i] in the following way:


• During runtime, our system tracks the seek distances of past I/O opera-

tions and determines their seek time using fseek. We use an exponentially-

weighted moving average of the past disk seek times to estimate the seek

component of the next I/O switch cost. The average rotational delay com-

ponent of Cavgswitch is estimated as the average rotational delay between two

random track locations (i.e., the time it takes the disk to spin half a revo-

lution).

• For each I/O request starting at logical disk location Ltransfer, the data

transfer rate is determined as ftransfer(Ltransfer).

The competitiveness analysis in the previous section targets systems with stan-

dalone disk drives. However, the analysis applies also to more complex storage

devices as long as they exhibit disk-like I/O switch and sequential data transfer

characteristics. Hence, multi-disk storage devices can utilize the proposed compet-

itive prefetching algorithm. Disk arrays allow simultaneous data transfers out of

multiple disks so that they may offer higher aggregate I/O bandwidth. However,

seek and rotational delays are still limited by individual disks. Overall, multi-disk

storage devices often expect larger competitive prefetching depths.

We include three disk-based storage devices in our experimental evaluation: a

36.4 GB IBM 10 KRPM SCSI drive, a 146 GB Seagate 10 KRPM SCSI drive, and a

RAID0 disk array with two Seagate drives using an Adaptec RAID controller. We

have also measured a RAID5 disk array with five Seagate drives, which has a lower

transfer rate than the RAID0 due to the overhead of relatively complex RAID5

processing. We keep the RAID0 device in our evaluation as a representative of

high-bandwidth devices.

We measure the storage device properties by issuing direct SCSI commands

through the Linux generic SCSI interface, which allows us to bypass the OS mem-

ory cache and selectively disable the disk controller cache. Our disk profiling takes


0 0.2 0.4 0.6 0.8 10

20

40

60

80

100

Starting logical block number (in proportion to the max block number)

Tra

nsfe

r ra

te (

in M

B/s

ec)

(A) Sequential transfer rate

Adaptec RAID0 with two Segate drivesSingle Seagate driveSingle IBM drive (Linux 2.6.16)Single IBM drive

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10

12

Seek distance (in proportion to the total disk size)

See

k tim

e (in

mill

isec

ond)

(B) Disk seek time

Seagate driveIBM drive

Figure 3.2: Sequential transfer rate and seek time for two disk drives

and a RAID. Most results were acquired under the Linux 2.6.10 kernel.

The transfer rate of the IBM drive appears to be affected by the operating

system software. We also show its transfer rate on a Linux 2.6.16 kernel.

less than two minutes to complete for each disk drive/array and it could be easily

performed at disk installation time. Figure 3.2 shows the measured disk charac-

teristics (sequential transfer rate and the seek time). Most measurements were

performed on a Linux 2.6.10 kernel. We find that the transfer rate of the IBM drive

is constrained by certain bottleneck (at 37.5 MB/sec, or 300 Mbps). We speculate

there might be a software-related bug in the disk driver of the Linux 2.6.10 kernel.

A more recent Linux kernel (2.6.16) does not have this problem.

For the IBM drive (under Linux 2.6.10 kernel), the transfer speed is around

37.3 MB/sec; the average seek time between two independently random disk lo-

cations is 7.53 ms; and the average rotational delay is 3.00 ms. For this drive, the

average competitive prefetching depth is the amount of data that can sequentially

transferred within the average time of a single disk seek and rotation—393 KB. For

the Seagate drive and the RAID0 disk array, measured average sequential transfer

rates are 55.8 MB/sec and 80.9 MB/sec respectively. With the assumption that

the average seek time of the disk array is the same as that of a single disk drive


(7.98 ms), the average competitive prefetching depths for the Seagate drive and

the disk array are 446 KB and 646 KB respectively. In comparison, the default

maximum prefetching depth in Linux 2.6.10 is only 128 KB.

Note that the above average competitive prefetching derivation is calculated

based on the average seek time between two independently random disk locations

(or when there are two concurrent processes). The average seek time is smaller at

higher concurrency levels since the disk scheduler can choose from more concurrent

requests for seek reduction. Thus, the runtime competitive prefetching depth

should be lower in these cases. We use the average competitive prefetching depth

as a reference for implementing aggressive prefetching (section 3.5.2).

3.5 Experimental Evaluation

We assess the effectiveness and competitiveness of the proposed technique

in this section. In our experimental evaluation, we compare a Linux kernel ex-

tended with our competitive prefetching policy against the original Linux kernel

and a Linux kernel employing an aggressive prefetching algorithm. Anticipatory

scheduling is enabled across all experiments and kernel configurations. Our main

experimental platform is a machine equipped with two 2 GHz Intel Xeon pro-

cessors, 2 GB memory, and three storage drives. Figure 3.2 presents sequential

transfer rate and seek time characteristics of the three storage drives.

The performance of competitive prefetching is affected by several factors: 1)

virtualized or distributed I/O architecture that may limit the effectiveness of

anticipatory I/O scheduling; 2) the storage device characteristics; and 3) serial

vs. concurrent workloads. Section 3.5.2 presents the base-case performance of

concurrent workloads using the standalone IBM drive. Then we explicitly examine

the impact of each of these factors (sections 3.5.3, 3.5.4, and 3.5.5). Finally,

section 3.5.6 summarizes the experimental results.


3.5.1 Evaluation Benchmarks

Our evaluation benchmarks consist of a variety of I/O-intensive workloads,

including four microbenchmarks, two common file utilities, and four real appli-

cations. We mostly focus on server workloads. Some non-server workloads are

included to show that they can also benefit from the competitive prefetching

strategy.

Each of our microbenchmarks is a server workload involving a server and a

client. The server provides access to a dataset of 6000 4 MB disk-resident files.

Upon arrival of a new request from the client, the server spawns a thread, which

we call a request handler, to process it. Each request handler reads one or more

of the disk files in a certain pattern. The microbenchmark suite consists of four

microbenchmarks that are characterized by different access patterns. Specifically,

the microbenchmarks differ in the number of files accessed by each request handler

(one to four), the portion of each file accessed (the whole file, a random portion,

or a 64 KB chunk) and a processing delay (0 ms or 10 ms). We use a descriptive

naming convention to refer to each benchmark. For example, <Two-Rand-0>

describes a microbenchmark whose request handler accesses a random portion of

two files with 0 ms processing delay. The random portion starts from the beginning

of each file. Its size is evenly distributed between 64 KB and 4 MB. Accesses within

a portion of a file are always sequential in nature. The specifications of the four

microbenchmarks are:

• One-Whole-0: Each request handler randomly chooses a file and it repeat-

edly reads 64 KB data blocks until the whole file is accessed.

• One-Rand-10: Each request handler randomly chooses a file and it repeat-

edly reads 64 KB data blocks from the beginning of the file up to a random

total size (evenly distributed between 64 KB and 4 MB). Additionally, we

add a 10 ms processing delay at four random file access points during the


processing of each request. The processing delays are used to emulate pos-

sible delays during request processing and may cause the anticipatory I/O

scheduler to timeout.

• Two-Rand-0: Each request handler alternates reading 64 KB data blocks

sequentially from two randomly chosen files. It accesses a random portion

of each file from the beginning. This workload emulates applications that

simultaneously access multiple sequential data streams.

• Four-64KB-0: Each request handler randomly chooses four files and reads

a 64 KB random data block from each file.

We include two common file utilities and four real applications in our evalua-

tion:

• cp: cp reads a source file chunk by chunk and copies the chunks into a

destination file. We run cp on files of different sizes. We use cp5 and cp50

to represent test runs of cp on files of 5 MB and 50 MB respectively.

• BSD diff: diff is a utility that compares the content of two text files.

BSD diff alternately accesses two files in making the comparison and it is

more scalable than GNU diff [GNU06]. We have ported BSD diff to our

experimental platform. Again, we run it on files of different sizes. diff5

works on 2 files of 5 MB each and diff50 works on 2 files of 50 MB each.

• Kernel build: The kernel build workload includes compiling and linking a

Linux 2.6.10 kernel. The kernel build involves reading from and writing to

many small files.

• TPC-H: We evaluate a local implementation of the TPC-H decision support

benchmark [TPCH] running on the MySQL 5.0.17 database. The TPC-H

workload consists of 22 SQL queries with 2 GB database size. The maximum


database file size is 730 MB, while the minimum is 432 bytes. Each query

accesses several database files with mixed random and sequential access

patterns.

• Apache hosting media clips: We include the Apache web server in our evalu-

ation. Typical web workloads often contain many small files. Since our work

focuses on applications with substantially sequential access pattern, we use

a workload containing a set of media clips, following the file size and access

distribution of the video/audio clips portion of the 1998 World Cup work-

load [AJ99]. About 9% of files in the workload are large video clips while

the rest are small audio clips. The overall file size range is 24 KB–1418 KB

with an average of 152 KB. The total dataset size is 20.4 GB. During the

tests, individual media files are chosen in the client requests according to a

Zipf distribution.

• Index searching: Another application we evaluate is a trace-driven index

searching server using a dataset we acquired from the Ask Jeeves search

engine [Ask]. The dataset supports search on 12.6 million web pages. It

includes a 522 MB mapping file that maps MD5-encoded keywords to proper

locations in the search index. The search index file itself is approximately

18.5 GB, divided into 8 partitions. For each keyword in an input query,

a binary search on the mapping file returns the location and size of this

keyword’s matching web page identifier list in the main dataset. The dataset

is then accessed following a sequential access pattern. Multiple sequential

I/O streams on the dataset are accessed alternately for each multi-keyword

query. The search query words in our test workload are based on a trace

recorded at the Ask Jeeves site in summer 2004.

These workloads contain a variety of different I/O access patterns. cp, kernel

build, and the Apache web server request handler follow a simple sequential access


0

0.5

1

1.5

2

2.5

x 104

Blo

ck o

ffset

in fo

ur fi

les

Virtual time

(A) TPC−H Q2

0

2000

4000

6000

Blo

ck o

ffset

in fo

ur fi

les

Virtual time

(B) TPC−H Q16

0

500

1000

1500

Pag

e of

fset

on

the

sear

ch in

dex

Virtual time

(C) Index searching

Figure 3.3: I/O access pattern of certain application workloads. Virtual

time represents the sequence of discrete data accesses. In (A), each straight

line implies sequential access pattern of a file. In (B), random and non-

sequential accesses make the files indistinguishable.

pattern to access whole files. BSD diff accesses two files alternately and it

also accesses whole files. TPC-H and the index searching server access partial

files and they have more complex I/O access patterns. Figure 3.3 illustrates the

access pattern of two TPC-H queries (Q2 and Q16) as well as a typical index

searching request processing. We observe that the TPC-H Q2 and the index

searching exhibit alternating sequential access patterns while TPC-H Q16 has a

mixed pattern with a large amount of random accesses.

Among these workloads, the microbenchmarks, Apache web server, and in-

dex searching are considered as server-style concurrent applications. The other

workloads are considered as non-server applications. Each server-style applica-

tion may run at different concurrency levels depending on the input workload. In

our experiments, each server workload involves a server application and a load

generation client (running on a different machine) that can adjust the number of

simultaneous requests to control the server concurrency level.


3.5.2 Base-Case Performance

We assess the effectiveness of the proposed technique by first showing the base-

case performance. All the performance tests in this subsection are performed

on the standalone IBM drive. We compare application performance under the

following different kernel versions:

#1. Linux: The original Linux 2.6.10 kernel with a maximum prefetching depth

of 128 KB.

#2. CP: Our proposed competitive prefetching strategy described in Section 3.3

(with the slow-start phase to accommodate random-access workloads).

#3. AP: Aggressive prefetching with a maximum prefetching depth of 800KB

(about twice that of average competitive prefetching).

#4. OPT: An oracle prefetching policy implemented using application disclosed

hints. We have only implemented this policy for microbenchmarks. Each

microbenchmark provides accurate data access pattern to the OS through an

extended open() system call. The OS then prefetches exactly the amount

of data needed for each sequential stream. The oracle prefetching policy is a

hypothetical approach employed to approximate the performance of optimal

prefetching.

We separately show the results for server-style and non-server workloads be-

cause server workloads naturally run at varying concurrency levels. Figure 3.4

illustrates the performance of server-style workloads (including four microbench-

marks, Apache hosting media clips, and index searching).

For the first microbenchmark (One-Whole-0), all practical policies (Linux, CP,

and AP) perform equally well (Figure 3.4(A)). In this case, since anticipatory

scheduling can avoid I/O switching, files are accessed sequentially. However, Fig-

ures 3.4(B) and 3.4(C) demonstrate that anticipatory scheduling is not effective


1 2 4 8 16 32 640

10

20

30

40

Number of concurrent request handlers

App

licat

ion

I/O th

roug

hput

(M

B/s

ec) (A) Microbenchmark One−Whole−0

1 2 4 8 16 32 640

10

20

30

40


App

licat

ion

I/O

thro

ughp

ut (

MB

/sec

) (B) Microbenchmark One−Rand−10

OPTAPCPLinux

OPTAPCPLinux

1 2 4 8 16 32 640

10

20

30

40


App

licat

ion

I/O th

roug

hput

(M

B/s

ec) (C) Microbenchmark Two−Rand−0

1 2 4 8 16 32 640

2

4

6

8


App

licat

ion

I/O th

roug

hput

(M

B/s

ec) (D) Microbenchmark Four−64KB−0

OPTAPCPLinux

OPTAPCPLinux

1 2 4 8 16 32 640

8

16

24

32

Number of concurrent requests

App

licat

ion

I/O th

roug

hput

(in

MB

/sec

) (E) Apache

APCPLinux

1 2 4 8 16 32 640

5

10

15

20


App

licat

ion

I/O th

roug

hput

(M

B/s

ec) (F) Index searching

APCPLinux

Figure 3.4: Application I/O throughput of server-style workloads (four

microbenchmarks and two real applications). The emulated optimal per-

formance is presented only for the microbenchmarks.


when significant processing time (think time) is present or when a request han-

dler accesses multiple streams alternately. In such cases, our proposed competi-

tive prefetching policy can significantly improve the overall I/O performance (16–

24% and 10–35% for One-Ran-10 and Two-Rand-0 respectively). Figure 3.4(D)

presents performance results for the random-access microbenchmark. We observe

that all policies perform similarly. Competitive and aggressive prefetching do

not exhibit degraded performance because the slow-start phase quickly identi-

fies the non-sequential access pattern and avoids prefetching a significant amount

of unnecessary data. Figures 3.4(A–D) also demonstrate that the performance

of competitive prefetching is always at least half the performance of the oracle

prefetching policy for all microbenchmarks.

Figure 3.4(E) shows the throughput of the Apache web server hosting media

clips. Each request follows a strictly sequential data access pattern on a single

file. Similarly to One-Whole-0, anticipatory scheduling minimizes the disk I/O

switch cost and competitive prefetching does not lead to further performance

improvement. Figure 3.4(F) presents the I/O throughput of the index search

server at various concurrency levels. Each request handler for this application

alternates among several sequential streams. Competitive prefetching can improve

I/O throughput by 10–28% compared to the original Linux kernel.

We use the following concurrent workload scenarios for the non-server appli-

cations (cp, BSD diff, and TPC-H). 2cp5 represents the concurrent execution

of two instances of cp5 (copying a 5 MB file). 2cp50 represents two instances

of cp50 running concurrently. 2diff5 represents the concurrent execution of

two instances of diff5. 2diff50 represents two instances of diff50 running con-

currently. cp5+diff5 represents the concurrent execution of cp5 and diff5.

cp50+diff50 represents the concurrent execution of cp50 and diff50. 2TPC-H

represents the concurrent run of the TPC-H query series Q2→Q3→ · · · →Q11

and Q12→Q13→ · · · →Q22.


2cp5 2cp50 2diff5 2diff50 cp5+diff5 cp50+diff50 2TPC−H0

0.5

1

1.5

1.18 s 3.85 s 2.57 s 18.56 s 1.59 s 11.43 s 19.92 m

Non−server concurrent application workloads

Nor

mal

ized

exe

cutio

n tim

e

LinuxCPAP

Figure 3.5: Normalized execution time of non-server concurrent appli-

cations. The values on top of the “Linux” bars represent the running time

of each workload under the original Linux kernel (in minutes or seconds).

Figure 3.5 presents the execution time of the seven non-server concurrent work-

loads. There is not much performance variation between different prefetching poli-

cies for the concurrent execution of cp, because with anticipatory scheduling, files

are read without almost any disk I/O switching even under concurrent execution.

However, for 2diff5 and 2diff50 with alternating sequential access patterns,

competitive prefetching reduces the execution time by 31% and 34% respectively.

For 2TPC-H, the performance improvement achieved by competitive prefetching

is 8%.

While competitive prefetching improves the performance of several workload

scenarios, aggressive prefetching provides very little additional improvement. For

index searching, aggressive prefetching performs significantly worse than compet-

itive prefetching, since competitive prefetching maintains a good balance between

the overhead of disk I/O switch and that of unnecessary prefetching. Further-

more, aggressive prefetching risks wasting too much I/O bandwidth and memory

resources on fetching and storing unnecessary data.


3.5.3 Performance with

Process-Context-Oblivious I/O Scheduling

The experimental results in the previous section show that during the concur-

rent execution of applications with strictly sequential access patterns, anticipa-

tory scheduling avoids unnecessary I/O switch and competitive prefetching leads

to little further performance improvement. Such workloads include one-Whole-0,

Apache web server, 2cp5, 2cp50, cp5+diff5, and cp50+diff50. However, our

discussion in Section 3.2.1 suggests that anticipatory scheduling is not effective in

virtualized [JADAD06] or parallel/distributed I/O architecture [LS07] due to the

lack of knowledge on request issuing process contexts. In this subsection, we ex-

amine such performance effect under a virtual machine platform. Specifically, we

use the Xen (version 3.0.2) [BDF+03] virtual machine monitor and the guest OS is

a modified Linux 2.6.16 kernel. We have implemented our competitive prefetching

algorithm in the guest OS.

The experiments are conducted using the standalone IBM hard drive. How-

ever, since the drive delivers higher I/O throughput under the Linux 2.6.16 kernel

than under the Linux 2.6.10 kernel (as illustrated in Figure 3.2), the absolute

workload I/O throughput shown here appears higher than that in the previous

subsection.

Figure 3.6 and Figure 3.7 show the performance of server-style and non-server

workloads on a virtual machine platform. We present results only for the targeted

workloads mentioned in the beginning of this subsection. In contrast to Fig-

ure 3.4(A), Figure 3.6(A) shows that competitive prefetching is very effective in

improving the I/O throughput of the microbenchmark (up to 71% throughput en-

hancement compared to the original Linux kernel). Similarly, comparison between

Figure 3.6(B) and Figure 3.4(E) shows that competitive prefetching improves the

I/O throughput of the Apache web server. Comparison between Figure 3.7 and


1 2 4 8 16 32 640

16

32

48

64


App

licat

ion

I/O th

roug

hput

(M

B/s

ec) (A) Microbenchmark One−Whole−0

OPTAPCPLinux

1 2 4 8 16 32 640

15

30

45

60

Number of concurrent requests

App

licat

ion

I/O th

roug

hput

(in

MB

/sec

) (B) Apache

APCPLinux

Figure 3.6: Performance of server-style workloads on a Xen virtual ma-

chine platform where the I/O scheduling is oblivious to request-issuing pro-

cess contexts. We present results only for workloads whose performance

substantially differ from that in the base case (shown in Figure 3.4). For

the microbenchmark, we also show the performance of an emulated optimal

prefetching.

Figure 3.5 shows that competitive prefetching also improves the I/O throughput

of four non-server workloads. In the virtual machine environment, since anticipa-

tory I/O scheduling is not effective, it is competitive prefetching that can reduce

the frequency of I/O switches and thus improve the I/O performance. In addi-

tion, Figure 3.6(A) demonstrates that the performance of competitive prefetching

is always at least half the performance of the oracle prefetching policy for the

microbenchmark.

3.5.4 Performance on Different Storage Devices

The experiments of the previous sections are all performed on a standalone

IBM disk drive. In this section, we assess the impact of different storage devices

on the effectiveness of competitive prefetching. We compare results from three

different storage devices characterized in Section 3.4: a standalone IBM drive, a


2cp5 2cp50 cp5+diff5 cp50+diff500

0.2

0.4

0.6

0.8

1

Non−server concurrent application workloads

Nor

mal

ized

exe

cutio

n tim

e

0.86 s 3.29 s 0.55 s 4.85 s LinuxCPAP

Figure 3.7: Normalized execution time of non-server concurrent applica-

tions on a virtual machine platform where the I/O scheduling is oblivious

of request-issuing process contexts. We present results only for workloads

whose performance substantially differ from that in the base case (shown

in Figure 3.5). The values on top of the “Linux” bars represent the running

time of each workload under the original Linux kernel (in seconds).

standalone Seagate drive, and a RAID0 disk array with two Seagate drives using

an Adaptec RAID controller. Experiments in this section are performed on a

normal I/O architecture (not on a virtual machine platform).

We choose the workloads that already exhibit significant performance variation

among different prefetching policies on the standalone IBM drive. Figure 3.8

shows the normalized execution time of two non-server workloads 2diff5 and

2diff50. We observe that the competitive prefetching strategy is effective for all

three storage devices (up to 53% for 2diff50 on RAID0). In addition, we find

that the performance improvement is more pronounced on the disk array than on

standalone disks. This is because the competitive prefetching depth of the disk

array is larger than that of standalone disks due to larger sequential data transfer

rate of the disk array.

Figure 3.9 shows the normalized inverse of throughput for two server applica-

tions (microbenchmark Two-Rand-0 and index searching) on the three different


IBM disk Seagate disk RAID00

0.5

1

1.5

2.57 s 1.32 s 1.08 s

Storage devices

Nor

mal

ized

exe

cutio

n tim

e

(A) 2diff5

LinuxCPAP


0.5

1

1.5

18.56 s 14.29 s 10.31 s

Storage devices

Nor

mal

ized

exe

cutio

n tim

e

(B) 2diff50

LinuxCPAP

Figure 3.8: Normalized execution time of two non-server concurrent

workloads on three different storage devices.

storage devices. We use the inverse of throughput (as opposed to throughput

itself) so its illustration is more comparable to that in Figure 3.8. Figure 3.9(A)

shows that competitive prefetching can improve the microbenchmark I/O through-

put by up to 52% on RAID0. In addition, the throughput of competitive prefetch-

ing is at least half the throughput of the oracle prefetching policy for all stor-

age devices. Figure 3.9(B) shows that competitive prefetching can improve I/O

throughput by 19–26% compared to the original Linux kernel. In contrast, ag-

gressive prefetching provides significantly less improvement (up to 15%) due to

wasted I/O bandwidth on fetching unnecessary data. For index searching, the

performance improvement ratio on the disk array is not larger than that on stan-

dalone disks, since the sequential access streams for this application are typically

not very long and consequently they do not benefit from the larger competitive

prefetching depth on the disk array.

3.5.5 Performance of Serial Workloads

Though our competitive prefetching policy targets concurrent sequential I/O,

we also assess its performance for standalone serially executed workloads. For the



0.5

1

1.5

Storage devices

Nor

mal

ized

inve

rse

of th

roug

hput

(A) Microbenchmark Two−Rand−0

LinuxCPAPOPT


0.5

1

1.5

Storage devices

Nor

mal

ized

inve

rse

of th

roug

hput

(B) Index searching

LinuxCPAP

Figure 3.9: Normalized inverse of throughput for two server-style con-

current applications on three different storage devices. We use the inverse

of throughput (as opposed to throughput itself) so that the illustration

is more comparable to that in Figure 3.8. For the microbenchmark, we

also show an emulated optimal performance. For clear illustration, we only

show results at concurrency level 4. The performance at other concurrency

levels are in similar pattern.

server applications (Apache web server and index search), we show their average

request response time when each request runs individually.

Figure 3.10 presents the normalized execution time of the non-concurrent work-

loads. Competitive prefetching has little or no performance impact on cp, kernel

build, TPC-H, and the Apache web server, since serial execution of the above

applications exhibits no concurrent I/O accesses. However, performance is im-

proved for diff and the index search server by 33% and 18% respectively. Both

diff and the index search server exhibit concurrent sequential I/O (in the form of

alternation among multiple streams). Query Q2 of TPC-H exhibits similar access

patterns (as illustrated in Figure 3.3(A)). Our evaluation (Figure 3.10) suggests

a 25% performance improvement due to competitive prefetching for query Q2.

Overall competitive prefetching did not lead to performance degradation for any


cp5 cp50 diff5 diff50 kernel build TPC−H TPC−H Q2 Apache index search0

0.5

1

1.5

1.02 s 1.76 s 1.21 s 9.37 s 7.65 m 20.95 m 4.91 s 2.83 ms 718.15 ms

Serial application workloads

Nor

mal

ized

exe

cutio

n tim

e

LinuxCPAP

Figure 3.10: Normalized execution time of serial application workloads.

The values on top of the “Linux” bars represent the running time of each

workload under the original Linux prefetching (in minutes, seconds, or

milliseconds).

of our experimental workloads.

3.5.6 Summary of Results

• Compared to the original Linux kernel, competitive prefetching can improve

application performance by up to 53% for applications with significant con-

current sequential I/O. Our experiments show no negative side effects for

applications without such I/O access patterns. In addition, competitive

prefetching trails the performance of the oracle prefetching policy by no

more than 42% across all microbenchmarks.

• Overly aggressive prefetching does not perform significantly better than

competitive prefetching. In certain cases (as demonstrated for index search),

it hurts performance due to I/O bandwidth and memory space wasted on

fetching and storing unnecessary data.

• Since anticipatory scheduling can minimize the disk I/O switch cost during

concurrent application execution, it diminishes the performance benefit of


competitive prefetching. However, anticipatory scheduling may not be effec-

tive when: 1) applications exhibit alternating sequential access patterns (as

in BSD diff, the microbenchmark Two-Rand-0, and index searching); 2)

substantial processing time exists between consecutive sequential requests

from the same process (as in the microbenchmark One-Rand-10); 3) I/O

scheduling in the system architecture is oblivious to request-issuing process

contexts (as in a virtual machine environment or parallel/distributed file sys-

tem server). In such scenarios, competitive prefetching can yield significant

performance improvement.

• Competitive prefetching can be applied to both standalone disk drives and

disk arrays that exhibit disk-like characteristics on I/O switch and sequential

data transfer. Disk arrays often offer higher aggregate I/O bandwidth than

standalone disks and consequently their competitive prefetching depths are

larger.

3.6 Related Work

The literature on prefetching is very rich. Cao et al. first proposed integrated

prefetching and caching [CFKL95], where they presented four rules that an opti-

mal prefetching and caching strategy must follow. Then they explored application-

controlled prefetching and caching [CFL94]. They proposed a two-level page re-

placement scheme that allows applications to control their memory replacement

policy, while the kernel controls the allocation of memory space among applica-

tions. Several researchers have studied application-assisted prefetching [PGG+95,

KTP+96, TPG97, LD97, YLB01]. Mowry et al. proposed compiler-assisted

prefetching for out-of-core applications [MDK96]. Patterson et al. explored a

cost-benefit model to guide prefetching decisions [PGG+95]. Their results sug-

gest a prefetching depth up to a process’ prefetch horizon under the assumption


of no disk congestion. Tomkins et al. extended the proposed cost-benefit model

for workloads consisting of multiple applications [TPG97]. The above techniques

require a-priori knowledge of the application I/O access patterns. Patterson et al.

and Tomkins et al. depend on application-disclosed hints to predict future data ac-

cesses. In contrast, our work proposes a transparent operating system prefetching

technique that does not require any application support or modification.

Previous work has also examined ways to acquire or predict I/O access pat-

terns without direct application involvement. Proposed techniques include mod-

eling and analysis of important system events [LD97, YLB01], offline application

profiling [PS04], and speculative program execution [CG99, FC03]. The reliance

on I/O access pattern prediction affects the applicability of these approaches in

several ways. Particularly, the accuracy of predicted information is not guaran-

teed and no general method exists to assess the accuracy for different application

workloads with certainty. Some of these approaches still require offline profiling

or application changes, which increases the barrier for deployment in real systems.

Besides, these studies do not address the performance problem of frequent I/O

switches exhibited in highly concurrent server systems.

Shriver et al. studied performance factors in a disk-based file system [SSS99]

and suggested that aggressive prefetching should be employed when it is safe to

assume the prefetched data will be used. More recently, Papathanasiou and Scott

argued that aggressive prefetching has become more and more appropriate due to

recent developments in I/O hardware and emerging needs of applications [PS05].

Our work builds upon this idea by providing a systematic way of controlling

prefetching depth.

Previous work has also explored performance issues associated with concurrent

sequential I/O at various system levels. Carrera and Bianchini explored disk cache

management and proposed two disk firmware-level techniques to improve the I/O

throughput of data intensive servers [CB04]. Our work shares their objective


while focusing on operating system-level techniques. Anastasiadis et al. explored

an application-level block reordering technique that can reduce server disk traffic

when large content files are shared by concurrent clients [AWC04]. Our work

provides transparent operating system level support for a wider scope of data-

intensive workloads.

Barve et al. [BKVV00] have investigated “competitive prefetching” in parallel

I/O systems with a given lookahead amount (i.e., the knowledge of a certain

number of upcoming I/O requests). Their problem model is different from ours

on at least the following aspects. First, the target performance metric in their

study (the number of parallel I/O operations to complete a given workload) is

only meaningful in the context of parallel I/O systems. Second, the lookahead of

upcoming of I/O requests is essential in their system model while our study only

considers transparent OS-level prefetching techniques that require no information

about application data access pattern.

3.7 Conclusion and Discussions

To conclude, this chapter presents the design and implementation of a compet-

itive I/O prefetching technique targeting concurrent sequential I/O workloads on

data-intensive server systems. Our model and analysis show that the performance

of competitive prefetching (in terms of I/O throughput) is at least half the per-

formance of the optimal offline policy on prefetching depth selection. In practice,

the new prefetching scheme is most beneficial when anticipatory I/O scheduling

is not effective (e.g., when the application workload has an alternating sequential

data access pattern or possesses substantial think time between consecutive I/O

accesses, or when the system’s I/O scheduler is oblivious to I/O request-issuing

process contexts).

We implemented the proposed prefetching technique in the Linux 2.6.10 kernel


and conducted experiments using four microbenchmarks, two common file utili-

ties, and four real applications on several disk-based storage devices. Overall, our

evaluation demonstrates that competitive prefetching can improve the throughput

of real applications by up to 53% and it does not cause noticeable performance

degradation on a variety of workloads not directly targeted in this work. We also

show that the performance of competitive prefetching strategy trails that of an

oracle offline prefetching policy by no more than 42%, affirming its competitive-

ness.

In the remainder of this section, we discuss implications related to our com-

petitive prefetching policy. Prefetching and caching are two complementary OS

techniques that both make use of data locality to hide disk access latency. Un-

fortunately, they compete for memory resources against each other. Aggressive

prefetching can consume a significant amount of memory during highly concurrent

workloads. In extreme cases, memory contention may lead to significant perfor-

mance degradation because previously cached or prefetched data may be prema-

turely evicted before they are accessed. The competitiveness of our prefetching

strategy would not hold in the presence of high memory contention. A possible

method to avoid performance degradation due to prefetching during periods of

memory contention is to adaptively reduce the prefetching depth when increased

memory contention is detected and reinstate competitive prefetching when mem-

ory pressure is released. The topic of adaptively controlling the amount of memory

allocated for caching and prefetching has been previously addressed in the liter-

ature by Cao et al. [CFKL95] and Patterson et al. [PGG+95]. More recently,

Kaplan et al. proposed a dynamic memory allocation method based on detailed

bookkeeping of cost and benefit statistics [KMC02]. Gill et al. proposed a cache

management policy that dynamically partitions the cache space amongst sequen-

tial and random streams in order to reduce read misses [GM05]. Memory con-

tention can also be avoided by limiting the server execution concurrency level with


4 16 640

0.5

1

1.5

2

2.5

3

3.5

Workload concurrency level

Nor

mal

ized

late

ncy

(sec

ond)

0.085 0.245 0.810

LinuxCPAP

Figure 3.11: Normalized latency of individual I/O request under concur-

rent workload. The latency of an individual small I/O request is measured

with microbenchmark Two-Rand-0 running in the background. The values

on top of the “Linux” bars represent the request latency under the original

Linux prefetching (in seconds).

admission control or load conditioning [vBCZ+03, WCB01]. In addition, in the

next chapter, we propose a set of OS-level techniques that can be used to manage

the memory dedicated to prefetched data.

The proposed prefetching technique has been designed to minimize possible

negative impact on applications not directly targeted by our work. However,

some negative impact may still exist. In particular, large prefetching depths can

lead to increased latencies for individual I/O requests under concurrent workloads.

Figure 3.11 presents a quantitative illustration of this impact. In particular, at

concurrency of 64, compared with the I/O request latency in original Linux, com-

petitive prefetching causes the average I/O request latency to increase by 53%,

while aggressive prefetching increases the average request latency by one and a

half times. This result also suggests that for certain interactive applications, the

performance impact by aggressive prefetching may be unacceptable. Techniques


such as priority-based disk queues [GP98] and semi-preemptible I/O [DRC03]

can be employed to alleviate this problem. Additional investigation is needed to

address the integration of such techniques in the future.

49

4 Managing Prefetch Memory

4.1 Introduction

Data-intensive online servers access large disk-resident datasets while serving

many clients simultaneously. For such data-intensive server systems, the disk

I/O performance and memory utilization efficiency dominate the overall system

performance when the dataset size far exceeds the available server memory. As

described in the previous chapter, during concurrent execution, sequential data

access of one request handler (e.g., server process or thread) can be frequently in-

terrupted by I/O operations of other active request handlers in the server. Due to

the high overhead of I/O switches in disk-based storage devices, large-granularity

I/O prefetching is often employed to reduce the I/O switch frequency and thus

decrease its overhead. Consequentially, at high execution concurrency, there can

be many memory pages containing prefetched but not-yet-accessed data, which

we call prefetched pages in short.

The prefetched pages are traditionally managed together with the rest of the

memory buffer cache. Existing memory management methods generally fall into

two categories.

• Application-assisted techniques [CFL94, KTP+96, PS04, PGG+95, TPG97]

CHAPTER 4. MANAGING PREFETCH MEMORY 50

achieve efficient memory utilization with application-disclosed information

or hints on their I/O access patterns. However, the reliance on such infor-

mation affects the applicability of these approaches and their relatively slow

adoption in production operating systems is a reflection of this problem.

Our objective in this work is to provide transparent memory management

scheme in OS level that does not require any explicit application assistance.

• Existing transparent memory management methods typically use access

history-based page reclamation policies, such as LRU/LFU [MGST70], Working-

Set [Den68], or CLOCK [Smi78]. Because prefetched pages do not have any

access history yet, an access recency or frequency-based page reclamation

policy tends to evict them earlier than those pages that have been accessed

before. Although MRU-like schemes may benefit prefetched pages, they per-

form poorly with general application workloads and in practice they are only

used for workloads with exclusively sequential data access pattern. Among

prefetched pages themselves, the eviction order based on traditional recla-

mation policies would be actually First-In-First-Out (FIFO), again due to

the lack of any access history. Under such management, memory contention

at high execution concurrency may result in premature eviction of prefetched

pages before they get accessed, which we call page thrashing. These pages

may have to be read again from disk to meet the application’s need. It could

severely degrade system performance.

One reason for the lack of specific research attention on managing the prefetched

(but not-yet-accessed) memory pages is that traditionally these pages only con-

stitute a small portion of the total memory in a normal system. However, their

presence is substantial for data-intensive online servers executing at high concur-

rencies. Further, it was recently argued that more aggressive prefetching should

be supported in modern operating systems due to emerging application needs and

the disk I/O energy efficiency [PS05]. Aggressive prefetching would also contribute


to more substantial existence of prefetched memory pages in data-intensive server

systems.

In this chapter, we propose a new memory management scheme that han-

dles prefetched pages separately from other pages in the memory system. Such

a separation protects against excessive eviction of prefetched pages and allows a

customized page reclamation policy for them. We explore heuristic page reclama-

tion policies that can identify the prefetched pages least likely to be accessed in

the near future. In addition to the page reclamation policy, the prefetch mem-

ory management must also address the memory allocation between caching and

prefetching. We employ a gradient descent-based approach that dynamically ad-

justs the memory allocation for prefetched pages in order to minimize the overall

page miss rate in the system.

A number of earlier studies have investigated I/O prefetching and its memory

management in an integrated fashion [CFKL95, KMC02, PS04, PGG+95]. These

approaches can adjust the prefetching strategy depending on the memory cache

content and therefore achieve better memory utilization. Our work is exclusively

focused on the prefetch memory management with a given prefetching strategy.

Combining our results with adaptive prefetching may further improve the overall

system performance.

The rest of the chapter is organized as follows. Section 4.2 discusses the char-

acteristics of targeted data-intensive online servers and describes the existing OS

support. Section 4.3 presents the design of our proposed prefetch memory man-

agement techniques and its implementation in the Linux 2.6.10 kernel. Section 4.4

provides the performance results based on microbenchmarks and two real appli-

cation workloads. Section 4.5 reviews the related work and Section 4.6 concludes

the chapter.


4.2 Targeted Applications and Existing OS Sup-

port

As mentioned in section 2.1, our work focuses on online servers supporting

highly concurrent workloads that access a large amount of disk-resident data.

In these servers, each incoming user request is typically serviced by a request

handler (e.g., server process or thread) that repeatedly accesses disk data and

consumes the CPU before completion. A request handler may block if the needed

resource is unavailable. While request processing consumes both disk I/O and

CPU resources, the overall server throughput is often dominated by the disk I/O

performance and the memory utilization efficiency when the application data size

far exceeds the available server memory. Figure 2.1 in Chapter 2 illustrates such

an execution environment.

For data-intensive server systems, improving the I/O efficiency can be accom-

plished by employing a large I/O prefetching depth when the application exhibits

some sequential I/O access patterns [SSS99]. A larger prefetching depth results

in less frequent I/O switches, and consequently yields fewer disk seeks per time

unit. In practice, the maximum prefetching depths of default Linux and FreeBSD

are 128 KB and 256 KB respectively during sequential file accesses. In Chapter 3,

we proposed a competitive prefetching strategy that can achieve at least half the

I/O throughput of optimal prefetching when there is no memory contention. This

strategy specifies that the prefetching depth should be equal to the amount of

data that can be sequentially transferred within a single I/O switching period.

The competitive prefetching depth is around 400 KB for several modern IBM and

Seagate disks that we measured.

I/O prefetching can generate significant memory demands. Such demands may

result in severe memory contention at high execution concurrencies, where many

request handlers in the server are simultaneously competing for memory resource.


Access recency or frequency-based page replacement policies are misplaced for

handling prefetched pages because these policies fail to give proper priority to

pages that are prefetched but not yet accessed. We use the following two examples

to illustrate the problems of access history-based page replacement policies for

prefetched pages.

• Priority between prefetched pages and accessed pages: Assume page p1 is

prefetched and then accessed at t1 while page p2 is prefetched at t2 and it

has not yet been accessed. If the prefetching operation itself is not counted

as an access, then p2 will be evicted earlier under an access recency or

frequency-based policy although it may be more useful in the near future.

If the prefetching operation is counted as an access but t1 occurs after t2,

then p2 will still be evicted earlier than p1.

• Priority among prefetched pages: Due to the lack of any access history,

prefetched pages will follow a FIFO eviction order among themselves during

memory contention. Assume page p1 and p2 are both prefetched but not yet

accessed and p1 is prefetched ahead of p2. Then p1 will always be evicted

earlier under the FIFO order even if p2’s owner request handler (defined

as the request handler that initiated p2’s prefetching) has completed its

task and exited from the server, an strong indication that p2 would not be

accessed in the near future.

Under poor memory management, many prefetched pages may be evicted be-

fore their owner request handlers get the chance to access them. Prematurely

evicted pages will have to be fetched again from the disk and we call such a

phenomenon prefetch page thrashing. In addition to wasting I/O bandwidth on

makeup fetches, page thrashing further reduces the I/O efficiency. This is because

page thrashing destroys sequential access pattern crucial for large-granularity


1 100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10


App

licat

ion

IO th

roug

hput

(in

MB

/sec

)

0

60

120

180

240

300

Pag

e th

rash

ing

(in p

ages

/sec

)

application throughput

page thrashing rate

Figure 4.1: Application throughput and prefetch page thrashing rate

of a trace-driven index searching server under the original Linux 2.6.10

kernel. A prefetch thrashing is counted as a page miss on prefetched but

then prematurely evicted memory page.

prefetching and consequently many of the makeup fetches are inefficient small-

granularity I/O operations.

Measured performance We use the Linux kernel as an example to illustrate

the performance of an access history-based memory management method. Linux

employs two LRU lists for memory buffer cache: an active list used to cache hot

pages and an inactive list used to cache cold pages. Prefetched pages are initially

attached to the tail of the inactive list. Frequently accessed pages are moved from

the inactive list to the active list. When the available memory falls below a certain

reclamation threshold, an aging algorithm moves pages from the active list to the

inactive list and cold pages at the head of the inactive list are considered first for

eviction.

Figure 4.1 shows the performance of a trace-driven index searching server

under the Linux 2.6.10 kernel. This kernel is slightly modified to correct several


prefetching

memory

buffer cache

eviction

access

prefetch cachereclamation

Figure 4.2: A separate prefetch cache for prefetched pages.

performance anomalies during highly concurrent I/O (as illustrated in Chapter 5

and [SZL05]). The details about the benchmark and the experimental settings

can be found in Section 4.4.1. We show the application I/O throughput and

page thrashing rate for up to 1000 concurrent request handlers in the server. At

high execution concurrency, the I/O throughput drops significantly as a result of

high page thrashing rate. Particularly at the concurrency level of 1000, the I/O

throughput degrades to about 25% of its peak performance while the prefetch

page thrashing rate increases to about 318 pages/sec.

4.3 Prefetch Memory Management

We propose a new memory management scheme with the aim to reduce prefetch

page thrashing rate and improve the performance for data-intensive server sys-

tems. Our framework is based on a separate prefetch cache to manage prefetched

but not-yet-accessed pages (as shown in Figure 4.2). The basic concept of prefetch

cache was employed earlier by Papathanasiou and Scott [PS04] to manage the

prefetched memory pages for energy-efficient bursty disk I/O. Prefetched pages

are initially placed in the prefetch cache. When a page is referenced, it is moved

from the prefetch cache to the normal memory buffer cache and is thereafter

controlled by the kernel’s default page replacement policy. At the presence of

high memory pressure, some not-yet-accessed pages in the prefetch cache may be


moved to the normal buffer cache for possible eviction or reclamation.

The separation of prefetched pages from the rest of the memory system pro-

tects against excessive eviction of prefetched pages and allows a customized page

reclamation policy for them. Like many of the other previous studies [CFL94,

KTP+96, PGG+95, TPG97], Papathanasiou and Scott’s prefetch cache manage-

ment [PS04] requires application hints on their I/O access patterns. The reliance

on such information affects the applicability of these approaches and our objective

is to provide transparent memory management that does not require any appli-

cation assistance. The design of our prefetch memory management addresses the

reclamation order among prefetched pages (Section 4.3.1) and the partitioning

between prefetched pages and the rest of the memory buffer cache (Section 4.3.2).

Section 4.3.3 describes our implementation in the Linux 2.6.10 kernel.

4.3.1 Page Reclamation Policy

Previous studies pointed out that the offline optimal page replacement rule for

a prefetching system is that every prefetch should discard the page whose next

reference is farthest in the future [Bel66, CFKL95]. Such a rule was introduced

for minimizing the execution time of a standalone application process. We adapt

such a rule in the context of optimizing the I/O throughput of data-intensive

online servers. In our case, a large number of server request handlers execute

concurrently and each request handler runs for a relatively short period of time.

Because request handlers in an online server are independent from each other,

the data prefetched by one request handler are unlikely to be accessed by other

request handlers in the near future. Our adapted offline optimal replacement rules

are the following:

A. If there exist prefetched pages that will not be accessed by their respective

owner request handlers, discard one of these pages first.


B. Otherwise, discard the page whose next reference is farthest in the future.

The offline optimal replacement rules can only be followed when the I/O access

patterns for all request handlers are known. This requirement is not realistic

for most applications. We explore heuristic page reclamation policies that can

approximate the above rules without such requirement. We define a prefetch

stream as a group of sequentially prefetched pages that have not yet been accessed.

We examine the following online heuristics that can be used to identify a victim

page for reclamation.

#1. Discard any page whose owner request handler has completed its task and

exited from the server.

#2. Discard the last page from the longest prefetch stream.

#3. Discard the last page from the prefetch stream whose last access is least

recent. The recency can be measured by either the elapsed wall clock time

or the CPU time of each stream’s owner request handler. Since in most cases

only the stream’s owner request handler would reference it, the CPU time

may be a better indicator of how much opportunity of access the stream had

in the past.

Heuristic #1 is designed to follow the offline optimal replacement rule A while

heuristic #2 approximates rule B. Heuristic #3 approximates both rules A and

B because a prefetch stream that has not yet been referenced for a long time

is not likely to be referenced anytime soon and it may even not be referenced

after all. Although heuristic #3 is based on access recency, it is different from

the traditional LRU policy in that it relies on the access recency of the prefetch

stream it belongs to, not the access recency of each page. Note that every page

that was referenced after being prefetched would have already been moved out of

the prefetch cache.


Heuristic #1 can be combined with either heuristic #2 or heuristic #3 to form

a page reclamation policy. In either case, heuristic #1 should take precedence since

it guarantees to discard those pages that are unlikely to be accessed in the near

future. When applying heuristic #3, we have two different ways in identifying

the least recently used prefetch stream. Putting all these together, we may have

three page reclamation policies when a victim page needs to be identified for

reclamation:

• Longest: If there exist pages whose owner request handlers have exited

from the server, discard one of them first. Otherwise, discard the last page

from the longest prefetch stream.

• Coldest: If there exist pages whose owner request handlers have exited


of the prefetch stream which has not been accessed for the longest time.

• Coldest+: If there exist pages whose owner request handlers have exited


of the prefetch stream whose owner request handler has consumed the most

amount of CPU since the last time the request handler accessed the prefetch

stream.

4.3.2 Memory Allocation for Prefetched Pages

The memory allocation between prefetched pages and the rest of the memory

buffer cache can affect the memory utilization efficiency. Too small a prefetch

cache would not sufficiently reduce the prefetch page thrashing. Too large a

prefetch cache, on the other hand, would result in many page misses on the normal

memory buffer cache. To achieve high performance, we aim to minimize the

combined prefetch miss rate and the cache miss rate. A prefetch miss is defined as


a miss on a page which was prefetched and then prematurely evicted without being

accessed while a cache miss is a miss on a page that has already been accessed at

the time of last eviction.

Previous studies [KMC02, TSW92] have proposed to use histograms of page

hits to adaptively partition the memory between prefetching and caching or among

multiple processes. In particular, based on Thiebaut et al.’s work [TSW92], the

minimal overall page miss rate can be achieved when the current miss-rate deriva-

tive of the prefetch cache as a function of its allocated size is the same as the

current miss-rate derivative of the normal memory buffer cache. This indicates

an allocation point where there is no benefit of either increasing or decreasing the

prefetch cache allocation. These solutions have not been implemented in practice

and implementations suggested in the previous studies require significant over-

head in maintaining page hit histograms. In order to allow a relatively light-wight

implementation, we approximate Thiebaut et al.’s result using a gradient descent-

based greedy algorithm [RN03]. Given the curve of combined prefetch miss rates

and cache miss rates as a function of the prefetch cache allocation, the gradient

descent algorithm takes steps proportional to the negative of the gradient of the

function at the current point and proceeds downhill toward the bottom [Gra].

We divide the system execution time into epochs (or time windows) and the

prefetch cache allocation is adjusted epoch-by-epoch based on gradient descending

of the combined prefetch and cache miss rates. In the first epoch, we randomly

increase or decrease the prefetch cache allocation by an adjustment unit. We

calculate the change of combined miss rates in each subsequent epoch. The value

of the change is used to approximate the derivative of the combined miss rates.

If the miss rate increases by a substantial margin, we reverse the adjustment of

the previous epoch (e.g., increase the prefetch cache allocation if the previous

adjustment has decreased it). Otherwise, we further the adjustment made in the

previous epoch (e.g., increase the prefetch cache allocation further if the previous


adjustment has increased it).

4.3.3 Implementation

We have implemented the proposed prefetch memory management scheme in

the Linux 2.6.10 kernel. In the Linux kernel, page reclamation is initiated when

the number of free pages falls below a certain threshold (function try to free -

pages() in source file vmscan.c). The OS first scans the active list to move those

pages that have not been accessed recently to the inactive list. And then the OS

scans the inactive list to find candidate pages for eviction. In our implementation,

when a page in the prefetch cache is referenced, it is moved to the inactive LRU

list of the Linux memory buffer cache. Prefetch cache reclamation is triggered

at each invocation of the global Linux page reclamation. If the prefetch cache

size exceeds its allocation, prefetched pages will be selected for reclamation until

the desired allocation is reached. We also use a kernel thread to check if any

prefetched pages need to be reclaimed every 4 seconds.

We initialize the prefetch cache allocation to be 25% of the total physical

memory size. Our gradient descent-based prefetch cache allocation requires the

information on several memory performance statistics (e.g., the prefetch miss rate

and the cache miss rate). In order to identify misses on those prematurely evicted

pages, the system maintains a bounded-length history of meta-data for recently

evicted pages along with their status at the time of eviction (e.g., whether a page

is prefetched but not-yet-accessed). In the page data structure, we also keep track

of the information on how long each page has stayed in memory before being

evicted.

We use one bit in the flags field of the page data structure to signify whether

the page is prefetched or not (in page-flags.h). For a prefetched page, another bit

will show whether the page has been accessed. The prefetch cache data structure


is organized as a list of stream descriptors. Each stream descriptor points to a

list of ordered page descriptors, with newly prefetched page in the tail of the list.

Each stream descriptor also maintains a link to its owner process/thread (i.e.,

the process/thread that initiated the prefetching of pages in this stream) and the

timestamp showing when the prefetch stream is last accessed. The timestamp uses

either the wall clock time or the the CPU execution time of the stream’s owner

process depending on the adopted recency heuristic policy. Figure 4.3 provides an

illustration of the prefetch cache data structure. In current implementation, we

assume a request handler is a process (or thread) in a multi-processed (or multi-

threaded) server. Each prefetch stream can track the status of its owner process

through the link in the stream descriptor. When a process exits, the pages in

the associated prefetch stream will be moved to the inactive LRU list. Due to

its reliance on the process status, the current implementation cannot monitor

the status of request handlers in event-driven servers that reuse each process for

multiple request executions. In future implementation, we can associate each

prefetch stream with its open file data structure directly. In this way, a file close

operation will also cause pages in the associated stream to become reclamation

candidates.

We assess the space overhead of our implementation. Since the number of

active prefetch streams in the server is relatively small compared with the number

of pages, the space consumption of stream descriptors is not a serious concern.

The extra space overhead in each page includes a 4-byte pointer to the prefetch

stream and a 4-byte timestamp. Therefore they incur 0.2% space overhead (8

bytes per 4 KB page). An additional space overhead is the memory used by the

eviction history for identifying misses on recently evicted pages. We can limit

this overhead by controlling the number of entries in the eviction history. A

smaller history size may be adopted at the expense of the accuracy of the memory

performance statistics. A 20-byte data structure is used to record the information


prefetch cache

prefetching

reclamation owner

pid

referenceactive page list

head

time-

stamp

owner

pid

time-

stamp

owner

pid

time-

stamp

owner

pid

time-

stamp

stream 1

stream 2

stream 3

stream N

inactive page

list

eviction

page descriptor

Figure 4.3: An illustration of the prefetch cache data structure and our

implementation in Linux.

(meta-data) about each evicted page. When we limit the eviction history size to

40% of the total number of physical memory pages, the space overhead for the

eviction history is about 0.2% of the total memory size.

Our implementation has added/edited a total of about 1100 lines of C code in

the Linux. A kernel patch containing an implementation of the heuristic policy

Coldest is available at http://www.cs.rochester.edu/˜cli/Publication/patch2.htm.

All the modified files in the Linux source tree are listed as follows.

fs/file table.c

fs/mpage.c

fs/open.c

mm/filemap.c

mm/page alloc.c

mm/readahead.c

mm/swap.c

mm/truncate.c

mm/vmscan.c


include/linux/fs.h

include/linux/mm.h

include/linux/mm inline.h

include/linux/page-flags.h

include/linux/pagevec.h

include/linux/swap.h

4.4 Experimental Evaluation

We assess the effectiveness of our proposed prefetch memory management tech-

niques on improving the performance of data-intensive server systems. Experi-

ments were conducted on servers each with dual 2.0 GHz Xeon processors, 2 GB

memory, a 36.4 GB IBM 10 KRPM SCSI drive, and a 146 GB Seagate 10KRPM

SCSI drive. The Linux installation is Red Hat 9 with our enhanced kernel. Note

that we lower the memory size used in some experiments to better illustrate the

memory contention. Each experiment involves a server and a load-generating

client. The client program can adjust the number of simultaneous requests and

thus controls the server concurrency level.

The evaluation results are affected by several factors, including the I/O prefetch-

ing aggressiveness, server memory size, and the application dataset size. Our

methodology is to first demonstrate the performance at a typical setting and then

explicitly evaluate the impact of various factors. We perform experiments on four

microbenchmarks and two real application workloads. Each experiment is run for

4 rounds and each round takes 120 seconds. The performance metric we report

for all workloads is the I/O throughput observed at the application level. They

are acquired by instrumenting the server applications so that they return I/O

performance statistics to the client program. In addition to showing the server


I/O throughput, we also provide some results on the page thrashing statistics of

the prefetched memory. We summarize the evaluation results in the end of this

section.

4.4.1 Evaluation Benchmarks

We explore the performance of four microbenchmarks and two real applica-

tions with different I/O access patterns. All the microbenchmarks share the same

server program that accesses a dataset of 6000 4 MB disk-resident files. Upon the

arrival of each user request, the server spawns a thread to process it. The mi-

crobenchmarks differ in the number of files accessed by each request handler (one

to four) and the portion of each file accessed (the whole file, a random portion, or

a 64 KB chunk). We use the following microbenchmarks in the evaluation:

• One-Whole: Each request handler randomly chooses a file and it repeatedly

reads 64 KB data blocks until the whole file is accessed.

• One-Rand: Each request handler randomly chooses a file and it repeatedly

reads 64 KB data blocks from the file up to a random total size (evenly

distributed between 64 KB and 4 MB).

• Two-Rand: Each request handler alternates reading 64 KB data blocks from

two randomly chosen files. For each file, the request handler accesses the

same random number of blocks. This workload emulates applications that

simultaneously access multiple sequential data streams.

• Four-64KB: Each request handler randomly chooses four files and reads a

64 KB random data block from each file.

We include the following two real application workloads in the evaluation:


• Index searching: This application and its dataset are the same as the one

introduced in section 3.5. The dataset contains a 522 MB mapping file and

a 18.5 GB search index file. For each keyword in an input query, a binary

search is first performed on the mapping file and then the search index is

accessed following a sequential access pattern. Multiple prefetching streams

on the search index are accessed for each multi-keyword query. The search

query words in our test workload are based on a trace recorded at the Ask

Jeeves online site in early 2004.

• Apache hosting media clips: We include the Apache Web server in our eval-

uation. Since our work focuses on data-intensive applications, we use a

workload containing a set of media clips, following the file size and access

distribution of the video/audio clips portion of the 1998 World Cup work-

load [AJ99]. About 9% of files in the workload are large video clips while

the rest are small audio clips. The overall file size range is 24 KB–1418 KB

with an average of 152 KB. The total dataset size is 20.4 GB. During the

tests, individual media files are chosen in the client requests according to a

Zipf distribution. A random-size portion of each chosen file is accessed.

We summarizes our benchmark statistics in table 4.1.

4.4.2 Microbenchmark Performance

We assess the effectiveness of the proposed techniques by comparing the server

performance under five kernel versions using different memory management meth-

ods for prefetched pages:

#1. AccessHistory: We use the original Linux 2.6.10 to represent kernels that

manage the prefetched pages along with other memory pages using an ac-

cess history-based LRU reclamation policy. Note that a more sophisticated


Benchmark/workload Whole-file Streams per Memory Total data Min-mean-max size of

access? request Footprint size sequential access streams

Microbenchmark: One-Whole Yes Single 1 MB/request 24.0 GB 4MB–4 MB–4MB

Microbenchmark: One-Rand No Single 1 MB/request 24.0 GB 64 KB–2 MB–4MB

Microbenchmark: Two-Rand No Multiple 1 MB/request 24.0 GB 64 KB–2 MB–4MB

Microbenchmark: Four-64KB No N/A 1 MB/request 24.0 GB N/A

Index searching No Multiple Unknown 19.0 GB Unknown

Apache hosting media clips No Single Unknown 20.4 GB 24 KB–152 KB–1418 KB

Table 4.1: Benchmark statistics. The column “Whole-file access?” indi-

cates whether a request handler would prefetch some data that it would

never access. No such data exists if whole files are accessed by application

request handlers. The column “Stream per request” indicates the num-

ber of prefetching streams each request handler initiates simultaneously.

More prefetching streams per request handler would create higher memory

contention.

access recency or frequency-based reclamation policy would not make much

difference for managing prefetched pages. Although MRU-like schemes may

benefit prefetched pages, they perform poorly with general application work-

loads and in practice they are only used for workloads with exclusively se-

quential data access pattern.

#2. PC-AccessHistory: The original kernel with a prefetch cache whose allo-

cation is dynamically maintained using our gradient-descent algorithm de-

scribed in Section 4.3.2. Pages in the prefetch cache is reclaimed in FIFO-

order, which is the effective reclamation order for prefetched pages under

any access history-based reclamation policy.

#3. PC-Longest: The original kernel with a prefetch cache managed by the

Longest page reclamation policy described in Section 4.3.1.


#4. PC-Coldest: The original kernel with a prefetch cache managed by the Cold-

est page reclamation policy described in Section 4.3.1.

#5. PC-Coldest+: The original kernel with a prefetch cache managed by the

Coldest+ page reclamation policy described in Section 4.3.1.

The original Linux 2.6.10 exhibits several performance anomalies when sup-

porting data-intensive online servers [SZL05]. They include a mis-management of

prefetching during disk congestion and some lesser issues in the disk I/O sched-

uler. Details about these problems and some suggested fixes can be found in the

next chapter. All experimental results in this dissertation are based on corrected

kernels.

Figure 4.4 illustrates the I/O throughput of the four microbenchmarks under

all concurrency levels. The results are measured with the maximum prefetching

depth at 512 KB and the server memory size at 512 MB. Figure 4.4(A) shows con-

sistently high performance attained by all memory management methods for the

first microbenchmark (One-Whole). This is because this microbenchmark accesses

whole files and consequently all prefetched pages will be accessed. Further, for

applications with strictly sequential access pattern, the anticipatory I/O schedul-

ing allows each request handler to run without interruption. This results in a

nearly-serial execution even at high concurrency levels, therefore little memory

contention exists in the system.

Figure 4.4(B) shows the I/O throughput results for microbenchmark One-Rand.

Since this microbenchmark does not access whole files, some prefetched pages will

not be eventually accessed. Results indicate that our proposed techniques can

improve the I/O throughput of access history-based memory management at high

execution concurrencies. In particular, the improvement achieved by PC-Coldest+

is 8–97% when concurrency level is above 200.


100 200 300 400 500 600 700 800 900 10000

8

16

24

32

40


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(A) One−Whole

AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+

100 200 300 400 500 600 700 800 900 10000

6

12

18

24

30


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(B) One−Rand


100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(C) Two−Rand


100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(D) Four−64KB


Figure 4.4: Microbenchmark I/O throughput at different concurrency

levels. Our proposed prefetch memory management produces throughput

improvement on One-Rand and Two-Rand.

Figures 4.4(C) shows the performance for Two-Rand. The anticipatory schedul-

ing is not effective for this microbenchmark since a request handler in this case

accesses two streams alternately. Results show that our proposed techniques can

improve the performance of access history-based memory management at the con-

currency level of 200 or higher. For instance, the improvement of PC-Coldest+

is 34% and two-fold at the concurrency levels of 400 and 500 respectively. This

improvement is attributed to two factors: 1) the difference between PC-Coldest+

and PC-AccessHistory is attributed to the better page reclamation order inside


the prefetch cache; 2) the improvement of PC-AccessHistory over AccessHistory

is due to the separation of prefetched pages from the rest of the memory buffer

cache. Finally, we observe that the performance of PC-Coldest and PC-Longest

are not always as good as that of PC-Coldest+.

Figure 4.4(D) illustrates the performance of the random-access microbench-

mark Four-64KB. We observe that all the kernels perform similarly well at all

concurrency levels. Since prefetched pages in this microbenchmark are mostly not

used, early eviction would not hurt the server performance. On the other hand,

our memory partitioning scheme can quickly converge to small prefetch cache al-

locations through greedy adjustments and thus would not waste space on keeping

prefetched pages.

Note that despite the substantial improvement on Two-Rand and One-Rand,

our proposed prefetch memory management cannot eliminate memory contention

and the resulted performance degradation at high concurrency. However, this is

neither intended nor achievable. Consider a server with 1000 concurrent threads

each prefetching up to 512 KB. Just the pages pinned for outstanding I/O oper-

ations alone may occupy up to 512 MB memory space in this server.

To assess the remaining room for improvement, we approximate the offline

optimal reclamation policy through direct application assistance. In our approxi-

mation, applications provide exact hints about their data access pattern through

an augmented open() system call. With the hints, the approximated optimal pol-

icy can easily identify those prefetched but unneeded pages and evict them first

(following the offline optimal replacement rule A in Section 4.3.1). The approx-

imated optimal policy does not accurately embrace offline optimal replacement

rule B, because there is no way to determine the access order among pages from

different processes without the process scheduling knowledge. For the eviction

order among pages that would be accessed by their owner request handlers, the

approximated optimal policy follows that of Coldest+. Figure 4.5 shows that


1 100 200 300 400 500 600 700 800 900 10000

6

12

18

24

30


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(A) One−Rand

AccessHistoryPC−Coldest+Approximated Optimal

1 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(B) Two−Rand

AccessHistoryPC−Coldest+Approximated Optimal

Figure 4.5: Comparison among AccessHistory, Coldest+, and an ap-

proximated optimal page reclamation policy. The throughput of Coldest+

is 10–32% lower than that of the approximated optimal while it is 8–97%

higher than that of AccessHistory in high concurrencies.

the performance of Coldest+ approach is 10–32% lower than the approximated

optimal in high concurrency with the two tested microbenchmarks.

4.4.3 Performance of Real Applications

We examine the performance of the proposed techniques for two real applica-

tion workloads. Figure 4.6 (A) shows the application-perceived I/O throughput of

the index searching server at various concurrency levels. The results are measured

with the maximum prefetching depth at 512 KB and the server memory size of

512 MB. All schemes perform similarly well at very low concurrency. We notice

that the I/O throughput initially climbs up as the concurrency level increases

from 1 to 100. This is mainly due to shorter average seek distance when the disk

scheduler can choose from more concurrent I/O requests for seek reduction. At

high concurrency levels (100 and above), the PC-Coldest+ scheme outperforms

the access history-based management by 5–52%. Again, this improvement is at-

tributed to two factors. The performance difference between PC-AccessHistory


1 100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

(A) Index searching performance


1 100 200 300 400 500 600 700 800 900 10000

50

100

150

200

250

300

350


Pag

e th

rash

ing

rate

(in

pag

es/s

ec)

(B) Page thrashing


Figure 4.6: I/O throughput and page thrashing rate of the index search-

ing server.

and AccessHistory (about 10%) is due to separated management of prefetched

pages while the rest is attributed to better reclamation order within the prefetch

cache.

In order to better understand the performance results, we examine the page

thrashing rate in the server. The page thrashing rate is measured as the number of

page misses on prefetched but then prematurely evicted memory pages per second.

Figure 4.6 (B) shows the result for the index searching server. We observe that the

page thrashing rate increases as the server concurrency level goes up. Particularly

at the concurrency of 800, the page thrashing rates are around 306, 285, 274, 271

and 267 pages per second for the five schemes respectively. We observe that the

increase in page thrashing rates across the five schemes matches well with the

decrease of the application I/O throughput in Figure 4.6 (A), indicating that the

page thrashing is a main factor for the server performance degradation at high

concurrency.

We notice that even under the improved management, the page thrashing rate

still increases at high concurrency levels. We measure the statistics on how long

each page has stayed in memory before being evicted. We concentrate on those

prefetched but then prematurely evicted pages which later cause page misses.


400 600 8000%

20%

40%

60%

80%

100%

I/O concurrency (num. of concurrent request handlers)

Per

cent

age

of e

vict

ed p

ages

Index searching

AccessHistoryPC−AccessHistoryPC−Coldest+

Figure 4.7: Percentage of prematurely evicted pages that have stayed in

memory for more than 8 seconds for the index searching server. We only

consider those prefetched but then prematurely evicted pages which later

cause page misses.

Figure 4.7 shows the percentage of those pages that stay in memory for more than

8 seconds. We observe that at high concurrency levels (i.e., 400, 600 and 800),

this percentage for the PC-Coldest+ scheme is much higher than the percentages

for the other two schemes PC-AccessHistory and AccessHistory. Across all the

three I/O concurrencies shown in the figure, the average time each such page

stay in memory is 12.44, 8.31, and 6.22 seconds for the three page reclamation

policies respectively. This results prove that the proposed techniques can keep

those useful prefetched pages to stay much longer in memory compared with the

default page reclamation policy of Linux and the FIFO policy combined with

prefetch cache. The results suggest in order to further reduce the page thrashing

rate at high concurrency levels, those evicted pages need to be kept in memory

for longer time, which is not achieved by our current heuristic policies. We leave

this issue for future consideration. However, this might not be achievable. Again,

consider a server with 1000 concurrent threads each prefetching up to 512 KB.


1 100 200 300 400 500 600 700 800 900 10000

4

8

12

16

20


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

Apache Web Server hosting media clips.


Figure 4.8: Throughput of the Apache Web server hosting media clips.

Just the pages pinned for outstanding I/O operations alone may occupy up to

512 MB memory space in this server.

Figure 4.8 shows the application IO throughput of the Apache Web server host-

ing media clips. The results are measured with the maximum prefetching depth

at 512 KB and the server memory size at 256 MB. The four prefetch cache-based

approaches perform better than the access history-based approach in all concur-

rency levels. This improvement is due to separated management of prefetched

pages. The difference among the four prefetch cache-based approaches is slight,

but PC-Coldest+ persistently performs better than others. This is due to its

effective reclamation management of prefetched pages.

4.4.4 Impact of Factors

In this section, we explore the performance impact of the I/O prefetching

aggressiveness, the server memory size, and the application dataset size. We use

a microbenchmark in this evaluation because of its flexibility in adjusting the


1 100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

Two−Rand

PCC+, 512KB pref.PCC+, 256KB pref.PCC+, 128KB pref.AH, 512KB pref.AH, 256KB pref.AH, 128KB pref.

Figure 4.9: Performance impact of the maximum prefetching depth for

the microbenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and

“AH” stands for AccessHistory.

workload parameters. When we evaluate the impact of one factor, we set the

other factors to default values: 512 KB maximum prefetching depth, 512 MB for

the server memory size, and 24 GB for the application dataset size. We only show

the performance of PC-Coldest+ and AccessHistory in the results.

Figure 4.9 shows the performance of the microbenchmark Two-Rand with

different maximum prefetching depths: 128 KB, 256 KB, and 512 KB. Various

maximum prefetching depths are achieved by adjusting the appropriate parameter

in the Linux kernel. We observe that our proposed prefetch memory management

scheme performs consistently better than AccessHistory at all configurations. The

improvement tends to be more substantial for the servers with higher prefetching

depths due to higher memory contention (and therefore more room to improve

performance).

Figure 4.10 shows the performance of the microbenchmark at different server

memory sizes: 2 GB, 1 GB, and 512 MB. Our proposed prefetch memory man-


100 200 300 400 500 600 700 800 900 10000

5

10

15

20

25


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

Two−Rand

PCC+, 2GB memPCC+, 1GB memPCC+, 512MB memAH, 2GB memAH, 1GB memAH, 512MB mem

Figure 4.10: Performance impact of the server memory size for the mi-

crobenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”

stands for AccessHistory.

agement scheme performs consistently better than AccessHistory at all memory

sizes. The improvement tends to be less substantial for the servers with large

memory due to lower memory contention (and therefore less room to improve

performance).

Figure 4.11 illustrates the performance of the microbenchmark at different

workload data sizes: 1 GB, 8 GB, and 24 GB. Intuitively better performance is

correlated with smaller workload data sizes. However, the change on the abso-

lute performance does not significantly affect the performance advantage of our

strategy over access history-based memory management.

4.4.5 Summary of Results

• Compared with access history-based memory management, our proposed

prefetch memory management scheme (PC-Coldest+) can improve the per-

formance of real workloads by 5–52% at high execution concurrencies. The


1 100 200 300 400 500 600 700 800 900 10000

6

12

18

24

30


App

licat

ion

I/O th

roug

hput

(in

MB

/sec

)

Two−Rand

PCC+, 1GB dataPCC+, 8GB dataPCC+, 24GB dataAH, 1GB dataAH, 8GB dataAH, 24GB data

Figure 4.11: Performance impact of the workload data size for the mi-

crobenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”

stands for AccessHistory.

performance improvement can be attributed to two factors: the separated

management of prefetched pages and enhanced reclamation policies for them.

• At high concurrencies, the performance of PC-Coldest+ is 10–32% below

an approximated optimal page reclamation policy that uses application-

provided I/O access hints.

• The microbenchmark results suggest that our scheme does not provide per-

formance enhancement for applications that follow strictly sequential access

pattern in accessing whole files. This is because the anticipatory I/O sched-

uler executes request handlers serially for these applications, thus eliminat-

ing memory contention even at high concurrency. In addition, our scheme

does not provide performance enhancement for applications with only small-

size random data accesses since most prefetched pages will not be accessed

for these applications (and thus their reclamation order does not matter).


• The performance benefit of the proposed prefetch memory management may

be affected by various system parameters. In general, the benefit is higher

for systems with more memory contention. Additionally, our scheme does

not degrade the server performance on all the configurations that we have

tested.

4.5 Related Work

Concurrency management. The subject of highly-concurrent online servers

has been investigated by several researchers. Pai et al. showed that the HTTP

server workload with moderate disk I/O activities can be efficiently managed by

incorporating asynchronous I/O or helper threads into an event-driven HTTP

server [PDZ99a]. A later study by Welsh et al. further improved the perfor-

mance by balancing the load of multiple event-driven stages in an HTTP request

handler [WCB01]. The more recent Capriccio work provided a user-level thread

package that can scale to support hundreds of thousands of threads [vBCZ+03].

These studies mostly targeted CPU-intensive workloads with light disk I/O activ-

ities (e.g., the typical Web server workload and application-level packet routing).

They did not directly address the memory contention problem for data-intensive

server systems at high concurrency.

Admission control. The concurrency level of an online server can be con-

trolled by employing a fixed-size process/thread pool and a request waiting queue.

QoS-oriented admission control strategies [LJ00, VG01, VTFM01, WC03] can ad-

just the server concurrency according to a desired performance level. Controlling

the execution concurrency may mitigate the memory contention caused by I/O

prefetching. However, these strategies may not always identify the optimal con-

currency threshold when the server load fluctuates. Additionally, keeping the

concurrency level too low may reduce the server resource utilization efficiency


(due to longer average disk seek distance) and lower the server responsiveness

(due to higher probability for short requests to be blocked by long requests).

Our work complements the concurrency control management by inducing slower

performance degradation when the server load increases and thus making it less

critical to choose the optimal concurrency threshold.

Memory replacement. A large body of previous work has explored efficient

memory buffer page replacement policies for data-intensive applications, such as

LRU/LFU [MGST70], Working-Set [Den68], CLOCK [Smi78], 2Q [JS94], Unified

Buffer Management [KCK+00], and LIRS [JZ02]. All these memory replacement

strategies are based on the page reference history such as reference recency or

frequency. In data-intensive server systems, a large number of prefetched but

not-yet-accessed pages may emerge due to aggressive I/O prefetching and high

execution concurrency. Due to a lack of any access history, these pages may not

be properly managed by access recency or frequency-based replacement policies.

Memory management for Caching and Prefetching. Cao et al. proposed

a two-level page replacement scheme that allows applications to control their own

cache replacement while the kernel controls the allocation of cache space among

processes [CFL94]. Patterson et al. examined memory management based on

application-disclosed hints on application I/O access pattern [PGG+95]. Such

hints were used to construct a cost-benefit model for deciding the prefetching and

caching strategy. Kimbrel et al. explored the performance of application-assisted

memory management for multi-disk systems [KTP+96]. Tomkins et al. further

expanded the study for multi-programed execution of several processes [TPG97].

Hand studied application-assisted paging techniques with the goal of performance

isolation among multiple concurrent processes [Han99]. Anastasiadis et al. ex-

plored an application-level block reordering technique that can reduce server disk

traffic when large content files are shared by concurrent clients [AWC04]. Mem-

ory management with application assistance or application-level information can


achieve high (or optimal) performance. However, the reliance on such assistance

may affect the applicability of these approaches and the goal of our work is to

provide transparent memory management that does not require any application

assistance.

Chang et al. studied the automatic generation of I/O access hints through

OS-supported speculative execution [CG99, FC03]. The purpose of their work is

to improve the prefetching accuracy. They do not address the management of

prefetched memory pages during memory contention.

Several researchers have addressed the problem of memory allocation between

prefetched and cached pages. Many such studies required application assistance or

hints [CFKL95, KTP+96, PGG+95, TPG97]. Among those that did not require

such assistance, Thiebaut et al. [TSW92] and Kaplan et al. [KMC02] employed

page hit histogram-based cost-benefit models to dynamically control the prefetch

memory allocation with the goal of minimizing the overall page fault rate. Their

solutions require expensive tracking of page hits and they have not been imple-

mented in practice. In this work, we employ a greedy partitioning algorithm that

requires much less state maintenance and thus is easier to implement.

4.6 Conclusion

This chapter presents the design and implementation of a prefetch memory

management scheme supporting data-intensive server systems. Our main con-

tribution is the proposal of novel page reclamation policies for prefetched but

not-yet-accessed pages. Previously proposed page reclamation policies are prin-

cipally based on the page access history (access recency or frequency), which are

not well suited for prefetched memory pages that have no such history. Addi-

tionally, we employ a gradient descent-based greedy algorithm that dynamically

adjusts the memory allocation for prefetched pages. Our work is guided by previ-


ous results on optimal memory partitioning while we focus on a design with low

implementation overhead. We have implemented the proposed techniques in the

Linux 2.6.10 kernel and conducted experiments using microbenchmarks and two

real application workloads. Results suggest that our approach can substantially

reduce prefetch page thrashing and improve application I/O throughput at high

concurrency levels.

81

5 I/O System Performance

Anomalies

5.1 Introduction

As discussed in previous chapters, operating system support for data-intensive

server systems can be improved by more efficient design and implementation in

different system components, such as file system prefetching and caching. On

the other hand, better OS support for data-intensive server systems can also be

achieved by reducing system performance bugs or anomalies.

Large and complex system software, such as operating system, may not al-

ways behave as well as expected. We define performance bugs or anomalies as

problems in system implementation that cause the performance to deviate from

the expectation based on design protocol or algorithm. For example, performance

anomalies may be caused by overly-simplified implementation, mis-management

of special cases, or erroneous approximations. Unlike most correctness-related

bugs, performance anomalies usually do not cause system crashes or deadlocks.

It is challenging to identify performance anomalies and pinpoint their root

causes in complex systems, because such systems support wide ranges of workloads

and performance anomalies only manifest under particular system configurations

and workload conditions. We examined closed (i.e., resolved and corrected) bugs

CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 82

concerning Linux 2.4/2.6 IO/storage, file system, and memory management in the

Linux bug tracking system [LKB]. Among 219 reported bugs, only 9 are primarily

performance-related (about 4%) 1. In a previous work [SZL05], we proposed a

model-driven anomaly characterization approach and used it to discover operating

system performance bugs when supporting disk data-intensive servers systems.

We constructed a whole-system I/O throughput model as the reference of expected

performance and we used statistical clustering and characterization of performance

anomalies to guide debugging. As a result, our approach helped us quickly identify

four performance anomalies in the I/O system of Linux kernel.

In this chapter, we will present the Linux I/O system performance anomalies

we identified during the previous work [SZL05] and more recent follow-up work.

Our focus is on examining the anomalies themselves and exploring their root

causes. All the anomalies are about file system prefetching and I/O scheduling.

For most anomalies, we describe the high-level algorithmic design, illustrate the

anomalous implementation, and show the scenarios where they may manifest. The

rest of this chapter is organized as follows. Section 5.2 reviews Linux file system

prefetching algorithm and I/O scheduling algorithms. Section 5.3 elaborates on

the performance anomalies in Linux I/O system implementation. Section 5.4

provides general discussions on our experience in operating system performance

debugging. Section 5.5 discusses some related work about system debugging. And

Section 5.6 concludes this chapter.

1We manually examined each reported bug and we determine a bug to be primarily

performance-related if it causes significant performance degradation but it does not cause any

correctness-related symptoms like system crashes or deadlocks.


5.2 Linux I/O System Algorithms

All the anomalies covered in this chapter are about file system prefetching

and I/O scheduling. We review Linux file system prefetching algorithm and I/O

scheduling algorithms in this section.

5.2.1 Prefetching Algorithm in Linux

Prefetching or read-ahead2 means reading adjacent pages from the file before

they are actually accessed by the application. Then if prefetched pages are ac-

cessed later by the application, they are already available in the page cache in

memory. Therefore, prefetching can improve disk I/O efficiency by reducing seek

overhead for sequential accesses to disk-resident data, although it is not very useful

for random data accesses.

In practice, sequential I/O workloads are commonplace in most applications.

To support sequential disk accesses, Linux 2.6 adopts a sophisticated prefetching

algorithm which tries to detect sequential or non-sequential patterns in file ac-

cesses. To this end, the Linux prefetching algorithm manages two windows - the

current window and the ahead window, as shown in Figure 5.1. They are two sets

of pages, each corresponding to a contiguous region of the file.

The current window contains pages requested by the process or prefetched by

the kernel. All the pages in current window have their descriptors included in page

cache, although some of pages may be still unavailable (locked) due to the ongoing

data transfer. The ahead window contains only those pages that are prefetched

by the kernel but not yet requested by the processed. The pages in ahead window

sequentially follow those pages in the current window.

When a file is opened, both windows are empty. When the process’s first read

2Read-ahead is the term used in the Linux kernel source. We prefer prefetching in our work.


Figure 5.1: Current and ahead windows in asynchronous prefetching

access to the file starts from its first page, the kernel assumes the process will

read the file sequentially. In this case, the current window is created from the

first page, and the window size depends on the number of requested pages by the

process. At this moment, the ahead window is still empty. Later on, if the kernel

detects sequential access from the process, it creates the ahead window following

the last page of the current window. The ahead window size is either two or four

times the size of the current window. The ideal case is that while the application

is accessing the cached pages in the current window, the kernel is prefetching the

pages in the ahead window. When all the pages in current window are traversed

and the process begins to read a page in the ahead window, the current window

will be replaced by the ahead window (see Figure 5.1). The ahead window will

be reset following the new current window and its size will be either two or four

times the size of the current window. But size of both windows can not be larger

than the default maximum prefetching depth of Linux, which is 32 pages. This

process is called asynchronous prefetching because the prefetching is always ahead

of the data accesses by the application.

On the other hand, in the following two cases, the current window and ahead


window are emptied and asynchronous prefetching is temporarily disabled.

• when the kernel detects a file access that is not sequential.

• when the process’s first read access to the file does not start from its first

page.

In this two cases, the kernel will perform synchronous prefetching. Because the

requested pages are not available in the page cache, the application will block and

wait for I/O operations to complete. It is still prefetching, because the kernel usu-

ally access more pages than the application’s request. Asynchronous prefetching

will be re-enabled as soon as sequential file access is detected.

Therefore, with sequential file accesses, the prefetching algorithm aggressively

increases the prefetching depth (window sizes). However, the aggressiveness of

prefetching need to be contained by the available system resource. In general, the

size of the ahead window will be shrunk when the kernel can not find a prefetched

page in the page cache. This may happen when the prefetched page is reclaimed

by the kernel from the page cache under memory pressure.

The Linux prefetching depth can be tuned in different ways. In Linux 2.4,

system administrators may change the variable vm max readahead by modifying

the /proc/sys/vm/max-radahead file. In Linux 2.6, users are able to change the

prefetching depth for an open file by modifying its ra pages field through the

posix radvise() system call. In both kernels, users can also change prefetching

depth by modifying constant variables in the Linux source mm.h.

It is necessary for prefetching to avoid congesting the request queue in the

block-device layer. By default, the the maximum number of pending read requests

for a block device is 128. The kernel regards the request queue to be congested

when the number of pending requests is more than a certain threshold (113 by


default 3).

5.2.2 I/O Scheduling Algorithms in Linux

I/O schedulers are also called elevators. They manage I/O requests for each

block device. Specifically, an I/O scheduler organizes request queue by keeping

requests in order and merging the consecutive ones. The overall objectives include

reducing the disk seek overhead, improving the I/O throughput, and improving

fairness among I/O requests or among processes in the system. We need to note

that the processing of a specific I/O request may be delayed by the scheduler

in order to improve overall performance. Linux 2.6 supports four different I/O

schedulers explained as follows.

• Noop scheduler: Just as its name implies, this I/O scheduler does almost

nothing to each request. It simply adds each request to the request queue in

a FIFO order. The only optimization noop scheduler performs is to merge

consecutive requests when that is possible.

• CFQ scheduler: The Complete Fairness Queuing scheduler intends to fairly

treat all processes in the system. In addition to the dispatch queue for

the device, there is a request queue for each process that performs I/O

operations. The scheduler will work on the per-process queues in a round-

robin fashion. At each round, the scheduler selects a batch of requests from

one request queue, dispatches the requests, and then moves to the next

queue. CFQ scheduler was designed for multimedia workloads that require

short I/O latency.

• Deadline scheduler: This scheduler was introduced to avoid starvation

when requests keep coming in one direction of the request queue. In ad-

3In Linux source file ll rw blk.c, the threshold is calculated as: 7

8Q + 1, where Q is the

maximum queue size for the block device (default to 128).


dition to the dispatch queue, there are two sorted queues (read and write)

and two deadline queues (read and write). Each I/O request belongs to

both a sorted queue and a deadline queue (FIFO). In general, the scheduler

chooses requests from a sorted queue and moves them to the dispatch queue.

The scheduler needs to check the deadline queues and first schedules those

requests whose deadlines have expired.

• Anticipatory scheduler: This I/O scheduler was first proposed by Iyer and

Druschel [ID01] in 2001. It is widely used in Linux 2.6. Basically, after

each I/O request, the scheduler will idle the disk head for a short period of

time (even when there are pending I/O requests), so that consecutive I/O

requests that belong to the same process/thread may be serviced without

interruption.

We briefly describe how anticipatory I/O scheduler is implemented in Linux.

The anticipatory scheduler is built on top of the deadline scheduler. It also has

two sorted queues (read and write) and two deadline queues (read and write) for

each block device. The scheduler alternatively scans the two sorted read and write

queues, with preference on read. The scheduler implements a one-way elevator

algorithm. However, backward seek is possible when the seek distance to a request

in the back of the elevator is much shorter than the distance to the request in

front of the elevator. When the deadline of an I/O request has expired, the

scheduler will stop current anticipation or scanning, and go to service the expired

request. The scheduler services read or write requests in batches. A batch is a

collection of I/O requests. Read batch and write batch are processed alternatively.

Read and write expiration are checked only when the corresponding batch type

is underway. Anticipations occur only when the scheduler is processing a read

batch. When a read request is completed, the scheduler will check the next request

from the sorted read queue. If the next request is from the same process that


issued the just-completed request, or if the next request is very close to the just-

completed request, it is dispatched immediately. Otherwise, the scheduler may

decide to wait for a short period of time and anticipate that another request

close the previous request is coming up. When the anticipation timer expires, the

scheduler will stop anticipation and dispatch the next read request in the sorted

read queue. Anticipation decision of the scheduler is based on statistics maintained

for each process, including mean think time and mean seek distance. The kernel

source file for anticipatory scheduler implementation is as sched.c. There are five

parameters in the Linux implementation of anticipatory I/O scheduler. They can

be customized in the kernel source.

• antic expire: The amount of time in millisecond the scheduler will anticipate

for. According to the design, it is set to be around 6 by default.

• read expire: The deadline expiration time in millisecond for read requests.

• read batch expire: The deadline expiration time in millisecond for read re-

quest batch to be scheduled consecutively.

• write expire: The deadline expiration time in millisecond for write requests.

• write batch expire: The deadline expiration time in millisecond for write

request batch to be scheduled consecutively.

5.3 Anomalies in Linux I/O System Implemen-

tation

We show five performance anomalies we identified in the I/O system of Linux 2.6

kernel. Three of them are included in the previous work [SZL05] and two are from

a recent follow-up work.


5.3.1 Suppression of Prefetching

First of all, we show a performance anomaly concerning the prefetching algo-

rithm implementation. It causes performance losses for high concurrent workloads

(concurrency ≥ 128 ) with moderately long sequential access streams (length ≥

256KB). According to the kernel prefetching algorithm, it is necessary to avoid

congesting the request queue in the block-device layer. In the implementation of

Linux 2.6.10, upon submitting a prefetching request, if the read request queue is

considered congested, the prefetching request is abandoned.

The intuition for this treatment is that asynchronous prefetching should be

disabled when the I/O device is fully scheduled. However, the same function im-

plementing asynchronous prefetching operations also deals with synchronous I/O

operations. The implementation does not distinguish the two different kinds of

operations. That is, when the read request queue is considered congested, all

prefetching requests are abandoned. In practice, there is no problem to discard

those asynchronous I/O requests, because the requested pages are not yet accessed

by the application. However, for those synchronous I/O requests, things are dif-

ferent. The application is waiting for the requested pages. If the synchronous I/O

requests are discarded, each requested page will result in an inefficient single-page

makeup I/O. In effect, the system works as if each requested page causes a page

fault.

This problem is named performance anomaly #1. In order to fix this anomaly,

the kernel need to distinguish asynchronous I/O requests from those synchronous

ones. And only asynchronous I/O requests should be abandoned when disk conges-

tion happens. We make changes to the main function implementing the prefetch-

ing algorithm (page cache readahead() in mm/readahead.c). When disk conges-

tion occurs, the corrected kernel lets go those synchronous prefetching requests

and only discards asynchronous requests. Our fix coincides with the later kernel


1 2 4 8 16 32 64 128 2560

5

10

15

I/O concurency (num. of concurrent ops)

Ser

ver

I/O th

roug

hput

(in

MB

/sec

)

Index searching

#1 anomaly fixOriginal Linux 2.6.10

Figure 5.2: Throughput of a server workload. The anticipatory I/O

scheduler is used.

change in the official Linux 2.6.11.

We show application I/O throughput of a server workload in Figure 5.2. The

index searching application involves multiple concurrent request handlers. Each

request handler accesses one or more concurrent I/O streams alternatively. In this

test, the kernel with anomaly fix #1 can substantially improve I/O performance at

high execution concurrency compared with the original Linux kernel. Specifically,

the performance improvement is about 3.8 times and 3.7 times at concurrency

128 and concurrency 256 respectively. The average performance improvement

(across all tested concurrencies) of the corrected kernel over the original kernel is

25%. Overall, the anomaly fix #1 helps make the application performance more

predictable at high execution concurrency. In addition, we can infer that in our

experiment the disk congestion happens when the I/O concurrency is somewhere

between 32 and 128.


32 64 96 1280

2

4

6

8

10

Congestion threshold (num. of concurrent ops)

I/O th

roug

hput

(in

MB

ytes

/sec

)

Microbenchmark One−64KB−0

Threshold 113Threshold 62

Figure 5.3: Throughput of a server-style microbenchmark workload.

Change of congestion threshold may cause anomalous performance de-

viations.

5.3.2 Threshold of Disk Congestion

As just mentioned, later versions of Linux kernels (2.6.11 or later) have cor-

rected the anomaly #1 by only suppressing asynchronous prefetching when the

disk device is congested. However, the concept of disk congestion is not well de-

fined. Currently, the somewhat arbitrary disk congestion threshold at 113 queued

I/O requests still remains. It is not clear why this threshold is chosen. Actually, it

causes anomalous performance deviations when such threshold is crossed. Specif-

ically, medium-concurrency I/O may sometimes deliver much worse performance

than high-concurrency I/O.

This problem is called performance anomaly #2. We illustrate such a problem

in Figure 5.3. It shows the I/O throughput of a server-style microbenchmark

One-64-0 running on Linux 2.6.23 kernel. Each thread of the microbenchmark

reads a chunk of 64 KB data randomly in a large file. We can see when the


congestion threshold is the default 113, the throughput at concurrency 96 and

128 is much better than at concurrency 32 and 64. Similarly, when the congestion

threshold is set to 62, the throughput at concurrency 64, 96, and 128 is much

better than at concurrency 32. If we compare the throughput at concurrency 64,

it is improved by 61% when the threshold changes from 113 to 62. This is due

to the benefit of suppressing asynchronous prefetching at disk congestions. When

threshold is 113, asynchronous prefetching is discarded for concurrency 96 and

128. Accordingly, asynchronous prefetching is discarded for concurrency 64, 96,

and 128 when threshold is 62. The reason why asynchronous prefetching may

degrade I/O throughput in this situation is that the disk is already busy and

asynchronous prefetching may waste disk resource on unwanted data. Note that

each application thread only sequentially accesses a stream of 64 KB.

To fix this anomaly, it may need systematic analysis to decide the best disk

congestion threshold based on device characteristics and workload configurations.

It may make sense to make the threshold adaptive. We leave this as a future

direction.

5.3.3 Estimation of Seek Cost

This anomaly is about the anticipatory I/O scheduler. It involves workloads

at moderately high concurrency (8 and above) with stream length larger than

the maximum prefetching size (128 KB). Our investigation discovers the follow-

ing performance problem with kernel function as can break anticipation() in as-

iosched.c. The current implementation of the anticipatory scheduler will stop

an ongoing anticipation if there exists a pending I/O request with shorter seek

distance (compared with the statistical average seek distance of the anticipating

process). However, due to a significant seek initiation cost on modern disks (as

shown in Figure 3.2), the seek distance is not an accurate and reliable indication

of the seek time cost. For example, the average cost of a 0-distance seek and a


1 2 4 8 16 32 64 128 2560

5

10

15


Ser

ver

I/O th

roug

hput

(in

MB

/sec

)

Microbenchmark Two−Rand−0

#3 anomaly fixesOriginal Linux 2.6.10

Figure 5.4: Effect of kernel correction (for anomaly #3) on a microbench-

mark that alternatively reads random size of data in two files. The antic-

ipatory I/O scheduler is used.

2x-distance seek is much less than that of an x-distance seek. Consequentially,

the current implementation may tend to stop the anticipation when the benefit

of breaking the anticipation does not actually exceed that of continuing it.

This problem is called performance anomaly #3. We solve this problem by

using estimated seek time (instead of the seek distance) in the request anticipa-

tion cost/benefit analysis. Figure 5.4 illustrates effect of the anomaly fix #3 on

Linux 2.6.10 kernel. The average performance improvement (across all tested con-

currencies) of the corrected kernel with anomaly fix #3 is 6.4% (compared with

the original Linux kernel) for the microbenchmark Two-Rand-0.

5.3.4 Impact of System Timer Frequency

The default system timer interrupt frequency (hertz) in Linux 2.4 kernel was

100 Hz. Starting from the Linux 2.6.0, the default kernel hertz became 1000 Hz.


More recently, this default kernel hertz has been changed to 250 Hz since Linux 2.6.13.

Although there has been some controversy on this change among Linux commu-

nity [LKT], the default hertz of 250 Hz remains. And it is now configurable

through the kconfig file (kernel/Kconfig.hz) in kernel source.

With the change of default kernel hertz, another I/O system anomaly has

manifested. In the implementation of anticipatory scheduler (as-iosched.c), the

anticipation expiration time is defined as follows. Note the constant HZ is the

value of kernel hertz.

/* max time we may wait to anticipate a read (default around 6ms) */

#define default_antic_expire ((HZ / 150) ? HZ / 150 : 1)

With 1000 Hz kernel hertz (HZ=1000) in early 2.6 kernels, this code works

fine in calculating correct timeout value of six ticks. However, the above code is

problematic given that default kernel hertz becomes 250 after Linux 2.6.13. The

value of constant default antic expire is 1 in this circumstance, which means the

expiration time is one timer tick of Linux kernel. In a 250 Hz system, one timer tick

takes 4 milliseconds. This number is okay compared the expected 6 milliseconds,

although it is a little smaller. But in general it is not possible for the kernel to

deliver every timer interrupt in exact one tick. Even worse, a timer timeout of

only one tick could potentially expire at anytime between 0 and 4 milliseconds

after the timer is set. This problematic implementation with an average expiration

time of 2 milliseconds is too far from the expected 6 milliseconds. In practice, we

observe timeouts as short as 100 microseconds in kernel trace analysis.

This problem is called performance anomaly #4. Our fix to this anomaly is

to make the value of default antic expire to be 2 when it is 1, as demonstrated in

the following code.

/* max time we may wait to anticipate a read (default around 6ms) */

#define default_antic_expire ((HZ/150) ? ((HZ/150 >1)?(HZ/150):2):1)


1 2 4 80

2

4

6

8

10

12


I/O th

roug

hput

(in

MB

ytes

/sec

)

#4 anomaly fixoriginal linux 2.6.23

Figure 5.5: Effect of kernel correction (for anomaly #4) on a microbench-

mark that reads 256 KB randomly in large files.

We show the effect of this anomaly fix on microbenchmark in Figure 5.5. It

demonstrates that compared with the original Linux 2.6.23 kernel, our correction

with anomaly fix #4 improves the system I/O throughput by 10–13% during

concurrent executions. This anomaly also reminds us that when the kernel evolves,

changes with global impact may cause unexpected problems for descendant kernel

versions. Actually this anomaly with anticipation timer is not the only problem

related to minimum value of the Linux kernel timer. The implementation of

CFQ I/O scheduler in Linux needs a similar fix. Its timer currently makes sure

the timeout value is at least 1 tick, but it should be at least 2 ticks. Recently,

this performance anomaly has been accepted by the Linux kernel development

team [Bug] and the anomaly fix will be included in an upcoming Linux kernel

version.


5.3.5 Timing to Start Anticipation

As mentioned early, the anticipatory I/O scheduler services read or write re-

quests in batches, and anticipations occur when the scheduler is processing a read

batch. According to this design, when anticipation is enabled (default antic ex-

pire is not zero), only one read request is allowed to be dispatched at a time. In

contrast, many asynchronous write requests may be dispatched simultaneously.

When each read request is completed, the scheduler will decide whether it should

start the anticipation timer for the next read request by checking whether the

completed request is the one that was waited on. However, in current Linux

kernels, the one-read-a-time rule was not implemented correctly. Multiple read

requests may be dispatched to the block device at the same time due to at least

the following two reasons.

• During sequential data access, Linux asynchronous prefetching may submit

extra read requests, even though there is an outstanding read request in the

dispatch queue of the block device.

• Large read requests need to be split into several smaller requests so that the

size of each request is not more than the maximum request size acceptable for

the device. Those smaller requests created from the same larger request are

usually dispatched at the same time, although the intention of anticipatory

scheduling is to keep only one outstanding request dispatched at any time.

Large-size requests can be created by request merging and prefetching.

In the implementation of anticipatory I/O scheduler, when multiple read re-

quests are dispatched, the anticipation timer is started for the next request after

the first outstanding read request is completed. This is not appropriate, because

there are still other requests being serviced by the device, and the disk head posi-

tion is not dependent on the just-completed request. Therefore, it does not make

sense to start anticipation based on the just-completed request.


1 2 4 80

2

4

6

8

10

12


I/O th

roug

hput

(in

MB

ytes

/sec

)

#4, #5 anomaly fixes#5 anomaly fixoriginal linux 2.6.23

Figure 5.6: Effect of kernel corrections on a microbenchmark that reads

256 KB randomly in large files.

This problem is called performance anomaly #5. It may cause premature

timeout and issuance of other processes’ requests (often before all outstanding

requests from the current process return). We correct this anomaly (in function

as dispatch request() in as-iosched.c) by strictly allowing only one read request

to be dispatched at any moment. The scheduler always makes sure the number

of outstanding requests is one whenever it wants to dispatch a new read request.

Note that multiple requests are allowed to pass to the I/O scheduler and stay in the

sorted queues and deadline queues, but only one read request can be dispatched

at any time. Another possible fix is to start anticipation only when the last

outstanding request has returned. This performance anomaly has been reported to

the Linux kernel development team and we are trying to get it across to them. We

illustrate the effect of this anomaly fix in Figure 5.6. We show the I/O performance

of server-style microbenchmark on a corrected Linux 2.6.23 kernel. Each thread

of the microbenchmark reads 256 KB data streams randomly in large files that are

randomly selected. It demonstrates that compared with the original Linux 2.6.23,


our correction with anomaly fix #5 improves the system I/O throughput by 13–

20% during concurrent execution. When we combine the fixes of anomaly #4 and

#5 together, performance improvement of the corrected kernel is 26–37%.

5.4 Discussions

Based on our performance debugging experience in previous work, we make

the following discussions on Linux I/O system performance anomalies. It is our

argument that by identifying and fixing I/O system performance anomalies, oper-

ating system support for data-intensive server systems will be improved to some

extent.

• General purpose monolithic operating systems such as Linux are not always

reliable when the systems become more and more complex. It is not easy

to identify performance anomalies because they are usually hidden in the

system and only manifest at specific system configurations and workload

conditions. However, once identified, these performance anomalies are typ-

ically fairly easy to fix in comparison with developing new algorithms or

protocols to realize certain functionality.

• Performance anomalies are not always unqualified “bugs” that have to be

corrected. Some anomalies (e.g., #2) are better described as partially ben-

eficial optimizations. Some other anomalies (e.g., #3) are most likely inten-

tional simplifications for which perfections (i.e., optimal threshold setting

and accurate disk seek time estimation) are difficult to realize.

• Some performance anomalies are corrected with the evolution of the system

software. However new anomalies may manifest in later software versions.

Specially, when the kernel evolves, changes with global impact may cause


unexpected problems for descendant kernel versions. The change of default

kernel timer frequency mentioned in 5.3.4 is an example of such cases.

• Operating systems, such as Linux, are not always accurately documented.

In other words, some of their documentations and source code comments are

not reliable. The same point of view is shared in a recent work by Tan et

al. [TYKZ07]. It may cause unexpected trouble if we unreservedly believe in

those documentations when dealing with performance anomalies, although

in general the documentations are still good for reference. For example, the

anticipatory scheduling documentation emphasize that there should be at

most one read request dispatched to the device at any moment. But our

measurements show that it is common to have more than 1 read requests

dispatched during several of our microbenchmark and real-application work-

loads.

• Kernel event tracing tools such as LTT [YD00, LTT, LTTng] are of great use

for identifying and fixing Linux kernel performance anomalies. We extended

the existing LTT tool for our tested kernels by adding customized trace

events. This tool can conveniently trace almost all kinds of kernel events

with very low overhead. Similar tools should be helpful for other complex

systems.

• When looking for performance anomalies, we need to understand what nor-

mal performance behaviors are like. “Normal” performance behaviors are

not always easily known or even well defined. If available, an advisory

performance model can help characterize performance anomalies and guide

debugging [SZL05]. In addition, reference executions (defined next) are very

helpful for identifying the symptoms of performance anomalies. A reference

execution runs under a similar execution condition from that of the target

(suspected anomaly). For example, a reference execution condition might


be different from the suspected anomalous condition only in one of its ten

configuration parameters. Like a literary reference, it serves as a base of

comparison that helps understand the commonality and uniqueness of the

target.

• In addition to improving system performance, the identification and correc-

tion of performance anomalies help improve the predictability of the target

system behaviors. Such predictability is important for automatic system

management such as resource provisioning and capacity planning [DCA+03,

ABG+01, SS05, IBM]. For instance, with the anomaly fix #1, the applica-

tion performance becomes more predictable in high execution concurrency

(as shown in Figure 5.2).

5.5 Related Work

Several previous studies have explored performance problems in computer sys-

tems and applications. Two techniques (i.e., MemSpy [MGA92] and Mtool [GH93])

were proposed to analyze application performance bottleneck or performance

losses through program instrumentation. Perl and Weihl studied performance

assertion checking [PW93] to automate the testing of performance properties of

complex systems. Crovella and LeBlanc examined performance prediction of par-

allel programs based on lost cycles analysis [CL94]. Rosenblum et al. analyzed

large and complex workloads using complete system simulation [RBDH97]. In

addition, recent performance debugging work utilized statistical analysis of on-

line system traces [AMW+03, CKF+02] to identify defective components of com-

plex systems. These techniques focused on fine-grained examination of the tar-

get system/application in specific workload settings. But there was little result

on operating system performance anomalies and no general discussions on their

characteristics.


Some other recent studies have investigated techniques to identify non-performance

bugs in operating systems. Engler et al. provided techniques that automatically

extract system correctness rules/patterns from the system source code [ECH+01].

Chou et al. studied operating system errors by automatic, static, compiler anal-

ysis applied to the Linux and OpenBSD kernels [CYC+01]. Li et al. employed

data mining techniques to identify copy-paste related bugs in operating system

code [LLMZ04]. Compared with these studies, performance-oriented debugging

can be more challenging because performance bugs/anomalies are usually con-

nected with the code semantics or algorithms, and they often do not follow cer-

tain correctness rules. Further, performance anomalies occur when the system

performance is below expectation and they may not cause obvious malfunctions

such as incorrect states or system crashes.

5.6 Conclusion

In this chapter, we present five performance anomalies we identified in the I/O

system of Linux kernel. All the anomalies are about file system prefetching and

I/O scheduling. For most anomalies, we describe the high-level algorithmic design,

illustrate the anomalous implementation, and show the scenarios where they may

manifest. Driven by our experience with real Linux performance anomalies, we

also provide general discussions on operating system performance debugging.

102

6 Conclusions and Future Work

A large number of network and distributed applications in the world are built

upon data-intensive server systems. The performance of such systems is crucial

for industrial production and people’s life. This thesis work examines OS tech-

niques to improve the I/O performance for data-intensive server systems. We

look at transparent OS enhancements in three aspects (file system prefetching,

memory management, and disk I/O performance) that can improve the system

performance without any application assistance or changes.

6.1 Summary of Contributions

First, we propose a competitive prefetching strategy that can control the

prefetching depth so that the overhead of disk I/O switch and unnecessary prefetch-

ing are balanced. The proposed strategy does not require a-priori information on

the data access pattern, and achieves at least half the performance (in terms of

I/O throughput) of the optimal offline policy. We also provide analysis on the

optimality of our competitiveness result and extend the competitiveness result to

capture prefetching in the case of random-access workloads. We have implemented

the proposed competitive prefetching policy in Linux 2.6.10 and evaluated its per-

formance on both standalone disks and a disk array using a variety of workloads

CHAPTER 6. CONCLUSIONS AND FUTURE WORK 103

(including two common file utilities, Linux kernel compilation, the TPC-H bench-

mark, the Apache web server, and index searching). Compared to the original

Linux kernel, our competitive prefetching system improves performance by up to

53%. At the same time, it trails the performance of an oracle prefetching strategy

by no more than 42%.

Second, we explore a new memory management scheme that handles prefetched

(but not-yet-accessed) pages separately from the rest of the memory buffer cache.

We examine three new heuristic policies when a victim page (among the prefetched

pages) needs to be identified for eviction: 1) evict the last page of the longest

prefetch stream; 2) evict the last page of the least recently accessed prefetch

stream; and 3) evict the last page of the prefetch stream whose owner process

has consumed the most amount of CPU since it last accessed the prefetch stream.

These policies require no application changes or hints on their data access patterns.

We have implemented the proposed techniques in the Linux 2.6.10 kernel and con-

ducted experiments based on microbenchmarks and two real application workloads

(a trace-driven index searching server and the Apache Web server hosting media

clips). Compared with access history-based policies, our memory management

scheme can improve the server I/O throughput of real workloads by 5–52% at

high concurrency levels. Further, the proposed approach is 10–32% below an ap-

proximated optimal page reclamation policy that uses application-provided I/O

access hints. The space overhead of our implementation is about 0.4% of the

physical memory size.

Third, we present five performance anomalies we identified in the Linux I/O

system. All the anomalies are about file system prefetching and I/O scheduling.

For most anomalies, we describe the high-level algorithmic design, illustrate the

anomalous implementation, and show the scenarios where they may manifest.

We also discusses our experience in operating system performance debugging.

In particular, two of the “bugs” we identified are reported to the Linux kernel


development team [Bug]. One of the two has been accepted and its fix will be

included in an upcoming kernel version.

6.2 Future Directions

Based on this thesis work, we suggest the following future research directions.

• Adaptive prefetching using online workload profiling: Competitive prefetch-

ing only addresses the worst-case performance. A general-purpose operat-

ing system prefetching policy should deliver high average-case performance

for common application workloads. Previous work has studied adaptive

prefetching using application-disclosed data access patterns or hints [PGG+95,

TPG97]. Data access patterns can also be learned by online workload moni-

toring or profiling. Previous work [PS04, PS03] has proposed a monitor dae-

mon to generate hints on application data access patterns and burstiness.

We believe this technique if combined with competitive prefetching may gen-

erate better average-case performance for common application workloads.

• I/O cancellation: A larger prefetch depth of aggressive prefetching results

in less frequent I/O switching, and consequently yields fewer disk seeks

per time unit. On the other hand, prefetching may waste disk bandwidth

by retrieving unwanted data. To reduce the waste of disk resource while

supporting aggressive prefetching, we suggest to look at the feasibility of

canceling an I/O prefetching when it is identified as unwanted. Although

there might be overhead and some undesirable impacts, I/O cancellation can

provide great opportunities to make up for inaccurate prefetching predictions

and improve I/O performance and application responsiveness.

I/O request cancellation actually reflects a general issue for operating sys-

tems. The idea of cancellation also make sense for some other OS entities,


such as threads, processes, and virtual machines. Process cancellation, for

instance, will enable the system to avoid resource waste on unneeded tasks.

• Online system performance debugging: In systems research, performance im-

provement has been widely and deeply studied in many ways. Performance

debugging is yet another novel approach which improves system performance

by identifying and fixing performance anomalies. Large and complex sys-

tem software, such as operating system, may not always behave as well as

expected. Because performance anomalies may be caused by many reasons

at different system configurations, it is challenging to identify performance

anomalies and pinpoint their root causes in complex systems. We partic-

ipated in previous work [SZL05] which proposed a model-driven anomaly

characterization approach and used it to discover operating system perfor-

mance anomalies in a single-node system. It is desirable to study online

system performance debugging on large-scale systems such as distributed

file systems on wide area network or parallel file systems on large-scale clus-

ters.

106

Bibliography

[ABG+01] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy,

R. Golding, A. Merchant, M. Spasojevic, A. Veitch, and J. Wilkes.

Minerva: An automated resource provisioning tool for large-scale

storage systems. ACM Trans. on Computer Systems, 19(4):483–418,

November 2001.

[AJ99] M. Arlitt and T. Jin. Workload Characterization of the 1998 World

Cup Web Site. Technical Report HPL-1999-35, HP Laboratories Palo

Alto, 1999.

[AMW+03] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and

A. Muthitacharoen. Performance Debugging for Distributed Systems

of Black Boxes. In Proc. of the 19th ACM SOSP, pages 74–89, Bolton

Landing, NY, October 2003.

[Ask] Ask Jeeves Search. http://www.ask.com.

[AWC04] S. V. Anastasiadis, R. G. Wickremesinghe, and J. S. Chase. Circus:

Opportunistic Block Reordering for Scalable Content Servers. In

Proc. of the Third USENIX FAST, pages 201–212, San Fancisco,

CA, March 2004.

[BBG+99] J. Bruno, J. Brustoloni, E. Gabber, B. Ozden, and A. Silberschatz.

Disk Scheduling With Quality of Service Guarantees. In Proc. of

BIBLIOGRAPHY 107

the IEEE International Conference on Multimedia Computing and

Systems, Centro Affari, Firenze, Italy, June 1999.

[BC02] D. P. Bovet and M. Cesati. Understanding the Linux Kernel. O’Reilly

& Associates, 2nd edition, 2002.

[BCB03] R. Behren, J. Condit, and E. Brewer. Why Events Are A Bad Idea

(for high concurrency servers). Lihue, Hawaii, May 2003.

[BCD72] A. Bensoussan, C. Clingen, and R. Daley. The Multics Virtual Mem-

ory: Concepts and Design. Communications of the ACM, 15(5):308–

318, May 1972.

[BDF+96] W. Bolosky, J. Draves, R. Fitzgerald, G. Gibson, M. Jones, S. Levi,

N. Myhrvold, and R. Rashid. The Tiger Video Fileserver. In Proc. of

the 6th International Workshop on Network and Operating Systems

Support for Digital Audio and Video, Zushi, Japan, April 1996.

[BDF+03] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,

R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtu-

alization. In Proc. of the 19th ACM SOSP, pages 164–177, Bolton

Landing, NY, October 2003.

[Bel66] L. A. Belady. A Study of Replacement Algorithms for a Virtual-

storage Computer. IBM Systems Journal, 5(2):78–101, 1966.

[BHS95] T. Blackwell, J. Harris, and M. Seltzer. Heuristic Cleaning Algo-

rithms in Log-structured File Systems. In Proc. of the USENIX An-

nual Technical Conference, New Orleans, LA, January 1995.

[BKVV00] R. Barve, M. Kallahalla, P. J. Varman, and J. S. Vitter. Compet-

itive Parallel Disk Prefetching and Buffer Management. Journal of

Algorithms, 38(2):152–181, August 2000.

BIBLIOGRAPHY 108

[BO03] R. Bryant and D. O’Hallaron. Computer Systems: A Programmer’s

Perspective. Prentice Hall, 2003.

[Bru99] Brustoloni. The interoperation of copy avoidance in network and file

i/o. In Proc. of the IEEE INFOCOM, pages 534–542, New York City,

March 1999.

[Bug] Kernel Bug Tracker Bug 10756. http://bugzilla.kernel.org/show -

bug.cgi?id=10756.

[CB04] E. Carrera and R. Bianchini. Improving Disk Throughput in Data-

Intensive Servers. In Proc. of the 10th HPCA, pages 130–141,

Madrid,Spain, February 2004.

[CFKL95] P. Cao, E. W. Felten, A. R. Karlin, and K. Li. A Study of Integrated

Prefetching and Caching Strategies. In Proc. of the ACM SIGMET-

RICS, pages 188–197, Ottawa, Canada, June 1995.

[CFL94] P. Cao, E. W. Felten, and K. Li. Implementation and Performance of

Application-Controlled File Caching. In Proc. of the First USENIX

OSDI, Monterey, CA, November 1994.

[CG99] F. Chang and G. Gibson. Automatic I/O Hint Generation Through

Speculative Execution. In Proc. of the Third USENIX OSDI, New

Orleans, LA, February 1999.

[CIRT00] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A

Parallel File System For Linux Clusters. In Proc. of the 4th Annual

Linux Showcase and Conf., pages 317–327, Atlanta, GA, October

2000.

[CKF+02] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint:

Problem Determination in Large, Dynamic Systems. In Proc. of Int’l

BIBLIOGRAPHY 109

Conf. on Dependable Systems and Networks, pages 595–604, Wash-

ington, DC, June 2002.

[CL94] M. E. Crovella and T. J. LeBlanc. Parallel Performance Prediction

Using Lost Cycles Analysis. In Proc. of SuperComputing, pages 600–

610, Washington, DC, November 1994.

[CP99] C. D. Cranor and G. M. Parulkar. The UVM virtual memory system.

In Proceedings of the Usenix Annual Technical Conference, pages

117–130, June 1999.

[CYC+01] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical

Study of Operating Systems Errors. In Proc. of the 18th ACM SOSP,

pages 73–88, Banff, Canada, October 2001.

[DCA+03] R. P. Doyle, J. S. Chase, O. M. Asad, W. Jin, and A. M. Vahdat.

Model-based resource provisioning in a web service utility. In Proc.

of the 4th USENIX Symp. on Internet Technologies and Systems,

Seattle, WA, March 2003.

[Den68] P. J. Denning. The Working Set Model for Program Behavior. Com-

munications of the ACM, 11:323–333, May 1968.

[DP93] Peter Druschel and Larry L. Peterson. Fbufs: A High-Bandwidth

Cross-Domain Transfer Facility. In Proc. of the 14th ACM SOSP,

Asheville, NC, February 1993.

[DRC03] Z. Dimitrijevic, R. Rangaswami, and E. Chang. Design and Imple-

mentation of Semi-preemptible IO. In Proc. of the 2nd USENIX

Conf. on File and Storage Technologies, pages 145–158, San Fran-

cisco, CA, March 2003.

BIBLIOGRAPHY 110

[ECH+01] D. Engler, D. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as De-

viant Behavior: A General Approach to Inferring Errors in Systems

Code. In Proc. of the 18th ACM SOSP, pages 57–72, Banff, Canada,

October 2001.

[FC03] K. Fraser and F. Chang. Operating System I/O Speculation: How

Two Invocations Are Faster Than One. In Proc. of the USENIX

Annual Technical Conference, San Antonio, TX, June 2003.

[FP93] K. Fall and J. Pasquale. Exploring In-kernel Data Paths to Improve

I/O Throughput and CPU Availability. In Proc. of the 1993 Winter

Unix Conference, January 1993.

[GH93] A. J. Goldberg and J. L. Hennessy. Mtool: An Integrated System

for Performance Debugging Shared Memory Multiprocessor Applica-

tions. 4(1):28–40, January 1993.

[GM05] B. S. Gill and D. S. Modha. SARC: Sequential Prefetching in Adap-

tive Replacement Cache. In Proc. of the USENIX Annual Technical

Conference, pages 293–308, Anaheim, CA, April 2005.

[GNU06] GNU Diffutils Project, 2006. http://www.gnu.org

/software/diffutils/diffutils.html.

[GP98] G. R. Ganger and Y. N. Patt. Using System-Level Models to Eval-

uate I/O Subsystem Designs. IEEE Transactions on Computers,

47(6):667–678, June 1998.

[Gra] Gradient Descent Algorithm. http://en.wikipedia.org/wiki/Gradient -

descent.

[Han99] S. M. Hand. Self-paging in the Nemesis Operating System. In Proc.

of the Third USENIX OSDI, New Orleans, LA, February 1999.

BIBLIOGRAPHY 111

[IBM] IBM Autonomic Computing. http://www.research.ibm.com

/autonomic/.

[ID01] S. Iyer and P. Druschel. Anticipatory Scheduling: A Disk Schedul-

ing Framework to Overcome Deceptive Idleness in Synchronous I/O.

In Proc. of the 18th ACM SOSP, pages 117 – 130, Banff, Canada,

October 2001.

[JADAD06] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Ant-

farm: Tracking Processes in a Virtual Machine Environment. In Proc.

of the USENIX Annual Technical Conference, pages 1–14, Boston,

MA, June 2006.

[JCZ05] S. Jiang, F. Chen, and X. Zhang. CLOCK-Pro: an effective improve-

ment of the CLOCK replacement. In Proc. of the USENIX Annual

Technical Conference, pages 323–336, Anaheim, CA, April 2005.

[JS94] T. Johnson and D. Shasha. 2Q: A Low Overhead High Performance

Buffer Management Replacement Algorithm. In Proc. of the 20th

VLDB Conference, pages 439–450, Santiago, Chile, September 1994.

[JZ02] S. Jiang and X. Zhang. LIRS: An Efficient Low Inter-reference Re-

cency Set Replacement Policy to Improve Buffer Cache Performance.

In Proc. of the ACM SIGMETRICS, pages 31–42, Marina Del Rey,

CA, June 2002.

[KCK+00] J. M. Kim, J. Choi, J. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S.

Kim. A Low-Overhead High-Performance Unified Buffer Manage-

ment Scheme that Exploits Sequential and Looping References. In

Proc. of the 4th USENIX OSDI, San Diego, CA, October 2000.

[KEG+97] M. Kaashoek, D. Englerand, G. Ganger, H. Briceno, R. Hunt,

D. Mazieres, T. Pinckney, R. Grimm, J. Jannotti, and K. MacKen-

BIBLIOGRAPHY 112

zie. Application Performance and Flexibility on Exokernel Systems.

In SOSP1997, October 1997.

[KMC02] S. F. Kaplan, L. A. McGeoch, and M. F. Cole. Adaptive Caching for

Demand Prepaging. In Proc. of the Third Int’l Symp. on Memory

Management, pages 114–126, Berlin, Germany, June 2002.

[KTP+96] T. Kimbrel, A. Tomkins, R. H. Patterson, B. Bershad, P. Cao, E. W.

Felten, G. A. Gibson, A. R. Karlin, and K. Li. A Trace-Driven

Comparison of Algorithms for Parallel Prefetching and Caching. In

Proc. of the Second USENIX OSDI, Seattle, WA, October 1996.

[KTR94] D. Kotz, S. B. Toh, and S. Radhakrishnan. A Detailed Simulation

Model of the HP 97560 Disk Drive. Technical Report PCS-TR94-220,

Dept. of Computer Science, Dartmouth College, July 1994.

[LD97] H. Lei and D. Duchamp. An Analytical Approach to File Prefetching.

In Proc. of the USENIX Annual Technical Conference, Anaheim, CA,

January 1997.

[LJ00] K. Li and S. Jamin. A Measurement-Based Admission-Controlled

Web Server. In Proc. of the IEEE INFOCOM, pages 651–659, Tel-

Aviv, Israel, March 2000.

[LKB] Linux kernel bug tracker. http://bugzilla.kernel.org/.

[LKT] Linux kernel trap. http://kerneltrap.org/node/5411.

[LLMZ04] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A Tool for

Finding Copy-paste and Related Bugs in Operating System Code.

pages 289–302, San Francisco, CA, December 2004.

BIBLIOGRAPHY 113

[LS07] P. Lu and K. Shen. Multi-Layer Event Trace Analysis for Parallel I/O

Performance Tuning. In Proc. of the 36th International Conference

on Parallel Processing, XiAn, China, September 2007.

[LSG+00] C. Lumb, J. Schindler, G. R. Ganger, E. R., and D. F. Nagle. Towards

Higher Disk Head Utilization: Extracting “Free” Bandwidth from

Busy Disk Drives. In Proc. of the 4th USENIX OSDI, San Diego,

CA, October 2000.

[LSGN00] C. R. Lumb, J. Schindler, G. R. Ganger, and D. F. Nagle. Towards

Higher Disk Head Utilization: Extracting Free Bandwidth From Busy

Disk Drives. In Proc. of the 4th USENIX OSDI, San Diego, CA,

October 2000.

[LTT] Linux Trace Toolkit. http://www.opersys.com/LTT/.

[LTTng] Linux Trace Toolkit Next Generation, ng. http://ltt.polymtl.ca/.

[MDK96] T. C. Mowry, A. K. Demke, and O. Krieger. Automatic Compiler-

Inserted I/O Prefetching for Out-of-Core Applications. In Proc. of

the Second USENIX OSDI, Seattle, WA, October 1996.

[Met97] R. V. Meter. Observing the effects of multi-zone disks. In Proc. of

the USENIX Annual Technical Conference, Anaheim, CA, January

1997.

[MGA92] M. Martonosi, A. Gupta, and T. Anderson. MemSpy: Analyzing

Memory System Bottlenecks in Programs. In Proc. of the ACM SIG-

METRICS, pages 1–12, Newport, RI, June 1992.

[MGST70] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation

Techniques for Storage Hierarchies. IBM Systems J., 9(2):78–117,

1970.

BIBLIOGRAPHY 114

[Nut99] G. Nutt. Operating Systems: A Modern Perspective. Addison Wesley,

2nd edition, 1999.

[OOW93] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-K Page Re-

placement Algorithm for Database Disk Buffering. In Proc. of the

ACM SIGMOD, pages 297–306, Washington, DC, May 1993.

[PDZ99a] V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An Efficient

and Portable Web Server. In Proc. of the USENIX Annual Technical

Conference, Monterey, CA, June 1999.

[PDZ99b] V. S. Pai, P. Druschel, and W. Zwaenepoel. IO-Lite: A Unified I/O

Buffering and Caching System. In Proc. of the Third USENIX OSDI,

New Orleans, LA, February 1999.

[PGG+95] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Ze-

lenka. Informed Prefetching and Caching. In Proc. of the 15th ACM

SOSP, pages 79–95, Copper Mountain Resort, CO, December 1995.

[PS03] A. E. Papathanasiou and M. L. Scott. Energy Efficiency Through

Burstiness. In Proc. of the 5th IEEE Workshop on Mobile Computing

Systems and Applications, Monterey, CA, October 2003.

[PS04] A. E. Papathanasiou and M. L. Scott. Energy Efficient Prefetching

and Caching. In Proc. of the USENIX Annual Technical Conference,

Boston, MA, June 2004.

[PS05] A. Papathanasiou and M. Scott. Aggressive Prefetching: An Idea

Whose Time Has Come. In Proc. of the 10th Workshop on Hot Topics

in Operating Systems, Santa Fe, NM, June 2005.

BIBLIOGRAPHY 115

[PW93] S. E. Perl and W. E. Weihl. Performance Assertion Checking. In Proc.

of the 14th ACM SOSP, pages 134–145, Asheville, NC, December

1993.

[RBDH97] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod. Using the

SimOS Machine Simulator to Study Complex Computer Systems.

ACM Trans. on Modeling and Computer Simulation, 7(1):78–103,

January 1997.

[RD90] J. T. Robinson and M. V. Devarakonda. Data Cache Management

Using Frequency-Based Replacement. In Proc. of the ACM SIGMET-

RICS, pages 134–142, Boulder, CO, May 1990.

[RN03] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach.

Prentice Hall, 2nd edition, 2003.

[RO92] M. Rosenblum and J. K. Ousterhout. The Design and Implementa-

tion of A Log-Structured File System. ACM Transactions on Com-

puter Systems, 10(1):100–114, 1992.

[RW94] C. Ruemmler and J. Wilkes. An Introduction to Disk Drive Modeling.

IEEE Computer, 27(3):17–28, March 1994.

[SES] Search Engine Watch. http://searchenginewatch.com/showPage.html?-

page=2156481.

[SGLG02] J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. Track-

aligned Extents: Matching Access Patterns to Disk Drive Character-

istics. In Proc. of the 1st Conference on File and Storage Technolo-

gies, Monterey, CA, January 2002.

BIBLIOGRAPHY 116

[Smi78] A. J. Smith. Sequentiality and Prefetching in Database Systems.

ACM Transactions on Database Systems, 3(3):223–247, September

1978.

[SMW98] E. Shriver, A. Merchant, and J. Wilkes. An Analytical Behavior

Model for Disk Drives with Readahead Caches and Request Reorder-

ing. In Proc. of the ACM SIGMETRICS, pages 182–192, Madison,

WI, June 1998.

[SS96] K. A. Smith and M. Seltzer. A Comparison of FFS Disk Allocation

Policies. In Proc. of the USENIX Annual Technical Conference, San

Diego, CA, January 1996.

[SS05] C. Stewart and K. Shen. Performance modeling and system man-

agement for multi-component online services. In Second USENIX

Symp. on Networked Systems Design and Implementation, pages 71–

84, Boston, MA, May 2005.

[SSS99] E. Shriver, C. Small, and K. Smith. Why Does File System Prefetch-

ing Work ? In Proc. of the USENIX Annual Technical Conference,

Monterey, CA, June 1999.

[SWW] The Size of the World Wide Web.

http://www.worldwidewebsize.com/.

[SZL05] K. Shen, M. Zhong, and C. Li. I/O System Performance Debugging

Using Model-driven Anomaly Characterization. San Francisco, CA,

December 2005.

[Tan01] A. Tanenbaum. Modern Operating System. Prentice Hall, 2nd edi-

tion, 2001.

[TPCH] TPC-H Benchmark, H. http://www.tpc.org.

BIBLIOGRAPHY 117

[TPG97] A. Tomkins, R. H. Patterson, and G. A. Gibson. Informed Multi-

Process Prefetching and Caching. In Proc. of the ACM SIGMET-

RICS, pages 100–114, Seattle, WA, June 1997.

[TSW92] D. Thiebaut, H. S. Stone, and J. L. Wolf. Improving Disk Cache Hit-

Ratios Through Cache Partitioning. IEEE Transactions on Comput-

ers, 41(6):665–676, June 1992.

[TYKZ07] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /* icomment: Bugs

or bad comments? */. In 21th ACM Symp. on Operating Systems

Principles, pages 145–158, Stevenson, WA, October 2007.

[vBCZ+03] R. von Behren, J. Condit, F. Zhou, G. C. Necula, and E. Brewer.

Capriccio: Scalable Threads for Internet Services. In Proc. of the 19th

ACM SOSP, pages 268–281, Bolton Landing, NY, October 2003.

[VG01] T. Voigt and P. Gunningberg. Dealing with Memory-Intensive Web

Requests. Technical Report 2001-010, Uppsala University, May 2001.

[VTFM01] T. Voigt, R. Tewari, D. Freimuth, and A. Mehra. Kernel Mechanisms

for Service Differentiation in Overloaded Web Servers. In Proc. of the

USENIX Annual Technical Conference, Boston, MA, June 2001.

[WC03] M. Welsh and D. Culler. Adaptive Overload Control for Busy Internet

Servers. In Proc. of the 4th USENIX Symp. on Internet Technologies

and Systems, Seattle, WA, March 2003.

[WCB01] M. Welsh, D. Culler, and E. Brewer. SEDA: An Architecture for

Well-Conditioned, Scalable Internet Services. In Proc. of the 18th

ACM SOSP, pages 230–243, Banff, Canada, October 2001.

BIBLIOGRAPHY 118

[WW94] C. Waldspurger and W. Weihl. Lottery Scheduling: Flexible

Proportional-Share Resource Management. In OSDI1994, Monterey,

CA, November 1994.

[WW95] C. Waldspurger and W. Weihl. Stride Scheduling: Deterministic

Proportional Resource Management. Technical report, June 1995.

[YD00] K. Yaghmour and M. R. Dagenais. Measuring and Characterizing

System Behavior Using Kernel-Level Event Logging. In Proc. of the

USENIX Annual Technical Conference, San Diego, CA, June 2000.

[YLB01] T. Yeh, D. Long, and S. A. Brandt. Using Program and User In-

formation to Improve File Prediction Performance. In Proc. of the

Int’l Symposium on Performance Analysis of Systems and Software,

pages 111–119, Tucson, AZ, November 2001.

[ZPL01] Y. Zhou, J. F. Philbin, and K. Li. The Multi-Queue Replacement

Algorithm for Second Level Buffer Caches. In Proc. of the USENIX

Annual Technical Conference, Boston, MA, June 2001.

Documents

Operating System Enhancements for Data-Intensive Server ... · 3.6 Performance of server-style workloads on a Xen virtual machine platform where the I/O scheduling is oblivious to