Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
Operating System Enhancements for
Data-Intensive Server Systems
by
Chuanpeng Li
Submitted in Partial Fulfillment
of the
Requirements for the Degree
Doctor of Philosophy
Supervised by
Kai Shen
Department of Computer ScienceThe College
Arts and Sciences
University of RochesterRochester, New York
2008
ii
Curriculum Vitae
Chuanpeng Li was born in Shandong province, China, on May 2nd, 1977. He
attended Northeastern University in Shenyang, China from 1995 to 1999, and
graduated with a Bachelor of Engineering degree in Computer Science and Tech-
nology. He attended the same university from 1999 to 2002, and graduated with
a Master of Engineering degree. He entered the Ph.D. program in the Computer
Science Department at the University of Rochester in Fall 2002. Under the direc-
tion of Professor Kai Shen, his research is on operating system enhancements for
data-intensive server systems. He received a Master of Science degree in Computer
Science from the University of Rochester in 2003.
iii
Acknowledgments
First of all, I would like to thank Professor Kai Shen. I am very lucky to have
him as my advisor. He spent a lot of time on my work. I remember many nights
he worked with me until early morning. I remember many times he sacrificed
his weekend and even dinner time in order to discuss the work with me. I still
remember several times he encouraged me to hold on when I did not even believe
in myself. I appreciate all the time he spent with me and all the great ideas and
efforts he put into this work. In addition to guiding my research, Professor Shen
also gave me great advices on how to be a happy person and how to achieve a
successful career. His advices will continue to help me in the future.
I am really grateful to other members in my thesis committee: Professor Chen
Ding, Professor Michael Huang, and Professor Michael Scott for their advices and
patience. Their feedbacks and comments are extremely important to help me
adjust my research work and finally finish the thesis. Their encouragement helps
me build up confidence. I am very lucky to have them in my committee.
I would like to thank Athanasios Papathanasiou for many excellent discussions
on my research. I also learned a lot of Linux kernel hacking experience from
him. Many thanks to Professor Sandhya Dwarkadas for her advices during our
collaborated work. To Ming Zhong, Christopher Stewart, and Pin Lu for their
comments on my research. To the whole URCS systems group for their comments
on my work and my practice talks.
ACKNOWLEDGEMENTS iv
I would like to thank the other URCS faculty, staff, and my fellow graduate
students for making this department a pleasant place to work at. Thanks to my
office mates Manu Chhabra and Kirk Kelsey for many helpful discussions about
work and life.
I would also like to thank Ask.com for providing a prototype of index searching
application and its workload trace in their support of this work.
Finally, I am deeply indebted to my great parents and my dear wife. They
always tell me I am the best. They always let me feel comfortable. They stand
behind me all the time no matter when I was excited or when I was frustrated. I
love them.
The material presented in this dissertation is based upon work supported by
the National Science Foundation (NSF) grants CCR-0306473, ITR/IIS-0312925,
CNS-0615045, CCF-0621472, CCF-0448413, and an IBM Faculty Award. Any
opinions, findings, and conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect the views of the above
named organizations.
v
Abstract
Recent studies on operating system support for concurrent server systems
mostly target CPU-intensive workloads with light disk I/O activities. However, an
important class of server systems that access a large amount of disk-resident data,
such as the index searching server of large-scale Web search engines, has received
limited attention. In this thesis work, we examine operating system techniques to
improve the performance of data-intensive server systems under concurrent exe-
cution. We propose OS enhancements in three aspects of the operating system:
file system prefetching, memory management, and disk I/O system. First, we
propose a competitive prefetching strategy that can balance the overhead of disk
I/O switching and the wasted I/O bandwidth on prefetching unnecessary data.
Second, we explore a new memory management scheme for prefetched data, in or-
der to reduce prefetching-incurred page thrashing at high execution concurrency.
Third, we present five performance anomalies we identified in the Linux I/O sys-
tem, and provide general discussions on operating system performance debugging.
We have implemented the proposed techniques in the Linux 2.6 kernel. Per-
formance evaluation on microbenchmarks and real applications shows promising
results.
vi
Table of Contents
Curriculum Vitae ii
Acknowledgments iii
Abstract v
List of Figures x
List of Tables xiv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Targeted applications . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Memory Replacement . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Disk I/O Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 OS Support for Highly-Concurrent Servers . . . . . . . . . . . . . 10
2.6 Other Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
TABLE OF CONTENTS vii
3 Competitive Prefetching 12
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2.1 Existing Operating System Support . . . . . . . . . . . . . 14
3.2.2 Disk Drive Characteristics . . . . . . . . . . . . . . . . . . 16
3.3 Competitive Prefetching . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.1 2-Competitive Prefetching . . . . . . . . . . . . . . . . . . 18
3.3.2 Optimality of 2-Competitiveness . . . . . . . . . . . . . . . 21
3.3.3 Accommodation for Random-Access
Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28
3.5.1 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . 29
3.5.2 Base-Case Performance . . . . . . . . . . . . . . . . . . . . 33
3.5.3 Performance with
Process-Context-Oblivious I/O Scheduling . . . . . . . . . 37
3.5.4 Performance on Different Storage Devices . . . . . . . . . 38
3.5.5 Performance of Serial Workloads . . . . . . . . . . . . . . 40
3.5.6 Summary of Results . . . . . . . . . . . . . . . . . . . . . 42
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7 Conclusion and Discussions . . . . . . . . . . . . . . . . . . . . . 45
4 Managing Prefetch Memory 49
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2 Targeted Applications and Existing OS Support . . . . . . . . . . 52
TABLE OF CONTENTS viii
4.3 Prefetch Memory Management . . . . . . . . . . . . . . . . . . . . 55
4.3.1 Page Reclamation Policy . . . . . . . . . . . . . . . . . . . 56
4.3.2 Memory Allocation for Prefetched Pages . . . . . . . . . . 58
4.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 63
4.4.1 Evaluation Benchmarks . . . . . . . . . . . . . . . . . . . 64
4.4.2 Microbenchmark Performance . . . . . . . . . . . . . . . . 65
4.4.3 Performance of Real Applications . . . . . . . . . . . . . . 70
4.4.4 Impact of Factors . . . . . . . . . . . . . . . . . . . . . . . 73
4.4.5 Summary of Results . . . . . . . . . . . . . . . . . . . . . 75
4.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5 I/O System Performance Anomalies 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Linux I/O System Algorithms . . . . . . . . . . . . . . . . . . . . 83
5.2.1 Prefetching Algorithm in Linux . . . . . . . . . . . . . . . 83
5.2.2 I/O Scheduling Algorithms in Linux . . . . . . . . . . . . 86
5.3 Anomalies in Linux I/O System Implementation . . . . . . . . . . 88
5.3.1 Suppression of Prefetching . . . . . . . . . . . . . . . . . . 89
5.3.2 Threshold of Disk Congestion . . . . . . . . . . . . . . . . 91
5.3.3 Estimation of Seek Cost . . . . . . . . . . . . . . . . . . . 92
5.3.4 Impact of System Timer Frequency . . . . . . . . . . . . . 93
5.3.5 Timing to Start Anticipation . . . . . . . . . . . . . . . . 96
5.4 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
TABLE OF CONTENTS ix
5.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6 Conclusions and Future Work 102
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Bibliography 106
x
List of Figures
2.1 Concurrent sequential I/O in a data-intensive server system. . . . 6
2.2 Concurrent sequential I/O under alternating accesses from a single
process/thread . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.1 An illustration of I/O scheduling at the virtual machine monitor
(VMM). Typically, the VMM I/O scheduler has no knowledge of
the request-issuing process contexts within the guest systems. As
a result, the anticipatory I/O scheduling at the VMM may not
function properly. . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Sequential transfer rate and seek time for two disk drives and a
RAID. Most results were acquired under the Linux 2.6.10 kernel.
The transfer rate of the IBM drive appears to be affected by the
operating system software. We also show its transfer rate on a
Linux 2.6.16 kernel. . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 I/O access pattern of certain application workloads. Virtual time
represents the sequence of discrete data accesses. In (A), each
straight line implies sequential access pattern of a file. In (B),
random and non-sequential accesses make the files indistinguishable. 32
3.4 Application I/O throughput of server-style workloads (four mi-
crobenchmarks and two real applications). The emulated optimal
performance is presented only for the microbenchmarks. . . . . . 34
LIST OF FIGURES xi
3.5 Normalized execution time of non-server concurrent applications.
The values on top of the “Linux” bars represent the running time
of each workload under the original Linux kernel (in minutes or
seconds). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Performance of server-style workloads on a Xen virtual machine
platform where the I/O scheduling is oblivious to request-issuing
process contexts. We present results only for workloads whose per-
formance substantially differ from that in the base case (shown in
Figure 3.4). For the microbenchmark, we also show the perfor-
mance of an emulated optimal prefetching. . . . . . . . . . . . . 38
3.7 Normalized execution time of non-server concurrent applications on
a virtual machine platform where the I/O scheduling is oblivious
of request-issuing process contexts. We present results only for
workloads whose performance substantially differ from that in the
base case (shown in Figure 3.5). The values on top of the “Linux”
bars represent the running time of each workload under the original
Linux kernel (in seconds). . . . . . . . . . . . . . . . . . . . . . . 39
3.8 Normalized execution time of two non-server concurrent workloads
on three different storage devices. . . . . . . . . . . . . . . . . . 40
3.9 Normalized inverse of throughput for two server-style concurrent
applications on three different storage devices. We use the inverse
of throughput (as opposed to throughput itself) so that the illustra-
tion is more comparable to that in Figure 3.8. For the microbench-
mark, we also show an emulated optimal performance. For clear
illustration, we only show results at concurrency level 4. The per-
formance at other concurrency levels are in similar pattern. . . . . 41
LIST OF FIGURES xii
3.10 Normalized execution time of serial application workloads. The
values on top of the “Linux” bars represent the running time of
each workload under the original Linux prefetching (in minutes,
seconds, or milliseconds). . . . . . . . . . . . . . . . . . . . . . . . 42
3.11 Normalized latency of individual I/O request under concurrent work-
load. The latency of an individual small I/O request is measured
with microbenchmark Two-Rand-0 running in the background. The
values on top of the “Linux” bars represent the request latency un-
der the original Linux prefetching (in seconds). . . . . . . . . . . . 47
4.1 Application throughput and prefetch page thrashing rate of a trace-
driven index searching server under the original Linux 2.6.10 kernel.
A prefetch thrashing is counted as a page miss on prefetched but
then prematurely evicted memory page. . . . . . . . . . . . . . . . 54
4.2 A separate prefetch cache for prefetched pages. . . . . . . . . . . . 55
4.3 An illustration of the prefetch cache data structure and our imple-
mentation in Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4 Microbenchmark I/O throughput at different concurrency levels.
Our proposed prefetch memory management produces throughput
improvement on One-Rand and Two-Rand. . . . . . . . . . . . . . 68
4.5 Comparison among AccessHistory, Coldest+, and an approximated
optimal page reclamation policy. The throughput of Coldest+ is
10–32% lower than that of the approximated optimal while it is
8–97% higher than that of AccessHistory in high concurrencies. . 70
4.6 I/O throughput and page thrashing rate of the index searching server. 71
LIST OF FIGURES xiii
4.7 Percentage of prematurely evicted pages that have stayed in mem-
ory for more than 8 seconds for the index searching server. We
only consider those prefetched but then prematurely evicted pages
which later cause page misses. . . . . . . . . . . . . . . . . . . . . 72
4.8 Throughput of the Apache Web server hosting media clips. . . . . 73
4.9 Performance impact of the maximum prefetching depth for the mi-
crobenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and
“AH” stands for AccessHistory. . . . . . . . . . . . . . . . . . . . 74
4.10 Performance impact of the server memory size for the microbench-
mark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”
stands for AccessHistory. . . . . . . . . . . . . . . . . . . . . . . . 75
4.11 Performance impact of the workload data size for the microbench-
mark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”
stands for AccessHistory. . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Current and ahead windows in asynchronous prefetching . . . . . 84
5.2 Throughput of a server workload. The anticipatory I/O scheduler
is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.3 Throughput of a server-style microbenchmark workload. Change of
congestion threshold may cause anomalous performance deviations. 91
5.4 Effect of kernel correction (for anomaly #3) on a microbenchmark
that alternatively reads random size of data in two files. The an-
ticipatory I/O scheduler is used. . . . . . . . . . . . . . . . . . . . 93
5.5 Effect of kernel correction (for anomaly #4) on a microbenchmark
that reads 256 KB randomly in large files. . . . . . . . . . . . . . 95
5.6 Effect of kernel corrections on a microbenchmark that reads 256 KB
randomly in large files. . . . . . . . . . . . . . . . . . . . . . . . . 97
xiv
List of Tables
4.1 Benchmark statistics. The column “Whole-file access?” indicates
whether a request handler would prefetch some data that it would
never access. No such data exists if whole files are accessed by
application request handlers. The column “Stream per request”
indicates the number of prefetching streams each request handler
initiates simultaneously. More prefetching streams per request han-
dler would create higher memory contention. . . . . . . . . . . . 66
1
1 Introduction
1.1 Motivation
With the ubiquity of computer networks, network-based applications become
more and more popular. A large number of network and distributed applica-
tions in the world are built upon on-line servers. Server systems usually provide
computing or storage services to many simultaneous clients. In the meanwhile,
modern computing systems become increasingly data-driven. The dataset size
of typical server applications increases evidently. As an example, from 2003 to
2008, the estimated number of indexed web pages at Google increased from 3.2
billion to over 20 billion [SWW, SES]. Under the circumstances, many data-
intensive server systems have come on the stage. In addition to large-scale Web
search engines, examples of such systems also include parallel file systems and
FTP/Web servers that host big files, such as free software and multimedia data.
Such data-intensive server systems access large disk-resident dataset while serving
many clients simultaneously.
Operating system (OS) is the lowest layer of software in a computer system.
Efficient OS support for upper-layer applications has direct and significant impact
on the overall system performance and reliability. Previous studies have investi-
CHAPTER 1. INTRODUCTION 2
gated OS support for concurrent server systems. Techniques such as event-driven
concurrency management [PDZ99a, WCB01] and user-level threads [vBCZ+03]
were proposed to avoid expensive kernel-level context switches and reduce the
synchronization overhead. However, these techniques are mostly designed to sup-
port CPU-intensive workloads with light disk I/O activities (e.g., the typical Web
server workload and application-level packet routing). The existing OS support
is insufficient for concurrent server systems to achieve high performance while
accessing a large amount of disk-resident data.
1.2 Thesis Statement
The goal of this thesis work is to enhance the OS support to achieve high
performance and efficient resource utilization for data-intensive server systems.
Specifically, we investigate transparent enhancements in three aspects
of the operating system: file system prefetching, memory management,
and disk I/O system. Although large scale server applications often run on
clusters or distributed systems, we concentrate on OS enhancements for single
independent node of a large system. Studies on OS support for server systems in
cluster level is beyond the scope of this work.
First, for data-intensive server systems, the disk I/O performance dominates
the overall system throughput when the dataset size far exceeds the available
server memory. Many applications are carefully constructed to follow sequential
pattern on accessing the disk-resident data in order to achieve high I/O perfor-
mance. However, such sequential access patterns are often disrupted by concur-
rent request processing. More specifically, data access of one request handler
(e.g., server process or thread) can be frequently interrupted by other active re-
quest handlers in the system. Such a phenomenon, we call disk I/O switches,
can severely affect the I/O efficiency due to long disk seek and rotational delays.
CHAPTER 1. INTRODUCTION 3
Aggressive prefetching can improve the granularity of sequential data access, but
it comes with a higher risk of fetching unneeded data. To address this problem,
we propose a competitive prefetching technique which can maintain the balance
between the overhead of I/O switches and that of unnecessary prefetching.
Second, data-intensive server systems may contain a substantial amount of
prefetched data in memory due to large-granularity I/O prefetching and high
execution concurrency. Traditional access recency or frequency-based page recla-
mation methods are not effective for prefetched data because prefetched memory
pages have no access history. Using a traditional page reclamation policy, memory
contention can result in premature eviction of prefetched pages before they are
actually accessed. Such a phenomenon, we call page thrashing, may degrade the
server system performance dramatically. To address this problem, we explore a
new memory management scheme for prefetched but not-yet-accessed pages. We
examine efficient heuristic page reclamation policies that are not based on access
recency or frequency.
Third, OS support for data-intensive server systems can also be improved by
reducing I/O system performance bugs or anomalies. I/O storage system is usually
the slowest part of a computer system. Its efficiency is crucial for data-intensive
server applications. In a typical monolithic operating system, I/O system is large
and complex. It may not always behave as well as expected. We investigate the
Linux I/O system and identify several performance anomalies. In this dissertation,
we present five performance anomalies and discuss our experience in operating
system performance debugging. In particular, two of the bugs we identified are
reported to the Linux kernel development team [Bug]. One of the two has been
accepted by the team and its fix will be included in an upcoming Linux kernel
version. The other three of the anomalies are not unqualified “bugs” that have to
be corrected (see section 5.4 for the detail).
This work aims to provide OS enhancements that can transparently benefit
CHAPTER 1. INTRODUCTION 4
a large number of data-intensive applications. No application assistance or hints
will be needed to take advantage of our research results. Our design will also
take measures to minimize negative impact on applications that are not directly
targeted in this work.
1.3 Organization
The rest of this dissertation is organized as follows. Chapter 2 provides more
background information on this work. Chapter 3 presents the design, implementa-
tion, and evaluation of the competitive prefetching technique. Chapter 4 presents
the design, implementation, and evaluation of the enhanced memory management
strategy for prefetched data. Chapter 5 presents the performance anomalies we
identified in the Linux I/O system and provides discussions on operating sys-
tem performance debugging. Chapter 6 concludes this work with a summary of
contributions and provides discussions on future directions.
5
2 Background
In this chapter, we introduce some high-level background information on this
thesis work. We first describe the I/O characteristics of our targeted data-intensive
server systems. Then some concepts and existing work about prefetching, memory
management, disk I/O, and concurrent servers are reviewed. Limitations of the
existing work will also be discussed.
2.1 Targeted applications
The targeted applications that motivate us to conduct this work are data-
intensive server systems that may serve many clients simultaneously. In these
systems, each user request is typically serviced by a request handler which can be
a thread in a multi-threaded server or a series of event handlers in an event-driven
server. The request handler then repeatedly accesses disk data and consumes
the CPU before completion. We define a sequential I/O stream as a group of
spatially contiguous data items that are accessed by a single process or thread.
Concurrent requests to a data-intensive server will cause concurrent accesses to
multiple sequential I/O streams in the storage system. We call this type of I/O
workload concurrent sequential I/O, or concurrent I/O in short.
CHAPTER 2. BACKGROUND 6
Requests
from the
network
Timeline for a concurrent execution
Complete
Complete
Complete
Disk I/O CPUWaiting for
resources
Figure 2.1: Concurrent sequential I/O in a data-intensive server system.
Each request handler on the server does not necessarily make use of the full
content of each I/O stream. It may also perform interleaving I/O operations that
do not belong to the same sequential stream. Disk I/O can be triggered explicitly
through library calls such as fread, or implicitly through memory-mapped I/O.
For instance, a substantial fraction of disk operations in FTP and web servers,
which tend to retrieve whole or partial files, are associated with sequential I/O
streams. For more complex server applications, their developers may employ
data correlation techniques in order to layout their data sequentially on disk and
increase I/O efficiency. Figure 2.1 illustrates concurrent I/O operations for a
typical data-intensive server application.
In addition to data-intensive server systems, concurrent I/O may also be gen-
erated by concurrent execution of independent processes or a standalone process
that multiplexes accesses among multiple sequential I/O streams (Figure 2.2). Al-
though these applications are not our major concern, the proposed techniques are
also beneficial for them. Examples of such applications include concurrent copying
of multiple files, the file comparison utility diff, and several database workloads.
For instance, web search engines [Ask] maintain an ordered list of matching web
pages for each indexed keyword. For multi-keyword queries, the search engine has
to compute the intersection of multiple such lists. Additionally, SQL multi-table
join queries require accessing a number of files sequentially.
CHAPTER 2. BACKGROUND 7
A sequential stream
Timeline
Disk I/O CPU
A sequential stream
Figure 2.2: Concurrent sequential I/O under alternating accesses from
a single process/thread
Many applications read a portion of each data stream at a time instead of
fetching the complete data stream into application-level buffers at once. Reading
complete data streams is generally avoided for the following reasons. First, some
applications do not have prior knowledge of the exact amount of data required
from a certain data stream. For instance, index search and SQL join queries will
often terminate when a desired number of “matches” are found. Second, applica-
tions that employ memory-mapped I/O do not directly initiate I/O operations.
The operating system loads into memory a small number of pages (through page
faults) at a time on their behalf. Third, fetching complete data streams into
application-level buffers may result in double buffering and/or increase memory
contention. For instance, while GNU diff [GNU06] reads the complete files into
memory at once, the BSD diff can read a portion of the files at a time and thus
it consumes significantly less memory when comparing large files.
2.2 Prefetching
Prefetching or read-ahead means reading adjacent pages from the file before
they are actually accessed by the application. Then if prefetched pages are ac-
cessed later by the application, they are already available in the page/buffer cache
in memory. Therefore, prefetching can improve disk I/O efficiency by reducing
seek and rotational overhead for sequential accesses to disk-resident data, although
CHAPTER 2. BACKGROUND 8
it is not very useful for random data accesses. Section 5.2.1 provides a description
on the prefetching algorithm in Linux.
Previous research has examined file-system prefetching. We will discuss related
work on prefetching in detail in Section 3.6. Here we briefly mention three pieces
of representative work. Cao et al. proposed integrated prefetching and caching
strategies in [CFKL95]. They presented four rules that an optimal prefetching
and caching strategy must follow. Patterson et al. proposed informed prefetching
which used application hints to guide prefetching and used a cost-benefit model
to evaluate the benefit of prefetching [PGG+95, TPG97]. Aggressive prefetching
was also proposed to increase disk access burstiness and thus to save energy for
mobile devices [PS03, PS04, PS05].
2.3 Memory Replacement
Virtual memory [BCD72] is one of the great ideas in computer systems [BO03,
Nut99]. It gives applications the impression that they have contiguous working
memory which may be larger than the physical memory. It is the basis for many
more advanced techniques or concepts such as multitasking and shared memory.
In virtual memory, memory/page replacement policies decide which memory pages
to page out when page faults happen and no free pages are available. Due to its
ease of implementation and good performance, an approximated LRU policy is
widely used in modern operating systems.
In addition to LRU, a large body of previous work has explored efficient mem-
ory/page replacement policies for data-intensive applications, including AR-1 [Bel66],
Working-Set [Den68], LFU [MGST70], CLOCK [Smi78], CLOCK-Pro [JCZ05],
FBR [RD90], LRU-K [OOW93], 2Q [JS94], Unified Buffer Management [KCK+00],
Multi-Queue [ZPL01], and LIRS [JZ02]. All these memory replacement strategies
are based on the page reference history such as reference recency or frequency. In
CHAPTER 2. BACKGROUND 9
data-intensive server systems, a large number of prefetched but not-yet-accessed
pages may emerge due to aggressive I/O prefetching and high execution concur-
rency. Because of the lack of any access history, these pages may not be properly
managed by access recency or frequency-based replacement policies.
2.4 Disk I/O Scheduler
I/O schedulers are also called elevators. They manage I/O operations for each
block device by reordering and merging requests. Many previous studies about
disk scheduling policy exist. SSF (Shortest Seek First) and Scan [Tan01] are
seek-reducing schedulers. YFQ (Yet-another Fair Queuing) [BBG+99], Lottery
scheduling [WW94], and Stride scheduling [WW95] are examples of proportional-
share schedulers. A disk scheduler is work-conserving if it always tries to schedule
the next request whenever the previous request has finished. Both seek-reducing
schedulers and proportional-share schedulers are work-conserving.
The drawback of work-conserving disk schedulers is that they all suffer from de-
ceptive idleness. Deceptive idleness is a condition where the scheduler incorrectly
assumes that the last request-issuing process has no further requests, and the
scheduler becomes forced to switch to a request from another process [ID01]. In
data-intensive server systems, because the workloads mostly do sequential access
to disk, deceptive idleness happens very often while a work-conserving scheduler
is used. This can negatively affect the server performance.
Iyer and Druschel proposed anticipatory disk scheduling framework to avoid
deceptive idleness [ID01]. An anticipatory disk scheduler is not work-conserving.
It will always delay servicing concurrent disk I/O requests from other processes for
a short period of time by waiting for a closer I/O request from the same process as
issued the previous request. It is shown that anticipatory scheduling can improve
the performance of many data-intensive workloads. But anticipatory scheduling
CHAPTER 2. BACKGROUND 10
may not be effective when substantial think time exists between consecutive I/O
requests or when the request handler has to perform some interleaving synchronous
I/O that does not exhibit strong locality. Section 3.2.1 has more on this issue.
2.5 OS Support for Highly-Concurrent Servers
In support of concurrent servers, operating systems traditionally provide sev-
eral programming models, such as multi-process (MP), multi-threaded (MT),
and single-process event-driven (SPED) models. The debate between threads
and events has been ongoing in the literature for a long time. Pai et al. pro-
posed a server architecture, called AMPED (Asymmetric Multiprocess Event
Driven) [PDZ99a]. It extends the SPED with multiple helper processes (or threads),
which handle blocking disk I/O operations. Welsh et al. proposed a staged event-
driven architecture (SEDA) for high-concurrent online servers [WCB01]. The
basic idea is to construct each application as a network of event-driven stages
connected by explicit event queues. More recently, Behren et al. argued that
thread-based systems can not only provide a simpler programming model but
also provide equivalent or superior performance compared with event-based sys-
tems [BCB03, vBCZ+03]. They implemented a user-level thread library, called
Capriccio, which scales well when supporting hundreds of thousands of threads
running Internet services.
However, these studies mainly focus on the CPU utilization efficiency of highly-
concurrent servers. In our study, the throughput of data-intensive server systems
is bounded by the I/O performance when the application data size far exceeds the
available system memory.
CHAPTER 2. BACKGROUND 11
2.6 Other Aspects
There is some other research work that is relevant to data-intensive server
systems.
• Many researchers have explored file system data allocation to reduce aver-
age disk head positioning overheads. Some studies focused on collocation of
related data [RO92, BHS95, SS96, BC02, LSG+00]. Some other studies sug-
gested making use of the high bandwidth of outer regions of the disk to host
frequently accessed data [Met97, BDF+96]. In a recent work, Schindler et
al. proposed a way to avoid rotational delay and track crossing overheads
by allocating related data on disk track boundaries [SGLG02].
• Several work studied efficient I/O buffering and caching to avoid redun-
dant data copying and multiple buffering in server systems [FP93, DP93,
KEG+97, Bru99, CP99, PDZ99b]. For instance, Pai et al. proposed to unify
all buffering and caching in the system, so that the file system, network sub-
system, IPC system, and application can share a single physical copy the
data.
Those studies provide important OS support for data-intensive server systems. In
contrast with our work, they look at different aspects of OS. Those techniques
complement our work and can work along with the techniques we propose.
In summary, the existing operating system supports for data-intensive server
systems are insufficient in terms of file-system prefetching, memory management,
and disk I/O system. In the following chapters, we will propose system-level
enhancements in the three aspects.
12
3 Competitive Prefetching
3.1 Introduction
Concurrent access to multiple sequential I/O streams is commonplace in both
server systems and desktop workloads. For such concurrent I/O workloads, con-
tinuous accesses to one sequential data stream can be interrupted by accesses to
other streams in the system. Frequent switching between multiple sequential I/O
streams may severely affect I/O efficiency due to long disk seek and rotational de-
lays of disk-based storage devices. In the rest of this dissertation, we refer to any
disk head movement that interrupts a sequential data transfer as an I/O switch.
The cost of an I/O switch includes both seek and rotational delays.
The problem of unnecessary disk I/O switch during concurrent I/O workloads
may be alleviated partially by non-work-conserving disk schedulers. In particular,
the anticipatory scheduler [ID01] may temporarily idle the disk (even when there
are outstanding I/O requests) so that consecutive I/O requests that belong to the
same process/thread may be serviced without interruption. However, anticipatory
scheduling is ineffective, in cases where a single process or thread synchronously
accesses multiple sequential data streams in an interleaving fashion or when a sub-
stantial amount of processing or think time exists between consecutive sequential
CHAPTER 3. COMPETITIVE PREFETCHING 13
requests from a process. In addition, anticipatory scheduling may not function
properly when it lacks the knowledge of I/O request-issuing process contexts, as
in the cases of I/O scheduling within a virtual machine monitor [JADAD06] or
within a parallel/distributed file system server [LS07, CIRT00].
Another technique to reduce I/O switch frequency for concurrent I/O work-
loads is to increase the granularity of sequential I/O operations. This can be
accomplished by aggressive OS-level prefetching that prefetches deeper in each
sequential I/O stream. However, under certain scenarios, aggressive prefetching
may have a negative impact on performance due to buffer cache pollution and inef-
ficient use of I/O bandwidth. Specifically, aggressive prefetching has a higher risk
of retrieving data that are not needed by the application. This chapter proposes
an OS-level prefetching technique that balances the overhead of I/O switching
and the disk bandwidth wasted on prefetching unnecessary data. We focus on the
mechanism that controls the prefetching depth for sequential or partially sequen-
tial accesses. We do not consider other aspects or forms of prefetching.
The proposed prefetching technique is based on a fundamental 2-competitiveness
result: “when the prefetching depth is equal to the amount of data that can be
sequentially transferred within the average time of a single I/O switch, the total
disk resource consumption of an I/O workload is at most twice that of the optimal
offline prefetching strategy”. Our proposed prefetching technique does not require
a-priori knowledge of an application’s data access pattern or any other type of
application support. We also show that the 2-competitiveness result is optimal,
i.e., no transparent OS-level online prefetching can achieve a competitiveness ra-
tio lower than 2. Finally, we further extend the competitiveness result to capture
prefetching in the case of random-access workloads.
The rest of this chapter is structured as follows. Section 3.2 provides more
background information on existing OS support and relevant storage device char-
acteristics. Section 3.3 presents the analysis and design of the proposed competi-
CHAPTER 3. COMPETITIVE PREFETCHING 14
tive prefetching strategy. Section 3.4 discusses practical issues concerning storage
device characterization and the implementation of our proposed strategy. Sec-
tion 3.5 evaluates the proposed strategy using several microbenchmarks and a
variety of real application workloads. Section 3.6 discusses related previous work.
Section 3.7 discusses some performance issues and concludes the chapter.
3.2 Background
3.2.1 Existing Operating System Support
During concurrent I/O workloads, sequential access to a data stream can be
interrupted by accesses to other streams, increasing the number of expensive disk
seek operations. Anticipatory scheduling [ID01] can partially alleviate the prob-
lem. At the completion of an I/O request, anticipatory scheduling may keep the
disk idle for a short period of time (even if there are outstanding requests) in
anticipation of a new I/O request from the process/thread that issued the just-
completed request. Consecutive requests generated from the same process/thread
often involve data residing on neighboring disk locations and require little or no
seeking. This technique results in significant performance improvement for con-
current I/O workloads, where all I/O requests, issued by an individual process,
exhibit strong locality. However, anticipatory scheduling is ineffective in several
cases.
• Since anticipatory scheduling expects a process to issue requests with strong
locality in a consecutive way, it cannot avoid disk seek operations when a
single process multiplexes accesses among two or more data streams (Fig-
ure 2.2).
• The limited anticipation period introduced by anticipatory scheduling may
not avoid I/O switches when a substantial amount of processing or think
CHAPTER 3. COMPETITIVE PREFETCHING 15
proc1 proc2 proc3
virtual I/O device
guest system
…
…
...
other guest systems
virtual machine monitor
I/O scheduler
Physical
device
req req req
Figure 3.1: An illustration of I/O scheduling at the virtual machine
monitor (VMM). Typically, the VMM I/O scheduler has no knowledge of
the request-issuing process contexts within the guest systems. As a result,
the anticipatory I/O scheduling at the VMM may not function properly.
time exists between consecutive sequential requests from the same process.
More specifically, the anticipation may timeout before the anticipated re-
quest actually arrives.
• Anticipatory scheduling needs to keep track of the I/O behavior of each pro-
cess so that it can decide whether to wait for a process to issue a new request
and how long to wait. It may not function properly when the I/O scheduler
does not know the process contexts from which I/O requests are issued. In
particular, Jones et al. [JADAD06] pointed out that I/O schedulers within
virtual machine monitors typically do not know the request’s process con-
texts inside the guest systems. As shown in figure 3.1, the process contexts
CHAPTER 3. COMPETITIVE PREFETCHING 16
information is encapsulated inside the guest OS and can not be easily used by
the I/O scheduler in the virtual machine monitor. Another example is that
the I/O scheduler at a parallel/distributed file system server [LS07, CIRT00]
may not have the knowledge of remote process contexts for I/O requests.
An alternative to anticipatory scheduling is I/O prefetching, which can in-
crease the granularity of sequential data accesses and, consequently, decrease
the frequency of disk I/O switch. Since sequential transfer rates improve sig-
nificantly faster than seek and rotational latencies on modern disk drives, aggres-
sive prefetching becomes increasingly appealing. In order to limit the amount of
wasted I/O bandwidth and memory space, operating systems often employ an
upper-bound on the prefetching depth, without a systematic understanding of the
impact of the prefetching depth on performance.
Our analysis in this chapter assumes that application-level logically sequential
data blocks are mapped to physically contiguous blocks on storage devices. This
is mostly true because file systems usually attempt to organize logically sequential
data blocks in a contiguous way in order to achieve high performance for sequential
access. Bad sector remapping on storage devices can disrupt sequential block
allocation. However, such remapping does not occur for the majority of sectors in
practice. We do not consider its impact on our analysis.
3.2.2 Disk Drive Characteristics
The disk service time of an I/O request consists of three main components: the
head seek time, the rotational delay, and the data transfer time. Disk performance
characteristics have been extensively studied in the past [KTR94, RW94, SMW98].
The seek time depends on the distance to be traveled by the disk head. It consists
roughly of a fixed head initiation cost plus a cost linear to the seek distance. The
rotational delay mainly depends on the rotation distance (fraction of a revolution).
CHAPTER 3. COMPETITIVE PREFETCHING 17
Techniques such as out-of-order transfer and free-block scheduling [LSGN00] are
available to reduce the rotational delay. The sequential data transfer rate varies
at different data locations because of zoning on modern disks. Specifically, tracks
on outer cylinders often contain more sectors per track and, consequently, exhibit
higher data transfer rates.
Modern disks are equipped with read-ahead caches so they may continue to
read data after the requested data have already been retrieved. However, the disk
cache size is relatively small (usually less than 16 MB) and, consequently, data
stored in the disk cache are replaced frequently under concurrent I/O workloads.
In addition, disk cache read-ahead is usually performed only when there are no
pending requests, which is rare in the context of I/O-intensive workloads. There-
fore, we do not consider the impact of disk cache read-ahead on our analysis of
OS-level prefetching.
3.3 Competitive Prefetching
We investigate I/O prefetching strategies that address the following tradeoff in
deciding the I/O prefetching depth: conservative prefetching may lead to a high
I/O switch overhead, while aggressive prefetching may waste too much I/O band-
width on fetching unnecessary data. Section 3.3.1 presents a prefetching policy
that can achieve at least half the performance (in terms of I/O throughput) of the
optimal offline prefetching strategy without a-priori knowledge of the application
data access pattern. In the context of this work, we refer to the “optimal” prefetch-
ing as the best-performing sequential prefetching strategy that minimizes the com-
bined cost of disk I/O switch and retrieving unnecessary data. We do not consider
other issues such as prefetching-induced memory contention in our analysis. Pre-
vious work has explored such issues [CFKL95, KMC02, PGG+95, TPG97]. We
will look at memory-related issues in chapter 4. In section 3.3.2, we show that the
CHAPTER 3. COMPETITIVE PREFETCHING 18
2-competitive prefetching is the best competitive strategy possible. Section 3.3.3
examines prefetching strategies adapted to accommodate mostly random-access
workloads and their competitiveness results.
The competitiveness analysis in this section focuses on systems employing
standalone disk drives. In later sections, we will discuss practical issues and
present experimental results when extending our results for disk arrays.
3.3.1 2-Competitive Prefetching
Consider a concurrent I/O workload consisting of N sequential streams. For
each stream i (1 ≤ i ≤ N), we assume the total amount of accessed data (or stream
length) is Stot[i]. Due to zoning on modern disks, the sequential data transfer rate
on a disk can vary depending on the data location. Since each stream is likely
localized to one disk zone, it is reasonable to assume that the transfer rate for any
data access on stream i is a constant, denoted as Rtr[i].
For stream i, in the ideal case, the optimal prefetching strategy has a-priori
knowledge of Stot[i] (the amount of data to be accessed). It performs a single
I/O switch (including seek and rotation) to the beginning of the stream and then
sequentially transfers data of size Stot[i]. Let the I/O switch cost for stream i be
Cswitch[i]. The disk resource consumption (in time) for accessing all N streams
under optimal prefetching is:
Copttotal =
∑
1≤i≤N
(
Stot[i]
Rtr[i]+ Cswitch[i]
)
(3.1)
However, in practice the total amount of data to be accessed Stot[i] is not known
a-priori. Consequently, OS-level prefetching tries to approximate Stot[i] using an
appropriate prefetching depth. For simplicity, we assume that the OS employs a
constant prefetching depth Sp[i] for each stream.1 Therefore, accessing the stream
1The assumption of a constant prefetching depth constrains only the design of the prefetching
CHAPTER 3. COMPETITIVE PREFETCHING 19
i requires ⌈Stot[i]Sp[i]
⌉ prefetch operations. Each prefetch operation requires one I/O
switch. Let the I/O switch cost be Cswitch[i, j] for the j-th prefetch operation of
stream i. The total I/O switch cost for accessing stream i is represented by:
Ctot switch[i] =∑
1≤j≤⌈Stot[i]Sp[i]
⌉
Cswitch[i, j] (3.2)
The time wasted while fetching unnecessary data is bounded by the cost of
the last prefetch operation:
Cwaste[i] ≤Sp[i]
Rtr[i](3.3)
Therefore, the total disk resource (in time) consumed to access all streams is
bounded by:
Ctotal
=∑
1≤i≤N
(
Stot[i]
Rtr[i]+ Cwaste[i] + Ctot switch[i]
)
≤∑
1≤i≤N
Stot[i]
Rtr[i]+
Sp[i]
Rtr[i]+
∑
1≤j≤⌈Stot[i]Sp[i]
⌉
Cswitch[i, j]
(3.4)
The expected average I/O switch time (including seek and rotation) for a
concurrent I/O workload can be derived from the disk drive characteristics and
the workload concurrency. We consider the seek time and rotational delay below.
• I/O scheduling algorithms such as the Cyclic-SCAN reorder outstanding
I/O requests based on the data location and schedule the I/O request clos-
est to the current disk head location. Suppose constant C is the size of
the contiguous portion of the disk where the application dataset resides.
policy. However, what can be achieved with a constrained policy can certainly be achieved with
a non-constrained, broader policy. Thus, our simplification does not restrict the competitiveness
result presented in this section.
CHAPTER 3. COMPETITIVE PREFETCHING 20
When the disk scheduler can choose from n concurrent I/O requests at uni-
formly random disk locations, the inter-request seek distance Dseek follows
the following distribution:
Pr[Dseek ≥ x] = (1 −x
C)n (3.5)
The expected seek time can then be acquired by combining Equation (3.5)
and the seek distance to seek time mapping of the disk. (Figure 3.2(B)
shows an example of the seek distance to seek time mapping of some disk
drives.)
• The expected rotational delay of an I/O workload is the mean rotational
time between two random track locations (i.e., the time it takes the disk to
spin half a revolution).
The above result suggests that the expected average I/O switch time of a
concurrent workload is independent of the prefetching schemes (optimal or not)
being employed. Instead, it is related to the I/O concurrency n. For a specific
workload with I/O concurrency N , we denote the expected average I/O switch
cost as Cavgswitch. Therefore, Equation (3.1) can be simplified to:
Copttotal =
∑
1≤i≤N
(
Stot[i]
Rtr[i]
)
+ N · Cavgswitch (3.6)
And Equation (3.4) can be simplified to:
Ctotal
≤∑
1≤i≤N
(
Stot[i]
Rtr[i]+
Sp[i]
Rtr[i]
)
+∑
1≤i≤N
⌈Stot[i]
Sp[i]⌉ · Cavg
switch
<∑
1≤i≤N
(
Stot[i]
Rtr[i]+
Sp[i]
Rtr[i]
)
+∑
1≤i≤N
(
Stot[i]
Sp[i]+ 1
)
· Cavgswitch
(3.7)
Based on Equations (3.6) and (3.7), we find that:
if Sp[i] = Cavgswitch · Rtr[i], then Ctotal < 2 · Copt
total (3.8)
CHAPTER 3. COMPETITIVE PREFETCHING 21
Equation 3.8 allows us to design a prefetching strategy with bounded worst-
case performance:
When the prefetching depth is equal to the amount of data that can be
sequentially transferred within the average time of a single I/O switch,
the total disk resource consumption of an I/O workload is at most twice
that of the optimal offline prefetching strategy.
In the meantime, the I/O throughput of a prefetching strategy that follows the
above guideline is at least half of the throughput of the optimal offline prefetching
strategy. Competitive strategies commonly refer to solutions whose performance
can be shown to be no worse than some constant factor of an optimal offline
strategy. We follow such convention to name our prefetching strategy competitive
prefetching.
3.3.2 Optimality of 2-Competitiveness
In this section, we show that no transparent OS-level online prefetching strat-
egy (without prior knowledge of the sequential stream length Stot[i] or any other
application-specific information) can achieve a competitiveness ratio lower than
2. Intuitively, we can reach the above conclusion since:
• Short sequential streams demand conservative prefetching, while long streams
require aggressive prefetching.
• No online prefetching strategy can unify the above conflicting goals with a
competitiveness ratio lower than 2.
We prove the above by contradiction. We assume there exists an α-competitive
solution X (where 1 < α < 2). In the derivation of the contradiction, we assume
CHAPTER 3. COMPETITIVE PREFETCHING 22
the I/O switch cost Cswitch is constant.2 The discussion below is in the context of
a single stream, therefore we do not specify the stream id for notational simplicity
(e.g., Stot[i] → Stot). In X , let Snp be the size of the n-th prefetch for a stream.
Since X is α-competitive, we have Ctotal ≤ α · Copttotal.
First, we show that the size of the first prefetch operation S1p satisfies:
S1p < Cswitch · Rtr (3.9)
Otherwise, we let Stot = ǫ where 0 < ǫ < (2−αα
) · Cswitch · Rtr. In this case,
Copttotal = Cswitch + ǫ
Rtrand ǫ < S1
p . Because Ctotal = Cswitch +S1
p
Rtr≥ 2 · Cswitch, we
have Ctotal > α · Copttotal. Contradiction!
We then show that:
∀k ≥ 2, Skp <
(
k−1∑
j=1
Sjp
)
− (k − 2) · Cswitch · Rtr (3.10)
Otherwise (if Equation (3.10) is not true for a particular k), we define T as below,
and we know T > 0.
T = Skp + (k − α) · Cswitch · Rtr + (1 − α) ·
(
k−1∑
j=1
Sjp
)
We then let Stot =(
∑k−1j=1 Sj
p
)
+ ǫ where 0 < ǫ < Tα. In this case, Copt
total = Cswitch +
(Pk−1
j=1 Sjp)+ǫ
Rtrand Ctotal ≥ k ·Cswitch +
(Pk
j=1 Sjp)
Rtr. Therefore, we have Ctotal > α · Copt
total.
Contradiction!
We next show:
∀k ≥ 2, Skp < Sk−1
p (3.11)
We prove Equation (3.11) by induction. Equation (3.11) is true for k = 2 as a
direct result of Equation (3.10). Assume Equation (3.11) is true for all 2 ≤ k ≤
2This assumption is safe because contradiction under the constrained disk system model
entails contradiction under the more general disk system model. Specifically, if an α-competitive
solution does not exist under the constrained disk system model, then such solution certainly
should not exist under the more general case.
CHAPTER 3. COMPETITIVE PREFETCHING 23
k − 1. This also means that Skp < Cswitch · Rtr for all 2 ≤ k ≤ k − 1. Below we
show Equation (3.11) is true for k = k:
S kp <
k−1∑
j=1
Sjp
− (k − 2) · Cswitch · Rtr
= S k−1p +
k−2∑
j=1
(Sjp − Cswitch · Rtr)
< S k−1p
(3.12)
Our final contradiction is that for any k ≥2·Cswitch·Rtr−S1
p
Cswitch·Rtr−S1p
, we have Skp < 0:
Skp <
(
k−1∑
j=1
Sjp
)
− (k − 2) · Cswitch · Rtr
< (k − 1) · S1p − (k − 2) · Cswitch · Rtr
= 2 · Cswitch · Rtr − S1p − k · (Cswitch · Rtr − S1
p)
≤ 2 · Cswitch · Rtr − S1p − (2 · Cswitch · Rtr − S1
p)
= 0
(3.13)
3.3.3 Accommodation for Random-Access
Workloads
Competitive strategies only address the worst-case performance. A general-
purpose operating system prefetching policy should deliver high average-case per-
formance for common application workloads. In particular, many applications
only exhibit fairly short sequential access streams or even completely random ac-
cess patterns. In order to accommodate these workloads, our competitive prefetch-
ing policy employs a slow-start phase, commonly employed in modern operating
systems such as Linux. For a new data stream, prefetching takes place with a
relatively small initial prefetching depth. Upon detection of a sequential access
CHAPTER 3. COMPETITIVE PREFETCHING 24
pattern, the depth of each additional prefetching operation is increased (often
doubled) until it reaches the desired maximum prefetching depth. In our imple-
mentation, the maximum prefetching depth is set to be the competitive prefetching
depth.
The following example illustrates the slow-start phase. Let the initial prefetch-
ing depth be 16 pages (64 KB). Upon detection of a sequential access pattern, the
depth of each consecutive prefetch operation increases by a factor of two until it
reaches its maximum permitted value. For a sequential I/O stream with a com-
petitive prefetching depth of 98 pages (392 KB), a typical series of prefetching
depths should look like: 16 pages → 32 pages → 64 pages → 98 pages → 98 pages
→ · · ·
For our targeted sequential I/O workloads, the introduction of slow-start phase
will affect the competitiveness of our proposed technique. In the rest of this
subsection, we extend the previous competitiveness result when a slow-start phase
is employed by the prefetching policy. Assume the slow-start phase consists of at
most P prefetching operations (3 in the above example). Also let the total amount
of data fetched during a slow-start phase be bounded by T (112 pages in the above
example). The total disk resource consumed (Equation (3.7)) becomes:
Ctotal
≤∑
1≤i≤N
(
Stot[i]
Rtr[i]+
Sp[i]
Rtr[i]
)
+∑
1≤i≤N
(
⌈Stot[i] − T
Sp[i]⌉ + P
)
· Cavgswitch
<∑
1≤i≤N
(
Stot[i]
Rtr[i]+
Sp[i]
Rtr[i]
)
+∑
1≤i≤N
(
Stot[i]
Sp[i]+ 1 + P
)
· Cavgswitch
(3.14)
Since the competitive prefetching depth Sp[i] is equal to Cavgswitch ·Rtr[i], Equa-
tion (3.14) becomes:
Ctotal < 2 ·∑
1≤i≤N
(
Stot[i]
Rtr[i]
)
+ (2 + P ) · N · Cavgswitch (3.15)
CHAPTER 3. COMPETITIVE PREFETCHING 25
Equations (3.6) and (3.15) lead to
Ctotal < 2 · Copttotal + P · N · Cavg
switch (3.16)
Consequently, the prefetching strategy that employs a slow-start phase is still
competitive, but with a constant term (P · N · Cavgswitch). Despite this weaker ana-
lytical result, our experimental results in Section 4.4 show that the new strategy
still delivers at least half the emulated optimal performance in practice.
3.4 Implementation Issues
We have implemented the proposed competitive prefetching technique in the
Linux 2.6.10 kernel. In our implementation, we use the competitive prefetching
policy to control the depth of prefetching operations. In default Linux kernel, the
maximum prefetch depth is constrained by the constant VM MAX READAHEAD
defined in the header file mm.h. We make the maximum prefetch depth a global
variable, whose value may be changed during application execution. We have not
modified other aspects of the prefetching algorithm used in the Linux kernel. In
particular, the slow start phase of default Linux kernel is kept to accommodate
random-access workloads.
To control the maximum prefetch depth, our competitive prefetching policy
depends on several storage device characteristics, including the functional map-
ping from seek distance to seek time (denoted by fseek) and the mapping from the
data location to the sequential transfer rate (denoted by ftransfer).
The disk drive characteristics can be evaluated offline, at disk installation time
or OS boot time. Using the two mappings, we determine the runtime Cavgswitch and
Rtr[i] in the following way:
CHAPTER 3. COMPETITIVE PREFETCHING 26
• During runtime, our system tracks the seek distances of past I/O opera-
tions and determines their seek time using fseek. We use an exponentially-
weighted moving average of the past disk seek times to estimate the seek
component of the next I/O switch cost. The average rotational delay com-
ponent of Cavgswitch is estimated as the average rotational delay between two
random track locations (i.e., the time it takes the disk to spin half a revo-
lution).
• For each I/O request starting at logical disk location Ltransfer, the data
transfer rate is determined as ftransfer(Ltransfer).
The competitiveness analysis in the previous section targets systems with stan-
dalone disk drives. However, the analysis applies also to more complex storage
devices as long as they exhibit disk-like I/O switch and sequential data transfer
characteristics. Hence, multi-disk storage devices can utilize the proposed compet-
itive prefetching algorithm. Disk arrays allow simultaneous data transfers out of
multiple disks so that they may offer higher aggregate I/O bandwidth. However,
seek and rotational delays are still limited by individual disks. Overall, multi-disk
storage devices often expect larger competitive prefetching depths.
We include three disk-based storage devices in our experimental evaluation: a
36.4 GB IBM 10 KRPM SCSI drive, a 146 GB Seagate 10 KRPM SCSI drive, and a
RAID0 disk array with two Seagate drives using an Adaptec RAID controller. We
have also measured a RAID5 disk array with five Seagate drives, which has a lower
transfer rate than the RAID0 due to the overhead of relatively complex RAID5
processing. We keep the RAID0 device in our evaluation as a representative of
high-bandwidth devices.
We measure the storage device properties by issuing direct SCSI commands
through the Linux generic SCSI interface, which allows us to bypass the OS mem-
ory cache and selectively disable the disk controller cache. Our disk profiling takes
CHAPTER 3. COMPETITIVE PREFETCHING 27
0 0.2 0.4 0.6 0.8 10
20
40
60
80
100
Starting logical block number (in proportion to the max block number)
Tra
nsfe
r ra
te (
in M
B/s
ec)
(A) Sequential transfer rate
Adaptec RAID0 with two Segate drivesSingle Seagate driveSingle IBM drive (Linux 2.6.16)Single IBM drive
0 0.2 0.4 0.6 0.8 10
2
4
6
8
10
12
Seek distance (in proportion to the total disk size)
See
k tim
e (in
mill
isec
ond)
(B) Disk seek time
Seagate driveIBM drive
Figure 3.2: Sequential transfer rate and seek time for two disk drives
and a RAID. Most results were acquired under the Linux 2.6.10 kernel.
The transfer rate of the IBM drive appears to be affected by the operating
system software. We also show its transfer rate on a Linux 2.6.16 kernel.
less than two minutes to complete for each disk drive/array and it could be easily
performed at disk installation time. Figure 3.2 shows the measured disk charac-
teristics (sequential transfer rate and the seek time). Most measurements were
performed on a Linux 2.6.10 kernel. We find that the transfer rate of the IBM drive
is constrained by certain bottleneck (at 37.5 MB/sec, or 300 Mbps). We speculate
there might be a software-related bug in the disk driver of the Linux 2.6.10 kernel.
A more recent Linux kernel (2.6.16) does not have this problem.
For the IBM drive (under Linux 2.6.10 kernel), the transfer speed is around
37.3 MB/sec; the average seek time between two independently random disk lo-
cations is 7.53 ms; and the average rotational delay is 3.00 ms. For this drive, the
average competitive prefetching depth is the amount of data that can sequentially
transferred within the average time of a single disk seek and rotation—393 KB. For
the Seagate drive and the RAID0 disk array, measured average sequential transfer
rates are 55.8 MB/sec and 80.9 MB/sec respectively. With the assumption that
the average seek time of the disk array is the same as that of a single disk drive
CHAPTER 3. COMPETITIVE PREFETCHING 28
(7.98 ms), the average competitive prefetching depths for the Seagate drive and
the disk array are 446 KB and 646 KB respectively. In comparison, the default
maximum prefetching depth in Linux 2.6.10 is only 128 KB.
Note that the above average competitive prefetching derivation is calculated
based on the average seek time between two independently random disk locations
(or when there are two concurrent processes). The average seek time is smaller at
higher concurrency levels since the disk scheduler can choose from more concurrent
requests for seek reduction. Thus, the runtime competitive prefetching depth
should be lower in these cases. We use the average competitive prefetching depth
as a reference for implementing aggressive prefetching (section 3.5.2).
3.5 Experimental Evaluation
We assess the effectiveness and competitiveness of the proposed technique
in this section. In our experimental evaluation, we compare a Linux kernel ex-
tended with our competitive prefetching policy against the original Linux kernel
and a Linux kernel employing an aggressive prefetching algorithm. Anticipatory
scheduling is enabled across all experiments and kernel configurations. Our main
experimental platform is a machine equipped with two 2 GHz Intel Xeon pro-
cessors, 2 GB memory, and three storage drives. Figure 3.2 presents sequential
transfer rate and seek time characteristics of the three storage drives.
The performance of competitive prefetching is affected by several factors: 1)
virtualized or distributed I/O architecture that may limit the effectiveness of
anticipatory I/O scheduling; 2) the storage device characteristics; and 3) serial
vs. concurrent workloads. Section 3.5.2 presents the base-case performance of
concurrent workloads using the standalone IBM drive. Then we explicitly examine
the impact of each of these factors (sections 3.5.3, 3.5.4, and 3.5.5). Finally,
section 3.5.6 summarizes the experimental results.
CHAPTER 3. COMPETITIVE PREFETCHING 29
3.5.1 Evaluation Benchmarks
Our evaluation benchmarks consist of a variety of I/O-intensive workloads,
including four microbenchmarks, two common file utilities, and four real appli-
cations. We mostly focus on server workloads. Some non-server workloads are
included to show that they can also benefit from the competitive prefetching
strategy.
Each of our microbenchmarks is a server workload involving a server and a
client. The server provides access to a dataset of 6000 4 MB disk-resident files.
Upon arrival of a new request from the client, the server spawns a thread, which
we call a request handler, to process it. Each request handler reads one or more
of the disk files in a certain pattern. The microbenchmark suite consists of four
microbenchmarks that are characterized by different access patterns. Specifically,
the microbenchmarks differ in the number of files accessed by each request handler
(one to four), the portion of each file accessed (the whole file, a random portion,
or a 64 KB chunk) and a processing delay (0 ms or 10 ms). We use a descriptive
naming convention to refer to each benchmark. For example, <Two-Rand-0>
describes a microbenchmark whose request handler accesses a random portion of
two files with 0 ms processing delay. The random portion starts from the beginning
of each file. Its size is evenly distributed between 64 KB and 4 MB. Accesses within
a portion of a file are always sequential in nature. The specifications of the four
microbenchmarks are:
• One-Whole-0: Each request handler randomly chooses a file and it repeat-
edly reads 64 KB data blocks until the whole file is accessed.
• One-Rand-10: Each request handler randomly chooses a file and it repeat-
edly reads 64 KB data blocks from the beginning of the file up to a random
total size (evenly distributed between 64 KB and 4 MB). Additionally, we
add a 10 ms processing delay at four random file access points during the
CHAPTER 3. COMPETITIVE PREFETCHING 30
processing of each request. The processing delays are used to emulate pos-
sible delays during request processing and may cause the anticipatory I/O
scheduler to timeout.
• Two-Rand-0: Each request handler alternates reading 64 KB data blocks
sequentially from two randomly chosen files. It accesses a random portion
of each file from the beginning. This workload emulates applications that
simultaneously access multiple sequential data streams.
• Four-64KB-0: Each request handler randomly chooses four files and reads
a 64 KB random data block from each file.
We include two common file utilities and four real applications in our evalua-
tion:
• cp: cp reads a source file chunk by chunk and copies the chunks into a
destination file. We run cp on files of different sizes. We use cp5 and cp50
to represent test runs of cp on files of 5 MB and 50 MB respectively.
• BSD diff: diff is a utility that compares the content of two text files.
BSD diff alternately accesses two files in making the comparison and it is
more scalable than GNU diff [GNU06]. We have ported BSD diff to our
experimental platform. Again, we run it on files of different sizes. diff5
works on 2 files of 5 MB each and diff50 works on 2 files of 50 MB each.
• Kernel build: The kernel build workload includes compiling and linking a
Linux 2.6.10 kernel. The kernel build involves reading from and writing to
many small files.
• TPC-H: We evaluate a local implementation of the TPC-H decision support
benchmark [TPCH] running on the MySQL 5.0.17 database. The TPC-H
workload consists of 22 SQL queries with 2 GB database size. The maximum
CHAPTER 3. COMPETITIVE PREFETCHING 31
database file size is 730 MB, while the minimum is 432 bytes. Each query
accesses several database files with mixed random and sequential access
patterns.
• Apache hosting media clips: We include the Apache web server in our evalu-
ation. Typical web workloads often contain many small files. Since our work
focuses on applications with substantially sequential access pattern, we use
a workload containing a set of media clips, following the file size and access
distribution of the video/audio clips portion of the 1998 World Cup work-
load [AJ99]. About 9% of files in the workload are large video clips while
the rest are small audio clips. The overall file size range is 24 KB–1418 KB
with an average of 152 KB. The total dataset size is 20.4 GB. During the
tests, individual media files are chosen in the client requests according to a
Zipf distribution.
• Index searching: Another application we evaluate is a trace-driven index
searching server using a dataset we acquired from the Ask Jeeves search
engine [Ask]. The dataset supports search on 12.6 million web pages. It
includes a 522 MB mapping file that maps MD5-encoded keywords to proper
locations in the search index. The search index file itself is approximately
18.5 GB, divided into 8 partitions. For each keyword in an input query,
a binary search on the mapping file returns the location and size of this
keyword’s matching web page identifier list in the main dataset. The dataset
is then accessed following a sequential access pattern. Multiple sequential
I/O streams on the dataset are accessed alternately for each multi-keyword
query. The search query words in our test workload are based on a trace
recorded at the Ask Jeeves site in summer 2004.
These workloads contain a variety of different I/O access patterns. cp, kernel
build, and the Apache web server request handler follow a simple sequential access
CHAPTER 3. COMPETITIVE PREFETCHING 32
0
0.5
1
1.5
2
2.5
x 104
Blo
ck o
ffset
in fo
ur fi
les
Virtual time
(A) TPC−H Q2
0
2000
4000
6000
Blo
ck o
ffset
in fo
ur fi
les
Virtual time
(B) TPC−H Q16
0
500
1000
1500
Pag
e of
fset
on
the
sear
ch in
dex
Virtual time
(C) Index searching
Figure 3.3: I/O access pattern of certain application workloads. Virtual
time represents the sequence of discrete data accesses. In (A), each straight
line implies sequential access pattern of a file. In (B), random and non-
sequential accesses make the files indistinguishable.
pattern to access whole files. BSD diff accesses two files alternately and it
also accesses whole files. TPC-H and the index searching server access partial
files and they have more complex I/O access patterns. Figure 3.3 illustrates the
access pattern of two TPC-H queries (Q2 and Q16) as well as a typical index
searching request processing. We observe that the TPC-H Q2 and the index
searching exhibit alternating sequential access patterns while TPC-H Q16 has a
mixed pattern with a large amount of random accesses.
Among these workloads, the microbenchmarks, Apache web server, and in-
dex searching are considered as server-style concurrent applications. The other
workloads are considered as non-server applications. Each server-style applica-
tion may run at different concurrency levels depending on the input workload. In
our experiments, each server workload involves a server application and a load
generation client (running on a different machine) that can adjust the number of
simultaneous requests to control the server concurrency level.
CHAPTER 3. COMPETITIVE PREFETCHING 33
3.5.2 Base-Case Performance
We assess the effectiveness of the proposed technique by first showing the base-
case performance. All the performance tests in this subsection are performed
on the standalone IBM drive. We compare application performance under the
following different kernel versions:
#1. Linux: The original Linux 2.6.10 kernel with a maximum prefetching depth
of 128 KB.
#2. CP: Our proposed competitive prefetching strategy described in Section 3.3
(with the slow-start phase to accommodate random-access workloads).
#3. AP: Aggressive prefetching with a maximum prefetching depth of 800KB
(about twice that of average competitive prefetching).
#4. OPT: An oracle prefetching policy implemented using application disclosed
hints. We have only implemented this policy for microbenchmarks. Each
microbenchmark provides accurate data access pattern to the OS through an
extended open() system call. The OS then prefetches exactly the amount
of data needed for each sequential stream. The oracle prefetching policy is a
hypothetical approach employed to approximate the performance of optimal
prefetching.
We separately show the results for server-style and non-server workloads be-
cause server workloads naturally run at varying concurrency levels. Figure 3.4
illustrates the performance of server-style workloads (including four microbench-
marks, Apache hosting media clips, and index searching).
For the first microbenchmark (One-Whole-0), all practical policies (Linux, CP,
and AP) perform equally well (Figure 3.4(A)). In this case, since anticipatory
scheduling can avoid I/O switching, files are accessed sequentially. However, Fig-
ures 3.4(B) and 3.4(C) demonstrate that anticipatory scheduling is not effective
CHAPTER 3. COMPETITIVE PREFETCHING 34
1 2 4 8 16 32 640
10
20
30
40
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(M
B/s
ec) (A) Microbenchmark One−Whole−0
1 2 4 8 16 32 640
10
20
30
40
Number of concurrent request handlers
App
licat
ion
I/O
thro
ughp
ut (
MB
/sec
) (B) Microbenchmark One−Rand−10
OPTAPCPLinux
OPTAPCPLinux
1 2 4 8 16 32 640
10
20
30
40
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(M
B/s
ec) (C) Microbenchmark Two−Rand−0
1 2 4 8 16 32 640
2
4
6
8
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(M
B/s
ec) (D) Microbenchmark Four−64KB−0
OPTAPCPLinux
OPTAPCPLinux
1 2 4 8 16 32 640
8
16
24
32
Number of concurrent requests
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
) (E) Apache
APCPLinux
1 2 4 8 16 32 640
5
10
15
20
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(M
B/s
ec) (F) Index searching
APCPLinux
Figure 3.4: Application I/O throughput of server-style workloads (four
microbenchmarks and two real applications). The emulated optimal per-
formance is presented only for the microbenchmarks.
CHAPTER 3. COMPETITIVE PREFETCHING 35
when significant processing time (think time) is present or when a request han-
dler accesses multiple streams alternately. In such cases, our proposed competi-
tive prefetching policy can significantly improve the overall I/O performance (16–
24% and 10–35% for One-Ran-10 and Two-Rand-0 respectively). Figure 3.4(D)
presents performance results for the random-access microbenchmark. We observe
that all policies perform similarly. Competitive and aggressive prefetching do
not exhibit degraded performance because the slow-start phase quickly identi-
fies the non-sequential access pattern and avoids prefetching a significant amount
of unnecessary data. Figures 3.4(A–D) also demonstrate that the performance
of competitive prefetching is always at least half the performance of the oracle
prefetching policy for all microbenchmarks.
Figure 3.4(E) shows the throughput of the Apache web server hosting media
clips. Each request follows a strictly sequential data access pattern on a single
file. Similarly to One-Whole-0, anticipatory scheduling minimizes the disk I/O
switch cost and competitive prefetching does not lead to further performance
improvement. Figure 3.4(F) presents the I/O throughput of the index search
server at various concurrency levels. Each request handler for this application
alternates among several sequential streams. Competitive prefetching can improve
I/O throughput by 10–28% compared to the original Linux kernel.
We use the following concurrent workload scenarios for the non-server appli-
cations (cp, BSD diff, and TPC-H). 2cp5 represents the concurrent execution
of two instances of cp5 (copying a 5 MB file). 2cp50 represents two instances
of cp50 running concurrently. 2diff5 represents the concurrent execution of
two instances of diff5. 2diff50 represents two instances of diff50 running con-
currently. cp5+diff5 represents the concurrent execution of cp5 and diff5.
cp50+diff50 represents the concurrent execution of cp50 and diff50. 2TPC-H
represents the concurrent run of the TPC-H query series Q2→Q3→ · · · →Q11
and Q12→Q13→ · · · →Q22.
CHAPTER 3. COMPETITIVE PREFETCHING 36
2cp5 2cp50 2diff5 2diff50 cp5+diff5 cp50+diff50 2TPC−H0
0.5
1
1.5
1.18 s 3.85 s 2.57 s 18.56 s 1.59 s 11.43 s 19.92 m
Non−server concurrent application workloads
Nor
mal
ized
exe
cutio
n tim
e
LinuxCPAP
Figure 3.5: Normalized execution time of non-server concurrent appli-
cations. The values on top of the “Linux” bars represent the running time
of each workload under the original Linux kernel (in minutes or seconds).
Figure 3.5 presents the execution time of the seven non-server concurrent work-
loads. There is not much performance variation between different prefetching poli-
cies for the concurrent execution of cp, because with anticipatory scheduling, files
are read without almost any disk I/O switching even under concurrent execution.
However, for 2diff5 and 2diff50 with alternating sequential access patterns,
competitive prefetching reduces the execution time by 31% and 34% respectively.
For 2TPC-H, the performance improvement achieved by competitive prefetching
is 8%.
While competitive prefetching improves the performance of several workload
scenarios, aggressive prefetching provides very little additional improvement. For
index searching, aggressive prefetching performs significantly worse than compet-
itive prefetching, since competitive prefetching maintains a good balance between
the overhead of disk I/O switch and that of unnecessary prefetching. Further-
more, aggressive prefetching risks wasting too much I/O bandwidth and memory
resources on fetching and storing unnecessary data.
CHAPTER 3. COMPETITIVE PREFETCHING 37
3.5.3 Performance with
Process-Context-Oblivious I/O Scheduling
The experimental results in the previous section show that during the concur-
rent execution of applications with strictly sequential access patterns, anticipa-
tory scheduling avoids unnecessary I/O switch and competitive prefetching leads
to little further performance improvement. Such workloads include one-Whole-0,
Apache web server, 2cp5, 2cp50, cp5+diff5, and cp50+diff50. However, our
discussion in Section 3.2.1 suggests that anticipatory scheduling is not effective in
virtualized [JADAD06] or parallel/distributed I/O architecture [LS07] due to the
lack of knowledge on request issuing process contexts. In this subsection, we ex-
amine such performance effect under a virtual machine platform. Specifically, we
use the Xen (version 3.0.2) [BDF+03] virtual machine monitor and the guest OS is
a modified Linux 2.6.16 kernel. We have implemented our competitive prefetching
algorithm in the guest OS.
The experiments are conducted using the standalone IBM hard drive. How-
ever, since the drive delivers higher I/O throughput under the Linux 2.6.16 kernel
than under the Linux 2.6.10 kernel (as illustrated in Figure 3.2), the absolute
workload I/O throughput shown here appears higher than that in the previous
subsection.
Figure 3.6 and Figure 3.7 show the performance of server-style and non-server
workloads on a virtual machine platform. We present results only for the targeted
workloads mentioned in the beginning of this subsection. In contrast to Fig-
ure 3.4(A), Figure 3.6(A) shows that competitive prefetching is very effective in
improving the I/O throughput of the microbenchmark (up to 71% throughput en-
hancement compared to the original Linux kernel). Similarly, comparison between
Figure 3.6(B) and Figure 3.4(E) shows that competitive prefetching improves the
I/O throughput of the Apache web server. Comparison between Figure 3.7 and
CHAPTER 3. COMPETITIVE PREFETCHING 38
1 2 4 8 16 32 640
16
32
48
64
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(M
B/s
ec) (A) Microbenchmark One−Whole−0
OPTAPCPLinux
1 2 4 8 16 32 640
15
30
45
60
Number of concurrent requests
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
) (B) Apache
APCPLinux
Figure 3.6: Performance of server-style workloads on a Xen virtual ma-
chine platform where the I/O scheduling is oblivious to request-issuing pro-
cess contexts. We present results only for workloads whose performance
substantially differ from that in the base case (shown in Figure 3.4). For
the microbenchmark, we also show the performance of an emulated optimal
prefetching.
Figure 3.5 shows that competitive prefetching also improves the I/O throughput
of four non-server workloads. In the virtual machine environment, since anticipa-
tory I/O scheduling is not effective, it is competitive prefetching that can reduce
the frequency of I/O switches and thus improve the I/O performance. In addi-
tion, Figure 3.6(A) demonstrates that the performance of competitive prefetching
is always at least half the performance of the oracle prefetching policy for the
microbenchmark.
3.5.4 Performance on Different Storage Devices
The experiments of the previous sections are all performed on a standalone
IBM disk drive. In this section, we assess the impact of different storage devices
on the effectiveness of competitive prefetching. We compare results from three
different storage devices characterized in Section 3.4: a standalone IBM drive, a
CHAPTER 3. COMPETITIVE PREFETCHING 39
2cp5 2cp50 cp5+diff5 cp50+diff500
0.2
0.4
0.6
0.8
1
Non−server concurrent application workloads
Nor
mal
ized
exe
cutio
n tim
e
0.86 s 3.29 s 0.55 s 4.85 s LinuxCPAP
Figure 3.7: Normalized execution time of non-server concurrent applica-
tions on a virtual machine platform where the I/O scheduling is oblivious
of request-issuing process contexts. We present results only for workloads
whose performance substantially differ from that in the base case (shown
in Figure 3.5). The values on top of the “Linux” bars represent the running
time of each workload under the original Linux kernel (in seconds).
standalone Seagate drive, and a RAID0 disk array with two Seagate drives using
an Adaptec RAID controller. Experiments in this section are performed on a
normal I/O architecture (not on a virtual machine platform).
We choose the workloads that already exhibit significant performance variation
among different prefetching policies on the standalone IBM drive. Figure 3.8
shows the normalized execution time of two non-server workloads 2diff5 and
2diff50. We observe that the competitive prefetching strategy is effective for all
three storage devices (up to 53% for 2diff50 on RAID0). In addition, we find
that the performance improvement is more pronounced on the disk array than on
standalone disks. This is because the competitive prefetching depth of the disk
array is larger than that of standalone disks due to larger sequential data transfer
rate of the disk array.
Figure 3.9 shows the normalized inverse of throughput for two server applica-
tions (microbenchmark Two-Rand-0 and index searching) on the three different
CHAPTER 3. COMPETITIVE PREFETCHING 40
IBM disk Seagate disk RAID00
0.5
1
1.5
2.57 s 1.32 s 1.08 s
Storage devices
Nor
mal
ized
exe
cutio
n tim
e
(A) 2diff5
LinuxCPAP
IBM disk Seagate disk RAID00
0.5
1
1.5
18.56 s 14.29 s 10.31 s
Storage devices
Nor
mal
ized
exe
cutio
n tim
e
(B) 2diff50
LinuxCPAP
Figure 3.8: Normalized execution time of two non-server concurrent
workloads on three different storage devices.
storage devices. We use the inverse of throughput (as opposed to throughput
itself) so its illustration is more comparable to that in Figure 3.8. Figure 3.9(A)
shows that competitive prefetching can improve the microbenchmark I/O through-
put by up to 52% on RAID0. In addition, the throughput of competitive prefetch-
ing is at least half the throughput of the oracle prefetching policy for all stor-
age devices. Figure 3.9(B) shows that competitive prefetching can improve I/O
throughput by 19–26% compared to the original Linux kernel. In contrast, ag-
gressive prefetching provides significantly less improvement (up to 15%) due to
wasted I/O bandwidth on fetching unnecessary data. For index searching, the
performance improvement ratio on the disk array is not larger than that on stan-
dalone disks, since the sequential access streams for this application are typically
not very long and consequently they do not benefit from the larger competitive
prefetching depth on the disk array.
3.5.5 Performance of Serial Workloads
Though our competitive prefetching policy targets concurrent sequential I/O,
we also assess its performance for standalone serially executed workloads. For the
CHAPTER 3. COMPETITIVE PREFETCHING 41
IBM disk Seagate disk RAID00
0.5
1
1.5
Storage devices
Nor
mal
ized
inve
rse
of th
roug
hput
(A) Microbenchmark Two−Rand−0
LinuxCPAPOPT
IBM disk Seagate disk RAID00
0.5
1
1.5
Storage devices
Nor
mal
ized
inve
rse
of th
roug
hput
(B) Index searching
LinuxCPAP
Figure 3.9: Normalized inverse of throughput for two server-style con-
current applications on three different storage devices. We use the inverse
of throughput (as opposed to throughput itself) so that the illustration
is more comparable to that in Figure 3.8. For the microbenchmark, we
also show an emulated optimal performance. For clear illustration, we only
show results at concurrency level 4. The performance at other concurrency
levels are in similar pattern.
server applications (Apache web server and index search), we show their average
request response time when each request runs individually.
Figure 3.10 presents the normalized execution time of the non-concurrent work-
loads. Competitive prefetching has little or no performance impact on cp, kernel
build, TPC-H, and the Apache web server, since serial execution of the above
applications exhibits no concurrent I/O accesses. However, performance is im-
proved for diff and the index search server by 33% and 18% respectively. Both
diff and the index search server exhibit concurrent sequential I/O (in the form of
alternation among multiple streams). Query Q2 of TPC-H exhibits similar access
patterns (as illustrated in Figure 3.3(A)). Our evaluation (Figure 3.10) suggests
a 25% performance improvement due to competitive prefetching for query Q2.
Overall competitive prefetching did not lead to performance degradation for any
CHAPTER 3. COMPETITIVE PREFETCHING 42
cp5 cp50 diff5 diff50 kernel build TPC−H TPC−H Q2 Apache index search0
0.5
1
1.5
1.02 s 1.76 s 1.21 s 9.37 s 7.65 m 20.95 m 4.91 s 2.83 ms 718.15 ms
Serial application workloads
Nor
mal
ized
exe
cutio
n tim
e
LinuxCPAP
Figure 3.10: Normalized execution time of serial application workloads.
The values on top of the “Linux” bars represent the running time of each
workload under the original Linux prefetching (in minutes, seconds, or
milliseconds).
of our experimental workloads.
3.5.6 Summary of Results
• Compared to the original Linux kernel, competitive prefetching can improve
application performance by up to 53% for applications with significant con-
current sequential I/O. Our experiments show no negative side effects for
applications without such I/O access patterns. In addition, competitive
prefetching trails the performance of the oracle prefetching policy by no
more than 42% across all microbenchmarks.
• Overly aggressive prefetching does not perform significantly better than
competitive prefetching. In certain cases (as demonstrated for index search),
it hurts performance due to I/O bandwidth and memory space wasted on
fetching and storing unnecessary data.
• Since anticipatory scheduling can minimize the disk I/O switch cost during
concurrent application execution, it diminishes the performance benefit of
CHAPTER 3. COMPETITIVE PREFETCHING 43
competitive prefetching. However, anticipatory scheduling may not be effec-
tive when: 1) applications exhibit alternating sequential access patterns (as
in BSD diff, the microbenchmark Two-Rand-0, and index searching); 2)
substantial processing time exists between consecutive sequential requests
from the same process (as in the microbenchmark One-Rand-10); 3) I/O
scheduling in the system architecture is oblivious to request-issuing process
contexts (as in a virtual machine environment or parallel/distributed file sys-
tem server). In such scenarios, competitive prefetching can yield significant
performance improvement.
• Competitive prefetching can be applied to both standalone disk drives and
disk arrays that exhibit disk-like characteristics on I/O switch and sequential
data transfer. Disk arrays often offer higher aggregate I/O bandwidth than
standalone disks and consequently their competitive prefetching depths are
larger.
3.6 Related Work
The literature on prefetching is very rich. Cao et al. first proposed integrated
prefetching and caching [CFKL95], where they presented four rules that an opti-
mal prefetching and caching strategy must follow. Then they explored application-
controlled prefetching and caching [CFL94]. They proposed a two-level page re-
placement scheme that allows applications to control their memory replacement
policy, while the kernel controls the allocation of memory space among applica-
tions. Several researchers have studied application-assisted prefetching [PGG+95,
KTP+96, TPG97, LD97, YLB01]. Mowry et al. proposed compiler-assisted
prefetching for out-of-core applications [MDK96]. Patterson et al. explored a
cost-benefit model to guide prefetching decisions [PGG+95]. Their results sug-
gest a prefetching depth up to a process’ prefetch horizon under the assumption
CHAPTER 3. COMPETITIVE PREFETCHING 44
of no disk congestion. Tomkins et al. extended the proposed cost-benefit model
for workloads consisting of multiple applications [TPG97]. The above techniques
require a-priori knowledge of the application I/O access patterns. Patterson et al.
and Tomkins et al. depend on application-disclosed hints to predict future data ac-
cesses. In contrast, our work proposes a transparent operating system prefetching
technique that does not require any application support or modification.
Previous work has also examined ways to acquire or predict I/O access pat-
terns without direct application involvement. Proposed techniques include mod-
eling and analysis of important system events [LD97, YLB01], offline application
profiling [PS04], and speculative program execution [CG99, FC03]. The reliance
on I/O access pattern prediction affects the applicability of these approaches in
several ways. Particularly, the accuracy of predicted information is not guaran-
teed and no general method exists to assess the accuracy for different application
workloads with certainty. Some of these approaches still require offline profiling
or application changes, which increases the barrier for deployment in real systems.
Besides, these studies do not address the performance problem of frequent I/O
switches exhibited in highly concurrent server systems.
Shriver et al. studied performance factors in a disk-based file system [SSS99]
and suggested that aggressive prefetching should be employed when it is safe to
assume the prefetched data will be used. More recently, Papathanasiou and Scott
argued that aggressive prefetching has become more and more appropriate due to
recent developments in I/O hardware and emerging needs of applications [PS05].
Our work builds upon this idea by providing a systematic way of controlling
prefetching depth.
Previous work has also explored performance issues associated with concurrent
sequential I/O at various system levels. Carrera and Bianchini explored disk cache
management and proposed two disk firmware-level techniques to improve the I/O
throughput of data intensive servers [CB04]. Our work shares their objective
CHAPTER 3. COMPETITIVE PREFETCHING 45
while focusing on operating system-level techniques. Anastasiadis et al. explored
an application-level block reordering technique that can reduce server disk traffic
when large content files are shared by concurrent clients [AWC04]. Our work
provides transparent operating system level support for a wider scope of data-
intensive workloads.
Barve et al. [BKVV00] have investigated “competitive prefetching” in parallel
I/O systems with a given lookahead amount (i.e., the knowledge of a certain
number of upcoming I/O requests). Their problem model is different from ours
on at least the following aspects. First, the target performance metric in their
study (the number of parallel I/O operations to complete a given workload) is
only meaningful in the context of parallel I/O systems. Second, the lookahead of
upcoming of I/O requests is essential in their system model while our study only
considers transparent OS-level prefetching techniques that require no information
about application data access pattern.
3.7 Conclusion and Discussions
To conclude, this chapter presents the design and implementation of a compet-
itive I/O prefetching technique targeting concurrent sequential I/O workloads on
data-intensive server systems. Our model and analysis show that the performance
of competitive prefetching (in terms of I/O throughput) is at least half the per-
formance of the optimal offline policy on prefetching depth selection. In practice,
the new prefetching scheme is most beneficial when anticipatory I/O scheduling
is not effective (e.g., when the application workload has an alternating sequential
data access pattern or possesses substantial think time between consecutive I/O
accesses, or when the system’s I/O scheduler is oblivious to I/O request-issuing
process contexts).
We implemented the proposed prefetching technique in the Linux 2.6.10 kernel
CHAPTER 3. COMPETITIVE PREFETCHING 46
and conducted experiments using four microbenchmarks, two common file utili-
ties, and four real applications on several disk-based storage devices. Overall, our
evaluation demonstrates that competitive prefetching can improve the throughput
of real applications by up to 53% and it does not cause noticeable performance
degradation on a variety of workloads not directly targeted in this work. We also
show that the performance of competitive prefetching strategy trails that of an
oracle offline prefetching policy by no more than 42%, affirming its competitive-
ness.
In the remainder of this section, we discuss implications related to our com-
petitive prefetching policy. Prefetching and caching are two complementary OS
techniques that both make use of data locality to hide disk access latency. Un-
fortunately, they compete for memory resources against each other. Aggressive
prefetching can consume a significant amount of memory during highly concurrent
workloads. In extreme cases, memory contention may lead to significant perfor-
mance degradation because previously cached or prefetched data may be prema-
turely evicted before they are accessed. The competitiveness of our prefetching
strategy would not hold in the presence of high memory contention. A possible
method to avoid performance degradation due to prefetching during periods of
memory contention is to adaptively reduce the prefetching depth when increased
memory contention is detected and reinstate competitive prefetching when mem-
ory pressure is released. The topic of adaptively controlling the amount of memory
allocated for caching and prefetching has been previously addressed in the liter-
ature by Cao et al. [CFKL95] and Patterson et al. [PGG+95]. More recently,
Kaplan et al. proposed a dynamic memory allocation method based on detailed
bookkeeping of cost and benefit statistics [KMC02]. Gill et al. proposed a cache
management policy that dynamically partitions the cache space amongst sequen-
tial and random streams in order to reduce read misses [GM05]. Memory con-
tention can also be avoided by limiting the server execution concurrency level with
CHAPTER 3. COMPETITIVE PREFETCHING 47
4 16 640
0.5
1
1.5
2
2.5
3
3.5
Workload concurrency level
Nor
mal
ized
late
ncy
(sec
ond)
0.085 0.245 0.810
LinuxCPAP
Figure 3.11: Normalized latency of individual I/O request under concur-
rent workload. The latency of an individual small I/O request is measured
with microbenchmark Two-Rand-0 running in the background. The values
on top of the “Linux” bars represent the request latency under the original
Linux prefetching (in seconds).
admission control or load conditioning [vBCZ+03, WCB01]. In addition, in the
next chapter, we propose a set of OS-level techniques that can be used to manage
the memory dedicated to prefetched data.
The proposed prefetching technique has been designed to minimize possible
negative impact on applications not directly targeted by our work. However,
some negative impact may still exist. In particular, large prefetching depths can
lead to increased latencies for individual I/O requests under concurrent workloads.
Figure 3.11 presents a quantitative illustration of this impact. In particular, at
concurrency of 64, compared with the I/O request latency in original Linux, com-
petitive prefetching causes the average I/O request latency to increase by 53%,
while aggressive prefetching increases the average request latency by one and a
half times. This result also suggests that for certain interactive applications, the
performance impact by aggressive prefetching may be unacceptable. Techniques
CHAPTER 3. COMPETITIVE PREFETCHING 48
such as priority-based disk queues [GP98] and semi-preemptible I/O [DRC03]
can be employed to alleviate this problem. Additional investigation is needed to
address the integration of such techniques in the future.
49
4 Managing Prefetch Memory
4.1 Introduction
Data-intensive online servers access large disk-resident datasets while serving
many clients simultaneously. For such data-intensive server systems, the disk
I/O performance and memory utilization efficiency dominate the overall system
performance when the dataset size far exceeds the available server memory. As
described in the previous chapter, during concurrent execution, sequential data
access of one request handler (e.g., server process or thread) can be frequently in-
terrupted by I/O operations of other active request handlers in the server. Due to
the high overhead of I/O switches in disk-based storage devices, large-granularity
I/O prefetching is often employed to reduce the I/O switch frequency and thus
decrease its overhead. Consequentially, at high execution concurrency, there can
be many memory pages containing prefetched but not-yet-accessed data, which
we call prefetched pages in short.
The prefetched pages are traditionally managed together with the rest of the
memory buffer cache. Existing memory management methods generally fall into
two categories.
• Application-assisted techniques [CFL94, KTP+96, PS04, PGG+95, TPG97]
CHAPTER 4. MANAGING PREFETCH MEMORY 50
achieve efficient memory utilization with application-disclosed information
or hints on their I/O access patterns. However, the reliance on such infor-
mation affects the applicability of these approaches and their relatively slow
adoption in production operating systems is a reflection of this problem.
Our objective in this work is to provide transparent memory management
scheme in OS level that does not require any explicit application assistance.
• Existing transparent memory management methods typically use access
history-based page reclamation policies, such as LRU/LFU [MGST70], Working-
Set [Den68], or CLOCK [Smi78]. Because prefetched pages do not have any
access history yet, an access recency or frequency-based page reclamation
policy tends to evict them earlier than those pages that have been accessed
before. Although MRU-like schemes may benefit prefetched pages, they per-
form poorly with general application workloads and in practice they are only
used for workloads with exclusively sequential data access pattern. Among
prefetched pages themselves, the eviction order based on traditional recla-
mation policies would be actually First-In-First-Out (FIFO), again due to
the lack of any access history. Under such management, memory contention
at high execution concurrency may result in premature eviction of prefetched
pages before they get accessed, which we call page thrashing. These pages
may have to be read again from disk to meet the application’s need. It could
severely degrade system performance.
One reason for the lack of specific research attention on managing the prefetched
(but not-yet-accessed) memory pages is that traditionally these pages only con-
stitute a small portion of the total memory in a normal system. However, their
presence is substantial for data-intensive online servers executing at high concur-
rencies. Further, it was recently argued that more aggressive prefetching should
be supported in modern operating systems due to emerging application needs and
the disk I/O energy efficiency [PS05]. Aggressive prefetching would also contribute
CHAPTER 4. MANAGING PREFETCH MEMORY 51
to more substantial existence of prefetched memory pages in data-intensive server
systems.
In this chapter, we propose a new memory management scheme that han-
dles prefetched pages separately from other pages in the memory system. Such
a separation protects against excessive eviction of prefetched pages and allows a
customized page reclamation policy for them. We explore heuristic page reclama-
tion policies that can identify the prefetched pages least likely to be accessed in
the near future. In addition to the page reclamation policy, the prefetch mem-
ory management must also address the memory allocation between caching and
prefetching. We employ a gradient descent-based approach that dynamically ad-
justs the memory allocation for prefetched pages in order to minimize the overall
page miss rate in the system.
A number of earlier studies have investigated I/O prefetching and its memory
management in an integrated fashion [CFKL95, KMC02, PS04, PGG+95]. These
approaches can adjust the prefetching strategy depending on the memory cache
content and therefore achieve better memory utilization. Our work is exclusively
focused on the prefetch memory management with a given prefetching strategy.
Combining our results with adaptive prefetching may further improve the overall
system performance.
The rest of the chapter is organized as follows. Section 4.2 discusses the char-
acteristics of targeted data-intensive online servers and describes the existing OS
support. Section 4.3 presents the design of our proposed prefetch memory man-
agement techniques and its implementation in the Linux 2.6.10 kernel. Section 4.4
provides the performance results based on microbenchmarks and two real appli-
cation workloads. Section 4.5 reviews the related work and Section 4.6 concludes
the chapter.
CHAPTER 4. MANAGING PREFETCH MEMORY 52
4.2 Targeted Applications and Existing OS Sup-
port
As mentioned in section 2.1, our work focuses on online servers supporting
highly concurrent workloads that access a large amount of disk-resident data.
In these servers, each incoming user request is typically serviced by a request
handler (e.g., server process or thread) that repeatedly accesses disk data and
consumes the CPU before completion. A request handler may block if the needed
resource is unavailable. While request processing consumes both disk I/O and
CPU resources, the overall server throughput is often dominated by the disk I/O
performance and the memory utilization efficiency when the application data size
far exceeds the available server memory. Figure 2.1 in Chapter 2 illustrates such
an execution environment.
For data-intensive server systems, improving the I/O efficiency can be accom-
plished by employing a large I/O prefetching depth when the application exhibits
some sequential I/O access patterns [SSS99]. A larger prefetching depth results
in less frequent I/O switches, and consequently yields fewer disk seeks per time
unit. In practice, the maximum prefetching depths of default Linux and FreeBSD
are 128 KB and 256 KB respectively during sequential file accesses. In Chapter 3,
we proposed a competitive prefetching strategy that can achieve at least half the
I/O throughput of optimal prefetching when there is no memory contention. This
strategy specifies that the prefetching depth should be equal to the amount of
data that can be sequentially transferred within a single I/O switching period.
The competitive prefetching depth is around 400 KB for several modern IBM and
Seagate disks that we measured.
I/O prefetching can generate significant memory demands. Such demands may
result in severe memory contention at high execution concurrencies, where many
request handlers in the server are simultaneously competing for memory resource.
CHAPTER 4. MANAGING PREFETCH MEMORY 53
Access recency or frequency-based page replacement policies are misplaced for
handling prefetched pages because these policies fail to give proper priority to
pages that are prefetched but not yet accessed. We use the following two examples
to illustrate the problems of access history-based page replacement policies for
prefetched pages.
• Priority between prefetched pages and accessed pages: Assume page p1 is
prefetched and then accessed at t1 while page p2 is prefetched at t2 and it
has not yet been accessed. If the prefetching operation itself is not counted
as an access, then p2 will be evicted earlier under an access recency or
frequency-based policy although it may be more useful in the near future.
If the prefetching operation is counted as an access but t1 occurs after t2,
then p2 will still be evicted earlier than p1.
• Priority among prefetched pages: Due to the lack of any access history,
prefetched pages will follow a FIFO eviction order among themselves during
memory contention. Assume page p1 and p2 are both prefetched but not yet
accessed and p1 is prefetched ahead of p2. Then p1 will always be evicted
earlier under the FIFO order even if p2’s owner request handler (defined
as the request handler that initiated p2’s prefetching) has completed its
task and exited from the server, an strong indication that p2 would not be
accessed in the near future.
Under poor memory management, many prefetched pages may be evicted be-
fore their owner request handlers get the chance to access them. Prematurely
evicted pages will have to be fetched again from the disk and we call such a
phenomenon prefetch page thrashing. In addition to wasting I/O bandwidth on
makeup fetches, page thrashing further reduces the I/O efficiency. This is because
page thrashing destroys sequential access pattern crucial for large-granularity
CHAPTER 4. MANAGING PREFETCH MEMORY 54
1 100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
Number of concurrent request handlers
App
licat
ion
IO th
roug
hput
(in
MB
/sec
)
0
60
120
180
240
300
Pag
e th
rash
ing
(in p
ages
/sec
)
application throughput
page thrashing rate
Figure 4.1: Application throughput and prefetch page thrashing rate
of a trace-driven index searching server under the original Linux 2.6.10
kernel. A prefetch thrashing is counted as a page miss on prefetched but
then prematurely evicted memory page.
prefetching and consequently many of the makeup fetches are inefficient small-
granularity I/O operations.
Measured performance We use the Linux kernel as an example to illustrate
the performance of an access history-based memory management method. Linux
employs two LRU lists for memory buffer cache: an active list used to cache hot
pages and an inactive list used to cache cold pages. Prefetched pages are initially
attached to the tail of the inactive list. Frequently accessed pages are moved from
the inactive list to the active list. When the available memory falls below a certain
reclamation threshold, an aging algorithm moves pages from the active list to the
inactive list and cold pages at the head of the inactive list are considered first for
eviction.
Figure 4.1 shows the performance of a trace-driven index searching server
under the Linux 2.6.10 kernel. This kernel is slightly modified to correct several
CHAPTER 4. MANAGING PREFETCH MEMORY 55
prefetching
memory
buffer cache
eviction
access
prefetch cachereclamation
Figure 4.2: A separate prefetch cache for prefetched pages.
performance anomalies during highly concurrent I/O (as illustrated in Chapter 5
and [SZL05]). The details about the benchmark and the experimental settings
can be found in Section 4.4.1. We show the application I/O throughput and
page thrashing rate for up to 1000 concurrent request handlers in the server. At
high execution concurrency, the I/O throughput drops significantly as a result of
high page thrashing rate. Particularly at the concurrency level of 1000, the I/O
throughput degrades to about 25% of its peak performance while the prefetch
page thrashing rate increases to about 318 pages/sec.
4.3 Prefetch Memory Management
We propose a new memory management scheme with the aim to reduce prefetch
page thrashing rate and improve the performance for data-intensive server sys-
tems. Our framework is based on a separate prefetch cache to manage prefetched
but not-yet-accessed pages (as shown in Figure 4.2). The basic concept of prefetch
cache was employed earlier by Papathanasiou and Scott [PS04] to manage the
prefetched memory pages for energy-efficient bursty disk I/O. Prefetched pages
are initially placed in the prefetch cache. When a page is referenced, it is moved
from the prefetch cache to the normal memory buffer cache and is thereafter
controlled by the kernel’s default page replacement policy. At the presence of
high memory pressure, some not-yet-accessed pages in the prefetch cache may be
CHAPTER 4. MANAGING PREFETCH MEMORY 56
moved to the normal buffer cache for possible eviction or reclamation.
The separation of prefetched pages from the rest of the memory system pro-
tects against excessive eviction of prefetched pages and allows a customized page
reclamation policy for them. Like many of the other previous studies [CFL94,
KTP+96, PGG+95, TPG97], Papathanasiou and Scott’s prefetch cache manage-
ment [PS04] requires application hints on their I/O access patterns. The reliance
on such information affects the applicability of these approaches and our objective
is to provide transparent memory management that does not require any appli-
cation assistance. The design of our prefetch memory management addresses the
reclamation order among prefetched pages (Section 4.3.1) and the partitioning
between prefetched pages and the rest of the memory buffer cache (Section 4.3.2).
Section 4.3.3 describes our implementation in the Linux 2.6.10 kernel.
4.3.1 Page Reclamation Policy
Previous studies pointed out that the offline optimal page replacement rule for
a prefetching system is that every prefetch should discard the page whose next
reference is farthest in the future [Bel66, CFKL95]. Such a rule was introduced
for minimizing the execution time of a standalone application process. We adapt
such a rule in the context of optimizing the I/O throughput of data-intensive
online servers. In our case, a large number of server request handlers execute
concurrently and each request handler runs for a relatively short period of time.
Because request handlers in an online server are independent from each other,
the data prefetched by one request handler are unlikely to be accessed by other
request handlers in the near future. Our adapted offline optimal replacement rules
are the following:
A. If there exist prefetched pages that will not be accessed by their respective
owner request handlers, discard one of these pages first.
CHAPTER 4. MANAGING PREFETCH MEMORY 57
B. Otherwise, discard the page whose next reference is farthest in the future.
The offline optimal replacement rules can only be followed when the I/O access
patterns for all request handlers are known. This requirement is not realistic
for most applications. We explore heuristic page reclamation policies that can
approximate the above rules without such requirement. We define a prefetch
stream as a group of sequentially prefetched pages that have not yet been accessed.
We examine the following online heuristics that can be used to identify a victim
page for reclamation.
#1. Discard any page whose owner request handler has completed its task and
exited from the server.
#2. Discard the last page from the longest prefetch stream.
#3. Discard the last page from the prefetch stream whose last access is least
recent. The recency can be measured by either the elapsed wall clock time
or the CPU time of each stream’s owner request handler. Since in most cases
only the stream’s owner request handler would reference it, the CPU time
may be a better indicator of how much opportunity of access the stream had
in the past.
Heuristic #1 is designed to follow the offline optimal replacement rule A while
heuristic #2 approximates rule B. Heuristic #3 approximates both rules A and
B because a prefetch stream that has not yet been referenced for a long time
is not likely to be referenced anytime soon and it may even not be referenced
after all. Although heuristic #3 is based on access recency, it is different from
the traditional LRU policy in that it relies on the access recency of the prefetch
stream it belongs to, not the access recency of each page. Note that every page
that was referenced after being prefetched would have already been moved out of
the prefetch cache.
CHAPTER 4. MANAGING PREFETCH MEMORY 58
Heuristic #1 can be combined with either heuristic #2 or heuristic #3 to form
a page reclamation policy. In either case, heuristic #1 should take precedence since
it guarantees to discard those pages that are unlikely to be accessed in the near
future. When applying heuristic #3, we have two different ways in identifying
the least recently used prefetch stream. Putting all these together, we may have
three page reclamation policies when a victim page needs to be identified for
reclamation:
• Longest: If there exist pages whose owner request handlers have exited
from the server, discard one of them first. Otherwise, discard the last page
from the longest prefetch stream.
• Coldest: If there exist pages whose owner request handlers have exited
from the server, discard one of them first. Otherwise, discard the last page
of the prefetch stream which has not been accessed for the longest time.
• Coldest+: If there exist pages whose owner request handlers have exited
from the server, discard one of them first. Otherwise, discard the last page
of the prefetch stream whose owner request handler has consumed the most
amount of CPU since the last time the request handler accessed the prefetch
stream.
4.3.2 Memory Allocation for Prefetched Pages
The memory allocation between prefetched pages and the rest of the memory
buffer cache can affect the memory utilization efficiency. Too small a prefetch
cache would not sufficiently reduce the prefetch page thrashing. Too large a
prefetch cache, on the other hand, would result in many page misses on the normal
memory buffer cache. To achieve high performance, we aim to minimize the
combined prefetch miss rate and the cache miss rate. A prefetch miss is defined as
CHAPTER 4. MANAGING PREFETCH MEMORY 59
a miss on a page which was prefetched and then prematurely evicted without being
accessed while a cache miss is a miss on a page that has already been accessed at
the time of last eviction.
Previous studies [KMC02, TSW92] have proposed to use histograms of page
hits to adaptively partition the memory between prefetching and caching or among
multiple processes. In particular, based on Thiebaut et al.’s work [TSW92], the
minimal overall page miss rate can be achieved when the current miss-rate deriva-
tive of the prefetch cache as a function of its allocated size is the same as the
current miss-rate derivative of the normal memory buffer cache. This indicates
an allocation point where there is no benefit of either increasing or decreasing the
prefetch cache allocation. These solutions have not been implemented in practice
and implementations suggested in the previous studies require significant over-
head in maintaining page hit histograms. In order to allow a relatively light-wight
implementation, we approximate Thiebaut et al.’s result using a gradient descent-
based greedy algorithm [RN03]. Given the curve of combined prefetch miss rates
and cache miss rates as a function of the prefetch cache allocation, the gradient
descent algorithm takes steps proportional to the negative of the gradient of the
function at the current point and proceeds downhill toward the bottom [Gra].
We divide the system execution time into epochs (or time windows) and the
prefetch cache allocation is adjusted epoch-by-epoch based on gradient descending
of the combined prefetch and cache miss rates. In the first epoch, we randomly
increase or decrease the prefetch cache allocation by an adjustment unit. We
calculate the change of combined miss rates in each subsequent epoch. The value
of the change is used to approximate the derivative of the combined miss rates.
If the miss rate increases by a substantial margin, we reverse the adjustment of
the previous epoch (e.g., increase the prefetch cache allocation if the previous
adjustment has decreased it). Otherwise, we further the adjustment made in the
previous epoch (e.g., increase the prefetch cache allocation further if the previous
CHAPTER 4. MANAGING PREFETCH MEMORY 60
adjustment has increased it).
4.3.3 Implementation
We have implemented the proposed prefetch memory management scheme in
the Linux 2.6.10 kernel. In the Linux kernel, page reclamation is initiated when
the number of free pages falls below a certain threshold (function try to free -
pages() in source file vmscan.c). The OS first scans the active list to move those
pages that have not been accessed recently to the inactive list. And then the OS
scans the inactive list to find candidate pages for eviction. In our implementation,
when a page in the prefetch cache is referenced, it is moved to the inactive LRU
list of the Linux memory buffer cache. Prefetch cache reclamation is triggered
at each invocation of the global Linux page reclamation. If the prefetch cache
size exceeds its allocation, prefetched pages will be selected for reclamation until
the desired allocation is reached. We also use a kernel thread to check if any
prefetched pages need to be reclaimed every 4 seconds.
We initialize the prefetch cache allocation to be 25% of the total physical
memory size. Our gradient descent-based prefetch cache allocation requires the
information on several memory performance statistics (e.g., the prefetch miss rate
and the cache miss rate). In order to identify misses on those prematurely evicted
pages, the system maintains a bounded-length history of meta-data for recently
evicted pages along with their status at the time of eviction (e.g., whether a page
is prefetched but not-yet-accessed). In the page data structure, we also keep track
of the information on how long each page has stayed in memory before being
evicted.
We use one bit in the flags field of the page data structure to signify whether
the page is prefetched or not (in page-flags.h). For a prefetched page, another bit
will show whether the page has been accessed. The prefetch cache data structure
CHAPTER 4. MANAGING PREFETCH MEMORY 61
is organized as a list of stream descriptors. Each stream descriptor points to a
list of ordered page descriptors, with newly prefetched page in the tail of the list.
Each stream descriptor also maintains a link to its owner process/thread (i.e.,
the process/thread that initiated the prefetching of pages in this stream) and the
timestamp showing when the prefetch stream is last accessed. The timestamp uses
either the wall clock time or the the CPU execution time of the stream’s owner
process depending on the adopted recency heuristic policy. Figure 4.3 provides an
illustration of the prefetch cache data structure. In current implementation, we
assume a request handler is a process (or thread) in a multi-processed (or multi-
threaded) server. Each prefetch stream can track the status of its owner process
through the link in the stream descriptor. When a process exits, the pages in
the associated prefetch stream will be moved to the inactive LRU list. Due to
its reliance on the process status, the current implementation cannot monitor
the status of request handlers in event-driven servers that reuse each process for
multiple request executions. In future implementation, we can associate each
prefetch stream with its open file data structure directly. In this way, a file close
operation will also cause pages in the associated stream to become reclamation
candidates.
We assess the space overhead of our implementation. Since the number of
active prefetch streams in the server is relatively small compared with the number
of pages, the space consumption of stream descriptors is not a serious concern.
The extra space overhead in each page includes a 4-byte pointer to the prefetch
stream and a 4-byte timestamp. Therefore they incur 0.2% space overhead (8
bytes per 4 KB page). An additional space overhead is the memory used by the
eviction history for identifying misses on recently evicted pages. We can limit
this overhead by controlling the number of entries in the eviction history. A
smaller history size may be adopted at the expense of the accuracy of the memory
performance statistics. A 20-byte data structure is used to record the information
CHAPTER 4. MANAGING PREFETCH MEMORY 62
prefetch cache
prefetching
reclamation owner
pid
referenceactive page list
head
time-
stamp
owner
pid
time-
stamp
owner
pid
time-
stamp
owner
pid
time-
stamp
stream 1
stream 2
stream 3
stream N
inactive page
list
eviction
page descriptor
Figure 4.3: An illustration of the prefetch cache data structure and our
implementation in Linux.
(meta-data) about each evicted page. When we limit the eviction history size to
40% of the total number of physical memory pages, the space overhead for the
eviction history is about 0.2% of the total memory size.
Our implementation has added/edited a total of about 1100 lines of C code in
the Linux. A kernel patch containing an implementation of the heuristic policy
Coldest is available at http://www.cs.rochester.edu/˜cli/Publication/patch2.htm.
All the modified files in the Linux source tree are listed as follows.
fs/file table.c
fs/mpage.c
fs/open.c
mm/filemap.c
mm/page alloc.c
mm/readahead.c
mm/swap.c
mm/truncate.c
mm/vmscan.c
CHAPTER 4. MANAGING PREFETCH MEMORY 63
include/linux/fs.h
include/linux/mm.h
include/linux/mm inline.h
include/linux/page-flags.h
include/linux/pagevec.h
include/linux/swap.h
4.4 Experimental Evaluation
We assess the effectiveness of our proposed prefetch memory management tech-
niques on improving the performance of data-intensive server systems. Experi-
ments were conducted on servers each with dual 2.0 GHz Xeon processors, 2 GB
memory, a 36.4 GB IBM 10 KRPM SCSI drive, and a 146 GB Seagate 10KRPM
SCSI drive. The Linux installation is Red Hat 9 with our enhanced kernel. Note
that we lower the memory size used in some experiments to better illustrate the
memory contention. Each experiment involves a server and a load-generating
client. The client program can adjust the number of simultaneous requests and
thus controls the server concurrency level.
The evaluation results are affected by several factors, including the I/O prefetch-
ing aggressiveness, server memory size, and the application dataset size. Our
methodology is to first demonstrate the performance at a typical setting and then
explicitly evaluate the impact of various factors. We perform experiments on four
microbenchmarks and two real application workloads. Each experiment is run for
4 rounds and each round takes 120 seconds. The performance metric we report
for all workloads is the I/O throughput observed at the application level. They
are acquired by instrumenting the server applications so that they return I/O
performance statistics to the client program. In addition to showing the server
CHAPTER 4. MANAGING PREFETCH MEMORY 64
I/O throughput, we also provide some results on the page thrashing statistics of
the prefetched memory. We summarize the evaluation results in the end of this
section.
4.4.1 Evaluation Benchmarks
We explore the performance of four microbenchmarks and two real applica-
tions with different I/O access patterns. All the microbenchmarks share the same
server program that accesses a dataset of 6000 4 MB disk-resident files. Upon the
arrival of each user request, the server spawns a thread to process it. The mi-
crobenchmarks differ in the number of files accessed by each request handler (one
to four) and the portion of each file accessed (the whole file, a random portion, or
a 64 KB chunk). We use the following microbenchmarks in the evaluation:
• One-Whole: Each request handler randomly chooses a file and it repeatedly
reads 64 KB data blocks until the whole file is accessed.
• One-Rand: Each request handler randomly chooses a file and it repeatedly
reads 64 KB data blocks from the file up to a random total size (evenly
distributed between 64 KB and 4 MB).
• Two-Rand: Each request handler alternates reading 64 KB data blocks from
two randomly chosen files. For each file, the request handler accesses the
same random number of blocks. This workload emulates applications that
simultaneously access multiple sequential data streams.
• Four-64KB: Each request handler randomly chooses four files and reads a
64 KB random data block from each file.
We include the following two real application workloads in the evaluation:
CHAPTER 4. MANAGING PREFETCH MEMORY 65
• Index searching: This application and its dataset are the same as the one
introduced in section 3.5. The dataset contains a 522 MB mapping file and
a 18.5 GB search index file. For each keyword in an input query, a binary
search is first performed on the mapping file and then the search index is
accessed following a sequential access pattern. Multiple prefetching streams
on the search index are accessed for each multi-keyword query. The search
query words in our test workload are based on a trace recorded at the Ask
Jeeves online site in early 2004.
• Apache hosting media clips: We include the Apache Web server in our eval-
uation. Since our work focuses on data-intensive applications, we use a
workload containing a set of media clips, following the file size and access
distribution of the video/audio clips portion of the 1998 World Cup work-
load [AJ99]. About 9% of files in the workload are large video clips while
the rest are small audio clips. The overall file size range is 24 KB–1418 KB
with an average of 152 KB. The total dataset size is 20.4 GB. During the
tests, individual media files are chosen in the client requests according to a
Zipf distribution. A random-size portion of each chosen file is accessed.
We summarizes our benchmark statistics in table 4.1.
4.4.2 Microbenchmark Performance
We assess the effectiveness of the proposed techniques by comparing the server
performance under five kernel versions using different memory management meth-
ods for prefetched pages:
#1. AccessHistory: We use the original Linux 2.6.10 to represent kernels that
manage the prefetched pages along with other memory pages using an ac-
cess history-based LRU reclamation policy. Note that a more sophisticated
CHAPTER 4. MANAGING PREFETCH MEMORY 66
Benchmark/workload Whole-file Streams per Memory Total data Min-mean-max size of
access? request Footprint size sequential access streams
Microbenchmark: One-Whole Yes Single 1 MB/request 24.0 GB 4MB–4 MB–4MB
Microbenchmark: One-Rand No Single 1 MB/request 24.0 GB 64 KB–2 MB–4MB
Microbenchmark: Two-Rand No Multiple 1 MB/request 24.0 GB 64 KB–2 MB–4MB
Microbenchmark: Four-64KB No N/A 1 MB/request 24.0 GB N/A
Index searching No Multiple Unknown 19.0 GB Unknown
Apache hosting media clips No Single Unknown 20.4 GB 24 KB–152 KB–1418 KB
Table 4.1: Benchmark statistics. The column “Whole-file access?” indi-
cates whether a request handler would prefetch some data that it would
never access. No such data exists if whole files are accessed by application
request handlers. The column “Stream per request” indicates the num-
ber of prefetching streams each request handler initiates simultaneously.
More prefetching streams per request handler would create higher memory
contention.
access recency or frequency-based reclamation policy would not make much
difference for managing prefetched pages. Although MRU-like schemes may
benefit prefetched pages, they perform poorly with general application work-
loads and in practice they are only used for workloads with exclusively se-
quential data access pattern.
#2. PC-AccessHistory: The original kernel with a prefetch cache whose allo-
cation is dynamically maintained using our gradient-descent algorithm de-
scribed in Section 4.3.2. Pages in the prefetch cache is reclaimed in FIFO-
order, which is the effective reclamation order for prefetched pages under
any access history-based reclamation policy.
#3. PC-Longest: The original kernel with a prefetch cache managed by the
Longest page reclamation policy described in Section 4.3.1.
CHAPTER 4. MANAGING PREFETCH MEMORY 67
#4. PC-Coldest: The original kernel with a prefetch cache managed by the Cold-
est page reclamation policy described in Section 4.3.1.
#5. PC-Coldest+: The original kernel with a prefetch cache managed by the
Coldest+ page reclamation policy described in Section 4.3.1.
The original Linux 2.6.10 exhibits several performance anomalies when sup-
porting data-intensive online servers [SZL05]. They include a mis-management of
prefetching during disk congestion and some lesser issues in the disk I/O sched-
uler. Details about these problems and some suggested fixes can be found in the
next chapter. All experimental results in this dissertation are based on corrected
kernels.
Figure 4.4 illustrates the I/O throughput of the four microbenchmarks under
all concurrency levels. The results are measured with the maximum prefetching
depth at 512 KB and the server memory size at 512 MB. Figure 4.4(A) shows con-
sistently high performance attained by all memory management methods for the
first microbenchmark (One-Whole). This is because this microbenchmark accesses
whole files and consequently all prefetched pages will be accessed. Further, for
applications with strictly sequential access pattern, the anticipatory I/O schedul-
ing allows each request handler to run without interruption. This results in a
nearly-serial execution even at high concurrency levels, therefore little memory
contention exists in the system.
Figure 4.4(B) shows the I/O throughput results for microbenchmark One-Rand.
Since this microbenchmark does not access whole files, some prefetched pages will
not be eventually accessed. Results indicate that our proposed techniques can
improve the I/O throughput of access history-based memory management at high
execution concurrencies. In particular, the improvement achieved by PC-Coldest+
is 8–97% when concurrency level is above 200.
CHAPTER 4. MANAGING PREFETCH MEMORY 68
100 200 300 400 500 600 700 800 900 10000
8
16
24
32
40
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(A) One−Whole
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
100 200 300 400 500 600 700 800 900 10000
6
12
18
24
30
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(B) One−Rand
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(C) Two−Rand
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
100 200 300 400 500 600 700 800 900 10000
1
2
3
4
5
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(D) Four−64KB
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
Figure 4.4: Microbenchmark I/O throughput at different concurrency
levels. Our proposed prefetch memory management produces throughput
improvement on One-Rand and Two-Rand.
Figures 4.4(C) shows the performance for Two-Rand. The anticipatory schedul-
ing is not effective for this microbenchmark since a request handler in this case
accesses two streams alternately. Results show that our proposed techniques can
improve the performance of access history-based memory management at the con-
currency level of 200 or higher. For instance, the improvement of PC-Coldest+
is 34% and two-fold at the concurrency levels of 400 and 500 respectively. This
improvement is attributed to two factors: 1) the difference between PC-Coldest+
and PC-AccessHistory is attributed to the better page reclamation order inside
CHAPTER 4. MANAGING PREFETCH MEMORY 69
the prefetch cache; 2) the improvement of PC-AccessHistory over AccessHistory
is due to the separation of prefetched pages from the rest of the memory buffer
cache. Finally, we observe that the performance of PC-Coldest and PC-Longest
are not always as good as that of PC-Coldest+.
Figure 4.4(D) illustrates the performance of the random-access microbench-
mark Four-64KB. We observe that all the kernels perform similarly well at all
concurrency levels. Since prefetched pages in this microbenchmark are mostly not
used, early eviction would not hurt the server performance. On the other hand,
our memory partitioning scheme can quickly converge to small prefetch cache al-
locations through greedy adjustments and thus would not waste space on keeping
prefetched pages.
Note that despite the substantial improvement on Two-Rand and One-Rand,
our proposed prefetch memory management cannot eliminate memory contention
and the resulted performance degradation at high concurrency. However, this is
neither intended nor achievable. Consider a server with 1000 concurrent threads
each prefetching up to 512 KB. Just the pages pinned for outstanding I/O oper-
ations alone may occupy up to 512 MB memory space in this server.
To assess the remaining room for improvement, we approximate the offline
optimal reclamation policy through direct application assistance. In our approxi-
mation, applications provide exact hints about their data access pattern through
an augmented open() system call. With the hints, the approximated optimal pol-
icy can easily identify those prefetched but unneeded pages and evict them first
(following the offline optimal replacement rule A in Section 4.3.1). The approx-
imated optimal policy does not accurately embrace offline optimal replacement
rule B, because there is no way to determine the access order among pages from
different processes without the process scheduling knowledge. For the eviction
order among pages that would be accessed by their owner request handlers, the
approximated optimal policy follows that of Coldest+. Figure 4.5 shows that
CHAPTER 4. MANAGING PREFETCH MEMORY 70
1 100 200 300 400 500 600 700 800 900 10000
6
12
18
24
30
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(A) One−Rand
AccessHistoryPC−Coldest+Approximated Optimal
1 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(B) Two−Rand
AccessHistoryPC−Coldest+Approximated Optimal
Figure 4.5: Comparison among AccessHistory, Coldest+, and an ap-
proximated optimal page reclamation policy. The throughput of Coldest+
is 10–32% lower than that of the approximated optimal while it is 8–97%
higher than that of AccessHistory in high concurrencies.
the performance of Coldest+ approach is 10–32% lower than the approximated
optimal in high concurrency with the two tested microbenchmarks.
4.4.3 Performance of Real Applications
We examine the performance of the proposed techniques for two real applica-
tion workloads. Figure 4.6 (A) shows the application-perceived I/O throughput of
the index searching server at various concurrency levels. The results are measured
with the maximum prefetching depth at 512 KB and the server memory size of
512 MB. All schemes perform similarly well at very low concurrency. We notice
that the I/O throughput initially climbs up as the concurrency level increases
from 1 to 100. This is mainly due to shorter average seek distance when the disk
scheduler can choose from more concurrent I/O requests for seek reduction. At
high concurrency levels (100 and above), the PC-Coldest+ scheme outperforms
the access history-based management by 5–52%. Again, this improvement is at-
tributed to two factors. The performance difference between PC-AccessHistory
CHAPTER 4. MANAGING PREFETCH MEMORY 71
1 100 200 300 400 500 600 700 800 900 10000
2
4
6
8
10
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
(A) Index searching performance
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
1 100 200 300 400 500 600 700 800 900 10000
50
100
150
200
250
300
350
Number of concurrent request handlers
Pag
e th
rash
ing
rate
(in
pag
es/s
ec)
(B) Page thrashing
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
Figure 4.6: I/O throughput and page thrashing rate of the index search-
ing server.
and AccessHistory (about 10%) is due to separated management of prefetched
pages while the rest is attributed to better reclamation order within the prefetch
cache.
In order to better understand the performance results, we examine the page
thrashing rate in the server. The page thrashing rate is measured as the number of
page misses on prefetched but then prematurely evicted memory pages per second.
Figure 4.6 (B) shows the result for the index searching server. We observe that the
page thrashing rate increases as the server concurrency level goes up. Particularly
at the concurrency of 800, the page thrashing rates are around 306, 285, 274, 271
and 267 pages per second for the five schemes respectively. We observe that the
increase in page thrashing rates across the five schemes matches well with the
decrease of the application I/O throughput in Figure 4.6 (A), indicating that the
page thrashing is a main factor for the server performance degradation at high
concurrency.
We notice that even under the improved management, the page thrashing rate
still increases at high concurrency levels. We measure the statistics on how long
each page has stayed in memory before being evicted. We concentrate on those
prefetched but then prematurely evicted pages which later cause page misses.
CHAPTER 4. MANAGING PREFETCH MEMORY 72
400 600 8000%
20%
40%
60%
80%
100%
I/O concurrency (num. of concurrent request handlers)
Per
cent
age
of e
vict
ed p
ages
Index searching
AccessHistoryPC−AccessHistoryPC−Coldest+
Figure 4.7: Percentage of prematurely evicted pages that have stayed in
memory for more than 8 seconds for the index searching server. We only
consider those prefetched but then prematurely evicted pages which later
cause page misses.
Figure 4.7 shows the percentage of those pages that stay in memory for more than
8 seconds. We observe that at high concurrency levels (i.e., 400, 600 and 800),
this percentage for the PC-Coldest+ scheme is much higher than the percentages
for the other two schemes PC-AccessHistory and AccessHistory. Across all the
three I/O concurrencies shown in the figure, the average time each such page
stay in memory is 12.44, 8.31, and 6.22 seconds for the three page reclamation
policies respectively. This results prove that the proposed techniques can keep
those useful prefetched pages to stay much longer in memory compared with the
default page reclamation policy of Linux and the FIFO policy combined with
prefetch cache. The results suggest in order to further reduce the page thrashing
rate at high concurrency levels, those evicted pages need to be kept in memory
for longer time, which is not achieved by our current heuristic policies. We leave
this issue for future consideration. However, this might not be achievable. Again,
consider a server with 1000 concurrent threads each prefetching up to 512 KB.
CHAPTER 4. MANAGING PREFETCH MEMORY 73
1 100 200 300 400 500 600 700 800 900 10000
4
8
12
16
20
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
Apache Web Server hosting media clips.
AccessHistoryPC−AccessHistoryPC−LongestPC−ColdestPC−Coldest+
Figure 4.8: Throughput of the Apache Web server hosting media clips.
Just the pages pinned for outstanding I/O operations alone may occupy up to
512 MB memory space in this server.
Figure 4.8 shows the application IO throughput of the Apache Web server host-
ing media clips. The results are measured with the maximum prefetching depth
at 512 KB and the server memory size at 256 MB. The four prefetch cache-based
approaches perform better than the access history-based approach in all concur-
rency levels. This improvement is due to separated management of prefetched
pages. The difference among the four prefetch cache-based approaches is slight,
but PC-Coldest+ persistently performs better than others. This is due to its
effective reclamation management of prefetched pages.
4.4.4 Impact of Factors
In this section, we explore the performance impact of the I/O prefetching
aggressiveness, the server memory size, and the application dataset size. We use
a microbenchmark in this evaluation because of its flexibility in adjusting the
CHAPTER 4. MANAGING PREFETCH MEMORY 74
1 100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
Two−Rand
PCC+, 512KB pref.PCC+, 256KB pref.PCC+, 128KB pref.AH, 512KB pref.AH, 256KB pref.AH, 128KB pref.
Figure 4.9: Performance impact of the maximum prefetching depth for
the microbenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and
“AH” stands for AccessHistory.
workload parameters. When we evaluate the impact of one factor, we set the
other factors to default values: 512 KB maximum prefetching depth, 512 MB for
the server memory size, and 24 GB for the application dataset size. We only show
the performance of PC-Coldest+ and AccessHistory in the results.
Figure 4.9 shows the performance of the microbenchmark Two-Rand with
different maximum prefetching depths: 128 KB, 256 KB, and 512 KB. Various
maximum prefetching depths are achieved by adjusting the appropriate parameter
in the Linux kernel. We observe that our proposed prefetch memory management
scheme performs consistently better than AccessHistory at all configurations. The
improvement tends to be more substantial for the servers with higher prefetching
depths due to higher memory contention (and therefore more room to improve
performance).
Figure 4.10 shows the performance of the microbenchmark at different server
memory sizes: 2 GB, 1 GB, and 512 MB. Our proposed prefetch memory man-
CHAPTER 4. MANAGING PREFETCH MEMORY 75
100 200 300 400 500 600 700 800 900 10000
5
10
15
20
25
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
Two−Rand
PCC+, 2GB memPCC+, 1GB memPCC+, 512MB memAH, 2GB memAH, 1GB memAH, 512MB mem
Figure 4.10: Performance impact of the server memory size for the mi-
crobenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”
stands for AccessHistory.
agement scheme performs consistently better than AccessHistory at all memory
sizes. The improvement tends to be less substantial for the servers with large
memory due to lower memory contention (and therefore less room to improve
performance).
Figure 4.11 illustrates the performance of the microbenchmark at different
workload data sizes: 1 GB, 8 GB, and 24 GB. Intuitively better performance is
correlated with smaller workload data sizes. However, the change on the abso-
lute performance does not significantly affect the performance advantage of our
strategy over access history-based memory management.
4.4.5 Summary of Results
• Compared with access history-based memory management, our proposed
prefetch memory management scheme (PC-Coldest+) can improve the per-
formance of real workloads by 5–52% at high execution concurrencies. The
CHAPTER 4. MANAGING PREFETCH MEMORY 76
1 100 200 300 400 500 600 700 800 900 10000
6
12
18
24
30
Number of concurrent request handlers
App
licat
ion
I/O th
roug
hput
(in
MB
/sec
)
Two−Rand
PCC+, 1GB dataPCC+, 8GB dataPCC+, 24GB dataAH, 1GB dataAH, 8GB dataAH, 24GB data
Figure 4.11: Performance impact of the workload data size for the mi-
crobenchmark Two-Rand. “PCC+” stands for PC-Coldest+ and “AH”
stands for AccessHistory.
performance improvement can be attributed to two factors: the separated
management of prefetched pages and enhanced reclamation policies for them.
• At high concurrencies, the performance of PC-Coldest+ is 10–32% below
an approximated optimal page reclamation policy that uses application-
provided I/O access hints.
• The microbenchmark results suggest that our scheme does not provide per-
formance enhancement for applications that follow strictly sequential access
pattern in accessing whole files. This is because the anticipatory I/O sched-
uler executes request handlers serially for these applications, thus eliminat-
ing memory contention even at high concurrency. In addition, our scheme
does not provide performance enhancement for applications with only small-
size random data accesses since most prefetched pages will not be accessed
for these applications (and thus their reclamation order does not matter).
CHAPTER 4. MANAGING PREFETCH MEMORY 77
• The performance benefit of the proposed prefetch memory management may
be affected by various system parameters. In general, the benefit is higher
for systems with more memory contention. Additionally, our scheme does
not degrade the server performance on all the configurations that we have
tested.
4.5 Related Work
Concurrency management. The subject of highly-concurrent online servers
has been investigated by several researchers. Pai et al. showed that the HTTP
server workload with moderate disk I/O activities can be efficiently managed by
incorporating asynchronous I/O or helper threads into an event-driven HTTP
server [PDZ99a]. A later study by Welsh et al. further improved the perfor-
mance by balancing the load of multiple event-driven stages in an HTTP request
handler [WCB01]. The more recent Capriccio work provided a user-level thread
package that can scale to support hundreds of thousands of threads [vBCZ+03].
These studies mostly targeted CPU-intensive workloads with light disk I/O activ-
ities (e.g., the typical Web server workload and application-level packet routing).
They did not directly address the memory contention problem for data-intensive
server systems at high concurrency.
Admission control. The concurrency level of an online server can be con-
trolled by employing a fixed-size process/thread pool and a request waiting queue.
QoS-oriented admission control strategies [LJ00, VG01, VTFM01, WC03] can ad-
just the server concurrency according to a desired performance level. Controlling
the execution concurrency may mitigate the memory contention caused by I/O
prefetching. However, these strategies may not always identify the optimal con-
currency threshold when the server load fluctuates. Additionally, keeping the
concurrency level too low may reduce the server resource utilization efficiency
CHAPTER 4. MANAGING PREFETCH MEMORY 78
(due to longer average disk seek distance) and lower the server responsiveness
(due to higher probability for short requests to be blocked by long requests).
Our work complements the concurrency control management by inducing slower
performance degradation when the server load increases and thus making it less
critical to choose the optimal concurrency threshold.
Memory replacement. A large body of previous work has explored efficient
memory buffer page replacement policies for data-intensive applications, such as
LRU/LFU [MGST70], Working-Set [Den68], CLOCK [Smi78], 2Q [JS94], Unified
Buffer Management [KCK+00], and LIRS [JZ02]. All these memory replacement
strategies are based on the page reference history such as reference recency or
frequency. In data-intensive server systems, a large number of prefetched but
not-yet-accessed pages may emerge due to aggressive I/O prefetching and high
execution concurrency. Due to a lack of any access history, these pages may not
be properly managed by access recency or frequency-based replacement policies.
Memory management for Caching and Prefetching. Cao et al. proposed
a two-level page replacement scheme that allows applications to control their own
cache replacement while the kernel controls the allocation of cache space among
processes [CFL94]. Patterson et al. examined memory management based on
application-disclosed hints on application I/O access pattern [PGG+95]. Such
hints were used to construct a cost-benefit model for deciding the prefetching and
caching strategy. Kimbrel et al. explored the performance of application-assisted
memory management for multi-disk systems [KTP+96]. Tomkins et al. further
expanded the study for multi-programed execution of several processes [TPG97].
Hand studied application-assisted paging techniques with the goal of performance
isolation among multiple concurrent processes [Han99]. Anastasiadis et al. ex-
plored an application-level block reordering technique that can reduce server disk
traffic when large content files are shared by concurrent clients [AWC04]. Mem-
ory management with application assistance or application-level information can
CHAPTER 4. MANAGING PREFETCH MEMORY 79
achieve high (or optimal) performance. However, the reliance on such assistance
may affect the applicability of these approaches and the goal of our work is to
provide transparent memory management that does not require any application
assistance.
Chang et al. studied the automatic generation of I/O access hints through
OS-supported speculative execution [CG99, FC03]. The purpose of their work is
to improve the prefetching accuracy. They do not address the management of
prefetched memory pages during memory contention.
Several researchers have addressed the problem of memory allocation between
prefetched and cached pages. Many such studies required application assistance or
hints [CFKL95, KTP+96, PGG+95, TPG97]. Among those that did not require
such assistance, Thiebaut et al. [TSW92] and Kaplan et al. [KMC02] employed
page hit histogram-based cost-benefit models to dynamically control the prefetch
memory allocation with the goal of minimizing the overall page fault rate. Their
solutions require expensive tracking of page hits and they have not been imple-
mented in practice. In this work, we employ a greedy partitioning algorithm that
requires much less state maintenance and thus is easier to implement.
4.6 Conclusion
This chapter presents the design and implementation of a prefetch memory
management scheme supporting data-intensive server systems. Our main con-
tribution is the proposal of novel page reclamation policies for prefetched but
not-yet-accessed pages. Previously proposed page reclamation policies are prin-
cipally based on the page access history (access recency or frequency), which are
not well suited for prefetched memory pages that have no such history. Addi-
tionally, we employ a gradient descent-based greedy algorithm that dynamically
adjusts the memory allocation for prefetched pages. Our work is guided by previ-
CHAPTER 4. MANAGING PREFETCH MEMORY 80
ous results on optimal memory partitioning while we focus on a design with low
implementation overhead. We have implemented the proposed techniques in the
Linux 2.6.10 kernel and conducted experiments using microbenchmarks and two
real application workloads. Results suggest that our approach can substantially
reduce prefetch page thrashing and improve application I/O throughput at high
concurrency levels.
81
5 I/O System Performance
Anomalies
5.1 Introduction
As discussed in previous chapters, operating system support for data-intensive
server systems can be improved by more efficient design and implementation in
different system components, such as file system prefetching and caching. On
the other hand, better OS support for data-intensive server systems can also be
achieved by reducing system performance bugs or anomalies.
Large and complex system software, such as operating system, may not al-
ways behave as well as expected. We define performance bugs or anomalies as
problems in system implementation that cause the performance to deviate from
the expectation based on design protocol or algorithm. For example, performance
anomalies may be caused by overly-simplified implementation, mis-management
of special cases, or erroneous approximations. Unlike most correctness-related
bugs, performance anomalies usually do not cause system crashes or deadlocks.
It is challenging to identify performance anomalies and pinpoint their root
causes in complex systems, because such systems support wide ranges of workloads
and performance anomalies only manifest under particular system configurations
and workload conditions. We examined closed (i.e., resolved and corrected) bugs
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 82
concerning Linux 2.4/2.6 IO/storage, file system, and memory management in the
Linux bug tracking system [LKB]. Among 219 reported bugs, only 9 are primarily
performance-related (about 4%) 1. In a previous work [SZL05], we proposed a
model-driven anomaly characterization approach and used it to discover operating
system performance bugs when supporting disk data-intensive servers systems.
We constructed a whole-system I/O throughput model as the reference of expected
performance and we used statistical clustering and characterization of performance
anomalies to guide debugging. As a result, our approach helped us quickly identify
four performance anomalies in the I/O system of Linux kernel.
In this chapter, we will present the Linux I/O system performance anomalies
we identified during the previous work [SZL05] and more recent follow-up work.
Our focus is on examining the anomalies themselves and exploring their root
causes. All the anomalies are about file system prefetching and I/O scheduling.
For most anomalies, we describe the high-level algorithmic design, illustrate the
anomalous implementation, and show the scenarios where they may manifest. The
rest of this chapter is organized as follows. Section 5.2 reviews Linux file system
prefetching algorithm and I/O scheduling algorithms. Section 5.3 elaborates on
the performance anomalies in Linux I/O system implementation. Section 5.4
provides general discussions on our experience in operating system performance
debugging. Section 5.5 discusses some related work about system debugging. And
Section 5.6 concludes this chapter.
1We manually examined each reported bug and we determine a bug to be primarily
performance-related if it causes significant performance degradation but it does not cause any
correctness-related symptoms like system crashes or deadlocks.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 83
5.2 Linux I/O System Algorithms
All the anomalies covered in this chapter are about file system prefetching
and I/O scheduling. We review Linux file system prefetching algorithm and I/O
scheduling algorithms in this section.
5.2.1 Prefetching Algorithm in Linux
Prefetching or read-ahead2 means reading adjacent pages from the file before
they are actually accessed by the application. Then if prefetched pages are ac-
cessed later by the application, they are already available in the page cache in
memory. Therefore, prefetching can improve disk I/O efficiency by reducing seek
overhead for sequential accesses to disk-resident data, although it is not very useful
for random data accesses.
In practice, sequential I/O workloads are commonplace in most applications.
To support sequential disk accesses, Linux 2.6 adopts a sophisticated prefetching
algorithm which tries to detect sequential or non-sequential patterns in file ac-
cesses. To this end, the Linux prefetching algorithm manages two windows - the
current window and the ahead window, as shown in Figure 5.1. They are two sets
of pages, each corresponding to a contiguous region of the file.
The current window contains pages requested by the process or prefetched by
the kernel. All the pages in current window have their descriptors included in page
cache, although some of pages may be still unavailable (locked) due to the ongoing
data transfer. The ahead window contains only those pages that are prefetched
by the kernel but not yet requested by the processed. The pages in ahead window
sequentially follow those pages in the current window.
When a file is opened, both windows are empty. When the process’s first read
2Read-ahead is the term used in the Linux kernel source. We prefer prefetching in our work.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 84
Figure 5.1: Current and ahead windows in asynchronous prefetching
access to the file starts from its first page, the kernel assumes the process will
read the file sequentially. In this case, the current window is created from the
first page, and the window size depends on the number of requested pages by the
process. At this moment, the ahead window is still empty. Later on, if the kernel
detects sequential access from the process, it creates the ahead window following
the last page of the current window. The ahead window size is either two or four
times the size of the current window. The ideal case is that while the application
is accessing the cached pages in the current window, the kernel is prefetching the
pages in the ahead window. When all the pages in current window are traversed
and the process begins to read a page in the ahead window, the current window
will be replaced by the ahead window (see Figure 5.1). The ahead window will
be reset following the new current window and its size will be either two or four
times the size of the current window. But size of both windows can not be larger
than the default maximum prefetching depth of Linux, which is 32 pages. This
process is called asynchronous prefetching because the prefetching is always ahead
of the data accesses by the application.
On the other hand, in the following two cases, the current window and ahead
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 85
window are emptied and asynchronous prefetching is temporarily disabled.
• when the kernel detects a file access that is not sequential.
• when the process’s first read access to the file does not start from its first
page.
In this two cases, the kernel will perform synchronous prefetching. Because the
requested pages are not available in the page cache, the application will block and
wait for I/O operations to complete. It is still prefetching, because the kernel usu-
ally access more pages than the application’s request. Asynchronous prefetching
will be re-enabled as soon as sequential file access is detected.
Therefore, with sequential file accesses, the prefetching algorithm aggressively
increases the prefetching depth (window sizes). However, the aggressiveness of
prefetching need to be contained by the available system resource. In general, the
size of the ahead window will be shrunk when the kernel can not find a prefetched
page in the page cache. This may happen when the prefetched page is reclaimed
by the kernel from the page cache under memory pressure.
The Linux prefetching depth can be tuned in different ways. In Linux 2.4,
system administrators may change the variable vm max readahead by modifying
the /proc/sys/vm/max-radahead file. In Linux 2.6, users are able to change the
prefetching depth for an open file by modifying its ra pages field through the
posix radvise() system call. In both kernels, users can also change prefetching
depth by modifying constant variables in the Linux source mm.h.
It is necessary for prefetching to avoid congesting the request queue in the
block-device layer. By default, the the maximum number of pending read requests
for a block device is 128. The kernel regards the request queue to be congested
when the number of pending requests is more than a certain threshold (113 by
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 86
default 3).
5.2.2 I/O Scheduling Algorithms in Linux
I/O schedulers are also called elevators. They manage I/O requests for each
block device. Specifically, an I/O scheduler organizes request queue by keeping
requests in order and merging the consecutive ones. The overall objectives include
reducing the disk seek overhead, improving the I/O throughput, and improving
fairness among I/O requests or among processes in the system. We need to note
that the processing of a specific I/O request may be delayed by the scheduler
in order to improve overall performance. Linux 2.6 supports four different I/O
schedulers explained as follows.
• Noop scheduler: Just as its name implies, this I/O scheduler does almost
nothing to each request. It simply adds each request to the request queue in
a FIFO order. The only optimization noop scheduler performs is to merge
consecutive requests when that is possible.
• CFQ scheduler: The Complete Fairness Queuing scheduler intends to fairly
treat all processes in the system. In addition to the dispatch queue for
the device, there is a request queue for each process that performs I/O
operations. The scheduler will work on the per-process queues in a round-
robin fashion. At each round, the scheduler selects a batch of requests from
one request queue, dispatches the requests, and then moves to the next
queue. CFQ scheduler was designed for multimedia workloads that require
short I/O latency.
• Deadline scheduler: This scheduler was introduced to avoid starvation
when requests keep coming in one direction of the request queue. In ad-
3In Linux source file ll rw blk.c, the threshold is calculated as: 7
8Q + 1, where Q is the
maximum queue size for the block device (default to 128).
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 87
dition to the dispatch queue, there are two sorted queues (read and write)
and two deadline queues (read and write). Each I/O request belongs to
both a sorted queue and a deadline queue (FIFO). In general, the scheduler
chooses requests from a sorted queue and moves them to the dispatch queue.
The scheduler needs to check the deadline queues and first schedules those
requests whose deadlines have expired.
• Anticipatory scheduler: This I/O scheduler was first proposed by Iyer and
Druschel [ID01] in 2001. It is widely used in Linux 2.6. Basically, after
each I/O request, the scheduler will idle the disk head for a short period of
time (even when there are pending I/O requests), so that consecutive I/O
requests that belong to the same process/thread may be serviced without
interruption.
We briefly describe how anticipatory I/O scheduler is implemented in Linux.
The anticipatory scheduler is built on top of the deadline scheduler. It also has
two sorted queues (read and write) and two deadline queues (read and write) for
each block device. The scheduler alternatively scans the two sorted read and write
queues, with preference on read. The scheduler implements a one-way elevator
algorithm. However, backward seek is possible when the seek distance to a request
in the back of the elevator is much shorter than the distance to the request in
front of the elevator. When the deadline of an I/O request has expired, the
scheduler will stop current anticipation or scanning, and go to service the expired
request. The scheduler services read or write requests in batches. A batch is a
collection of I/O requests. Read batch and write batch are processed alternatively.
Read and write expiration are checked only when the corresponding batch type
is underway. Anticipations occur only when the scheduler is processing a read
batch. When a read request is completed, the scheduler will check the next request
from the sorted read queue. If the next request is from the same process that
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 88
issued the just-completed request, or if the next request is very close to the just-
completed request, it is dispatched immediately. Otherwise, the scheduler may
decide to wait for a short period of time and anticipate that another request
close the previous request is coming up. When the anticipation timer expires, the
scheduler will stop anticipation and dispatch the next read request in the sorted
read queue. Anticipation decision of the scheduler is based on statistics maintained
for each process, including mean think time and mean seek distance. The kernel
source file for anticipatory scheduler implementation is as sched.c. There are five
parameters in the Linux implementation of anticipatory I/O scheduler. They can
be customized in the kernel source.
• antic expire: The amount of time in millisecond the scheduler will anticipate
for. According to the design, it is set to be around 6 by default.
• read expire: The deadline expiration time in millisecond for read requests.
• read batch expire: The deadline expiration time in millisecond for read re-
quest batch to be scheduled consecutively.
• write expire: The deadline expiration time in millisecond for write requests.
• write batch expire: The deadline expiration time in millisecond for write
request batch to be scheduled consecutively.
5.3 Anomalies in Linux I/O System Implemen-
tation
We show five performance anomalies we identified in the I/O system of Linux 2.6
kernel. Three of them are included in the previous work [SZL05] and two are from
a recent follow-up work.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 89
5.3.1 Suppression of Prefetching
First of all, we show a performance anomaly concerning the prefetching algo-
rithm implementation. It causes performance losses for high concurrent workloads
(concurrency ≥ 128 ) with moderately long sequential access streams (length ≥
256KB). According to the kernel prefetching algorithm, it is necessary to avoid
congesting the request queue in the block-device layer. In the implementation of
Linux 2.6.10, upon submitting a prefetching request, if the read request queue is
considered congested, the prefetching request is abandoned.
The intuition for this treatment is that asynchronous prefetching should be
disabled when the I/O device is fully scheduled. However, the same function im-
plementing asynchronous prefetching operations also deals with synchronous I/O
operations. The implementation does not distinguish the two different kinds of
operations. That is, when the read request queue is considered congested, all
prefetching requests are abandoned. In practice, there is no problem to discard
those asynchronous I/O requests, because the requested pages are not yet accessed
by the application. However, for those synchronous I/O requests, things are dif-
ferent. The application is waiting for the requested pages. If the synchronous I/O
requests are discarded, each requested page will result in an inefficient single-page
makeup I/O. In effect, the system works as if each requested page causes a page
fault.
This problem is named performance anomaly #1. In order to fix this anomaly,
the kernel need to distinguish asynchronous I/O requests from those synchronous
ones. And only asynchronous I/O requests should be abandoned when disk conges-
tion happens. We make changes to the main function implementing the prefetch-
ing algorithm (page cache readahead() in mm/readahead.c). When disk conges-
tion occurs, the corrected kernel lets go those synchronous prefetching requests
and only discards asynchronous requests. Our fix coincides with the later kernel
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 90
1 2 4 8 16 32 64 128 2560
5
10
15
I/O concurency (num. of concurrent ops)
Ser
ver
I/O th
roug
hput
(in
MB
/sec
)
Index searching
#1 anomaly fixOriginal Linux 2.6.10
Figure 5.2: Throughput of a server workload. The anticipatory I/O
scheduler is used.
change in the official Linux 2.6.11.
We show application I/O throughput of a server workload in Figure 5.2. The
index searching application involves multiple concurrent request handlers. Each
request handler accesses one or more concurrent I/O streams alternatively. In this
test, the kernel with anomaly fix #1 can substantially improve I/O performance at
high execution concurrency compared with the original Linux kernel. Specifically,
the performance improvement is about 3.8 times and 3.7 times at concurrency
128 and concurrency 256 respectively. The average performance improvement
(across all tested concurrencies) of the corrected kernel over the original kernel is
25%. Overall, the anomaly fix #1 helps make the application performance more
predictable at high execution concurrency. In addition, we can infer that in our
experiment the disk congestion happens when the I/O concurrency is somewhere
between 32 and 128.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 91
32 64 96 1280
2
4
6
8
10
Congestion threshold (num. of concurrent ops)
I/O th
roug
hput
(in
MB
ytes
/sec
)
Microbenchmark One−64KB−0
Threshold 113Threshold 62
Figure 5.3: Throughput of a server-style microbenchmark workload.
Change of congestion threshold may cause anomalous performance de-
viations.
5.3.2 Threshold of Disk Congestion
As just mentioned, later versions of Linux kernels (2.6.11 or later) have cor-
rected the anomaly #1 by only suppressing asynchronous prefetching when the
disk device is congested. However, the concept of disk congestion is not well de-
fined. Currently, the somewhat arbitrary disk congestion threshold at 113 queued
I/O requests still remains. It is not clear why this threshold is chosen. Actually, it
causes anomalous performance deviations when such threshold is crossed. Specif-
ically, medium-concurrency I/O may sometimes deliver much worse performance
than high-concurrency I/O.
This problem is called performance anomaly #2. We illustrate such a problem
in Figure 5.3. It shows the I/O throughput of a server-style microbenchmark
One-64-0 running on Linux 2.6.23 kernel. Each thread of the microbenchmark
reads a chunk of 64 KB data randomly in a large file. We can see when the
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 92
congestion threshold is the default 113, the throughput at concurrency 96 and
128 is much better than at concurrency 32 and 64. Similarly, when the congestion
threshold is set to 62, the throughput at concurrency 64, 96, and 128 is much
better than at concurrency 32. If we compare the throughput at concurrency 64,
it is improved by 61% when the threshold changes from 113 to 62. This is due
to the benefit of suppressing asynchronous prefetching at disk congestions. When
threshold is 113, asynchronous prefetching is discarded for concurrency 96 and
128. Accordingly, asynchronous prefetching is discarded for concurrency 64, 96,
and 128 when threshold is 62. The reason why asynchronous prefetching may
degrade I/O throughput in this situation is that the disk is already busy and
asynchronous prefetching may waste disk resource on unwanted data. Note that
each application thread only sequentially accesses a stream of 64 KB.
To fix this anomaly, it may need systematic analysis to decide the best disk
congestion threshold based on device characteristics and workload configurations.
It may make sense to make the threshold adaptive. We leave this as a future
direction.
5.3.3 Estimation of Seek Cost
This anomaly is about the anticipatory I/O scheduler. It involves workloads
at moderately high concurrency (8 and above) with stream length larger than
the maximum prefetching size (128 KB). Our investigation discovers the follow-
ing performance problem with kernel function as can break anticipation() in as-
iosched.c. The current implementation of the anticipatory scheduler will stop
an ongoing anticipation if there exists a pending I/O request with shorter seek
distance (compared with the statistical average seek distance of the anticipating
process). However, due to a significant seek initiation cost on modern disks (as
shown in Figure 3.2), the seek distance is not an accurate and reliable indication
of the seek time cost. For example, the average cost of a 0-distance seek and a
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 93
1 2 4 8 16 32 64 128 2560
5
10
15
I/O concurency (num. of concurrent ops)
Ser
ver
I/O th
roug
hput
(in
MB
/sec
)
Microbenchmark Two−Rand−0
#3 anomaly fixesOriginal Linux 2.6.10
Figure 5.4: Effect of kernel correction (for anomaly #3) on a microbench-
mark that alternatively reads random size of data in two files. The antic-
ipatory I/O scheduler is used.
2x-distance seek is much less than that of an x-distance seek. Consequentially,
the current implementation may tend to stop the anticipation when the benefit
of breaking the anticipation does not actually exceed that of continuing it.
This problem is called performance anomaly #3. We solve this problem by
using estimated seek time (instead of the seek distance) in the request anticipa-
tion cost/benefit analysis. Figure 5.4 illustrates effect of the anomaly fix #3 on
Linux 2.6.10 kernel. The average performance improvement (across all tested con-
currencies) of the corrected kernel with anomaly fix #3 is 6.4% (compared with
the original Linux kernel) for the microbenchmark Two-Rand-0.
5.3.4 Impact of System Timer Frequency
The default system timer interrupt frequency (hertz) in Linux 2.4 kernel was
100 Hz. Starting from the Linux 2.6.0, the default kernel hertz became 1000 Hz.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 94
More recently, this default kernel hertz has been changed to 250 Hz since Linux 2.6.13.
Although there has been some controversy on this change among Linux commu-
nity [LKT], the default hertz of 250 Hz remains. And it is now configurable
through the kconfig file (kernel/Kconfig.hz) in kernel source.
With the change of default kernel hertz, another I/O system anomaly has
manifested. In the implementation of anticipatory scheduler (as-iosched.c), the
anticipation expiration time is defined as follows. Note the constant HZ is the
value of kernel hertz.
/* max time we may wait to anticipate a read (default around 6ms) */
#define default_antic_expire ((HZ / 150) ? HZ / 150 : 1)
With 1000 Hz kernel hertz (HZ=1000) in early 2.6 kernels, this code works
fine in calculating correct timeout value of six ticks. However, the above code is
problematic given that default kernel hertz becomes 250 after Linux 2.6.13. The
value of constant default antic expire is 1 in this circumstance, which means the
expiration time is one timer tick of Linux kernel. In a 250 Hz system, one timer tick
takes 4 milliseconds. This number is okay compared the expected 6 milliseconds,
although it is a little smaller. But in general it is not possible for the kernel to
deliver every timer interrupt in exact one tick. Even worse, a timer timeout of
only one tick could potentially expire at anytime between 0 and 4 milliseconds
after the timer is set. This problematic implementation with an average expiration
time of 2 milliseconds is too far from the expected 6 milliseconds. In practice, we
observe timeouts as short as 100 microseconds in kernel trace analysis.
This problem is called performance anomaly #4. Our fix to this anomaly is
to make the value of default antic expire to be 2 when it is 1, as demonstrated in
the following code.
/* max time we may wait to anticipate a read (default around 6ms) */
#define default_antic_expire ((HZ/150) ? ((HZ/150 >1)?(HZ/150):2):1)
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 95
1 2 4 80
2
4
6
8
10
12
I/O concurency (num. of concurrent ops)
I/O th
roug
hput
(in
MB
ytes
/sec
)
#4 anomaly fixoriginal linux 2.6.23
Figure 5.5: Effect of kernel correction (for anomaly #4) on a microbench-
mark that reads 256 KB randomly in large files.
We show the effect of this anomaly fix on microbenchmark in Figure 5.5. It
demonstrates that compared with the original Linux 2.6.23 kernel, our correction
with anomaly fix #4 improves the system I/O throughput by 10–13% during
concurrent executions. This anomaly also reminds us that when the kernel evolves,
changes with global impact may cause unexpected problems for descendant kernel
versions. Actually this anomaly with anticipation timer is not the only problem
related to minimum value of the Linux kernel timer. The implementation of
CFQ I/O scheduler in Linux needs a similar fix. Its timer currently makes sure
the timeout value is at least 1 tick, but it should be at least 2 ticks. Recently,
this performance anomaly has been accepted by the Linux kernel development
team [Bug] and the anomaly fix will be included in an upcoming Linux kernel
version.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 96
5.3.5 Timing to Start Anticipation
As mentioned early, the anticipatory I/O scheduler services read or write re-
quests in batches, and anticipations occur when the scheduler is processing a read
batch. According to this design, when anticipation is enabled (default antic ex-
pire is not zero), only one read request is allowed to be dispatched at a time. In
contrast, many asynchronous write requests may be dispatched simultaneously.
When each read request is completed, the scheduler will decide whether it should
start the anticipation timer for the next read request by checking whether the
completed request is the one that was waited on. However, in current Linux
kernels, the one-read-a-time rule was not implemented correctly. Multiple read
requests may be dispatched to the block device at the same time due to at least
the following two reasons.
• During sequential data access, Linux asynchronous prefetching may submit
extra read requests, even though there is an outstanding read request in the
dispatch queue of the block device.
• Large read requests need to be split into several smaller requests so that the
size of each request is not more than the maximum request size acceptable for
the device. Those smaller requests created from the same larger request are
usually dispatched at the same time, although the intention of anticipatory
scheduling is to keep only one outstanding request dispatched at any time.
Large-size requests can be created by request merging and prefetching.
In the implementation of anticipatory I/O scheduler, when multiple read re-
quests are dispatched, the anticipation timer is started for the next request after
the first outstanding read request is completed. This is not appropriate, because
there are still other requests being serviced by the device, and the disk head posi-
tion is not dependent on the just-completed request. Therefore, it does not make
sense to start anticipation based on the just-completed request.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 97
1 2 4 80
2
4
6
8
10
12
I/O concurency (num. of concurrent ops)
I/O th
roug
hput
(in
MB
ytes
/sec
)
#4, #5 anomaly fixes#5 anomaly fixoriginal linux 2.6.23
Figure 5.6: Effect of kernel corrections on a microbenchmark that reads
256 KB randomly in large files.
This problem is called performance anomaly #5. It may cause premature
timeout and issuance of other processes’ requests (often before all outstanding
requests from the current process return). We correct this anomaly (in function
as dispatch request() in as-iosched.c) by strictly allowing only one read request
to be dispatched at any moment. The scheduler always makes sure the number
of outstanding requests is one whenever it wants to dispatch a new read request.
Note that multiple requests are allowed to pass to the I/O scheduler and stay in the
sorted queues and deadline queues, but only one read request can be dispatched
at any time. Another possible fix is to start anticipation only when the last
outstanding request has returned. This performance anomaly has been reported to
the Linux kernel development team and we are trying to get it across to them. We
illustrate the effect of this anomaly fix in Figure 5.6. We show the I/O performance
of server-style microbenchmark on a corrected Linux 2.6.23 kernel. Each thread
of the microbenchmark reads 256 KB data streams randomly in large files that are
randomly selected. It demonstrates that compared with the original Linux 2.6.23,
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 98
our correction with anomaly fix #5 improves the system I/O throughput by 13–
20% during concurrent execution. When we combine the fixes of anomaly #4 and
#5 together, performance improvement of the corrected kernel is 26–37%.
5.4 Discussions
Based on our performance debugging experience in previous work, we make
the following discussions on Linux I/O system performance anomalies. It is our
argument that by identifying and fixing I/O system performance anomalies, oper-
ating system support for data-intensive server systems will be improved to some
extent.
• General purpose monolithic operating systems such as Linux are not always
reliable when the systems become more and more complex. It is not easy
to identify performance anomalies because they are usually hidden in the
system and only manifest at specific system configurations and workload
conditions. However, once identified, these performance anomalies are typ-
ically fairly easy to fix in comparison with developing new algorithms or
protocols to realize certain functionality.
• Performance anomalies are not always unqualified “bugs” that have to be
corrected. Some anomalies (e.g., #2) are better described as partially ben-
eficial optimizations. Some other anomalies (e.g., #3) are most likely inten-
tional simplifications for which perfections (i.e., optimal threshold setting
and accurate disk seek time estimation) are difficult to realize.
• Some performance anomalies are corrected with the evolution of the system
software. However new anomalies may manifest in later software versions.
Specially, when the kernel evolves, changes with global impact may cause
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 99
unexpected problems for descendant kernel versions. The change of default
kernel timer frequency mentioned in 5.3.4 is an example of such cases.
• Operating systems, such as Linux, are not always accurately documented.
In other words, some of their documentations and source code comments are
not reliable. The same point of view is shared in a recent work by Tan et
al. [TYKZ07]. It may cause unexpected trouble if we unreservedly believe in
those documentations when dealing with performance anomalies, although
in general the documentations are still good for reference. For example, the
anticipatory scheduling documentation emphasize that there should be at
most one read request dispatched to the device at any moment. But our
measurements show that it is common to have more than 1 read requests
dispatched during several of our microbenchmark and real-application work-
loads.
• Kernel event tracing tools such as LTT [YD00, LTT, LTTng] are of great use
for identifying and fixing Linux kernel performance anomalies. We extended
the existing LTT tool for our tested kernels by adding customized trace
events. This tool can conveniently trace almost all kinds of kernel events
with very low overhead. Similar tools should be helpful for other complex
systems.
• When looking for performance anomalies, we need to understand what nor-
mal performance behaviors are like. “Normal” performance behaviors are
not always easily known or even well defined. If available, an advisory
performance model can help characterize performance anomalies and guide
debugging [SZL05]. In addition, reference executions (defined next) are very
helpful for identifying the symptoms of performance anomalies. A reference
execution runs under a similar execution condition from that of the target
(suspected anomaly). For example, a reference execution condition might
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 100
be different from the suspected anomalous condition only in one of its ten
configuration parameters. Like a literary reference, it serves as a base of
comparison that helps understand the commonality and uniqueness of the
target.
• In addition to improving system performance, the identification and correc-
tion of performance anomalies help improve the predictability of the target
system behaviors. Such predictability is important for automatic system
management such as resource provisioning and capacity planning [DCA+03,
ABG+01, SS05, IBM]. For instance, with the anomaly fix #1, the applica-
tion performance becomes more predictable in high execution concurrency
(as shown in Figure 5.2).
5.5 Related Work
Several previous studies have explored performance problems in computer sys-
tems and applications. Two techniques (i.e., MemSpy [MGA92] and Mtool [GH93])
were proposed to analyze application performance bottleneck or performance
losses through program instrumentation. Perl and Weihl studied performance
assertion checking [PW93] to automate the testing of performance properties of
complex systems. Crovella and LeBlanc examined performance prediction of par-
allel programs based on lost cycles analysis [CL94]. Rosenblum et al. analyzed
large and complex workloads using complete system simulation [RBDH97]. In
addition, recent performance debugging work utilized statistical analysis of on-
line system traces [AMW+03, CKF+02] to identify defective components of com-
plex systems. These techniques focused on fine-grained examination of the tar-
get system/application in specific workload settings. But there was little result
on operating system performance anomalies and no general discussions on their
characteristics.
CHAPTER 5. I/O SYSTEM PERFORMANCE ANOMALIES 101
Some other recent studies have investigated techniques to identify non-performance
bugs in operating systems. Engler et al. provided techniques that automatically
extract system correctness rules/patterns from the system source code [ECH+01].
Chou et al. studied operating system errors by automatic, static, compiler anal-
ysis applied to the Linux and OpenBSD kernels [CYC+01]. Li et al. employed
data mining techniques to identify copy-paste related bugs in operating system
code [LLMZ04]. Compared with these studies, performance-oriented debugging
can be more challenging because performance bugs/anomalies are usually con-
nected with the code semantics or algorithms, and they often do not follow cer-
tain correctness rules. Further, performance anomalies occur when the system
performance is below expectation and they may not cause obvious malfunctions
such as incorrect states or system crashes.
5.6 Conclusion
In this chapter, we present five performance anomalies we identified in the I/O
system of Linux kernel. All the anomalies are about file system prefetching and
I/O scheduling. For most anomalies, we describe the high-level algorithmic design,
illustrate the anomalous implementation, and show the scenarios where they may
manifest. Driven by our experience with real Linux performance anomalies, we
also provide general discussions on operating system performance debugging.
102
6 Conclusions and Future Work
A large number of network and distributed applications in the world are built
upon data-intensive server systems. The performance of such systems is crucial
for industrial production and people’s life. This thesis work examines OS tech-
niques to improve the I/O performance for data-intensive server systems. We
look at transparent OS enhancements in three aspects (file system prefetching,
memory management, and disk I/O performance) that can improve the system
performance without any application assistance or changes.
6.1 Summary of Contributions
First, we propose a competitive prefetching strategy that can control the
prefetching depth so that the overhead of disk I/O switch and unnecessary prefetch-
ing are balanced. The proposed strategy does not require a-priori information on
the data access pattern, and achieves at least half the performance (in terms of
I/O throughput) of the optimal offline policy. We also provide analysis on the
optimality of our competitiveness result and extend the competitiveness result to
capture prefetching in the case of random-access workloads. We have implemented
the proposed competitive prefetching policy in Linux 2.6.10 and evaluated its per-
formance on both standalone disks and a disk array using a variety of workloads
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 103
(including two common file utilities, Linux kernel compilation, the TPC-H bench-
mark, the Apache web server, and index searching). Compared to the original
Linux kernel, our competitive prefetching system improves performance by up to
53%. At the same time, it trails the performance of an oracle prefetching strategy
by no more than 42%.
Second, we explore a new memory management scheme that handles prefetched
(but not-yet-accessed) pages separately from the rest of the memory buffer cache.
We examine three new heuristic policies when a victim page (among the prefetched
pages) needs to be identified for eviction: 1) evict the last page of the longest
prefetch stream; 2) evict the last page of the least recently accessed prefetch
stream; and 3) evict the last page of the prefetch stream whose owner process
has consumed the most amount of CPU since it last accessed the prefetch stream.
These policies require no application changes or hints on their data access patterns.
We have implemented the proposed techniques in the Linux 2.6.10 kernel and con-
ducted experiments based on microbenchmarks and two real application workloads
(a trace-driven index searching server and the Apache Web server hosting media
clips). Compared with access history-based policies, our memory management
scheme can improve the server I/O throughput of real workloads by 5–52% at
high concurrency levels. Further, the proposed approach is 10–32% below an ap-
proximated optimal page reclamation policy that uses application-provided I/O
access hints. The space overhead of our implementation is about 0.4% of the
physical memory size.
Third, we present five performance anomalies we identified in the Linux I/O
system. All the anomalies are about file system prefetching and I/O scheduling.
For most anomalies, we describe the high-level algorithmic design, illustrate the
anomalous implementation, and show the scenarios where they may manifest.
We also discusses our experience in operating system performance debugging.
In particular, two of the “bugs” we identified are reported to the Linux kernel
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 104
development team [Bug]. One of the two has been accepted and its fix will be
included in an upcoming kernel version.
6.2 Future Directions
Based on this thesis work, we suggest the following future research directions.
• Adaptive prefetching using online workload profiling: Competitive prefetch-
ing only addresses the worst-case performance. A general-purpose operat-
ing system prefetching policy should deliver high average-case performance
for common application workloads. Previous work has studied adaptive
prefetching using application-disclosed data access patterns or hints [PGG+95,
TPG97]. Data access patterns can also be learned by online workload moni-
toring or profiling. Previous work [PS04, PS03] has proposed a monitor dae-
mon to generate hints on application data access patterns and burstiness.
We believe this technique if combined with competitive prefetching may gen-
erate better average-case performance for common application workloads.
• I/O cancellation: A larger prefetch depth of aggressive prefetching results
in less frequent I/O switching, and consequently yields fewer disk seeks
per time unit. On the other hand, prefetching may waste disk bandwidth
by retrieving unwanted data. To reduce the waste of disk resource while
supporting aggressive prefetching, we suggest to look at the feasibility of
canceling an I/O prefetching when it is identified as unwanted. Although
there might be overhead and some undesirable impacts, I/O cancellation can
provide great opportunities to make up for inaccurate prefetching predictions
and improve I/O performance and application responsiveness.
I/O request cancellation actually reflects a general issue for operating sys-
tems. The idea of cancellation also make sense for some other OS entities,
CHAPTER 6. CONCLUSIONS AND FUTURE WORK 105
such as threads, processes, and virtual machines. Process cancellation, for
instance, will enable the system to avoid resource waste on unneeded tasks.
• Online system performance debugging: In systems research, performance im-
provement has been widely and deeply studied in many ways. Performance
debugging is yet another novel approach which improves system performance
by identifying and fixing performance anomalies. Large and complex sys-
tem software, such as operating system, may not always behave as well as
expected. Because performance anomalies may be caused by many reasons
at different system configurations, it is challenging to identify performance
anomalies and pinpoint their root causes in complex systems. We partic-
ipated in previous work [SZL05] which proposed a model-driven anomaly
characterization approach and used it to discover operating system perfor-
mance anomalies in a single-node system. It is desirable to study online
system performance debugging on large-scale systems such as distributed
file systems on wide area network or parallel file systems on large-scale clus-
ters.
106
Bibliography
[ABG+01] G. A. Alvarez, E. Borowsky, S. Go, T. H. Romer, R. Becker-Szendy,
R. Golding, A. Merchant, M. Spasojevic, A. Veitch, and J. Wilkes.
Minerva: An automated resource provisioning tool for large-scale
storage systems. ACM Trans. on Computer Systems, 19(4):483–418,
November 2001.
[AJ99] M. Arlitt and T. Jin. Workload Characterization of the 1998 World
Cup Web Site. Technical Report HPL-1999-35, HP Laboratories Palo
Alto, 1999.
[AMW+03] M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and
A. Muthitacharoen. Performance Debugging for Distributed Systems
of Black Boxes. In Proc. of the 19th ACM SOSP, pages 74–89, Bolton
Landing, NY, October 2003.
[Ask] Ask Jeeves Search. http://www.ask.com.
[AWC04] S. V. Anastasiadis, R. G. Wickremesinghe, and J. S. Chase. Circus:
Opportunistic Block Reordering for Scalable Content Servers. In
Proc. of the Third USENIX FAST, pages 201–212, San Fancisco,
CA, March 2004.
[BBG+99] J. Bruno, J. Brustoloni, E. Gabber, B. Ozden, and A. Silberschatz.
Disk Scheduling With Quality of Service Guarantees. In Proc. of
BIBLIOGRAPHY 107
the IEEE International Conference on Multimedia Computing and
Systems, Centro Affari, Firenze, Italy, June 1999.
[BC02] D. P. Bovet and M. Cesati. Understanding the Linux Kernel. O’Reilly
& Associates, 2nd edition, 2002.
[BCB03] R. Behren, J. Condit, and E. Brewer. Why Events Are A Bad Idea
(for high concurrency servers). Lihue, Hawaii, May 2003.
[BCD72] A. Bensoussan, C. Clingen, and R. Daley. The Multics Virtual Mem-
ory: Concepts and Design. Communications of the ACM, 15(5):308–
318, May 1972.
[BDF+96] W. Bolosky, J. Draves, R. Fitzgerald, G. Gibson, M. Jones, S. Levi,
N. Myhrvold, and R. Rashid. The Tiger Video Fileserver. In Proc. of
the 6th International Workshop on Network and Operating Systems
Support for Digital Audio and Video, Zushi, Japan, April 1996.
[BDF+03] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,
R. Neugebauer, I. Pratt, and A. Warfield. Xen and the art of virtu-
alization. In Proc. of the 19th ACM SOSP, pages 164–177, Bolton
Landing, NY, October 2003.
[Bel66] L. A. Belady. A Study of Replacement Algorithms for a Virtual-
storage Computer. IBM Systems Journal, 5(2):78–101, 1966.
[BHS95] T. Blackwell, J. Harris, and M. Seltzer. Heuristic Cleaning Algo-
rithms in Log-structured File Systems. In Proc. of the USENIX An-
nual Technical Conference, New Orleans, LA, January 1995.
[BKVV00] R. Barve, M. Kallahalla, P. J. Varman, and J. S. Vitter. Compet-
itive Parallel Disk Prefetching and Buffer Management. Journal of
Algorithms, 38(2):152–181, August 2000.
BIBLIOGRAPHY 108
[BO03] R. Bryant and D. O’Hallaron. Computer Systems: A Programmer’s
Perspective. Prentice Hall, 2003.
[Bru99] Brustoloni. The interoperation of copy avoidance in network and file
i/o. In Proc. of the IEEE INFOCOM, pages 534–542, New York City,
March 1999.
[Bug] Kernel Bug Tracker Bug 10756. http://bugzilla.kernel.org/show -
bug.cgi?id=10756.
[CB04] E. Carrera and R. Bianchini. Improving Disk Throughput in Data-
Intensive Servers. In Proc. of the 10th HPCA, pages 130–141,
Madrid,Spain, February 2004.
[CFKL95] P. Cao, E. W. Felten, A. R. Karlin, and K. Li. A Study of Integrated
Prefetching and Caching Strategies. In Proc. of the ACM SIGMET-
RICS, pages 188–197, Ottawa, Canada, June 1995.
[CFL94] P. Cao, E. W. Felten, and K. Li. Implementation and Performance of
Application-Controlled File Caching. In Proc. of the First USENIX
OSDI, Monterey, CA, November 1994.
[CG99] F. Chang and G. Gibson. Automatic I/O Hint Generation Through
Speculative Execution. In Proc. of the Third USENIX OSDI, New
Orleans, LA, February 1999.
[CIRT00] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A
Parallel File System For Linux Clusters. In Proc. of the 4th Annual
Linux Showcase and Conf., pages 317–327, Atlanta, GA, October
2000.
[CKF+02] M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint:
Problem Determination in Large, Dynamic Systems. In Proc. of Int’l
BIBLIOGRAPHY 109
Conf. on Dependable Systems and Networks, pages 595–604, Wash-
ington, DC, June 2002.
[CL94] M. E. Crovella and T. J. LeBlanc. Parallel Performance Prediction
Using Lost Cycles Analysis. In Proc. of SuperComputing, pages 600–
610, Washington, DC, November 1994.
[CP99] C. D. Cranor and G. M. Parulkar. The UVM virtual memory system.
In Proceedings of the Usenix Annual Technical Conference, pages
117–130, June 1999.
[CYC+01] A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical
Study of Operating Systems Errors. In Proc. of the 18th ACM SOSP,
pages 73–88, Banff, Canada, October 2001.
[DCA+03] R. P. Doyle, J. S. Chase, O. M. Asad, W. Jin, and A. M. Vahdat.
Model-based resource provisioning in a web service utility. In Proc.
of the 4th USENIX Symp. on Internet Technologies and Systems,
Seattle, WA, March 2003.
[Den68] P. J. Denning. The Working Set Model for Program Behavior. Com-
munications of the ACM, 11:323–333, May 1968.
[DP93] Peter Druschel and Larry L. Peterson. Fbufs: A High-Bandwidth
Cross-Domain Transfer Facility. In Proc. of the 14th ACM SOSP,
Asheville, NC, February 1993.
[DRC03] Z. Dimitrijevic, R. Rangaswami, and E. Chang. Design and Imple-
mentation of Semi-preemptible IO. In Proc. of the 2nd USENIX
Conf. on File and Storage Technologies, pages 145–158, San Fran-
cisco, CA, March 2003.
BIBLIOGRAPHY 110
[ECH+01] D. Engler, D. Chen, S. Hallem, A. Chou, and B. Chelf. Bugs as De-
viant Behavior: A General Approach to Inferring Errors in Systems
Code. In Proc. of the 18th ACM SOSP, pages 57–72, Banff, Canada,
October 2001.
[FC03] K. Fraser and F. Chang. Operating System I/O Speculation: How
Two Invocations Are Faster Than One. In Proc. of the USENIX
Annual Technical Conference, San Antonio, TX, June 2003.
[FP93] K. Fall and J. Pasquale. Exploring In-kernel Data Paths to Improve
I/O Throughput and CPU Availability. In Proc. of the 1993 Winter
Unix Conference, January 1993.
[GH93] A. J. Goldberg and J. L. Hennessy. Mtool: An Integrated System
for Performance Debugging Shared Memory Multiprocessor Applica-
tions. 4(1):28–40, January 1993.
[GM05] B. S. Gill and D. S. Modha. SARC: Sequential Prefetching in Adap-
tive Replacement Cache. In Proc. of the USENIX Annual Technical
Conference, pages 293–308, Anaheim, CA, April 2005.
[GNU06] GNU Diffutils Project, 2006. http://www.gnu.org
/software/diffutils/diffutils.html.
[GP98] G. R. Ganger and Y. N. Patt. Using System-Level Models to Eval-
uate I/O Subsystem Designs. IEEE Transactions on Computers,
47(6):667–678, June 1998.
[Gra] Gradient Descent Algorithm. http://en.wikipedia.org/wiki/Gradient -
descent.
[Han99] S. M. Hand. Self-paging in the Nemesis Operating System. In Proc.
of the Third USENIX OSDI, New Orleans, LA, February 1999.
BIBLIOGRAPHY 111
[IBM] IBM Autonomic Computing. http://www.research.ibm.com
/autonomic/.
[ID01] S. Iyer and P. Druschel. Anticipatory Scheduling: A Disk Schedul-
ing Framework to Overcome Deceptive Idleness in Synchronous I/O.
In Proc. of the 18th ACM SOSP, pages 117 – 130, Banff, Canada,
October 2001.
[JADAD06] S. T. Jones, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Ant-
farm: Tracking Processes in a Virtual Machine Environment. In Proc.
of the USENIX Annual Technical Conference, pages 1–14, Boston,
MA, June 2006.
[JCZ05] S. Jiang, F. Chen, and X. Zhang. CLOCK-Pro: an effective improve-
ment of the CLOCK replacement. In Proc. of the USENIX Annual
Technical Conference, pages 323–336, Anaheim, CA, April 2005.
[JS94] T. Johnson and D. Shasha. 2Q: A Low Overhead High Performance
Buffer Management Replacement Algorithm. In Proc. of the 20th
VLDB Conference, pages 439–450, Santiago, Chile, September 1994.
[JZ02] S. Jiang and X. Zhang. LIRS: An Efficient Low Inter-reference Re-
cency Set Replacement Policy to Improve Buffer Cache Performance.
In Proc. of the ACM SIGMETRICS, pages 31–42, Marina Del Rey,
CA, June 2002.
[KCK+00] J. M. Kim, J. Choi, J. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S.
Kim. A Low-Overhead High-Performance Unified Buffer Manage-
ment Scheme that Exploits Sequential and Looping References. In
Proc. of the 4th USENIX OSDI, San Diego, CA, October 2000.
[KEG+97] M. Kaashoek, D. Englerand, G. Ganger, H. Briceno, R. Hunt,
D. Mazieres, T. Pinckney, R. Grimm, J. Jannotti, and K. MacKen-
BIBLIOGRAPHY 112
zie. Application Performance and Flexibility on Exokernel Systems.
In SOSP1997, October 1997.
[KMC02] S. F. Kaplan, L. A. McGeoch, and M. F. Cole. Adaptive Caching for
Demand Prepaging. In Proc. of the Third Int’l Symp. on Memory
Management, pages 114–126, Berlin, Germany, June 2002.
[KTP+96] T. Kimbrel, A. Tomkins, R. H. Patterson, B. Bershad, P. Cao, E. W.
Felten, G. A. Gibson, A. R. Karlin, and K. Li. A Trace-Driven
Comparison of Algorithms for Parallel Prefetching and Caching. In
Proc. of the Second USENIX OSDI, Seattle, WA, October 1996.
[KTR94] D. Kotz, S. B. Toh, and S. Radhakrishnan. A Detailed Simulation
Model of the HP 97560 Disk Drive. Technical Report PCS-TR94-220,
Dept. of Computer Science, Dartmouth College, July 1994.
[LD97] H. Lei and D. Duchamp. An Analytical Approach to File Prefetching.
In Proc. of the USENIX Annual Technical Conference, Anaheim, CA,
January 1997.
[LJ00] K. Li and S. Jamin. A Measurement-Based Admission-Controlled
Web Server. In Proc. of the IEEE INFOCOM, pages 651–659, Tel-
Aviv, Israel, March 2000.
[LKB] Linux kernel bug tracker. http://bugzilla.kernel.org/.
[LKT] Linux kernel trap. http://kerneltrap.org/node/5411.
[LLMZ04] Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A Tool for
Finding Copy-paste and Related Bugs in Operating System Code.
pages 289–302, San Francisco, CA, December 2004.
BIBLIOGRAPHY 113
[LS07] P. Lu and K. Shen. Multi-Layer Event Trace Analysis for Parallel I/O
Performance Tuning. In Proc. of the 36th International Conference
on Parallel Processing, XiAn, China, September 2007.
[LSG+00] C. Lumb, J. Schindler, G. R. Ganger, E. R., and D. F. Nagle. Towards
Higher Disk Head Utilization: Extracting “Free” Bandwidth from
Busy Disk Drives. In Proc. of the 4th USENIX OSDI, San Diego,
CA, October 2000.
[LSGN00] C. R. Lumb, J. Schindler, G. R. Ganger, and D. F. Nagle. Towards
Higher Disk Head Utilization: Extracting Free Bandwidth From Busy
Disk Drives. In Proc. of the 4th USENIX OSDI, San Diego, CA,
October 2000.
[LTT] Linux Trace Toolkit. http://www.opersys.com/LTT/.
[LTTng] Linux Trace Toolkit Next Generation, ng. http://ltt.polymtl.ca/.
[MDK96] T. C. Mowry, A. K. Demke, and O. Krieger. Automatic Compiler-
Inserted I/O Prefetching for Out-of-Core Applications. In Proc. of
the Second USENIX OSDI, Seattle, WA, October 1996.
[Met97] R. V. Meter. Observing the effects of multi-zone disks. In Proc. of
the USENIX Annual Technical Conference, Anaheim, CA, January
1997.
[MGA92] M. Martonosi, A. Gupta, and T. Anderson. MemSpy: Analyzing
Memory System Bottlenecks in Programs. In Proc. of the ACM SIG-
METRICS, pages 1–12, Newport, RI, June 1992.
[MGST70] R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation
Techniques for Storage Hierarchies. IBM Systems J., 9(2):78–117,
1970.
BIBLIOGRAPHY 114
[Nut99] G. Nutt. Operating Systems: A Modern Perspective. Addison Wesley,
2nd edition, 1999.
[OOW93] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-K Page Re-
placement Algorithm for Database Disk Buffering. In Proc. of the
ACM SIGMOD, pages 297–306, Washington, DC, May 1993.
[PDZ99a] V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An Efficient
and Portable Web Server. In Proc. of the USENIX Annual Technical
Conference, Monterey, CA, June 1999.
[PDZ99b] V. S. Pai, P. Druschel, and W. Zwaenepoel. IO-Lite: A Unified I/O
Buffering and Caching System. In Proc. of the Third USENIX OSDI,
New Orleans, LA, February 1999.
[PGG+95] R. H. Patterson, G. A. Gibson, E. Ginting, D. Stodolsky, and J. Ze-
lenka. Informed Prefetching and Caching. In Proc. of the 15th ACM
SOSP, pages 79–95, Copper Mountain Resort, CO, December 1995.
[PS03] A. E. Papathanasiou and M. L. Scott. Energy Efficiency Through
Burstiness. In Proc. of the 5th IEEE Workshop on Mobile Computing
Systems and Applications, Monterey, CA, October 2003.
[PS04] A. E. Papathanasiou and M. L. Scott. Energy Efficient Prefetching
and Caching. In Proc. of the USENIX Annual Technical Conference,
Boston, MA, June 2004.
[PS05] A. Papathanasiou and M. Scott. Aggressive Prefetching: An Idea
Whose Time Has Come. In Proc. of the 10th Workshop on Hot Topics
in Operating Systems, Santa Fe, NM, June 2005.
BIBLIOGRAPHY 115
[PW93] S. E. Perl and W. E. Weihl. Performance Assertion Checking. In Proc.
of the 14th ACM SOSP, pages 134–145, Asheville, NC, December
1993.
[RBDH97] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod. Using the
SimOS Machine Simulator to Study Complex Computer Systems.
ACM Trans. on Modeling and Computer Simulation, 7(1):78–103,
January 1997.
[RD90] J. T. Robinson and M. V. Devarakonda. Data Cache Management
Using Frequency-Based Replacement. In Proc. of the ACM SIGMET-
RICS, pages 134–142, Boulder, CO, May 1990.
[RN03] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach.
Prentice Hall, 2nd edition, 2003.
[RO92] M. Rosenblum and J. K. Ousterhout. The Design and Implementa-
tion of A Log-Structured File System. ACM Transactions on Com-
puter Systems, 10(1):100–114, 1992.
[RW94] C. Ruemmler and J. Wilkes. An Introduction to Disk Drive Modeling.
IEEE Computer, 27(3):17–28, March 1994.
[SES] Search Engine Watch. http://searchenginewatch.com/showPage.html?-
page=2156481.
[SGLG02] J. Schindler, J. L. Griffin, C. R. Lumb, and G. R. Ganger. Track-
aligned Extents: Matching Access Patterns to Disk Drive Character-
istics. In Proc. of the 1st Conference on File and Storage Technolo-
gies, Monterey, CA, January 2002.
BIBLIOGRAPHY 116
[Smi78] A. J. Smith. Sequentiality and Prefetching in Database Systems.
ACM Transactions on Database Systems, 3(3):223–247, September
1978.
[SMW98] E. Shriver, A. Merchant, and J. Wilkes. An Analytical Behavior
Model for Disk Drives with Readahead Caches and Request Reorder-
ing. In Proc. of the ACM SIGMETRICS, pages 182–192, Madison,
WI, June 1998.
[SS96] K. A. Smith and M. Seltzer. A Comparison of FFS Disk Allocation
Policies. In Proc. of the USENIX Annual Technical Conference, San
Diego, CA, January 1996.
[SS05] C. Stewart and K. Shen. Performance modeling and system man-
agement for multi-component online services. In Second USENIX
Symp. on Networked Systems Design and Implementation, pages 71–
84, Boston, MA, May 2005.
[SSS99] E. Shriver, C. Small, and K. Smith. Why Does File System Prefetch-
ing Work ? In Proc. of the USENIX Annual Technical Conference,
Monterey, CA, June 1999.
[SWW] The Size of the World Wide Web.
http://www.worldwidewebsize.com/.
[SZL05] K. Shen, M. Zhong, and C. Li. I/O System Performance Debugging
Using Model-driven Anomaly Characterization. San Francisco, CA,
December 2005.
[Tan01] A. Tanenbaum. Modern Operating System. Prentice Hall, 2nd edi-
tion, 2001.
[TPCH] TPC-H Benchmark, H. http://www.tpc.org.
BIBLIOGRAPHY 117
[TPG97] A. Tomkins, R. H. Patterson, and G. A. Gibson. Informed Multi-
Process Prefetching and Caching. In Proc. of the ACM SIGMET-
RICS, pages 100–114, Seattle, WA, June 1997.
[TSW92] D. Thiebaut, H. S. Stone, and J. L. Wolf. Improving Disk Cache Hit-
Ratios Through Cache Partitioning. IEEE Transactions on Comput-
ers, 41(6):665–676, June 1992.
[TYKZ07] L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /* icomment: Bugs
or bad comments? */. In 21th ACM Symp. on Operating Systems
Principles, pages 145–158, Stevenson, WA, October 2007.
[vBCZ+03] R. von Behren, J. Condit, F. Zhou, G. C. Necula, and E. Brewer.
Capriccio: Scalable Threads for Internet Services. In Proc. of the 19th
ACM SOSP, pages 268–281, Bolton Landing, NY, October 2003.
[VG01] T. Voigt and P. Gunningberg. Dealing with Memory-Intensive Web
Requests. Technical Report 2001-010, Uppsala University, May 2001.
[VTFM01] T. Voigt, R. Tewari, D. Freimuth, and A. Mehra. Kernel Mechanisms
for Service Differentiation in Overloaded Web Servers. In Proc. of the
USENIX Annual Technical Conference, Boston, MA, June 2001.
[WC03] M. Welsh and D. Culler. Adaptive Overload Control for Busy Internet
Servers. In Proc. of the 4th USENIX Symp. on Internet Technologies
and Systems, Seattle, WA, March 2003.
[WCB01] M. Welsh, D. Culler, and E. Brewer. SEDA: An Architecture for
Well-Conditioned, Scalable Internet Services. In Proc. of the 18th
ACM SOSP, pages 230–243, Banff, Canada, October 2001.
BIBLIOGRAPHY 118
[WW94] C. Waldspurger and W. Weihl. Lottery Scheduling: Flexible
Proportional-Share Resource Management. In OSDI1994, Monterey,
CA, November 1994.
[WW95] C. Waldspurger and W. Weihl. Stride Scheduling: Deterministic
Proportional Resource Management. Technical report, June 1995.
[YD00] K. Yaghmour and M. R. Dagenais. Measuring and Characterizing
System Behavior Using Kernel-Level Event Logging. In Proc. of the
USENIX Annual Technical Conference, San Diego, CA, June 2000.
[YLB01] T. Yeh, D. Long, and S. A. Brandt. Using Program and User In-
formation to Improve File Prediction Performance. In Proc. of the
Int’l Symposium on Performance Analysis of Systems and Software,
pages 111–119, Tucson, AZ, November 2001.
[ZPL01] Y. Zhou, J. F. Philbin, and K. Li. The Multi-Queue Replacement
Algorithm for Second Level Buffer Caches. In Proc. of the USENIX
Annual Technical Conference, Boston, MA, June 2001.