E-528-529, sector-7, Dwarka, New delhi-110075
(Nr. Ramphal chowk and Sector 9 metro station) Ph. 011-47350606, (M) 7838010301-04 www.eduproz.in
Educate Anytime...Anywhere...
"Greetings For The Day" About Eduproz
We, at EduProz, started our voyage with a dream of making higher education available for everyone. Since
its inception, EduProz has been working as a stepping-stone for the students coming from varied
backgrounds. The best part is – the classroom for distance learning or correspondence courses for both
management (MBA and BBA) and Information Technology (MCA and BCA) streams are free of cost.
Experienced faculty-members, a state-of-the-art infrastructure and a congenial environment for learning -
are the few things that we offer to our students. Our panel of industrial experts, coming from various
industrial domains, lead students not only to secure good marks in examination, but also to get an edge over
others in their professional lives. Our study materials are sufficient to keep students abreast of the present
nuances of the industry. In addition, we give importance to regular tests and sessions to evaluate our
students’ progress.
Students can attend regular classes of distance learning MBA, BBA, MCA and BCA courses at EduProz
without paying anything extra. Our centrally air-conditioned classrooms, well-maintained library and well-
equipped laboratory facilities provide a comfortable environment for learning.
Honing specific skills is inevitable to get success in an interview. Keeping this in mind, EduProz has a career
counselling and career development cell where we help student to prepare for interviews. Our dedicated
placement cell has been helping students to land in their dream jobs on completion of the course.
EduProz is strategically located in Dwarka, West Delhi (walking distance from Dwarka Sector 9 Metro
Station and 4-minutes drive from the national highway); students can easily come to our centre from
anywhere Delhi and neighbouring Gurgaon, Haryana and avail of a quality-oriented education facility at
apparently no extra cost.
Why Choose Edu Proz for distance learning?
• Edu Proz provides class room facilities free of cost.
• In EduProz Class room teaching is conducted through experienced faculty.
• Class rooms are spacious fully air-conditioned ensuring comfortable ambience.
• Course free is not wearily expensive.
• Placement assistance and student counseling facilities.
• Edu Proz unlike several other distance learning courses strives to help and motivate pupils to get
high grades thus ensuring that they are well placed in life.
• Students are groomed and prepared to face interview boards.
• Mock tests, unit tests and examinations are held to evaluate progress.
• Special care is taken in the personality development department.
"HAVE A GOOD DAY"
Karnataka State Open University
(KSOU) was established on 1st June 1996 with the assent of H.E. Governor of Karnataka as a full fledged University in the academic year 1996 vide Government notification No/EDI/UOV/dated 12th February 1996 (Karnataka State Open University Act – 1992). The act was promulgated with the object to incorporate an Open University at the State level for the introduction and promotion of Open University and Distance Education systems in the education pattern of the State and the country for the Co-ordination and determination of standard of such systems. Keeping in view the educational needs of our country, in general, and state in particular the policies and programmes have been geared to cater to the needy. Karnataka State Open University is a UGC recognised University of Distance Education Council (DEC), New Delhi, regular member of the Association of Indian Universities (AIU), Delhi, permanent member of Association of Commonwealth Universities (ACU), London, UK, Asian Association of Open Universities (AAOU), Beijing, China, and also has association with Commonwealth of Learning (COL). Karnataka State Open University is situated at the North–Western end of the Manasagangotri campus, Mysore. The campus, which is about 5 kms, from the city centre, has a serene atmosphere ideally suited for academic pursuits. The University houses at present the Administrative Office, Academic Block, Lecture Halls, a well-equipped Library, Guest House Cottages, a Moderate Canteen, Girls Hostel and a few cottages providing limited accommodation to students coming to Mysore for attending the Contact Programmes or Term-end examinations.
Unit 1: Overview of the Operating Systems:
This unit covers introduction, evolution of OS. And also covers the OS components and
its services.
Introduction to Operating Systems
Programs, Code files, Processes and Threads
• A sequence of instructions telling the computer what to do is called a program.
The user normally uses a text editor to write their program in a high level
language, such as Pascal, C, Java, etc. Alternatively, they may write it in
assembly language. Assembly language is a computer language whose statements
have an almost one to one correspondence to the instructions understood by the
CPU of the computer. It provides a way of specifying in precise detail what
machine code the assembler should create.
A compiler is used to translate a high level language program into assembly
language or machine code, and an assembler is used to translate an assembly
language program into machine code. A linker is used to combine relocatable
object files (code files corresponding to incomplete portions of a program) into
executable code files (complete code files, for which the addresses have been
resolved for all global functions and variables).
The text for a program written in a high level language or assembly language is
normally saved in a source file on disk. Machine code for a program is normally
saved in a code file on disk. The machine code is loaded into the virtual memory
for a process, when the process attempts to execute the program.
The notion of a program is becoming more complex nowadays, because of
shared libraries. In the old days, the user code for a process was all in one file.
However, with GUI libraries becoming so large, this is no longer possible.
Library code is now stored in memory that is shared by all processes that use it.
Perhaps it is best to use the term program for the machine code stored in or
derived from a single code file.
Code files contain more than just machine code. On UNIX, a code file starts with
a header, containing information on the position and size of the code (”text”),
initialised data, and uninitialised data segments of the code file. The header also
contains other information, such as the initial value to give the program counter
(the “entry point”) and global pointer register. The data for the code and
initialised data segments then follows.
As well the above information, code files can contain a symbol table – a table
indicating the names of all functions and global variables, and the virtual
addresses they correspond to. The symbol table is used by the linker, when it
combines several relocatable object files into a single executable code file, to
resolve references to functions in shared libraries. The symbol table is also used
for debugging. The structure of UNIX code files on the Alpha is very complex,
due to the use of shared libraries.
• When a user types in the name of a command in the UNIX shell, this results in the
creation of what is called a process. On any large computer, especially one with
more than one person using it at the same time, there are normally many
processes executing at any given time. Under UNIX, every time a user types in a
command, they create a separate process. If several users execute the same
command, then each one creates a different process. The Macintosh is a little
different from UNIX. If the user double clicks on several data files for an
application, only one process is created, and this process manages all the data
files.
A process is the virtual memory, and information on open files, and other
operating system resources, shared by its threads of execution, all executing in the
same virtual memory.
The threads in a process execute not only the code from a user program. They can
also execute the shared library code, operating system kernel code, and (on the
Alpha) what is called PALcode.
A process is created to execute a command. The code file for the command is
used to initialise the virtual memory containing the user code and global
variables. The user stack for the initial thread is cleared, and the parameters to the
command are passed as parameters to the main function of the program. Files are
opened corresponding to the standard input and output (keyboard and screen,
unless file redirection is used).
When a process is created, it is created with a single thread of execution.
Conventional processes never have more than a single thread of execution, but
multi-threaded processes are now becoming common place. We often speak about
a program executing, or a process executing a program, when we really mean a
thread within the process executes the program.
In UNIX, a new process executing a new program is created by the fork() system
call (which creates an almost identical copy of an existing process, executing the
same program), followed by the exec() system call (which replaces the program
being executed by the new program).
In the Java programming language, a new process executing a new program is
created by the exec() method in the Runtime class. The Java exec() is probably
implemented as a combination of the UNIX fork() and exec() system calls.
• A thread is an instance of execution (the entity that executes). All the threads that
make up a process share access to the same user program, virtual memory, open
files, and other operating system resources. Each thread has its own program
counter, general purpose registers, and user and kernel stack. The program
counter and general purpose registers for a thread are stored in the CPU when the
thread is executing, and saved away in memory when it is not executing.
The Java programming language supports the creation of multiple threads. To
create a thread in Java, we create an object that implements the Runnable
interface (has a run() method), and use this to create a new Thread object. To
initiate the execution of the thread, we invoke the start() method of the thread,
which invokes the run() method of the Runnable object. The threads that make up
a process need to use some kind of synchronisation mechanism to avoid more
than one thread accessing shared data at the same time. In Java, synchronisation is
done by synchronised methods. The wait(), notifyO, and notifyAU() methods in
the Object class are used to allow a thread to wait until the data has been updated
by another thread, and to notify other threads when the data has been altered.
In UNIX C, the pthreads library contains functions to create new threads, and
provide the equivalent of synchronised methods, waitO, notifyO, etc. The Java
mechanism is in fact based on the pthreads library. In Java, synchronisation is
built into the design of the language (the compiler knows about synchronised
methods). In C, there is no syntax to specify that a function (method) is
synchronised, and the programmer has to explicitly put in code at the start and
end of the method to gain and relinquish exclusive access to a data structure.
Some people call threads lightweight processes, and processes heavyweight
processes. Some people call processes tasks.
Many application programs, such as Microsoft word, are starting to make use of
multiple threads. For example, there is a thread that processes the input, and a
thread for doing repagination in the background. A compiler could have multiple
threads, one for lexical analysis, one for parsing, one for analysing the abstract
syntax tree. These can all execute in parallel, although the parser cannot execute
ahead of the lexical analyser, and the abstract syntax tree analyser can only
process the portion of the abstract syntax tree already generated by the parser. The
code for performing graphics can easily be sped up by having multiple threads,
each painting a portion of the screen. File and network servers have to deal with
multiple external requests, many of which block before the reply is given. An
elegant way of programming servers is to have a thread for each request.
Multi-threaded processes are becoming very important, because computers with
multiple processors are becoming commonplace, as are distributed systems, and servers.
It is important that you learn how to program in this manner. Multi-threaded
programming, particularly dealing with synchronisation issues, is not trivial, and a good
conceptual understanding of synchronisation is essential. Synchronisation is dealt with
fully in the stage 3 operating systems paper.
Objectives
An operating system can be thought of as having three objectives:
Convenience: An operating system makes a computer more convenient to use.
Efficiency: An operating system allows the computer system resources to be used in an
efficient manner.
Ability to evolve: An operating system should be constructed in such a way as to permit
the effective development, testing and introduction of new system functions without
interfering with current services provided.
What is an Operating System?
An operating system (OS) is a program that controls the execution of an application
program and acts as an interface between the user and computer hardware. The purpose
of an OS is to provide an environment in which a user can execute programs in a
convenient and efficient manner.
The operating system must provide certain services to programs and to the users of those
programs in order to make the programming task easier, these services will differ from
one OS to another.
Functions of an Operating System
Modern Operating systems generally have following three major goals. Operating
systems generally accomplish these goals by running processes in low privilege and
providing service calls that invoke the operating system kernel in high-privilege state.
To hide details of hardware
An abstraction is software that hides lower level details and provides a set of higher-level
functions. An operating system transforms the physical world of devices, instructions,
memory, and time into virtual world that is the result of abstractions built by the
operating system. There are several reasons for abstraction.
First, the code needed to control peripheral devices is not standardized. Operating
systems provide subroutines called device drivers that perform operations on behalf of
programs for example, input/output operations.
Second, the operating system introduces new functions as it abstracts the hardware. For
instance, operating system introduces the file abstraction so that programs do not have to
deal with disks.
Third, the operating system transforms the computer hardware into multiple virtual
computers, each belonging to a different program. Each program that is running is called
a process. Each process views the hardware through the lens of abstraction.
Fourth, the operating system can enforce security through abstraction.
Resources Management
An operating system as resource manager, controls how processes (the active agents)
may access resources (passive entities). One can view Operating Systems from two
points of views: Resource manager and Extended machines. Form Resource manager
point of view Operating Systems manage the different parts of the system efficiently and
from extended machines point of view Operating Systems provide a virtual machine to
users that is more convenient to use. The structurally Operating Systems can be design as
a monolithic system, a hierarchy of layers, a virtual machine system, a micro-kernel, or
using the client-server model. The basic concepts of Operating Systems are processes,
memory management, I/O management, the file systems, and security.
Provide a effective user interface
The user interacts with the operating systems through the user interface and usually
interested in the look and feel of the operating system. The most important components
of the user interface are the command interpreter, the file system, on-line help, and
application integration. The recent trend has been toward increasingly integrated
graphical user interfaces that encompass the activities of multiple processes on networks
of computers.
Evolution of Operating System
Operating system and computer architecture have had a great deal of influence on each
other. To facilitate the use of the hardware, OS’s were developed. As operating systems
were designed and used, it became obvious that changes in the design of the hardware
could simplify them.
Early Systems
In the earliest days of electronic digital computing, everything was done on the bare
hardware. Very few computers existed and those that did exist were experimental in
nature. The researchers who were making the first computers were also the programmers
and the users. They worked directly on the “bare hardware”. There was no operating
system. The experimenters wrote their programs in assembly language and a running
program had complete control of the entire computer. Debugging consisted of a
combination of fixing both the software and hardware, rewriting the object code and
changing the actual computer itself.
The lack of any operating system meant that only one person could use a computer at a
time. Even in the research lab, there were many researchers competing for limited
computing time. The first solution was a reservation system, with researchers signing up
for specific time slots.
The high cost of early computers meant that it was essential that the rare computers be
used as efficiently as possible. The reservation system was not particularly efficient. If a
researcher finished work early, the computer sat idle until the next time slot. If the
researcher’s time ran out, the researcher might have to pack up his or her work in an
incomplete state at an awkward moment to make room for the next researcher. Even
when things were going well, a lot of the time the computer actually sat idle while the
researcher studied the results (or studied memory of a crashed program to figure out what
went wrong).
The solution to this problem was to have programmers prepare their work off-line on
some input medium (often on punched cards, paper tape, or magnetic tape) and then hand
the work to a computer operator. The computer operator would load up jobs in the order
received (with priority overrides based on politics and other factors). Each job still ran
one at a time with complete control of the computer, but as soon as a job finished, the
operator would transfer the results to some output medium (punched tape, paper tape,
magnetic tape, or printed paper) and deliver the results to the appropriate programmer. If
the program ran to completion, the result would be some end data. If the program
crashed, memory would be transferred to some output medium for the programmer to
study (because some of the early business computing systems used magnetic core
memory, these became known as “core dumps”)
Soon after the first successes with digital computer experiments, computers moved out of
the lab and into practical use. The first practical application of these experimental digital
computers was the generation of artillery tables for the British and American armies.
Much of the early research in computers was paid for by the British and American
militaries. Business and scientific applications followed.
As computer use increased, programmers noticed that they were duplicating the same
efforts.
Every programmer was writing his or her own routines for I/O, such as reading input
from a magnetic tape or writing output to a line printer. It made sense to write a common
device driver for each input or output device and then have every programmer share the
same device drivers rather than each programmer writing his or her own. Some
programmers resisted the use of common device drivers in the belief that they could write
“more efficient” or faster or “”better” device drivers of their own.
Additionally each programmer was writing his or her own routines for fairly common
and repeated functionality, such as mathematics or string functions. Again, it made sense
to share the work instead of everyone repeatedly “reinventing the wheel”. These shared
functions would be organized into libraries and could be inserted into programs as
needed. In the spirit of cooperation among early researchers, these library functions were
published and distributed for free, an early example of the power of the open source
approach to software development.
Simple Batch Systems
When punched cards were used for user jobs, processing of a job involved physical
actions by the system operator, e.g., loading a deck of cards into the card reader, pressing
switches on the computer’s console to initiate a job, etc. These actions wasted a lot of
central processing unit (CPU) time.
Operating System
User Program Area
Figure 1.1: Simple Batch System
To speed up processing, jobs with similar needs were batched together and were run as a
group. Batch processing (BP) was implemented by locating a component of the BP
system, called the batch monitor or supervisor, permanently in one part of computer’s
memory. The remaining memory was used to process a user job – the current job in the
batch as shown in the figure 1.1 above.
The delay between job submission and completion was considerable in batch processed
system as a number of programs were put in a batch and the entire batch had to be
processed before the results were printed. Further card reading and printing were slow as
they used slower mechanical units compared to CPU which was electronic. The speed
mismatch was of the order of 1000. To alleviate this problem programs were spooled.
Spool is an acronym for simultaneous peripheral operation on-line. In essence the idea
was to use a cheaper processor known as peripheral processing unit (PPU) to read
programs and data from cards store them on a disk. The faster CPU read programs/data
from the disk processed them and wrote the results back on the disk. The cheaper
processor then read the results from the disk and printed them.
Multi Programmed Batch Systems
Even though disks are faster than card reader/ printer they are still two orders of
magnitude slower than CPU. It is thus useful to have several programs ready to run
waiting in the main memory of CPU. When one program needs input/output (I/O) from
disk it is suspended and another program whose data is already in main memory (as
shown in the figure 1.2 bellow) is taken up for execution. This is called
multiprogramming.
Operating System
Program 1
Program 2
Program 3
Program 4
Figure 1.2: Multi Programmed Batch Systems
Multiprogramming (MP) increases CPU utilization by organizing jobs such that the CPU
always has a job to execute. Multiprogramming is the first instance where the operating
system must make decisions for the user.
The MP arrangement ensures concurrent operation of the CPU and the I/O subsystem. It
ensures that the CPU is allocated to a program only when it is not performing an I/O
operation.
Time Sharing Systems
Multiprogramming features were superimposed on BP to ensure good utilization of CPU
but from the point of view of a user the service was poor as the response time, i.e., the
time elapsed between submitting a job and getting the results was unacceptably high.
Development of interactive terminals changed the scenario. Computation became an on-
line activity. A user could provide inputs to a computation from a terminal and could also
examine the output of the computation on the same terminal. Hence, the response time
needed to be drastically reduced. This was achieved by storing programs of several users
in memory and providing each user a slice of time on CPU to process his/her program.
Distributed Systems
A recent trend in computer system is to distribute computation among several processors.
In the loosely coupled systems the processors do not share memory or a clock. Instead,
each processor has its own local memory. The processors communicate with one another
using communication network.
The processors in a distributed system may vary in size and function, and referred by a
number of different names, such as sites, nodes, computers and so on depending on the
context. The major reasons for building distributed systems are:
Resource sharing: If a number of different sites are connected to one another, then a
user at one site may be able to use the resources available at the other.
Computation speed up: If a particular computation can be partitioned into a number of
sub computations that can run concurrently, then a distributed system may allow a user to
distribute computation among the various sites to run them concurrently.
Reliability: If one site fails in a distributed system, the remaining sites can potentially
continue operations.
Communication: There are many instances in which programs need to exchange data
with one another. Distributed data base system is an example of this.
Real-time Operating System
The advent of timesharing provided good response times to computer users. However,
timesharing could not satisfy the requirements of some applications. Real-time (RT)
operating systems were developed to meet the response requirements of such
applications.
There are two flavors of real-time systems. A hard real-time system guarantees that
critical tasks complete at a specified time. A less restrictive type of real time system is
soft real-time system, where a critical real-time task gets priority over other tasks, and
retains that priority until it completes. The several areas in which this type is useful are
multimedia, virtual reality, and advance scientific projects such as undersea exploration
and planetary rovers. Because of the expanded uses for soft real-time functionality, it is
finding its way into most current operating systems, including major versions of Unix and
Windows NT OS.
A real-time operating system is one, which helps to fulfill the worst-case response time
requirements of an application. An RT OS provides the following facilities for this
purpose:
1. Multitasking within an application.
2. Ability to define the priorities of tasks.
3. Priority driven or deadline oriented scheduling.
4. Programmer defined interrupts.
A task is a sub-computation in an application program, which can be executed
concurrently with other sub-computations in the program, except at specific places in its
execution called synchronization points. Multi-tasking, which permits the existence of
many tasks within the application program, provides the possibility of overlapping the
CPU and I/O activities of the application with one another. This helps in reducing its
elapsed time. The ability to specify priorities for the tasks provides additional controls to
a designer while structuring an application to meet its response-time requirements.
Real time operating systems (RTOS) are specifically designed to respond to events that
happen in real time. This can include computer systems that run factory floors, computer
systems for emergency room or intensive care unit equipment (or even the entire ICU),
computer systems for air traffic control, or embedded systems. RTOSs are grouped
according to the response time that is acceptable (seconds, milliseconds, microseconds)
and according to whether or not they involve systems where failure can result in loss of
life. Examples of real-time operating systems include QNX, Jaluna-1, ChorusOS,
LynxOS, Windows CE .NET, and VxWorks AE, etc.
Self assessment questions
1. What do the terms program, process, and thread mean?
2. What is the purpose of a compiler, assembler and linker?
3. What is the structure of a code file? What is the purpose of the symbol table in a
code file?
4. Why are shared libraries essential on modern computers?
Operating System Components
Even though, not all systems have the same structure many modern operating systems
share the same goal of supporting the following types of system components.
Process Management
The operating system manages many kinds of activities ranging from user programs to
system programs like printer spooler, name servers, file server etc. Each of these
activities is encapsulated in a process. A process includes the complete execution context
(code, data, PC, registers, OS resources in use etc.).
It is important to note that a process is not a program. A process is only ONE instant of a
program in execution. There are many processes can be running the same program. The
five major activities of an operating system in regard to process management
are1. Creation and deletion of user and system processes.
2. Suspension and resumption of processes.
3. A mechanism for process synchronization.
4. A mechanism for process communication.
5. A mechanism for deadlock handling.
Main-Memory Management
Primary-Memory or Main-Memory is a large array of words or bytes. Each word or byte
has its own address. Main-memory provides storage that can be access directly by the
CPU. That is to say for a program to be executed, it must in the main memory.
The major activities of an operating in regard to memory-management are:
1. Keep track of which part of memory are currently being used and by whom.
2. Decide which processes are loaded into memory when memory space becomes
available.
3. Allocate and de-allocate memory space as needed.
File Management
A file is a collection of related information defined by its creator. Computer can store
files on the disk (secondary storage), which provides long term storage. Some examples
of storage media are magnetic tape, magnetic disk and optical disk. Each of these media
has its own properties like speed, capacity, data transfer rate and access methods.
A file system normally organized into directories to ease their use. These directories may
contain files and other directions.
The five main major activities of an operating system in regard to file management are
1. The creation and deletion of files.
2. The creation and deletion of directions.
3. The support of primitives for manipulating files and directions.
4. The mapping of files onto secondary storage.
5. The back up of files on stable storage media.
I/O System Management
I/O subsystem hides the peculiarities of specific hardware devices from the user. Only the
device driver knows the peculiarities of the specific device to whom it is assigned.
Secondary-Storage Management
Generally speaking, systems have several levels of storage, including primary storage,
secondary storage and cache storage. Instructions and data must be placed in primary
storage or cache to be referenced by a running program. Because main memory is too
small to accommodate all data and programs, and its data are lost when power is lost, the
computer system must provide secondary storage to back up main memory. Secondary
storage consists of tapes, disks, and other media designed to hold information that will
eventually be accessed in primary storage (primary, secondary, cache) is ordinarily
divided into bytes or words consisting of a fixed number of bytes. Each location in
storage has an address; the set of all addresses available to a program is called an address
space.
The three major activities of an operating system in regard to secondary storage
management are:
1. Managing the free space available on the secondary-storage device.
2. Allocation of storage space when new files have to be written.
3. Scheduling the requests for memory access.
Networking
A distributed system is a collection of processors that do not share memory, peripheral
devices, or a clock. The processors communicate with one another through
communication lines called network. The communication-network design must consider
routing and connection strategies, and the problems of contention and security.
Protection System
If a computer system has multiple users and allows the concurrent execution of multiple
processes, then various processes must be protected from one another’s activities.
Protection refers to mechanism for controlling the access of programs, processes, or users
to the resources defined by a computer system.
Command Interpreter System
A command interpreter is an interface of the operating system with the user. The user
gives commands with are executed by operating system (usually by turning them into
system calls). The main function of a command interpreter is to get and execute the next
user specified command. Command-Interpreter is usually not part of the kernel, since
multiple command interpreters (shell, in UNIX terminology) may be supported by an
operating system, and they do not really need to run in kernel mode. There are two main
advantages of separating the command interpreter from the kernel.
1. If we want to change the way the command interpreter looks, i.e., I want to
change the interface of command interpreter, I am able to do that if the command
interpreter is separate from the kernel. I cannot change the code of the kernel so I
cannot modify the interface.
2. If the command interpreter is a part of the kernel, it is possible for a malicious
process to gain access to certain part of the kernel that it should not have. To
avoid this scenario it is advantageous to have the command interpreter separate
from kernel.
Self Assessment Questions
1. Discuss the various components of OS?
2. Explain the Memory Management and File Management in brief.
3. Write Note on.
1. Secondary-Storage Management
2. Command Interpreter System
Operating System Services
Following are the five services provided by operating systems for the convenience of the
users.
Program Execution
The purpose of a computer system is to allow the user to execute programs. So the
operating system provides an environment where the user can conveniently run programs.
The user does not have to worry about the memory allocation or multitasking or
anything. These things are taken care of by the operating systems.
Running a program involves the allocating and de-allocating memory, CPU scheduling in
case of multi-process. These functions cannot be given to the user-level programs. So
user-level programs cannot help the user to run programs independently without the help
from operating systems.
I/O Operations
Each program requires an input and produces output. This involves the use of I/O. The
operating systems hides from the user the details of underlying hardware for the I/O. All
the users see that the I/O has been performed without any details. So the operating
system, by providing I/O, makes it convenient for the users to run programs.
For efficiently and protection users cannot control I/O so this service cannot be provided
by user-level programs.
File System Manipulation
The output of a program may need to be written into new files or input taken from some
files. The operating system provides this service. The user does not have to worry about
secondary storage management. User gives a command for reading or writing to a file
and sees his/her task accomplished. Thus operating system makes it easier for user
programs to accomplish their task.
This service involves secondary storage management. The speed of I/O that depends on
secondary storage management is critical to the speed of many programs and hence I
think it is best relegated to the operating systems to manage it than giving individual
users the control of it. It is not difficult for the user-level programs to provide these
services but for above mentioned reasons it is best if this service is left with operating
system.
Communications
There are instances where processes need to communicate with each other to exchange
information. It may be between processes running on the same computer or running on
the different computers. By providing this service the operating system relieves the user
from the worry of passing messages between processes. In case where the messages need
to be passed to processes on the other computers through a network, it can be done by the
user programs. The user program may be customized to the specifications of the
hardware through which the message transits and provides the service interface to the
operating system.
Error Detection
An error in one part of the system may cause malfunctioning of the complete system. To
avoid such a situation the operating system constantly monitors the system for detecting
the errors. This relieves the user from the worry of errors propagating to various part of
the system and causing malfunctioning.
This service cannot be allowed to be handled by user programs because it involves
monitoring and in cases altering area of memory or de-allocation of memory for a faulty
process, or may be relinquishing the CPU of a process that goes into an infinite loop.
These tasks are too critical to be handed over to the user programs. A user program if
given these privileges can interfere with the correct (normal) operation of the operating
systems.
Self Assessment Questions
1. Explain the five services provided by the operating system.
Operating Systems for Different Computers
Operating systems can be grouped according to functionality: operating systems for
Supercomputers, Computer Clusters, Mainframes, Servers, Workstations, Desktops,
Handheld Devices, Real Time Systems, or Embedded Systems.
OS for Supercomputers:
Supercomputers are the fastest computers, very expensive and are employed for
specialized applications that require immense amounts of mathematical calculations, for
example, weather forecasting, animated graphics, fluid dynamic calculations, nuclear
energy research, and petroleum exploration. Out of many operating systems used for
supercomputing UNIX and Linux are the most dominant ones.
Computer Clusters Operating Systems:
A computer cluster is a group of computers that work together closely so that in many
respects they can be viewed as though they are a single computer. The components of a
cluster are commonly, connected to each other through fast local area networks. Besides
many open source operating systems, and two versions of Windows 2003 Server, Linux
is popularly used for Computer clusters.
Mainframe Operating Systems:
Mainframes used to be the primary form of computer. Mainframes are large centralized
computers and at one time they provided the bulk of business computing through time
sharing. Mainframes are still useful for some large scale tasks, such as centralized billing
systems, inventory systems, database operations, etc.
Minicomputers were smaller, less expensive versions of mainframes for businesses that
couldn’t afford true mainframes. The chief difference between a supercomputer and a
mainframe is that a supercomputer channels all its power into executing a few programs
as fast as possible, whereas a mainframe uses its power to execute many programs
concurrently. Besides various versions of operating systems by IBM for its early
System/360, to newest Z series operating system z/OS, Unix and Linux are also used as
mainframe operating systems.
Servers Operating Systems:
Servers are computers or groups of computers that provides services to other computers,
connected via network. Based on the requirements, there are various versions of server
operating systems from different vendors, starting with Microsoft’s Servers from
Windows NT to Windows 2003, OS/2 servers, UNIX servers, Mac OS servers, and
various flavors of Linux.
Workstation Operating Systems:
Workstations are more powerful versions of personal computers. Like desktop
computers, often only one person uses a particular workstation, and run a more powerful
version of a desktop operating system. Most of the times workstations are used as clients
in a network environment. The popular workstation operating systems are Windows NT
Workstation, Windows 2000 Professional, OS/2 Clients, Mac OS, UNIX, Linux, etc
Desktop Operating Systems:
A personal computer (PC) is a microcomputer whose price, size, and capabilities make it
useful for individuals, also known as Desktop computers or home computers
Desktop operating systems are used for personal computers, for example DOS, Windows
9x, Windows XP, Macintosh OS, Linux, etc.
Embedded Operating Systems:
Embedded systems are combinations of processors and special software that are inside of
another device, such as the electronic ignition system on cars. Examples of embedded
operating systems are Embedded Linux, Windows CE, Windows XP Embedded, Free
DOS, Free RTOS, etc.
Operating Systems for Handheld Computers:
Handheld operating systems are much smaller and less capable than desktop operating
systems, so that they can fit into the limited memory of handheld devices. The operating
systems include Palm OS, Windows CE, EPOC, and Summary
An operating system (OS) is a program that controls the execution of an application
program and acts as an interface between the user and computer hardware. The objectives
of operating system are convenience, efficiency, and ability to evolve. Besides this the
operating system performs function such as hiding details of the hardware, resource
management, and providing effective user interface.
The process management component of operating system is responsible for creation,
termination, other and state transitions of a process. The memory management unit is
mainly responsible for allocation, de-allocation to processes, and keeping track records of
memory usage by different processes. The operating system services are program
execution, I/O operations, file system manipulation, communication and error detection.
Terminal Questions
1. What is an operating system?
2. What are the objectives of an operating system?
3. Describe in brief, the function of an operating system.
4. Explain the evolution of operating system in brief.
5. Write a note on Batch OS. Discuss how it is differ from Multi Programmed Batch
Systems.
6. What is difference between multi-programming and timesharing operating
systems?
7. What are the typical features of an operating system provides?
8. Explain the functions of operating system as file manager.
9. What are different services provided by an operating system?
10. Write Note on :
1.Mainframe Operating Systems
2.Embedded Operating Systems
3.Servers Operating Systems
4.Desktop Operating Systems
many Linux versions such as Qt Palmtop, and Pocket Linux, etc.
Unit 2: Operating System Architecture :
This unit deals with the Simple structure, extended machine, layered approaches. It
covers the different methodology for OS design (Models). It covers the Introduction of
Virtual Machine, Virtual environment and Machine aggregation. And also describes the
implementation techniques.
Introduction
A system as large and complex as a modern operating system must be engineered
carefully if it is to function properly and be modified easily. A common approach is to
partition the task into small component rather than have one monolithic system. Each of
these modules should be a well-defined portion of the system, with carefully defined
inputs, outputs, and functions. In this unit, we discuss how various components of an
operating system are interconnected and melded into a kernel.
Objective:
At the end of this unit, readers would be able to understand:
• What is Kernel? Monolithic Kernel Architecture
• Layered Architecture
• Microkernel Architecture
• Operating System Components
• Operating System Services
OS as an Extended Machine
We can think of an operating system as an Extended Machine standing between our
programs and the bare hardware.
As shown in above figure 2.1, the operating system interacts with the hardware hiding it
from the application program, and user. Thus it acts as interface between user programs
and hardware.
Self Assessment Questions
1. What is the role of an Operating System?
Simple Structure
Many commercial systems do not have well-defined structures. Frequently, such
operating systems started as small, simple, and limited systems and then grew beyond
their original scope. MS-DOS is an example of such a system. It was originally designed
and implemented by a few people who had no idea that it would become so popular. It
was written to provide the most functionality in the least space, so it was not divided into
modules carefully. In MS-DOS, the interfaces and levels of functionality are not well
separated. For instance, application programs are able to access the basic I/O routines to
write directly to the display and disk drives. Such freedom leaves MS-DOS vulnerable to
errant (or malicious) programs, causing entire system crashes when user programs fail.
Of course, MS-DOS was also limited by the hardware of its era. Because the Intel 8088
for which it was written provides no dual mode and no hardware protection, the designers
of MS-DOS had no choice but to leave the base hardware accessible.
Another example of limited structuring is the original UNIX operating system. UNIX is
another system that initially was limited by hardware functionality. It consists of two
separable parts:
• the kernel and
• the system programs
The kernel is further separated into a series of interfaces and device drivers, which have
been added and expanded over the years as UNIX has evolved. We can view the
traditional UNIX operating system as being layered. Everything below the system call
interface and above the physical hardware is the kernel. The kernel provides the file
system, CPU scheduling, memory management, and other operating-system functions
through system calls. Taken in sum, that is an enormous amount of functionality to be
combined into one level. This monolithic structure was difficult to implement and
maintain.
Self Assessment Questions
1. ”In MS-DOS, the interfaces and levels of functionality are not well separated”.
Comment on this.
2. What are the components of a Unix Operating System?
Layered Approach
With proper hardware support, operating systems can be broken into pieces that are
smaller and more appropriate than those allowed by the original MS-DOS or UNIX
systems. The operating system can then retain much greater control over the computer
and over the applications that make use of that computer. Implementers have more
freedom in changing the inner workings of the system and in creating modular operating
systems. Under the top-down approach, the overall functionality and features are
determined and the separated into components. Information hiding is also important,
because it leaves programmers free to implement the low-level routines as they see fit,
provided that the external interface of the routine stays unchanged and that the routine
itself performs the advertised task.
A system can be made modular in many ways. One method is the layered approach, in
which the operating system is broken up into a number of layers (levels). The bottom
layer (layer 0) id the hardware; the highest (layer N) is the user interface.
Users
File Systems
Inter-process Communication
I/O and Device Management
Virtual Memory
Primitive Process Management
Hardware
Fig. 2.2: Layered Architecture
An operating-system layer is an implementation of an abstract object made up of data and
the operations that can manipulate those data. A typical operating – system layer-say,
layer M-consists of data structures and a set of routines that can be invoked by higher-
level layers. Layer M, in turn, can invoke operations on lower-level layers.
The main advantage of the layered approach is simplicity of construction and debugging.
The layers are selected so that each uses functions (operations) and services of only
lower-level layers. This approach simplifies debugging and system verification. The first
layer can be debugged without any concern for the rest of the system, because, by
definition, it uses only the basic hardware (which is assumed correct) to implement its
functions. Once the first layer is debugged, its correct functioning can be assumed while
the second layer is debugged, and so on. If an error is found during debugging of a
particular layer, the error must be on that layer, because the layers below it are already
debugged. Thus, the design and implementation of the system is simplified.
Each layer is implemented with only those operations provided by lower-level layers. A
layer does not need to know how these operations are implemented; it needs to know
only what these operations do. Hence, each layer hides the existence of certain data
structures, operations, and hardware from higher-level layers. The major difficulty with
the layered approach involves appropriately defining the various layers. Because layer
can use only lower-level layers, careful planning is necessary. For example, the device
driver for the backing store (disk space used by virtual-memory algorithms) must be at a
lower level than the memory-management routines, because memory management
requires the ability to use the backing store.
Other requirement may not be so obvious. The backing-store driver would normally be
above the CPU scheduler, because the driver may need to wait for I/O and the CPU can
be rescheduled during this time. However, on a larger system, the CPU scheduler may
have more information about all the active processes than can fit in memory. Therefore,
this information may need to be swapped in and out of memory, requiring the backing-
store driver routine to be below the CPU scheduler.
A final problem with layered implementations is that they tend to be less efficient than
other types. For instance, when a user program executes an I/O operation, it executes a
system call that is trapped to the I/O layer, which calls the memory-management layer,
which in turn calls the CPU-scheduling layer, which is then passed to the hardware. At
each layer, the parameters may be modified; data may need to be passed, and so on. Each
layer adds overhead to the system call; the net result is a system call that takes longer
than does one on a non-layered system. These limitations have caused a small backlash
against layering in recent years. Fewer layers with more functionality are being designed,
providing most of the advantages of modularized code while avoiding the difficult
problems of layer definition and interaction.
Self Assessment Questions
1. What is the layered Architecture of UNIX?
2. What are the advantages of layered Architecture?
Micro-kernels
We have already seen that as UNIX expanded, the kernel became large and difficult to
manage. In the mid-1980s, researches at Carnegie Mellon University developed an
operating system called Mach that modularized the kernel using the microkernel
approach. This method structures the operating system by removing all nonessential
components from the kernel and implementing then as system and user-level programs.
The result is a smaller kernel. There is little consensus regarding which services should
remain in the kernel and which should be implemented in user space. Typically, however,
micro-kernels provide minimal process and memory management, in addition to a
communication facility.
Device
Drivers
File Server
Client Process
….
Virtual Memory
Microkernel
Hardware
Fig. 2.3: Microkernel Architecture
The main function of the microkernel is to provide a communication facility between the
client program and the various services that are also running in user space.
Communication is provided by message passing. For example, if the client program and
service never interact directly. Rather, they communicate indirectly by exchanging
messages with the microkernel.
On benefit of the microkernel approach is ease of extending the operating system. All
new services are added to user space and consequently do not require modification of the
kernel. When the kernel does have to be modified, the changes tend to be fewer, because
the microkernel is a smaller kernel. The resulting operating system is easier to port from
one hardware design to another. The microkernel also provided more security and
reliability, since most services are running as user – rather than kernel – processes, if a
service fails the rest of the operating system remains untouched.
Several contemporary operating systems have used the microkernel approach. Tru64
UNIX (formerly Digital UNIX provides a UNIX interface to the user, but it is
implemented with a March kernel. The March kernel maps UNIX system calls into
messages to the appropriate user-level services.
The following figure shows the UNIX operating system architecture. At the center is
hardware, covered by kernel. Above that are the UNIX utilities, and command interface,
such as shell (sh), etc.
SelAssessment Questions
1. What other facilities Micro-kernel provides in addition to Communication
facility?
2. What are the benefits of Micro-kernel?
UNIX kernel Components
The UNIX kernel has components as depicted in the figure 2.5 bellow. The figure is
divided in to three modes: user mode, kernel mode, and hardware. The user mode
contains user programs which can access the services of the kernel components using
system call interface.
The kernel mode has four major components: system calls, file subsystem, process
control subsystem, and hardware control. The system calls are interface between user
programs and file and process control subsystems. The file subsystem is responsible for
file and I/O management through device drivers.
The process control subsystem contains scheduler, Inter-process communication and
memory management. Finally the hardware control is the interface between these two
subsystems and hardware.
Fig. 2.5: Unix kernel components
Another example is QNX. QNX is a real-time operating system that is also based on the
microkernel design. The QNX microkernel provides services for message passing and
process scheduling. It also handled low-level network communication and hardware
interrupts. All other services in QNX are provided by standard processes that run outside
the kernel in user mode.
Unfortunately, microkernels can suffer from performance decreases due to increased
system function overhead. Consider the history of Windows NT. The first release had a
layered microkernels organization. However, this version delivered low performance
compared with that of Windows 95. Windows NT 4.0 partially redressed the performance
problem by moving layers from user space to kernel space and integrating them more
closely. By the time Windows XP was designed, its architecture was more monolithic
than microkernel.
Self Assessment Questions
1. What are the components of UNIX Kernel?
2. Under what circumstances a Micro-kernel may suffer from performance
decrease?
Modules
Perhaps the best current methodology for operating-system design involves using object-
oriented programming techniques to create a modular kernel. Here, the kernel has a set of
core components and dynamically links in additional services either during boot time or
during run time. Such a strategy uses dynamically loadable modules and is common in
modern implementations of UNIX, such as Solaris, Linux and MacOSX. For example,
the Solaris operating system structure is organized around a core kernel with seven types
of loadable kernel modules:
1. Scheduling classes
2. File systems
3. Loadable system calls
4. Executable formats
5. STREAMS formats
6. Miscellaneous
7. Device and bus drivers
Such a design allow the kernel to provide core services yet also allows certain
features to be implemented dynamically. For example device and bus drivers for
specific hardware can be added to the kernel, and support for different file
systems can be added as loadable modules. The overall result resembles a layered
system in that each kernel section has defined, protected interfaces; but it is more
flexible than a layered system in that any module can call any other module.
Furthermore, the approach is like the microkernel approach in that the primary
module has only core functions and knowledge of how to load and communicate
with other modules; but it is more efficient, because modules do not need to
invoke message passing in order to communicate.
Self Assessment Questions
1. Which strategy uses dynamically loadable modules and is common in
modern implementations of UNIX?
2. What are different loadable modules based on which the Solaris operating
system structure is organized around a core kernel?
Introduction to Virtual Machine
The layered approach of operating systems is taken to its logical conclusion in the
concept of virtual machine. The fundamental idea behind a virtual machine is to abstract
the hardware of a single computer (the CPU, Memory, Disk drives, Network Interface
Cards, and so forth) into several different execution environments and thereby creating
the illusion that each separate execution environment is running its own private
computer. By using CPU Scheduling and Virtual Memory techniques, an operating
system can create the illusion that a process has its own processor with its own (virtual)
memory. Normally a process has additional features, such as system calls and a file
system, which are not provided by the hardware. The Virtual machine approach does not
provide any such additional functionality but rather an interface that is identical to the
underlying bare hardware. Each process is provided with a (virtual) copy of the
underlying computer.
Hardware Virtual machine
The original meaning of virtual machine, sometimes called a hardware virtual
machine, is that of a number of discrete identical execution environments on a single
computer, each of which runs an operating system (OS). This can allow applications
written for one OS to be executed on a machine which runs a different OS, or provide
execution “sandboxes” which provide a greater level of isolation between processes than
is achieved when running multiple processes on the same instance of an OS. One use is to
provide multiple users the illusion of having an entire computer, one that is their
“private” machine, isolated from other users, all on a single physical machine. Another
advantage is that booting and restarting a virtual machine can be much faster than with a
physical machine, since it may be possible to skip tasks such as hardware initialization.
Such software is now often referred to with the terms virtualization and virtual servers.
The host software which provides this capability is often referred to as a virtual machine
monitor or hypervisor.
Software virtualization can be done in three major ways:· Emulation, full system
simulation, or “full virtualization with dynamic recompilation” — the virtual machine
simulates the complete hardware, allowing an unmodified OS for a completely different
CPU to be run.· Paravirtualization — the virtual machine does not simulate hardware but
instead offers a special API that requires OS modifications. An example of this is
XenSource’s XenEnterprise (www.xensource.com)· Native virtualization and “full
virtualization” — the virtual machine only partially simulates enough hardware to allow
an unmodified OS to be run in isolation, but the guest OS must be designed for the same
type of CPU. The term native virtualization is also sometimes used to designate that
hardware assistance through Virtualization Technology is used.
Application virtual machine
Another meaning of virtual machine is a piece of computer software that isolates the
application being used by the user from the computer. Because versions of the virtual
machine are written for various computer platforms, any application written for the
virtual machine can be operated on any of the platforms, instead of having to produce
separate versions of the application for each computer and operating system. The
application is run on the computer using an interpreter or Just In Time compilation. One
of the best known examples of an application virtual machine is Sun Microsystem’s Java
Virtual Machine.
Self Assessment Questions
1. What do you mean by a Virtual Machine?
2. Differentiate Hardware Virtual Machines and Software Virtual Machines.
Virtual Environment
A virtual environment (otherwise referred to as Virtual private server) is another kind of
a virtual machine. In fact, it is a virtualized environment for running user-level programs
(i.e. not the operating system kernel and drivers, but applications). Virtual environments
are created using the software implementing operating system-level virtualization
approach, such as Virtuozzo, FreeBSD Jails, Linux-VServer, Solaris Containers, chroot
jail and OpenVZ.
Machine Aggregation
A less common use of the term is to refer to a computer cluster consisting of many
computers that have been aggregated together as a larger and more powerful “virtual”
machine. In this case, the software allows a single environment to be created spanning
multiple computers, so that the end user appears to be using only one computer rather
than several.
PVM (Parallel Virtual Machine) and MPI (Message Passing Interface) are two common
software packages that permit a heterogeneous collection of networked UNIX and/or
Windows computers to be used as a single, large, parallel computer. Thus large
computational problems can be solved more cost effectively by using the aggregate
power and memory of many computers than with a traditional supercomputer. The Plan9
Operating System from Bell Labs uses this approach.
Boston Circuits had released the gCore (grid-on-chip) Central Processing Unit (CPU)
with 16 ARC 750D cores and a Time-machine hardware module to provide a virtual
machine that uses this approach.
Self Assessment Questions
1. What is Virtual Environment?
2. Explain Machine Aggregation.
Implementation Techniques
Emulation of the underlying raw hardware (native execution)
This approach is described as full virtualization of the hardware, and can be implemented
using a Type 1 or Type 2 hypervisor. (A Type 1 hypervisor runs directly on the hardware;
a Type 2 hypervisor runs on another operating system, such as Linux.) Each virtual
machine can run any operating system supported by the underlying hardware. Users can
thus run two or more different “guest” operating systems simultaneously, in separate
“private” virtual computers.
The pioneer system using this concept was IBM’s CP-40, the first (1967) version of
IBM’s CP/CMS (1967-1972) and the precursor to IBM’s VM family (1972-present).
With the VM architecture, most users run a relatively simple interactive computing
single-user operating system, CMS, as a “guest” on top of the VM control program (VM-
CP). This approach kept the CMS design simple, as if it were running alone; the control
program quietly provides multitasking and resource management services “behind the
scenes”. In addition to CMS, VM users can run any of the other IBM operating systems,
such as MVS or z/OS. z/VM is the current version of VM, and is used to support
hundreds or thousands of virtual machines on a given mainframe. Some installations use
Linux for zSeries to run Web servers, where Linux runs as the operating system within
many virtual machines.
Full virtualization is particularly helpful in operating system development, when
experimental new code can be run at the same time as older, more stable, versions, each
in separate virtual machines. (The process can even be recursive: IBM debugged new
versions of its virtual machine operating system, VM, in a virtual machine running under
an older version of VM, and even used this technique to simulate new hardware.)
The x86 processor architecture as used in modern PCs does not actually meet the Popek
and Goldberg virtualization requirements. Notably, there is no execution mode where all
sensitive machine instructions always trap, which would allow per-instruction
virtualization.
Despite these limitations, several software packages have managed to provide
virtualization on the x86 architecture, even though dynamic recompilation of privileged
code, as first implemented by VMware, incurs some performance overhead as compared
to a VM running on a natively virtualizable architecture such as the IBM System/370 or
Motorola MC68020. By now, several other software packages such as Virtual PC,
VirtualBox, Parallels Workstation and Virtual Iron manage to implement virtualization
on x86 hardware.
On the other hand, plex86 can run only Linux under Linux using a specific patched
kernel. It does not emulate a processor, but uses bochs for emulation of motherboard
devices.
Intel and AMD have introduced features to their x86 processors to enable virtualization
in hardware.
Emulation of a non-native system
Virtual machines can also perform the role of an emulator, allowing software applications
and operating systems written for computer processor architecture to be run.
Some virtual machines emulate hardware that only exists as a detailed specification. For
example:
• One of the first was the p-code machine specification, which allowed
programmers to write Pascal programs that would run on any computer running
virtual machine software that correctly implemented the specification.
• The specification of the Java virtual machine.
• The Common Language Infrastructure virtual machine at the heart of the
Microsoft .NET initiative.
• Open Firmware allows plug-in hardware to include boot-time diagnostics,
configuration code, and device drivers that will run on any kind of CPU.
This technique allows diverse computers to run any software written to that specification;
only the virtual machine software itself must be written separately for each type of
computer on which it runs.
Self Assessment Questions
1. What are the techniques to realize Virtual Machines concept?
2. What are the advantages of Virtual Machines?
Operating system-level virtualization
Operating System-level Virtualization is a server virtualization technology which
virtualizes servers on an operating system (kernel) layer. It can be thought of as
partitioning: a single physical server is sliced into multiple small partitions (otherwise
called virtual environments (VE), virtual private servers (VPS), guests, zones etc); each
such partition looks and feels like a real server, from the point of view of its users.
The operating system level architecture has low overhead that helps to maximize efficient
use of server resources. The virtualization introduces only a negligible overhead and
allows running hundreds of virtual private servers on a single physical server. In contrast,
approaches such as virtualisation (like VMware) and paravirtualization (like Xen or
UML) cannot achieve such level of density, due to overhead of running multiple kernels.
From the other side, operating system-level virtualization does not allow running
different operating systems (i.e. different kernels), although different libraries,
distributions etc. are possible
Self Assessment Questions
1. Describe the Operating System Level Virtualization.
Summary
The virtual machine concept has several advantages. In this environment, there is
complete protection of the various system resources. Each virtual machine is completely
isolated from all other virtual machines, so there are no protection problems. At the same
time, however, there is no direct sharing of resources. Two approaches to provide sharing
have been implemented. A virtual machine is a perfect vehicle for operating systems
research and development.
Operating system as extended machine acts as interface between hardware and user
application programs. The kernel is the essential center of a computer operating system,
i.e. the core that provides basic services for all other parts of the operating system. It
includes interrupts handler, scheduler, operating system address space manager, etc.
In the layered type architecture of operating systems, the components of kernel are built
as layers on one another, and each layer can interact with its neighbor through interface.
Whereas in micro-kernel architecture, most of these components are not part of kernel but
acts as another layer to the kernel, and the kernel comprises of essential and basic
components.
Terminal Questions
1. Explain operating system as extended machine.
2. What is a kernel? What are the main components of a kernel?
3. Explain monolithic type of kernel architecture in brief.
4. What is a micro-kernel? Describe its architecture.
5. Compare micro-kernel with layered architecture of operating system.
6. Describe UNIX kernel components in brief.
7. What are the components of operating system?
8. Explain the responsibilities of operating system as process management.
9. Explain the function of operating system as file management.
10. What are different services provided by an operating system?
Unit 3: Process Management :
This unit covers the process management and threads. Brief about the process creation,
termination, process state and process control. Discussed about the process Vs
Threads, Types of threads etc.
Introduction
This unit discuss the definition of process, process creation, process termination, process
state, and process control. And also deals with the threads and thread types.
A process can be simply defined as a program in execution. Process along with program
code, comprises of program counter value, Processor register contents, values of
variables, stack and program data.
A process is created and terminated, and it follows some or all of the states of process
transition; such as New, Ready, Running, Waiting, and Exit.
A thread is a single sequence stream within in a process. Because threads have some of
the properties of processes, they are sometimes called lightweight processes. There are
two types of threads: user level threads (ULT) and kernel level threads (KLT), user level
threads are mostly used on the systems where the operating system does not support
threads, but also can be combined with the kernel level threads. Threads also have similar
properties like processes e.g. execution states, context switch etc.
Objectives :
At the end of this unit, you will be able to understand the :� What is a Process?
� Process Creation , Process Termination,
� Process States, Process Control
� Threads
� Types of Threads
What is a Process?
The notion of process is central to the understanding of operating systems. The term
process is used somewhat interchangeably with ‘task’ or ‘job’. There are quite a few
definitions presented in the literature, for instance� A program in Execution.
� An asynchronous activity.
� The entity to which processors are assigned.
� The ‘dispatchable’ unit.
And many more, but the definition “Program in Execution” seem to be most frequently
used. And this is a concept we will use in the present study of operating systems.
Now that we agreed upon the definition of process, the question is, what is the relation
between process and program, or is it same with different name or when the process is
sleeping (not executing) it is called program and when it is executing becomes process.
Well, to be very precise. Process is not the same as program. A process is more than a
program code. A process is an ‘active’ entity as oppose to program which considered
being a ‘passive’ entity. As we all know that a program is an algorithm expressed in some
programming language. Being a passive, a program is only a part of process. Process, on
the other hand, includes:
� Current value of Program Counter (PC)
� Contents of the processors registers
� Value of the variables
� The process stack, which typically contains temporary data such as subroutine
parameter, return address, and temporary variables.
� A data section that contains global variables.
� A process is the unit of work in a system.
In Process model, all software on the computer is organized into a number of sequential
processes. A process includes PC, registers, and variables. Conceptually, each process
has its own virtual CPU. In reality, the CPU switches back and forth among processes.
Process Creation
In general-purpose systems, some way is needed to create processes as needed during
operation. There are four principal events led to processes creation.� System
initialization.
� Execution of a process Creation System call by a running process.
� A user request to create a new process.
� Initialization of a batch job.
Foreground processes interact with users. Background processes that stay in background
sleeping but suddenly springing to life to handle activity such as email, webpage,
printing, and so on. Background processes are called daemons. This call creates an exact
clone of the calling process.
A process may create a new process by executing system call ‘fork’ in UNIX. Creating
process is called parent process and the created one is called the child processes. Only
one parent is needed to create a child process. This creation of process (processes) yields
a hierarchical structure of processes. Note that each child has only one parent but each
parent may have many children. After the fork, the two processes, the parent and the
child, initially have the same memory image, the same environment strings and the same
open files. After a process is created, both the parent and child have their own distinct
address space.
Following are some reasons for creation of a process
1. User logs on.
2. User starts a program.
3. Operating systems creates process to provide service, e.g., to manage printer.
4. Some program starts another process.
Creation of a process involves following steps:
1. Assign a unique process identifier to the new process, followed by making new
entry in to the process table regarding this process.
2. Allocate space for the process: this operating involves finding how much space is
needed by the process and allocating space to the parts of the process such as user
program, user data, stack and process attributes. The requirement of the space can be
taken by default based on the type of the process, or from the parent process if the
process is spawned by another process.
3. Initialize Process Control Block: the PCB contains various attributes required to
execute and control a process, such as process identification, processor status
information and control information. This can be initialized to standard default values
plus attributes that have been requested for this process.
4. Set the appropriate linkages: the operating system maintains various queues
related to a process in the form of linked lists, the newly created process should be
attached to one of such queues.
5. Create or expand other data structures: depending on the implementation, an
operating system may need to create some data structures for this process, for
example to maintain accounting file for billing or performance assessment.
Process Termination
A process terminates when it finishes executing its last statement. Its resources are
returned to the system, it is purged from any system lists or tables, and its process control
block (PCB) is erased i.e., the PCB’s memory space is returned to a free memory pool.
The new process terminates the existing process, usually due to following reasons:
• Normal Exit Most processes terminates because they have done their job. This
call is exit in UNIX.
• Error Exit When process discovers a fatal error. For example, a user tries to
compile a program that does not exist.
• Fatal Error An error caused by process due to a bug in program for example,
executing an illegal instruction, referring non-existing memory or dividing by
zero.
• Killed by another Process A process executes a system call telling the
Operating Systems to terminate some other process.
Process States
A process goes through a series of discrete process states during its lifetime. Depending
on the implementation, the operating systems may differ in the number of states a process
goes though. Though there are various state models starting from two states to nine states,
we will first see a five states model and then seven states model, as lower states models
are now obsolete.
Five State Process Model
Following are the states of a five state process model. The figure 3.1 show these state
transition.
• New State The process being created.
• Terminated State The process has finished execution.
�
• Blocked (waiting) State When a process blocks, it does so because logically it
cannot continue, typically because it is waiting for input that is not yet available.
Formally, a process is said to be blocked if it is waiting for some event to happen
(such as an I/O completion) before it can proceed. In this state a process is unable
to run until some external event happens.
• Running State A process is said to be running if it currently has the CPU,
which is, actually using the CPU at that particular instant.
• Ready State A process is said to be ready if it use a CPU if one were available.
It is run-able but temporarily stopped to let another process run.
Logically, the ‘Running’ and ‘Ready’ states are similar. In both cases the process is
willing to run, only in the case of ‘Ready’ state, there is temporarily no CPU available for
it. The ‘Blocked’ state is different from the ‘Running’ and ‘Ready’ states in that the
process cannot run, even if the CPU is available.
Following are six possible transitions among above mentioned five states
Transition 1 occurs when process discovers that it cannot continue. If running process
initiates an I/O operation before its allotted time expires, the running process voluntarily
relinquishes the CPU.
This state transition is:
Block (process): Running → Blocked.
Transition 2 occurs when the scheduler decides that the running process has run long
enough and it is time to let another process have CPU time.
This state transition is:
Time-Run-Out (process): Running → Ready.
Transition 3 occurs when all other processes have had their share and it is time for the
first process to run again
This state transition is:
Dispatch (process): Ready → Running.
Transition 4 occurs when the external event for which a process was waiting (such as
arrival of input) happens.
This state transition is:
Wakeup (process): Blocked → Ready.
Transition 5 occurs when the process is created.
This state transition is:
Admitted (process): New → Ready.
Transition 6 occurs when the process has finished execution.
This state transition is:
Exit (process): Running → Terminated.
Swapping
Many of the operating systems follow the above shown process model. However the
operating systems which does not employ virtual memory, the processor will be idle most
of the times considering the difference between speed of I/O and processor. There will be
many processes waiting for I/O in the memory, and exhausting the memory. If there is no
ready process to run; new processes can not be created as there is no memory available to
accommodate new process. Thus the processor has to wait till any of the waiting
processes become ready after completion of an I/O operation.
This problem can be solved by adding to more states in the above process model by using
swapping technique. Swapping involves moving part or all of a process from main
memory to disk. When none of the processes in main memory is in the ready state, the
operating system swaps one of the blocked processes out onto disk in to a suspend queue.
This is a queue of existing processes that have been temporarily shifted out of main
memory, or suspended. The operating system then either creates new process or brings a
swapped process from the disk which has become ready.
Seven State Process Model
The following figure 3.2 shows the seven state process model in which uses above
described swapping technique.
Apart from the transitions we have seen in five states model, following are the new
transitions which occur in the above seven state model.
• Blocked to Blocked / Suspend: If there are now ready processes in the main
memory, at least one blocked process is swapped out to make room for another
process that is not blocked.
• Blocked / Suspend to Blocked: If a process is terminated making space in the
main memory, and if there is any high priority process which is blocked but
suspended, anticipating that it will become free very soon, the process is brought
in to the main memory.
• Blocked / Suspend to Ready / Suspend: A process is moved from Blocked /
Suspend to Ready / Suspend, if the event occurs on which the process was
waiting, as there is no space in the main memory.
• Ready / Suspend to Ready: If there are no ready processes in the main memory,
operating system has to bring one in main memory to continue the execution.
Some times this transition takes place even there are ready processes in main
memory but having lower priority than one of the processes in Ready / Suspend
state. So the high priority process is brought in the main memory.
• Ready to Ready / Suspend: Normally the blocked processes are suspended by
the operating system but sometimes to make large block free, a ready process may
be suspended. In this case normally the low priority processes are suspended.
• New to Ready / Suspend: When a new process is created, it should be added to
the Ready state. But some times sufficient memory may not be available to
allocate to the newly created process. In this case, the new process is sifted to
Ready / Suspend.
Process Control
In this section we will study structure of a process, process control block, modes of
process execution, and process switching.
Process Structure
After studying the process states now we will see where does the process reside, and what
is the physical manifestation of a process?
The location of the process depends on memory management scheme being used. In the
simplest case, a process is maintained in the secondary memory, and to manage this
process, at least small part of this process is maintained in the main memory. To execute
the process, the entire process or part of it is brought in the main memory, and for that the
operating system need to know the location of the process.
Process identification
Processor state information
Process control information
User Stack
Private user address space (program, data)
Shared address space
Figure 3.3: Process Image
The obvious contents of a process are User Program to be executed, and the User
Data which is associated with that program. Apart from these there are two major parts
of a process; System Stack, which is used to store parameters and calling addresses for
procedure and system calls, and Process Control Block, this is nothing but collection of
process attributes needed by operating system to control a process. The collection of user
program, data, system stack, and process control block is called as Process Image as
shown in the figure 3.3 above.
Process Control Block
A process control block as shown in the figure 3.4 bellow, contains various attributes
required by operating system to control a process, such as process state, program counter,
CPU state, CPU scheduling information, memory management information, I/O state
information, etc.
These attributes can be grouped in to three general categories as follows:� Process
identification
� Processor state information
� Process control information
The first category stores information related to Process identification, such as identifier
of the current process, identifier of the process which created this process, to maintain
parent-child process relationship, and user identifier, the identifier of the user on behalf
of who’s this process is being run.
The Processor state information consists of the contents of the processor registers, such
as user-visible registers, control and status registers which includes program counter and
program status word, and stack pointers.
The third category Process Control Identification is mainly required for the control of a
process. The information includes: scheduling and state information, data structuring,
inter-process communication, process privileges memory management, and resource
ownership and utilization.
pointer process state
process number
program counter
registers
memory limits
list of open files
.
.
.
Figure 3.4: Process Control Block
Modes of Execution
In order to ensure the correct execution of each process, an operating system must protect
each process’s private information (executable code, data, and stack) from uncontrolled
interferences from other processes. This is accomplished by suitably restricting the
memory address space available to a process for reading/writing, so that the OS can
regain CPU control through hardware-generated exceptions whenever a process violates
those restrictions.
Also the OS code needs to execute in a privileged condition with respect to “normal”: to
manage processes, it needs to be enabled to execute operations which are forbidden to
“normal” processes. Thus most of the processors support at least two modes of execution.
Certain instructions can only be executed in the more privileged mode. These include
reading or altering a control register such as program status word, primitive I/O
instruction; and memory management instructions.
The less privileged mode is referred as user mode as typically user programs are executed
in this mode, and the more privileged mode in which important operating system
functions are executed is called as kernel mode/ system mode or control mode.
The current mode information is stored in the PSW, i.e. whether the processor is running
in user mode or kernel mode. The mode change is normally done by executing change
mode instruction; typically after a user process invokes a system call, or whenever an
interrupt occurs, as these are operating system functions and needed to be executed in
privileged mode. After the completion of system call or interrupt routine, the mode is
again changed to user mode to continue the user process execution.
Context Switching
To give each process on a multiprogrammed machine a fair share of the CPU, a hardware
clock generates interrupts periodically. This allows the operating system to schedule all
processes in main memory (using scheduling algorithm) to run on the CPU at equal
intervals. Each time a clock interrupt occurs, the interrupt handler checks how much time
the current running process has used. If it has used up its entire time slice, then the CPU
scheduling algorithm (in kernel) picks a different process to run. Each switch of the CPU
from one process to another is called a context switch.
A context is the contents of a CPU’s registers and program counter at any point in time.
Context switching can be described as the kernel (i.e., the core of the operating system)
performing the following activities with regard to processes on the CPU: (1) suspending
the progression of one process and storing the CPU’s state (i.e., the context) for that
process somewhere in memory, (2) retrieving the context of the next process from
memory and restoring it in the CPU’s registers and (3) returning to the location indicated
by the program counter (i.e., returning to the line of code at which the process was
interrupted) in order to resume the process. The figure 3.5 bellow depicts the process of
context switch from process P0 to process P1.
Figure 3.5: Process switching
Self Assessment Questions:
1. Discuss the process state with its five state process model.
2. Explain the seven state process model.
3. What is Process Control ? Discuss the process control block.
4. Write note on Context Switching.
A context switch is sometimes described as the kernel suspending execution of one
process on the CPU and resuming execution of some other process that had previously
been suspended.
A context switch occurs due to interrupts, trap (error due to the current instruction) or a
system call as described bellow:
• Clock interrupt: when a process has executed its current time quantum which
was allocated to it, the process must be switched from running state to ready state,
and another process must be dispatched for execution.
• I/O interrupt: whenever any I/O related event occurs, the OS is interrupted, the
OS has to determine the reason of it and take necessary action for that event. Thus
the current process is switched to ready state and the interrupt routine is loaded to
do the action for the interrupt event (e.g. after an I/O interrupt the OS moves all
the processes which were blocked on the event, from blocked state to ready state,
and blocked/suspended to ready/suspended state). After completion of the
interrupt related actions, it is expected that the process which was switched,
should be brought for execution, but that does not happen. At this point the
scheduler again decides which process is to be scheduled for execution from all
the ready processes afresh. This is important as it will schedule any high priority
process present in the ready queue added during the interrupt handling period.
• Memory fault: when virtual memory technique is used for memory management,
many a times it happens that a process refers to a memory address which is not
present in the main memory, and needs to be brought in. As the memory block
transfer takes time, another process should be given chance for execution and the
current process should be blocked. Thus the OS blocks the current process, issues
an I/O request to get the memory block in the memory and switches the current
process to blocked state, and loads another process for execution.
• Trap: if the instruction being executed has any error or exception, depending on
the criticalness of the error / exception and design of operating system, it may
either move the process to exit state, or may execute the current process after a
possible recovery.
System call: many a times a process has to invoke a system call for a privileged job, for
this the current process is blocked and the respective operating system’s system call code
is executed. Thus the context of the current process is switched to the system call code.
Example: UNIX Process
Let us see an example of UNIX System V, which makes use of a simple but powerful
process facility that is highly visible to the user. The following figure shows the model
followed by UNIX, in which most of the operating system executes within the
environment of a user process. Thus, two modes, user and kernel, are required. UNIX
uses two categories of processes: system processes and user processes. System processes
run in kernel mode and execute operating system code to perform administrative and
housekeeping functions, such as allocation of memory and process swapping. User
processes operate in user mode to execute user programs and utilities and in kernel mode
to execute instructions belong to the kernel. A user process enters kernel mode by issuing
a system call, when an exception (fault) is generated or when an interrupt occurs.
A total of nine process states are recognized by the UNIX operating system as explained
bellow
• User Running: Executing in user mode.
• Kernel Running: Executing in kernel mode.
• Ready to Run, in Memory: Ready to run as soon as the kernel schedules it.
• Asleep in Memory: Unable to execute until an event occurs; process is in main
memory (a blocked state).
• Ready to Run, Swapped: Process is ready to run, but the swapper must swap the
process into main memory before the kernel can schedule it to execute.
• Sleeping, Swapped: The process is awaiting an event and has been swapped to
secondary storage (a blocked state).
• Preempted: Process is returning from kernel to user mode, but the kernel
preempts it and does a process switch to schedule another process.
• Created: Process is newly created and not yet ready to run.
• Zombie: Process no longer exists, but it leaves a record for its parent process to
collect.
UNIX employs two Running states to indicate whether the process is executing in user
mode or kernel mode. A distinction is made between the two states: (Ready to Run, in
Memory) and (Preempted). These are essentially the same state, as indicated by the
dotted line joining them. The distinction is made to emphasize the way in which the
preempted state is entered. When a process is running in kernel mode (as a result of a
supervisor call, clock interrupt, or I/O interrupt), there will come a time when the kernel
has completed its work and is ready to return control to the user program. At this point,
the kernel may decide to preempt the current process in favor of one that is ready and of
higher priority. In that case, the current process moves to the preempted state. However,
for purposes of dispatching, those processes in the preempted state and those in the
Ready to Run, in Memory state form one queue.
Preemption can only occur when a process is about to move from kernel mode to user
mode. While a process is running in kernel mode, it may not be preempted. This makes
UNIX unsuitable for real-time processing.
Two processes are unique in UNIX. Process 0 is a special process that is created when
the system boots; in effect, it is predefined as a data structure loaded at boot time. It is the
swapper process. In addition, process 0 spawns process 1, referred to as the init process;
all other processes in the system have process 1 as an ancestor. When a new interactive
user logs onto the system, it is process 1 that creates a user process for that user.
Subsequently, the user process can create child processes in a branching tree, so that any
particular application can consist of a number of related processes.
Threads
A thread is a single sequence stream within in a process. Because threads have some of
the properties of processes, they are sometimes called lightweight processes. In a process,
threads allow multiple executions of streams. In many respect, threads are popular way to
improve application through parallelism. The CPU switches rapidly back and forth
among the threads giving illusion that the threads are running in parallel. Like a
traditional process i.e., process with one thread, a thread can be in any of several states
(Running, Blocked, Ready or Terminated). Each thread has its own stack. Since thread
will generally call different procedures and thus a different execution history. This is why
thread needs its own stack. An operating system that has thread facility, the basic unit of
CPU utilization is a thread. A thread has or consists of a program counter (PC), a register
set, and a stack space. Threads are not independent of one other like processes as a result
threads shares with other threads their code section, data section, OS resources also
known as task, such as open files and signals.
Processes Vs Threads
As we mentioned earlier that in many respect threads operate in the same way as that of
processes. Some of the similarities and differences are:
Similarities
• Like processes threads share CPU and only one thread is running at a time.
• Like processes, threads within processes execute sequentially.
• Like processes, thread can create children.
• And like process, if one thread is blocked, another thread can run.
Differences
• Unlike processes, threads are not independent of one another.
• Unlike processes, all threads can access every address in the task .
• Unlike processes, threads are designed to assist one other. (Processes might or
might not assist one another because processes may originate from different
users.)
Why Threads?
Following are some reasons why we use threads in designing operating systems.
1. A process with multiple threads makes a great server for example printer server.
2. Because threads can share common data, they do not need to use interprocess
communication.
3. Because of the very nature, threads can take advantage of multiprocessors.
Threads are cheap in the sense that
1. They only need a stack and storage for registers therefore, threads are cheap to
create.
2. Threads use very little resources of an operating system in which they are
working. That is, threads do not need new address space, global data, program
code or operating system resources.
3. Context switching is fast when working with threads. The reason is that we only
have to save and/or restore PC, SP and registers.
Advantages of Threads over Multiple Processes
• Context Switching Threads are very inexpensive to create and destroy, and
they are inexpensive to represent. For example, they require space to store, the
PC, the SP, and the general-purpose registers, but they do not require space to
share memory information, Information about open files of I/O devices in use,
etc. With so little context, it is much faster to switch between threads. In other
words, it is relatively easier for a context switch using threads.
• Sharing Treads allow the sharing of a lot resources that cannot be shared in
process, for example, sharing code section, data section, Operating System
resources like open file etc.
A proxy server satisfying the requests for a number of computers on a LAN would be
benefited by a multi-threaded process. In general, any program that has to do more than
one task at a time could benefit from multitasking. For example, a program that reads
input, process it, and outputs could have three threads, one for each task.
Disadvantages of Threads over Multiple Processes
• Blocking: The major disadvantage if that if the kernel is single threaded, a system
call of one thread will block the whole process and CPU may be idle during the
blocking period.
• Security: Since there is, an extensive sharing among threads there is a potential
problem of security. It is quite possible that one thread over writes the stack of
another thread (or damaged shared data) although it is very unlikely since threads
are meant to cooperate on a single task.
Any sequential process that cannot be divided into parallel task will not benefit from
thread, as they would block until the previous one completes. For example, a program
that displays the time of the day would not benefit from multiple threads.
Self Assessment Questions
1. Define Thread.
2. Discuss the Process vs Threads.
3. State the advantages and disadvantages of Threads over multiple processes.
Types of Threads
There are two types of threads: user level threads (ULT) and kernel level threads (KLT).
User Level Threads
User-level threads implement in user-level libraries, rather than via systems calls, so
thread switching does not need to call operating system and to cause interrupt to the
kernel. In fact, the kernel knows nothing about user-level threads and manages them as if
they were single-threaded processes as shown in the figure 3.7 bellow.
Figure 3.7: User Level Thread
Advantages:
The most obvious advantage of this technique is that a user-level threads package can be
implemented on an Operating System that does not support threads. Some other
advantages are
• User-level thread does not require modification to operating systems.
• Simple Representation: Each thread is represented simply by a PC, registers, stack
and a small control block, all stored in the user process address space.
• Simple Management: This simply means that creating a thread, switching
between threads and synchronization between threads can all be done without
intervention of the kernel.
• Fast and Efficient: Thread switching is not much more expensive than a procedure
call.
Disadvantages:
• There is a lack of coordination between threads and operating system kernel.
Therefore, process as whole gets one time slice irrespective of whether process
has one thread or 1000 threads within. It is up to each thread to relinquish control
to other threads.
• User-level threads require non-blocking systems call i.e., a multithreaded kernel.
Otherwise, entire process will blocked in the kernel, even if there are runable
threads left in the processes. For example, if one thread causes a page fault, the
process blocks.
Kernel Level Threads:
As shown in the figure 3.8 bellow, in this method, the kernel knows about and manages
the threads. No runtime system is needed in this case. Instead of thread table in each
process, the kernel has a thread table that keeps track of all threads in the system. In
addition, the kernel also maintains the traditional process table to keep track of processes.
Operating Systems kernel provides system call to create and manage threads.
�
Figure 3.8: Kernel Level Thread
Advantages:
• Because kernel has full knowledge of all threads, Scheduler may decide to give
more time to a process having large number of threads than process having small
number of threads.
• Kernel-level threads are especially good for applications that frequently block.
Disadvantages:
• The kernel-level threads are slow and inefficient. For instance, threads operations
are hundreds of times slower than that of user-level threads.
• Since kernel must manage and schedule threads as well as processes. It requires a
full thread control block (TCB) for each thread to maintain information about
threads. As a result there is significant overhead and increased in kernel
complexity.
Thread States
As like processes, threads also go through some similar states as depicted in the figure
below. The figure only shows three main states i.e. ready, running and blocked states.
Apart from these states there are new and terminated states, very similar to the process
states.
Figure 3.9: Thread States
The only difference in thread states and processes states is that, depending on its
implementation, in a running process there may be many threads, but only one will be in
a running state and others will be in blocked or ready states. Thus a process may be
running but there may be a blocked state thread inside the thread. Also in user level
threads, a process may be blocked due to I/O request by a thread, or a process may be
switched to ready state after execution for some time, but the thread which was in
running state at the time of switch or I/O request will be in running state. Thus the
process is not in running state, but the thread within the process is in running state.
Self Assessment Questions
1. Write a advantages and disadvantages user level threads.
2. Write a note on Kernal level threads.
Summary
A process can be simply defined as a program in execution. Process along with program
code, comprises of program counter value, Processor register contents, values of
variables, stack and program data.
A process is created and terminated, and it follows some or all of the states of process
transition; such as New, Ready, Running, Waiting, and Exit.
A thread is a single sequence stream within a process. Because threads have some of the
properties of processes, they are sometimes called lightweight processes. There are two
types of threads: user level threads (ULT) and kernel level threads (KLT), user level
threads are mostly used on the systems where the operating system does not support
threads, but also can be combined with the kernel level threads.
Threads also have similar properties like processes e.g. execution states, context switch
etc.
Terminal Questions
1. Define process. Explain the major components of a process.
2. What are the events for process creation?
3. Explain the reasons for termination of a process.
4. Explain the process state transition with diagram.
5. Explain the event for transition of a process
1. from New to Ready
2. from Ready to Running
3. from Running to Blocked
6. What are threads?
7. State advantages and disadvantages of thread over a process.
8. What are different types of threads? Explain.
Unit 4: Memory Management :
This unit covers memory hierarchy , paging and segmentation and its paging policies.
Discussed about the cache Memory and its performance fetch and write mechanism,
replacement policy. Covers the associative memory.
Introduction
The part of the operating system which handles this responsibility is called the memory
manager. Since every process must have some amount of primary memory in order to
execute, the performance of the memory manager is crucial to the performance of the
entire system. Virtual memory refers to the technology in which some space in hard disk
is used as an extension of main memory so that a user program need not worry if its size
extends the size of the main memory.
For paging memory management, each process is associated with a page table. Each
entry in the table contains the frame number of the corresponding page in the virtual
address space of the process. This same page table is also the central data structure for
virtual memory mechanism based on paging, although more facilities are needed. It
covers the Control bits, Multi-level page table etc.
Segmentation is another popular method for both memory management and virtual
memory
Basic Cache Structure : The idea of cache memories is similar to virtual memory in that
some active portion of a low-speed memory is stored in duplicate in a higher-speed cache
memory. When a memory request is generated, the request is first presented to the cache
memory, and if the cache cannot respond, the request is then presented to main memory.
Content-Addressable Memory (CAM) is a special type of computer memory used in
certain very high speed searching applications. It is also known as associative memory,
associative storage, or associative array, although the last term is more often used for a
programming data structure.
Objectives :
At the end of this unit, you will be able to understand that :
• Memory hierarchy with strategies.
• Virtual memory and its mechanism
• Paging and Segmentation
• Replacement policy and replacement algorithms etc.
Memory Hierarchy
In addition to the responsibility of managing processes, the operating system must
efficiently manage the primary memory of the computer. The part of the operating system
which handles this responsibility is called the memory manager. Since every process
must have some amount of primary memory in order to execute, the performance of the
memory manager is crucial to the performance of the entire system. Nutt explains: “The
memory manager is responsible for allocating primary memory to processes and for
assisting the programmer in loading and storing the contents of the primary memory.
Managing the sharing of primary memory and minimizing memory access time are the
basic goals of the memory manager.”
The real challenge of efficiently managing memory is seen in the case of a system which
has multiple processes running at the same time. Since primary memory can be space-
multiplexed, the memory manager can allocate a portion of primary memory to each
process for its own use. However, the memory manager must keep track of which
processes are running in which memory locations, and it must also determine how to
allocate and de-allocate available memory when new processes are created and when old
processes complete execution. While various different strategies are used to allocate
space to processes competing for memory, three of the most popular are Best fit, Worst
fit, and First fit. Each of these strategies are described below:
• Best fit: The allocator places a process in the smallest block of unallocated
memory in which it will fit. For example, suppose a process requests 12KB of
memory and the memory manager currently has a list of unallocated blocks of
6KB, 14KB, 19KB, 11KB, and 13KB blocks. The best-fit strategy will allocate
12KB of the 13KB block to the process.
• Worst fit: The memory manager places a process in the largest block of
unallocated memory available. The idea is that this placement will create the
largest hold after the allocations, thus increasing the possibility that, compared to
best fit, another process can use the remaining space. Using the same example as
above, worst fit will allocate 12KB of the 19KB block to the process, leaving a
7KB block for future use.
• First fit: There may be many holes in the memory, so the operating system, to
reduce the amount of time it spends analyzing the available spaces, begins at the
start of primary memory and allocates memory from the first hole it encounters
large enough to satisfy the request. Using the same example as above, first fit will
allocate 12KB of the 14KB block to the process.
Notice in the diagram above that the Best fit and First fit strategies both leave a tiny
segment of memory unallocated just beyond the new process. Since the amount of
memory is small, it is not likely that any new processes can be loaded here. This
condition of splitting primary memory into segments as the memory is allocated and
deallocated is known as fragmentation. The Worst fit strategy attempts to reduce the
problem of fragmentation by allocating the largest fragments to new processes. Thus, a
larger amount of space will be left as seen in the diagram above.
Another way in which the memory manager enhances the ability of the operating system
to support multiple process running simultaneously is by the use of virtual memory.
According the Nutt, “virtual memory strategies allow a process to use the CPU when
only part of its address space is loaded in the primary memory. In this approach, each
process’s address space is partitioned into parts that can be loaded into primary memory
when they are needed and written back to secondary memory otherwise.” Another
consequence of this approach is that the system can run programs which are actually
larger than the primary memory of the system, hence the idea of “virtual memory.”
Brookshear explains how this is accomplished:
“Suppose, for example, that a main memory of 64 megabytes is required but only 32
megabytes is actually available. To create the illusion of the larger memory space, the
memory manager would divide the required space into units called pages and store the
contents of these pages in mass storage. A typical page size is no more than four
kilobytes. As different pages are actually required in main memory, the memory manager
would exchange them for pages that are no longer required, and thus the other software
units could execute as though there were actually 64 megabytes of main memory in the
machine.”
In order for this system to work, the memory manager must keep track of all the pages
that are currently loaded into the primary memory. This information is stored in a page
table maintained by the memory manager. A page fault occurs whenever a process
requests a page that is not currently loaded into primary memory. To handle page faults,
the memory manager takes the following steps:
1. The memory manager locates the missing page in secondary memory.
2. The page is loaded into primary memory, usually causing another page to be
unloaded.
3. The page table in the memory manager is adjusted to reflect the new state of the
memory.
4. The processor re-executes the instructions which caused the page fault.
1. Virtual Memory – An Introduction
In an operating system, it is possible that a program is too large to be loaded into the
main memory. In theory, a 32-bit program may have a linear space of up to 4 giga bytes,
which is larger than almost all computers nowadays. Thus we need some mechanism that
allows the execution of a process that is not completely in main memory. Overlay is one
choice. With it, the programmers have to deal with swapping in and out themselves to
make sure at any moment that the instruction to be executed next is physically in main
memory. Obviously this brings a heavy burden on the programmers. In this Unit, we
introduce another solution called virtual memory, which has been adopted by almost all
modern operating systems.
Virtual memory refers to the technology in which some space in hard disk is used as an
extension of main memory so that a user program need not worry if its size extends the
size of the main memory. If that does happen, at any time only a part of the program will
reside in main memory, and other parts will otherwise remain on hard disk and may be
switched into memory later if needed.
This mechanism is similar to the two-level memory hierarchy we once discussed before,
including cache and main memory because the principle of locality is also a basis here.
With virtual memory, if a piece of process that is needed is not in a full main memory,
then another piece will be swapped out and the former be brought in. If unfortunately, the
latter is used immediately, then it will have to loaded back into main memory right away.
As we know, the access to hard disk is time-consuming compared to the access to main
memory, Thus the reference to the virtual memory space on hard disks will deteriorate
the system performance significantly. Fortunately, the principle of locality holds. That is
the instruction and data references during a short period tend to be bounded to one piece
of process. So the access to hard disks will not be frequently requested and performed.
Thus, the same principle, on the one hand, enables the caching mechanism to increase
system performance, and on the other hand avoids the deterioration of performance with
virtual memory. With virtual memory, there must be some facility to separate a process
into several pieces so that they may reside separately either on hard disks or in main
memory. Paging or/and segmentation are two methods that are usually used to achieve
the goal.
Paging
For paging memory management, each process is associated with a page table. Each
entry in the table contains the frame number of the corresponding page in the virtual
address space of the process. This same page table is also the central data structure for
virtual memory mechanism based on paging, although more facilities are needed.
Control bits
Since only some pages of a process may be in main memory, a bit in the page table entry,
P in Figure 1(a), is used to indicate whether the corresponding page is present in main
memory or not. Another control bit needed in the page table entry is a modified bit, M,
indicating whether the content of the corresponding page have been altered or not since
the page was last loaded into main memory. We often say swapping in and swapping out,
suggesting that a process is typically separated into two parts, one residing in main
memory and the other in secondary memory, and some pages may be removed from one
part and join the other. They together make up of the whole process image. Actually the
secondary memory contains the whole image of the process and part of it may have been
loaded into main memory. When swapping out is to be performed, typically the page to
be swapped out may be simply overwritten by the new page, since the corresponding
page is already on secondary memory. However sometimes the content of a page may
have been altered at runtime, say a page containing data. In this case, the alteration
should be reflected in secondary memory. So when the M bit is 1, then the page to be
swapped out should be written out. Other bits may also be used for sharing or protection.
Multi-level page table
Typically, there is only one page table for each process, which is completely loaded into
main memory during the execution of the process. However some processes may be so
large that even its page table cannot be held fully in main memory. For example, in 32-bit
x86 architecture, each process may have up to 232 = 4G bytes of virtual memory. With
pages of 29 = 512 bytes, as many as 223 pages are needed as well as a page table of 223
entries. If each entry requires 4 bytes, that will be 225 = 32Mbytes. Thus some
mechanism is needed to allow only part of a page table is loaded in main memory.
Naturally we use paging for this. That’s page tables are subject to paging just as other
pages are, called multi-level paging. Figure 2
shows an example of a two-level scheme with a 32-bit address. If we assume 4Kbyte
pages, then 4G-byte virtual address is composed of 220 pages. If each page table entry
requires 4 bytes, then a user page table of 220 entries requires 4M bytes. This huge page
table itself needs 210 pages. For paging with it, a root page table of 210 is needed,
requiring 4K bytes.
Fig. 1: Typical memory management formats
With this two-level paging scheme, the root page table always remains in main memory.
The first 10 bits of a virtual address are used to index into the root page table to find an
entry for a page of the user page table. If that page is not in main memory, a page fault
occurs and the operating system is asked to load that page. If it is in main memory, then
the next 10 bits of the virtual address index into the user page table to find the entry for
the page that is referenced by the virtual address. This whole process is illustrated in
Figure 3.
Fig. 2: A two-level hierarchical page table
Fig. 3: Address translation in A two-level paging system
Translation lookaside buffer
As we discussed before, a translation lookaside buffer (TLB) may be used to speed up
paging and avoid frequent access to main memory, which is shown in Figure 4. With
multi-level paging scheme, the benefit of TLB will be even more significant.
Fig. 4: Use of a translation lookaside buffer
It should be noted that the TLB is a cache for a page table while the regular cache we
mentioned before is for main memory and these facilities should work together when
they are both present in a system. As figure 5 illustrates, for a virtual address consisting
of a page number and an offset address, the memory system consults the TLB first to see
if the matching page entry is present. If yes, the real address is generated by combining
the frame number with the offset. If not, the entry is accessed from a page table. Once the
real address is generated, the cache is consulted to see if the block containing that word is
present. If so, it is returned to the CPU. If not, the word is retrieved from main memory.
Self Assessment Questions
1. Discuss the page table with suitable example.
2. Explain the significant of control bits in paging mechanism.
3. What strategy would you followed in paging if a demanding process holds such
large size of memory space where page table can not hold in memory?
Cleaning policy
A cleaning policy is the opposite of a fetch policy. It deals with when a modified page
should be written out to secondary memory. There are two common choices:
• Demand cleaning: A page is written out only when it has been selected for
replacement.
• Pre-cleaning: Modified pages are updated on secondary memory before their
page frames are needed so that pages can be written out in batches.
• Pre-cleaning has advantage over demand cleaning but it cannot be performed too
frequently because some pages may be modified so often that frequent writing out
turns out to be unnecessary.
Frame locking
One point that is worth mentioning is that some of the frames in main memory may not
be replaced, or may be locked. For example, the frames occupied by the kernel of the
operating system, used for I/O buffers and other time-critical areas should always be
available in main memory for the operating system to operate properly. This requirement
can be satisfied by adding an additional bit in the page table.
Load control
Another related question is how many processes may be started to run and reside in main
memory simultaneously, which is called load control. Load control is critical in memory
management because, if too few processes are in main memory at any one time, it will be
very likely for all the processes to be blocked, and thus much time will be spent in
swapping. On the other hand, if too many processes exist, each individual process will be
allocated a small number of frames, and thus frequent page faulting will occur. Figure 10
shows that if all other aspects are given, there is a specific point to achieve the highest
utilization.
Fig. 10: Multiprogramming effects
Cache Memory
Basic Cache Structure
Processors are generally able to perform operations on operands faster than the access
time of large capacity main memory. Though semiconductor memory which can operate
at speeds comparable with the operation of the processor exists, it is not economical to
provide all the main memory with very high speed semiconductor memory. The problem
can be alleviated by introducing a small block of high speed memory called a cache
between the main memory and the processor.
The idea of cache memory is similar to virtual memory in that some active portion of a
low-speed memory is stored in duplicate in a higher-speed cache memory. When a
memory request is generated, the request is first presented to the cache memory, and if
the cache cannot respond, the request is then presented to main memory.
The difference between cache and virtual memory is a matter of implementation; the two
notions are conceptually the same because they both rely on the correlation properties
observed in sequences of address references. Cache implementations are totally different
from virtual memory implementation because of the speed requirements of cache.
We define a cache miss to be a reference to a item that is not resident in cache, but is
resident in main memory. The corresponding concept for cache memories is page fault,
which is defined to be a reference to a page in virtual memory that is not resident in main
memory. For cache misses, the fast memory is cache and the slow memory is main
memory. For page faults the fast memory is main memory, and the slow memory is
auxiliary memory.
Fig. 11: A cache-memory reference. The tag 0117X matches address 01173, so the cache returns the
item in the position X=3 of the matched block
A cell in memory is presented to the cache. The cache searches its directory of address
tags shown in the figure to see if the item is in the cache. If the item is not in the cache, a
miss occurs.
For READ operations that cause a cache miss, the item is retrieved from main memory
and copied into the cache. During the short period available before the main-memory
operation is complete, some other item in cache is removed form the cache to make rood
for the new item.
The cache-replacement decision is critical; a good replacement algorithm can yield
somewhat higher performance than can a bad replacement algorithm. The effective cycle-
time of a cache memory (teff) is the average of cache-memory cycle time (tcache) and main-
memory cycle time (tmain), where the probabilities in the averaging process are the
probabilities of hits and misses.
If we consider only READ operations, then a formula for the average cycle-time is:
teff = tcache + ( 1 – h ) tmain
where h is the probability of a cache hit (sometimes called the hit rate), the quantity (1 –
h), which is the probability of a miss, is know as the miss rate.
In Fig.11 we show an item in the cache surrounded by nearby items, all of which are
moved into and out of the cache together. We call such a group of data a block of the
cache.
Cache Memory Organizations
Fig. 12: The logical organization of a four-way set-associate cache
Fig. 12 shows a conceptual implementation of a cache memory. This system is called set
associative because the cache is partitioned into distinct sets of blocks, ad each set
contains a small fixed number of blocks. The sets are represented by the rows in the
figure. In this case, the cache has N sets, and each set contains four blocks. When an
access occurs to this cache, the cache controller does not search the entire cache looking
for a match. Instead, the controller maps the address to a particular set of the cache and
searches only the set for a match.
If the block is in the cache, it is guaranteed to be in the set that is searched. Hence, if the
block is not in that set, the block is not present in the cache, and the cache controller
searches no further. Because the search is conducted over four blocks, the cache is said to
be four-way set associative or, equivalently, to have an associativity of four.
Fig. 12 is only one example, there are various ways that a cache can be arranged
internally to store the cached data. In all cases, the processor reference the cache with the
main memory address of the data it wants. Hence each cache organization must use this
address to find the data in the cache if it is stored there, or to indicate to the processor
when a miss has occurred. The problem of mapping the information held in the main
memory into the cache must be totally implemented in hardware to achieve
improvements in the system operation. Various strategies are possible.
Fully associative mapping
Perhaps the most obvious way of relating cached data to the main memory address is to
store both memory address and data together in the cache. This the fully associative
mapping approach. A fully associative cache requires the cache to be composed of
associative memory holding both the memory address and the data for each cached line.
The incoming memory address is simultaneously compared with all stored addresses
using the internal logic of the associative memory, as shown in Fig. 13. If a match is
fund, the corresponding data is read out. Single words form anywhere within the main
memory could be held in the cache, if the associative part of the cache is capable of
holding a full address
Fig. 13: Cache with fully associative mapping
In all organizations, the data can be more than one word, i.e., a block of consecutive
locations to take advantage of spatial locality. In Fig. 14 aline constitutes four words,
each word being 4 bytes. The least significant part of the address selects the particular
byte, the next part selects the word, and the remaining bits form the address compared to
the address in the cache. The whole line can be transferred to and from the cache in one
transaction if there are sufficient data paths between the main memory and the cache.
With only one data word path, the words of the line have to be transferred in separate
transactions.
Fig. 14: Fully associative mapped cache with multi-word lines
The fully associate mapping cache gives the greatest flexibility of holding combinations
of blocks in the cache and minimum conflict for a given sized cache, but is also the most
expensive, due to the cost of the associative memory. It requires a replacement algorithm
to select a block to remove upon a miss and the algorithm must be implemented in
hardware to maintain a high speed of operation. The fully associative cache can only be
formed economically with a moderate size capacity. Microprocessors with small internal
caches often employ the fully associative mechanism.
I Direct mapping
The fully associative cache is expensive to implement because of requiring a comparator
with each cache location, effectively a special type of memory. In direct mapping, the
cache consists of normal high speed random access memory, and each location in the
cache holds the data, at an address in the cache given by the lower significant bits of the
main memory address. This enables the block to be selected directly from the lower
significant bits of the memory address. The remaining higher significant bits of the
address are stored in the cache with the data to complete the identification of the cached
data.
Consider the example shown in Fig. 15. The address from the processor is divided into
tow fields, a tag and an index. The tag consists of the higher significant bits of the
address, which are stored with the data. The index is the lower significant bits of the
address used to address the cache.
Figure 15: Direct Mapping
When the memory is referenced, the index is first used to access a word in the cache.
Then the tag stored in the accessed word is read and compared with the tag in the address.
If the two tags are the same, indicating that the word is the one required, access is made
to the addressed cache word. However, if the tags are not the same, indicating that the
required word is not in the cache, reference is made to the main memory to find it. For a
memory read operation, the word is then transferred into the cache where it is accessed. It
is possible to pass the information to the cache and the processor simultaneously, i.e., to
read-through the cache, on a miss. The cache location is altered for a write operation. The
main memory may be altered at the same time (write-through) or later.
Fig. 15. shows the direct mapped cache with a line consisting of more than one word. The
main memory address is composed of a tag, an index, and a word within a line. All the
words within a line in the cache have the same stored tag. The index part to the address is
used to access the cache and the stored tag is compared with required tag address. For a
read operation, if the tags are the same the word within the block is selected for transfer
to the processor. If the tags are not the same, the block containing the required word is
first transferred to the cache.
In direct mapping, the corresponding blocks with the same index in the main memory
will map into the same block in the cache, and hence only blocks with different indices
can be in the cache at the same time. A replacement algorithm is unnecessary, since there
is only one allowable location for each incoming block. Efficient replacement relies on
the low probability of lines with the same index being required. However there are such
occurrences, for example, when two data vectors are stored starting at the same index and
pairs of elements need to processed together. To gain the greatest performance, data
arrays and vectors need to be stored in a manner which minimizes the conflicts in
processing pairs of elements. Fig.6 shows the lower bits of the processor address used to
address the cache location directly. It is possible to introduce a mapping function between
the address index and the cache index so that they are not the same.
1. II Set-associative mapping
In the direct scheme, all words stored in the cache must have different indices.
The tags may be the same or different. In the fully associative scheme, blocks can
displace any other block and can be placed anywhere, but the cost of the fully
associative memories operate relatively slowly.
Set-associative mapping allows a limited number of blocks, with the same index
and different tags, in the cache and can therefore be considered as a compromise
between a fully associative cache and a direct mapped cache. The cache is divided
into “sets” of blocks. A four-way set associative cache would have four blocks in
each set. The number of blocks in a set is know as the associativity or set size.
Each block in each set has a stored tag which, together with the index, completes
the identification of the block. First, the index of the address from the processor is
used to access the set. Then, comparators are used to compare all tags of the
selected set with the incoming tag. If a match is found, the corresponding location
is accessed, other wise, as before, an access to the main memory is made.
Figure 16: Cache with set-associative mapping
The tag address bits are always chosen to be the most significant bits of the full
address, the block address bits are the next significant bits and the word/byte
address bits form the least significant bits as this spreads out consecutive man
memory blocks throughout consecutive sets in the cache. This addressing format
is known as bit selection and is used by all known systems. In a set-associative
cache it would be possible to have the set address bits as the most significant bits
of the address and the block address bits as the next significant, with the word
within the block as the least significant bits, or with the block address bits as the
least significant bits and the word within the block as the middle bits.
Notice that the association between the stored tags and the incoming tag is done
using comparators and can be shared for each associative search, and all the
information, tags and data, can be stored in ordinary random access memory. The
number of comparators required in the set-associative cache is given by the
number of blocks in a set, not the number of blocks in all, as in a fully associative
memory. The set can be selected quickly and all the blocks of the set can be read
out simultaneously with the tags before waiting for the tag comparisons to be
made. After a tag has been identified, the corresponding block can be selected.
The replacement algorithm for set-associative mapping need only consider the
lines in one set, as the choice of set is predetermined by the index in the address.
Hence, with two blocks in each set, for example, only one additional bit is
necessary in each set to identify the block to replace.
III Sector Mapping
In sector mapping, the main memory and the cache are both divided into sectors;
each sector is composed of a number of blocks. Any sector in the main memory
can map into any sector in the cache and a tag is stored with each sector in the
cache to identify the main memory sector address. However, a complete sector is
not transferred to the cache or back to the main memory as one unit. Instead,
individual blocks are transferred as required. On cache sector miss, the required
block of the sector is transferred into a specific location within one sector. The
sector location in the cache is selected and all the other existing blocks in the
sector in the cache are from a previous sector.
Sector mapping might be regarded as a fully associative mapping scheme with
valid bits, as in some microprocessor caches. Each block in the fully associative
mapped cache corresponds to a sector, and each byte corresponds to a “sector
block”.
Self Assessment Questions
1. Discuss the basic ideas of using the cache memory.
2. Write note on:
a. Cache Memory Organization b. Direct Mapping
3. Explain the cache with set-associative mapping with neat diagram.
Cache Performance
The performance of a cache can be quantified in terms of the hit and miss rates, the cost
of a hit, and the miss penalty, where a cache hit is a memory access that finds data in the
cache and a cache miss is one that does not.
When reading, the cost of a cache hit is roughly the time to access an entry in the cache.
The miss penalty is the additional cost of replacing a cache line with one containing the
desired data.
(Access time) = (hit cost) + (miss rate)*(miss penalty)
= (Fast memory access time) + (miss rate)*(slow memory access time)
Note that the approximation is an underestimate – control costs have been left out. Also
note that only one word is being loaded from the faster memory while a whole cache
block’s worth of data is being loaded from the slower memory.
Since the speeds of the actual memory used will be improving “independently”, most
effort in cache design is spent on fast control and decreasing the miss rates. We can
classify misses into three categories, compulsory misses, capacity misses and conflict
misses. Compulsory misses are when data is loaded into the cache for the first time (e.g.
program startup) and are unavoidable. Capacity misses are when data is reloaded because
the cache is not large enough to hold all the data no matter how we organize the data (i.e.
even if we changed the hash function and made it omniscient). All other misses are
conflict misses – there is theoretically enough space in the cache to avoid the miss but our
fast hash function caused a miss anyway.
Fetch and write mechanism
Fetch policy
We can identify three strategies for fetching bytes or blocks from the main memory to the
cache, namely:
1. Demand fetch
Which is the fetching a block when it is needed and is not already in the cache,
i.e. to fetch the required block on a miss. This strategy is the simplest and requires
no additional hardware or tags in the cache recording the references, except to
identify the block in the cache to be replaced.
Pre-fetch
Which is fetching blocks before they are requested. A simple prefetch strategy is
to prefetch the (i+1)th block when the ith block is initially referenced on the
expectation that it is likely to be needed if the ith block is needed. On the simple
prefetch strategy, not all first references will induce a miss, as some will be to
prefetched blocks.
Selective fetch
Which is the policy of not always fetching blocks, dependent upon some defined
criterion, and in these cases using the main memory rather than the cache to hold
the information. For example, shared writable data might be easier to maintain if
it is always kept in the main memory and not passed to a cache for access,
especially in multi-processor systems. Cache systems need to be designed so that
the processor can access the main memory directly and bypass the cache.
Individual locations could be tagged as non-cacheable.
Instruction and data caches
The basic stored program computer provides for one main memory for holding
both program instructions and program data. The cache can be organized in the
same fashion, with the cache holding both program instructions and data. This is
called a unified cache. We also can separate the cache into two parts: data cache
and instruction (code) cache. The general arrangement of separate caches is
shown in fig. 17. Often the cache will be integrated inside the processor chip.
Figure 17: Separate instruction and data caches
Write operations
As reading the required word in the cache does not affect the cache contents, there
can be no discrepancy between the cache word and the copy held in the main
memory after a memory read instruction. However, in general, writing can occur
to cache words and it is possible that the cache word and copy held in the main
memory may be different. It is necessary to keep the cache and the main memory
copy identical if input/output transfers operate on the main memory contents, or if
multiple processors operate on the main memory, as in a shared memory multiple
processor system.
If we ignore the overhead of maintaining consistency and the time for writing data
back to the main memory, then the average access time is given by the previous
equation, i.e. teff = tcache + ( 1 – h ) tmain , assuming that all accesses are first made
to the cache. The average access time including write operations will add
additional time to this equation that will depend upon the mechanism used to
maintain data consistency. There are two principal alternative mechanisms to
update the main memory, namely the write-through mechanism and the write-
back mechanism.
Write-through mechanism
In the write-though mechanism, every write operation to the cache is repeated to
the main memory, normally at the same time. The additional write operation to
the main memory will, of course, take much longer than to the cache and will
dominate the access time for write operations. The average access time of write-
through with transfers from main memory to the cache on all misses (read and
write) is given by:
ta = tcache + ( 1 – h ) ttrans + w(tmain - tcache)
= (1 – w) tcache + (1 – h) ttrans + wtmain
Where
ttrans
= time to transfer block to cache, assuming the whole block must be
transferred together
W = fraction of write references.
The term (tmain - tcache) is the additional time to write the word to main memory
whether a hit or a miss has occurred, given that both cache and main memory
write operation occur simultaneously but the main memory write operation must
complete before any subsequent cache read/write operation can be proceed. If the
size of the block matches the external data path size, a whole block can be
transferred in one transaction and
ttrans = tmain.
On a cache miss, a block could be transferred from the main memory to the cache
whether the miss was caused by a write or by a read operation. The term allocate
on write is used to describe a policy of bringing a word/block from the main
memory into the cache for a write operation. In write-through, fetch on write
transfers are often not done on a miss, i.e., a Non- allocate on write policy. The
information will be written back to the main memory but not kept in the cache.
The write-through scheme can be enhanced by incorporating buffers, as shown in
Fig. 18, to hold information to be written back to the main memory, freeing the
cache for subsequent accesses.
Figure 18: Cache with write buffer
For write-through, each item to be written back to the main memory is held in a
buffer together with the corresponding main memory address if the transfer
cannot be made immediately. Immediate writing to main memory when new
values are generated ensures that the most recent values are held in the main
memory and hence that any device or processor accessing the main memory
should obtain the most recent values immediately, thus avoiding the need for
complicated consistency mechanisms. There will be latency before the main
memory has been updated, and the cache and main memory values are not
consistent during this period.
2. Write-back mechanism
In the write-back mechanism, the write operation to the main memory is only
done at block replacement time. At this time, the block displaced by the incoming
block might be written back to the main memory irrespective of whether the block
has been altered. The policy is known as simple write-back, and leads to an
average access time of:
ta = tcache + ( 1 – h ) ttrans + (1 – h) ttrans
Where one (1 – h) ttrans term is due to fetching a block from memory and the other
(1 – h) ttrans term is due to writing back a block. Write-back normally handles
write misses as allocate on write, as opposed to write-through, which often
handles write misses as Non-allocate on write.
The write-back mechanism usually only writes back lines that have been altered.
To implement this policy, a 1-bit tag is associated with each cache line and is set
whenever the block is altered. At replacement time, the tags are examined to
determine whether it is necessary to write the block back to the main memory.
The average access time now becomes:
ta = tcache + ( 1 – h ) ttrans + wb(1 – h) ttrans
where wb is the probability that a block has been altered (fraction of blocks
altered). The probability that a block has been altered could be as high as the
probability of write references, w, but is likely to be much less, as more than one
write reference to the same block is likely and some references to the same
byte/word within the block are likely. However, under this policy the complete
block is written back, even if only one word in the block has been altered, and
thus the policy results in more traffic than is necessary, especially for memory
data paths narrower than a line, but still there is usually less memory traffic than
write-through, which causes every alteration to be recorded in the main memory.
The write-back scheme can also be enhanced by incorporating buffers to hold
information to be written back to the main memory, just as is possible and
normally done with write-through.
Self Assessment Questions
1. List and explain the various activities involved in fetch and write
mechanism.
2. When write-back mechanism is used and what its average access time.
Replacement policy
When the required word of a block is not held in the cache, we have seen that it is
necessary to transfer the block from the main memory into the cache, displacing an
existing block if the cache is full. Except for direct mapping, which does not allow a
replacement algorithm, the existing block in the cache is chosen by a replacement
algorithm. The replacement mechanism must be implemented totally in hardware,
preferably such that the selection can be made completely during the main memory cycle
for fetching the new block. Ideally, the block replaced will not be needed again in the
future. However, such future events cannot be known and a decision has to be made
based upon facts that are known at the time.
1. Random replacement algorithm
Perhaps the easiest replacement algorithm to implement is a pseudo-random
replacement algorithm. A true random replacement algorithm would select a
block to replace in a totally random order, with no regard to memory references or
previous selections; practical random replacement algorithms can approximate
this algorithm in one of several ways. For example, one counter for the whole
cache could be incremented at intervals (for example after each clock cycle, or
after each reference, irrespective of whether it is a hit or a miss). The value held in
the counter identifies the block in the cache ( if fully associative) or the block in
the set if it is a set-associative cache. The counter should have sufficient bits to
identify any block. For a fully associative cache, an n-bit counter is necessary if
there are 2n
words in the cache. For a four-way set-associative cache, one 2-bit
counter would be sufficient, together with logic to increment the counter.
2. First-in first-out replacement algorithm
The first-in first-out replacement algorithm removes the block that has been in the
cache for the longest time. The first-in first-out algorithm would naturally be
implemented with a first-in first-out queue of block address, but can be more
easily implemented with counters, only one counter for a fully associative cache
or one counter for each set in a set-associative cache, each with a sufficient
number of bits to identify the block.
3. Least recently used algorithm for a cache
In the least recently used (LRU) algorithm, the block which has not been
referenced for the longest time is removed from the cache. Only those blocks in
the cache are considered. The word “recently” comes about because the block is
not the least used, as this is likely to be back in memory. It is the least used of
those blocks in the cache, and all of those are likely to have been recently used
otherwise they would not be in the cache. The least recently used (LRU)
algorithm is popular for cache systems and can be implemented fully when the
number of blocks involved is small. There are several ways the algorithm can be
implemented in hardware for a cache, these include:
1) Counters
In the counter implementation, a counter is associated with each block. A simple
implementation would be to increment each counter at regular intervals and to
reset a counter when the associated line had been referenced. Hence the value in
each counter would indicate the age of a block since last referenced. The block
with the largest age would be replaced at replacement time.
2) Register stack
In the register stack implementation, a set of n-bit registers is formed, one for
each block in the set to be considered. The most recently used block is recorded at
the “top” of the stack and the least recently used block at the bottom. Actually, the
set of registers does not form a conventional stack, as both ends and internal
values are accessible. The value held in one register is passed to the next register
under certain conditions. When a block is referenced, starting at the top of the
stack, starting at the top of the stack, the values held in the registers are shifted
one place towards the bottom of the stack until a register is found to hold the same
value as the incoming block identification. Subsequent registers are not shifted.
The top register is loaded with the incoming block identification. This has the
effect of moving the contents of the register holding the incoming block number
to the top of the stack. This logic is fairly substantial and slow, and not really a
practical solution.
Fig. 19
3) Reference matrix
The reference matrix method centers around a matrix of status bits. There is more
than one version of the method. In one version (Smith, 1982), the upper triangular
matrix of a B X B matrix is formed without the diagonal, if there are B blocks to
consider. The triangular matrix has (B * (B – 1))/2 bits. When the ith block is
referenced, all the bits in the ith row of the matrix are set to 1 and then all the bits
in the ith column are set to 0. The least recently used block is one which has all
0’s in its row and all 1’s in its column, which can be detected easily by logic. The
method is demonstrated in Fig. 19 for
B = 4 and the reference sequence 2, 1, 3, 0, 3, 2, 1, …, together with the values
that would be obtained using a register stack.
4) Approximate methods.
When the number of blocks to consider increases above about four to eight,
approximate methods are necessary for the LRU algorithm. Fig. 20 shows a two-
stage approximation method with eight blocks, which is applicable to any
replacement algorithm. The eight blocks in Fig. 20 are divided into four pairs, and
each pair has one status bit to indicate the most/least recently used block in the
pair (simply set or reset by reference to each block). The least recently used
replacement algorithm now only considers the four pairs. Six status bits are
necessary (using the reference matrix) to identify the least recently used pair
which, together with the status bit of the pair, identifies the least recently used
block of a pair.
Figure 20: Two-stage replacement algorithm
The method can be extended to further levels. For example, sixteen blocks can be
divided into four groups, each group having two pairs. One status bit can be
associated with each pair, identifying the block in the pair, and another with each
group, identifying the group in a pair of groups. A true least recently used
algorithm is applied to the groups. In fact, the scheme could be taken to its logical
conclusion of extending to a full binary tree.
Fig. 21 gives an example. Here, there are four blocks in a set. One status bit, B0,
specifies which half o the blocks are most/least recently used. Two more bits, B1
and B2, specify which block of pairs is most/least recently used. Every time a
cache block is referenced (or loaded on a miss), the status bits are updated. For
example, if block L2 is referenced, B2 is set to a 0 to indicate that L2 is the most
recently used of the pair L2 and L3. B0 is set to a 1 to indicate that L2/L3 is the most
recently used of the four blocks, L0, L1, L2 and L3. To identify the line to replace
on a miss, the status bits are examined. If B0 = 0, then the block is either L0 or L1.
If then B1 = 0, it is L0.
Figure 21: Replacement algorithm using a tree selection
Self Assessment Questions
1. Discuss the various types of memory replacement algorithms in brief.
2. Write a note on:
a. Register Stack method b. Reference matrix method
Second-level caches
When the cache is integrated into the processor, it will be impossible to increase its size
should the performance not be sufficient. In any case, increasing the size of the cache
may create a slower cache. As an alternative, which has become very popular, a second
larger cache can be introduced between the first cache and the main memory as shown in
Fig. 22. This “second-level” cache is sometimes called a secondary cache.
Figure 22: Two-level caches
On a memory reference, the processor will access the first-level cache. If the information
is not found there (a first-level cache miss occurs), the second-level cache will be
accessed. If it is not in the second cache (a second-level cache miss occurs), then the
main memory must be accessed. Memory locations will be transferred to the second-level
cache and then to the first-level cache, so that two copies of a memory location will exist
in the cache system at least initially, i.e., locations cached in the second-level cache also
exist in the first-level cache. This is known as the Principle of Inclusion. (Of course the
copies of locations in the second-level cache will never be needed as they will be found
in the first-level cache.) Whether this continues will depend upon the replacement and
write policies. The replacement policy practiced in both caches would normally be the
least recently used algorithm. Normally write-through will be practiced between the
caches, which will maintain duplicate copies. The block size of the second-level cache
will be at least the same if not larger than the block size of the first-level cache, because
otherwise on a first-level cache miss, more than one second-level cache line would need
to be transferred into the first-level cache block.
Optimizing the data cache performance
When we deal with multiple arrays with some arrays accessed by rows and some by
columns, Storing the arrays row-by-row or column-by-column does not solve the
problem because both rows and columns are used in each iteration of the loop. We must
bring the same data into the cache again and again if the cache is not large enough to hold
all the data, which is a waste. We will use a matrix multiplication (C = A.B, where A, B,
and C are respectively m x p, p x n, and m x n matrices) as an example to show how to
utilize the locality to improve cache performance.
Principle of Locality
Since code is generally executed sequentially, virtually all programs repeat sections of
code and repeatedly access the same or nearby data. This characteristic is embodied in
the Principle of Locality, which has been found empirically to be obeyed by most
programs. It applies to both instruction references and data references, though it is more
likely in instruction references. It has two main aspects:
1. Temporal locality (locality in time) – individual locations, once referenced, are
likely to be referenced again in the near future.
2. Spatial locality (locality in space) – references, including the next location, are
likely to be near the last reference.
Temporal locality is found in instruction loops, data stacks and variable accesses. Spatial
locality describes the characteristic that programs access a number of distinct regions.
Sequential locality describes sequential locations being referenced and is a main attribute
of program construction. It can also be seen in data accesses, as data item are often stored
in sequential locations.
Taking advantage of temporal locality
When instructions are formed into loops which are executed many times, the length of a
loop is usually quite small. Therefore once a cache is loaded with loops of instructions
from the main memory, the instructions are used more than once before new instructions
are required from the main memory. The same situation applies to data; data is repeatedly
accessed. Suppose the reference is repeated n times in all during a program loop and after
the first reference, the location is always found in the cache, then the average access time
would be:
ta = (n*tcache + tmain)/n = tcache + tmain/n
where n = number of references. As n increases, the average access time decreases. The
increase in speed will, of course, depend upon the program. Some programs might have a
large amount of temporal locality, while others have less. We can do some optimization
about this.
Taking advantage of spatial locality
To take advantage of spatial locality, we will transfer not just one byte or word from the
main memory to the cache (and vice versa) but a series of sequential locations called a
block. We have assumed that it is necessary to reference the cache before a reference is
make to the main memory to fetch a word, and it is usual to look into the cache first to
see if the information is held there.
Data Blocking
For the matrix multiplication C = A.B, if we made code as below:
For (I = 0; I < m; I++) For (J = 0; J < n; J = J++) { R = 0; For (K = 0; K < p; K++) R = R
+ A[I][K] * B[K][J]; C[I][J] = R; }
The two inner loops read all p by n elements of B and access the same p elements in a
row of A repeatedly, and write one row of n elements of C. The number of capacity
misses clearly depends on the dimension parameters: m, n, p and the size of the cache. If
the cache can hold all three metrics, then all is well, provided there are no cache conflicts.
In the worst case, there would be (2*m*n*p + m*n) words read form memory for m*n*p
operations.
To enhance the cache performance if it is not big enough, we use an optimization
technique: blocking. The block method for this matrix product consist of:
• Split result matrix C into blocks CI,J of size Nb x Nb, each blocks is constructed
into a continuous array Cb which is then copied back into the right CI,J.
• Matrices A and B are spit into panels AI and BJ of size (Nb x p) and (p x Nb) each
panel is copied into continuous arrays Ab and Bb. The choice of Nb must ensure
that Cb, Ab and Bb fit into one level of cache, usually L2 cache.
Then we rewrite the code as:
For (I = 0; I < m/Nb; I++){ Ab = AI; For (J = 0; J < n/Nb; J++) { Bb = BJ; Cb = 0; For (K =
0; K < p/Nb; K++) Cb = Cb + AbK*BKb; CI,J = Cb; }} here “=” means assignment for matrix
We suppose for simplicity that Nb divides m, n and p. The figure 23 below may help you
in understanding operations performed on blocks. In the case of previous algorithm
matrix A is loaded only one time into cache compared to the n times access of the
original one, while matrix B is still accessed m times. This simple block method greatly
reduce memory access and real codes may choose by looking at matrix size which loop
structure (ijk vs. jik) is best appropriate and if some matrix operand fits totally into cache.
Figure 23
In the previous we do not talk about L1 cache use. In fact L1 will be generally too small to
handle a CI,J block and one panel of A and B, but remember that operation performed at
Cb = Cb + AbK*BKb is a matrix-matrix product so each operand AbK and BKb is aceessed Nb
times: this part could also use a block method. Since Nb is relatively small, the
implementation may load only one of Cb, AbK, BKb into L1 cache and works with others
from L2.
Summary
Operating system which handles the responsibility of managing the memory and t deals
with the memory management which covers the Memory hierarchy Paging and page
handling, segmentation with its policies and algorithms, Cache memory, cache memory
organization and associative mapping , cache performance. These all concepts managing
the sharing of primary and secondary memory and minimizing memory access time are
the vital goal of the memory management. And also it covers the memory fetch and
writes mechanism, replacement policy etc.
Terminal Questions
1. Memory management is important in operating systems. Discuss the main
problems that can occur if memory is managed poorly.
2. Explain the difference between logical and physical addresses.
3. Consider a paging system with a page-table stored in memory. If a memory
references takes 200 nanoseconds, how long does a paged memory reference take? If
we add associative registers, and 75 percent of all page table references are found in
the associative registers, what is the effective memory reference time? (Assume that
looking for (and maybe finding) a page-table entry in the associative memory takes
zero time).
4. Consider a demand-paging system with a paging disk that has an average access
and transfer time of 20 milliseconds. Addresses are translated through a page table in
main memory, with an access time of 1 microsecond per memory access. Thus, each
memory reference through the page table takes two accesses. To improve this time we
have added an associative memory that reduces access time to one memory reference
if the page table entry is in the associative memory. Assume that 80 percent of the
accesses are in the associative memory and that of the remaining, 10 percent (or 2
percent of the total) cause page faults. What is the effective memory access time.
5. We have discussed LRU as an attempt to predict future memory access patterns
based on previous access patterns (i.e. if we haven’t accessed a particular page in a
while, we are not likely to reference it again soon). Another idea that some
researchers have explored is to record the memory reference pattern from the last
time the program was run and use it to predict what it will access next time. Discuss
the positive and negative aspects of this idea.
Unit 5 : CPU Scheduling :
This unit covers Brief introduction of CPU scheduling, scheduling criteria and various
types of scheduling algorithms. Multiple-Processing scheduling and thread scheduling.
Introduction
Almost all programs have some alternating cycle of CPU number crunching and waiting
for I/O of some kind. (Even a simple fetch from memory takes a long time relative to
CPU speeds.). In a simple system running a single process, the time spent waiting for I/O
is wasted, and those CPU cycles are lost forever. A scheduling system allows one process
to use the CPU while another is waiting for I/O, thereby making full use of otherwise lost
CPU cycles. The challenge is to make the overall system as “efficient” and “fair” as
possible, subject to varying and often dynamic conditions, and where “efficient” and
“fair” are somewhat subjective terms, often subject to shifting priority policies.
Objective :
At the end of this unit, you will be able to understand the :� CPU-I/O Burst Cycle
� CPU Scheduler
� Scheduling Algorithms
� Multiple-Processor Scheduling
� Symmetric Multithreading
� Thread Scheduling
� Algorithm Evaluation
CPU-I/O Burst Cycle
Almost all process alternate between two states in a continuing cycle, as shown in Figure
5.1 below:
• A CPU burst of performing calculations, and
• An I/O burst, waiting for data transfer in or out of the system.
Fig. 5.1: Alternating sequence of CPU and I/O Bursts
CPU bursts vary from process to process, and from program to program, but an extensive
study shows frequency patterns similar to that shown in
Figure 5.2:
Fig. 5.2: Histogram of CPU-burst durations
Self Assessment Questions
1. Discuss the process alternate between two states in a continuing cycle.
2. Explain preemptive scheduling and non preemptive scheduling.
3. What is dispatcher?
CPU Scheduler
Whenever the CPU becomes idle, it is the job of the CPU Scheduler (a.k.a. the short-term
scheduler) to select another process from the ready queue to run next. The storage
structure for the ready queue and the algorithm used to select the next process are not
necessarily a FIFO queue. There are several alternatives to choose from, as well as
numerous adjustable parameters for each algorithm, which is the basic subject of this
entire unit.
Preemptive Scheduling
CPU scheduling decisions take place under one of four conditions:
1. When a process switches from the running state to the waiting state, such as for an I/O
request or invocation of the wait( ) system call.
2. When a process switches from the running state to the ready state, for example in
response to an interrupt.
3. When a process switches from the waiting state to the ready state, say at completion of
I/O or a return from wait( ).
4. When a process terminates.
For conditions 1 and 4 there is no choice – A new process must be selected. For
conditions 2 and 3 there is a choice – To either continue running the current process, or
select a different one. If scheduling takes place only under conditions 1 and 4, the system
is said to be non-preemptive, or cooperative. Under these conditions, once a process
starts running it keeps running, until it either voluntarily blocks or until it finishes.
Otherwise the system is said to be preemptive. Windows used non-preemptive scheduling
up to Windows 3.x, and started using pre-emptive scheduling with Win95. Macs used
non-preemptive prior to OSX, and pre-emptive since then. Note that pre-emptive
scheduling is only possible on hardware that supports a timer interrupt. It is to be noted
that pre-emptive scheduling can cause problems when two processes share data, because
one process may get interrupted in the middle of updating shared data structures.
Preemption can also be a problem if the kernel is busy implementing a system call (e.g.
updating critical kernel data structures) when the preemption occurs. Most modern
UNIXes deal with this problem by making the process wait until the system call has
either completed or blocked before allowing the preemption Unfortunately this solution is
problematic for real-time systems, as real-time response can no longer be guaranteed.
Some critical sections of code protect themselves from concurrency problems by
disabling interrupts before entering the critical section and re-enabling interrupts on
exiting the section. Needless to say, this should only be done in rare situations, and only
on very short pieces of code that will finish quickly, ( usually just a few machine
instructions. )
Dispatcher
The dispatcher is the module that gives control of the CPU to the process selected by the
scheduler. This function involves:� Switching context.
� Switching to user mode.
� Jumping to the proper location in the newly loaded program.
The dispatcher needs to be as fast as possible, as it is run on every context switch. The
time consumed by the dispatcher is known as dispatch latency.
Scheduling Criteria
There are several different criteria to consider when trying to select the “best” scheduling
algorithm for a particular situation and environment, including:� CPU utilization –
Ideally the CPU would be busy 100% of the time, so as to waste 0 CPU cycles. On a real
system CPU usage should range from 40% (lightly loaded) to 90% (heavily
loaded.)� Throughput – Number of processes completed per unit time. May range from
10 / second to 1 / hour depending on the specific processes.� Turnaround time – Time
required for a particular process to complete, from submission time to completion. (Wall
clock time.)� Waiting time – How much time processes spend in the ready queue
waiting their turn to get on the CPU.
� (Load average – The average number of processes sitting in the ready queue waiting
their turn to get into the CPU. Reported in 1-minute, 5-minute, and 15-minute averages
by “uptime” and “who”.)� Response time – The time taken in an interactive program
from the issuance of a command to the commence of a response to that command.
In general one wants to optimize the average value of a criteria (Maximize CPU
utilization and throughput, and minimize all the others.) However some times one wants
to do something different, such as to minimize the maximum response time. Sometimes it
is most desirable to minimize the variance of a criteria than the actual value. I.e. users are
more accepting of a consistent predictable system than an inconsistent one, even if it is a
little bit slower.
Scheduling Algorithms
The following subsections will explain several common scheduling strategies, looking at
only a single CPU burst each for a small number of processes. Obviously real systems
have to deal with a lot more simultaneous processes executing their CPU-I/O burst
cycles.
First-Come First-Serve Scheduling, FCFS
FCFS is very simple – Just a FIFO queue, like customers waiting in line at the bank or
the post office or at a copying machine. Unfortunately, however, FCFS can yield some
very long average wait times, particularly if the first process to get there takes a long
time. For example, consider the following three processes:
Process Burst Time
P1 24
P2 3
P3 3
In the first Gantt chart below, process P1 arrives first. The average waiting time for the
three processes is (0 + 24 + 27) / 3 = 17.0 ms. In the second Gantt chart below, the same
three processes have an average wait time of
(0 + 3 + 6) / 3 = 3.0 ms. The total run time for the three bursts is the same, but in the
second case two of the three finish much quicker, and the other process is only delayed
by a short amount.
FCFS can also block the system in a busy dynamic system in another way, known as the
convoy effect. When one CPU intensive process blocks the CPU, a number of I/O
intensive processes can get backed up behind it, leaving the I/O devices idle. When the
CPU hog finally relinquishes the CPU, then the I/O processes pass through the CPU
quickly, leaving the CPU idle while everyone queues up for I/O, and then the cycle
repeats itself when the CPU intensive process gets back to the ready queue.
Shortest-Job-First Scheduling, SJF
The idea behind the SJF algorithm is to pick the quickest fastest little job that needs to be
done, get it out of the way first, and then pick the next smallest fastest job to do next.
(Technically this algorithm picks a process based on the next shortest CPU burst, not the
overall process time.). For example, the Gantt chart below is based upon the following
CPU burst times, (and the assumption that all jobs arrive at the same time.)
Process Burst Time
P1 6
P2 8
P3 7
P4 3
In the case above the average wait time is (0 + 3 + 9 + 16) / 4 = 7.0 ms, (as opposed to
10.25 ms for FCFS for the same processes.)
SJF can be proven to be the fastest scheduling algorithm, but it suffers from one
important problem: How do you know how long the next CPU burst is going to be?� For
long-term batch jobs this can be done based upon the limits that users set for their jobs
when they submit them, which encourages them to set low limits, but risks their having to
re-submit the job if they set the limit too low. However that does not work for short-term
CPU scheduling on an interactive system.� Another option would be to statistically
measure the run time characteristics of jobs, particularly if the same tasks are run
repeatedly and predictably. But once again that really isn’t a viable option for short term
CPU scheduling in the real world.� A more practical approach is to predict the length of
the next burst, based on some historical measurement of recent burst times for this
process. One simple, fast, and relatively accurate method is the exponential average,
which can be defined as follows.
estimate[ i + 1 ] = alpha * burst[ i ] + ( 1.0 – alpha ) * estimate[ i ]� In this scheme the
previous estimate contains the history of all previous times, and alpha serves as a
weighting factor for the relative importance of recent data versus past history. If alpha is
1.0, then past history is ignored, and we assume the next burst will be the same length as
the last burst. If alpha is 0.0, then all measured burst times are ignored, and we just
assume a constant burst time. Most commonly alpha is set at 0.5, as illustrated in Figure
5.3:
Fig. 5.3: Prediction of the length of the next CPU burst
SJF can be either preemptive or non-preemptive. Preemption occurs when a new process
arrives in the ready queue that has a predicted burst time shorter than the time remaining
in the process whose burst is currently on the CPU. Preemptive SJF is sometimes referred
to as shortest remaining time first scheduling. For example, the following Gantt chart is
based upon the following data:
Process Arrival Time Burst Time
P1 0 8
P2 1 4
P3 2 9
p4 3 5
The average wait time in this case is ( (5 – 3) + (10 – 1) + (17 – 2)) / 4 = 26 / 4 = 6.5 ms.
(As opposed to 7.75 ms for non-preemptive SJF or 8.75 for FCFS.)
Priority Scheduling
Priority scheduling is a more general case of SJF, in which each job is assigned a priority
and the job with the highest priority gets scheduled first. (SJF uses the inverse of the next
expected burst time as its priority – The smaller the expected burst, the higher the
priority.)
Note that in practice, priorities are implemented using integers within a fixed range, but
there is no agreed-upon convention as to whether “high” priorities use large numbers or
small numbers. This book uses low number for high priorities, with 0 being the highest
possible priority. For example, the following Gantt chart is based upon these process
burst times and priorities, and yields an average waiting time of 8.2 ms:
Process Burst Time Priority
P1 10 3
P2 1 1
P3 2 4
P4 1 5
P5 5 2
Priorities can be assigned either internally or externally. Internal priorities are assigned
by the OS using criteria such as average burst time, ratio of CPU to I/O activity, system
resource use, and other factors available to the kernel. External priorities are assigned by
users, based on the importance of the job, fees paid, politics, etc. Priority scheduling can
be either preemptive or non-preemptive. Priority scheduling can suffer from a major
problem known as indefinite blocking, or starvation, in which a low-priority task can
wait forever because there are always some other jobs around that have higher
priority.� If this problem is allowed to occur, then processes will either run eventually
when the system load lightens (at say 2:00 a.m.), or will eventually get lost when the
system is shut down or crashes. (There are rumors of jobs that have been stuck for
years.)� One common solution to this problem is aging, in which priorities of jobs
increase the longer they wait. Under this scheme a low-priority job will eventually get its
priority raised high enough that it gets run.
Round Robin Scheduling
Round robin scheduling is similar to FCFS scheduling, except that CPU bursts are
assigned with limits called time quantum. When a process is given the CPU, a timer is
set for whatever value has been set for a time quantum.� If the process finishes its burst
before the time quantum timer expires, then it is swapped out of the CPU just like the
normal FCFS algorithm.� If the timer goes off first, then the process is swapped out of
the CPU and moved to the back end of the ready queue.
The ready queue is maintained as a circular queue, so when all processes have had a turn,
then the scheduler gives the first process another turn, and so on. RR scheduling can give
the effect of all processors sharing the CPU equally, although the average wait time can
be longer than with other scheduling algorithms. In the following example the average
wait time is 5.66 ms.
Process Burst Time
P1 24
P2 3
P3 3
The performance of RR is sensitive to the time quantum selected. If the quantum is large
enough, then RR reduces to the FCFS algorithm; If it is very small, then each process
gets 1/nth of the processor time and share the CPU equally.
BUT, a real system invokes overhead for every context switch, and the smaller the time
quantum the more context switches there are. (See Figure 5.4 below.) Most modern
systems use time quantum between 10 and 100 milliseconds, and context switch times on
the order of 10 microseconds, so the overhead is small relative to the time quantum.
Fig. 5.4: The way in which a smaller time quantum increases context switches
Turn around time also varies with quantum time, in a non-apparent manner. Consider, for
example the processes shown in Figure 5.5:
Fig. 5.5: The way in which turnaround time varies with the time quantum
In general, turnaround time is minimized if most processes finish their next cpu burst
within one time quantum. For example, with three processes of
10 ms bursts each, the average turnaround time for 1 ms quantum is 29, and for 10 ms
quantum it reduces to 20. However, if it is made too large, then RR just degenerates to
FCFS. A rule of thumb is that 80% of CPU bursts should be smaller than the time
quantum.
Multilevel Queue Scheduling
When processes can be readily categorized, then multiple separate queues can be
established, each implementing whatever scheduling algorithm is most appropriate for
that type of job, and/or with different parametric adjustments. Scheduling must also be
done between queues, that is scheduling one queue to get time relative to other queues.
Two common options are strict priority (no job in a lower priority queue runs until all
higher priority queues are empty) and round-robin (each queue gets a time slice in turn,
possibly of different sizes.)
Note that under this algorithm jobs cannot switch from queue to queue – Once they are
assigned a queue, that is their queue until they finish.
Fig. 5.6: Multilevel queue scheduling
Multilevel Feedback-Queue Scheduling
Multilevel feedback queue scheduling is similar to the ordinary multilevel queue
scheduling described above, except jobs may be moved from one queue to another for a
variety of reasons:
� If the characteristics of a job change between CPU-intensive and I/O intensive, then it
may be appropriate to switch a job from one queue to another.
� Aging can also be incorporated, so that a job that has waited for a long time can get
bumped up into a higher priority queue for a while.
Multilevel feedback queue scheduling is the most flexible, because it can be tuned for
any situation. But it is also the most complex to implement because of all the adjustable
parameters. Some of the parameters which define one of these systems include:� The
number of queues.� The scheduling algorithm for each queue.� The methods used to
upgrade or demote processes from one queue to another. ( Which may be different.
)� The method used to determine which queue a process enters initially.
Fig. 5.7: Multilevel feedback queues
Self Assessment Questions
1. Explain the several common scheduling strategies in brief.
2. Explain the FCFS scheduling with a suitable example.
3. Write note on:
a. Priority Scheduling b. RR Scheduling
Multiple-Processor Scheduling
When multiple processors are available, then the scheduling gets more complicated,
because now there is more than one CPU which must be kept busy and in effective use at
all times. Load sharing revolves around balancing the load between multiple processors.
Multi-processor systems may be heterogeneous, (different kinds of CPUs), or
homogenous, (all the same kind of CPU). Even in the latter case there may be special
scheduling constraints, such as devices which are connected via a private bus to only one
of the CPUs. This book will restrict its discussion to homogenous systems.
Approaches to Multiple-Processor Scheduling
One approach to multi-processor scheduling is asymmetric multiprocessing, in which
one processor is the master, controlling all activities and running all kernel code, while
the other runs only user code. This approach is relatively simple, as there is no need to
share critical system data. Another approach is symmetric multiprocessing, SMP, where
each processor schedules its own jobs, either from a common ready queue or from
separate ready queues for each processor. Virtually all modern OSes support SMP,
including XP, Win 2000, Solaris, Linux, and Mac OSX.
Processor Affinity
Processors contain cache memory, which speeds up repeated accesses to the same
memory locations. If a process were to switch from one processor to another each time it
got a time slice, the data in the cache (for that process) would have to be invalidated and
re-loaded from main memory, thereby obviating the benefit of the cache. Therefore SMP
systems attempt to keep processes on the same processor, via processor affinity.
Soft affinity occurs when the system attempts to keep processes on the same processor
but makes no guarantees. Linux and some other OSes support hard affinity, in which a
process specifies that it is not to be moved between processors.
Load Balancing
Obviously an important goal in a multiprocessor system is to balance the load between
processors, so that one processor won’t be sitting idle while another is overloaded.
Systems using a common ready queue are naturally self-balancing, and do not need any
special handling. Most systems, however, maintain separate ready queues for each
processor.
Balancing can be achieved through either push migration or pull migration:
• Push migration involves a separate process that runs periodically,
(e.g. every 200 milliseconds), and moves processes from heavily loaded
processors onto less loaded ones.
• Pull migration involves idle processors taking processes from the ready queues of
other processors.
• Push and pull migration are not mutually exclusive.
Note that moving processes from processor to processor to achieve load balancing works
against the principle of processor affinity, and if not carefully managed, the savings
gained by balancing the system can be lost in rebuilding caches. One option is to only
allow migration when imbalance surpasses a given threshold.
Symmetric Multithreading
An alternative strategy to SMP is SMT, Symmetric Multi-Threading, in which multiple
virtual (logical) CPUs are used instead of (or in combination with) multiple physical
CPUs. SMT must be supported in hardware, as each logical CPU has its own registers
and handles its own interrupts. (Intel refers to SMT as hyperthreading technology.) To
some extent the OS does not need to know if the processors it is managing are real or
virtual. On the other hand, some scheduling decisions can be optimized if the scheduler
knows the mapping of virtual processors to real CPUs. (Consider the scheduling of two
CPU-intensive processes on the architecture shown below.)
Fig. 5.8: A typical SMT architecture
Thread Scheduling
The process scheduler schedules only the kernel threads. User threads are mapped to
kernel threads by the thread library – The OS (and in particular the scheduler) is unaware
of them.
Contention Scope
Contention scope refers to the scope in which threads compete for the use of physical
CPUs. On systems implementing many-to-one and many-to-many threads, Process
Contention Scope, PCS, occurs, because competition occurs between threads that are
part of the same process.
(This is the management / scheduling of multiple user threads on a single kernel thread,
and is managed by the thread library.)
System Contention Scope, SCS, involves the system scheduler scheduling kernel threads
to run on one or more CPUs. Systems implementing one-to-one threads (XP, Solaris 9,
Linux), use only SCS. PCS scheduling is typically done with priority, where the
programmer can set and/or change the priority of threads created by his or her programs.
Even time slicing is not guaranteed among threads of equal priority.
Pthread Scheduling
The Pthread library provides for specifying scope contention:
• PTHREAD_SCOPE_PROCESS schedules threads using PCS, by scheduling user
threads onto available LWPs using the many-to-many model.
• PTHREAD_SCOPE_SYSTEM schedules threads using SCS, by binding user
threads to particular LWPs, effectively implementing a one-to-one model.
Getscope and setscope methods provide for determining and setting the scope contention
respectively:
Fig. 5.9: Pthread Scheduling API
Operating System Examples
Example: Solaris Scheduling
• Priority-based kernel thread scheduling.
• Four classes (real-time, system, interactive, and time-sharing), and multiple
queues / algorithms within each class.
• Default is time-sharing.
o Process priorities and time slices are adjusted dynamically in a multilevel-
feedback priority queue system.
o Time slices are inversely proportional to priority – Higher priority jobs get
smaller time slices.
o Interactive jobs have higher priority than CPU-Bound ones.
o See the table below for some of the 60 priority levels and how they shift.
“Time quantum expired” and “return from sleep” indicate the new priority
when those events occur.
Fig. 5.10: Solaries scheduling
Fig. 5.11: Solaries dispatch table for interactive and time-sharing threads
Solaris 9 introduced two new scheduling classes: Fixed priority and fair share.
• Fixed priority is similar to time sharing, but not adjusted dynamically.
• Fair share uses shares of CPU time rather than priorities to schedule jobs. A
certain share of the available CPU time is allocated to a project, which is a set of
processes.
System class is reserved for kernel use. (User programs running in kernel mode are NOT
considered in the system scheduling class.)
Fig. 5.13: Windows XP priorities
Fig. 5.14: List of tasks indexed according to priority
Algorithm Evaluation
The first step in determining which algorithm (and what parameter settings within that
algorithm) is optimal for a particular operating environment is to determine what criteria
are to be used, what goals are to be targeted, and what constraints if any must be applied.
For example, one might want to “maximize CPU utilization, subject to a maximum
response time of
1 second”.
Once criteria have been established, then different algorithms can be analyzed and a “best
choice” determined. The following sections outline some different methods for
determining the “best choice”.
Deterministic Modeling
If a specific workload is known, then the exact values for major criteria can be fairly
easily calculated, and the “best” determined. For example, consider the following
workload (with all processes arriving at time 0), and the resulting schedules determined
by three different algorithms:
Process Burst Time
P1 10
P2 29
P3 3
P4 7
P5 12
The average waiting times for FCFS, SJF, and RR are 28ms, 13ms, and 23ms
respectively. Deterministic modeling is fast and easy, but it requires specific known
input, and the results only apply for that particular set of input. However by examining
multiple similar cases, certain trends can be observed. (Like the fact that for processes
arriving at the same time, SJF will always yield the shortest average wait time.)
Queuing Models
Specific process data is often not available, particularly for future times. However a study
of historical performance can often produce statistical descriptions of certain important
parameters, such as the rate at which new processes arrive, the ratio of CPU bursts to I/O
times, the distribution of CPU burst times and I/O burst times, etc.
Armed with those probability distributions and some mathematical formulas, it is
possible to calculate certain performance characteristics of individual waiting queues. For
example, Little’s Formula says that for an average queue length of N, with an average
waiting time in the queue of W, and an average arrival of new jobs in the queue of
Lambda, then these three terms can be related by:
N = Lambda * W
Queuing models treat the computer as a network of interconnected queues, each of which
is described by its probability distribution statistics and formulas such as Little’s formula.
Unfortunately real systems and modern scheduling algorithms are so complex as to make
the mathematics intractable in many cases with real systems.
Simulations
Another approach is to run computer simulations of the different proposed algorithms
(and adjustment parameters) under different load conditions, and to analyze the results to
determine the “best” choice of operation for a particular load pattern. Operating
conditions for simulations are often randomly generated using distribution functions
similar to those described above. A better alternative when possible is to generate trace
tapes, by monitoring and logging the performance of a real system under typical expected
work loads. These are better because they provide a more accurate picture of system
loads, and also because they allow multiple simulations to be run with the identical
process load, and not just statistically equivalent loads. A compromise is to randomly
determine system loads and then save the results into a file, so that all simulations can be
run against identical randomly determined system loads.
Although trace tapes provide more accurate input information, they can be difficult and
expensive to collect and store, and their use increases the complexity of the simulations
significantly. There is also some question as to whether the future performance of the
new system will really match the past performance of the old system. (If the system runs
faster, users may take fewer coffee breaks, and submit more processes per hour than
under the old system. Conversely if the turnaround time for jobs is longer, intelligent
users may think more carefully about the jobs they submit rather than randomly
submitting jobs and hoping that one of them works out.)
Fig. 5.15: Evaluation of CPU schedulers by simulation
Implementation
The only real way to determine how a proposed scheduling algorithm is going to operate
is to implement it on a real system. For experimental algorithms and those under
development, this can cause difficulties and resistances among users who don’t care
about developing OS’s and are only trying to get their daily work done. Even in this case,
the measured results may not be definitive, for at least two major reasons: (1) System
work loads are not static, but change over time as new programs are installed, new users
are added to the system, new hardware becomes available, new work projects get started,
and even societal changes. (For example the explosion of the Internet has drastically
changed the amount of network traffic that a system sees and the importance of handling
it with rapid response times.) (2) As mentioned above, changing the scheduling system
may have an impact on the work load and the ways in which users use the system.
Most modern systems provide some capability for the system administrator to adjust
scheduling parameters, either on the fly or as the result of a reboot or a kernel rebuild.
Summary
The summary of this unit covers the alternating sequence of CPU I/O bursts. CPU
scheduler in this there are several alternatives to choose from, as well as numerous
adjustable parameters for each specified scheduling algorithms. In this discussed various
common scheduling strategies, such as FCFS Scheduling, Shortest-Job-First scheduling,
priority scheduling, RR-scheduling and Multilevel queue scheduling and Multiple-
Processor scheduling. Finally we also discussed about the load balancing, thread
scheduling and various algorithm evaluation models.
Terminal Questions
1. What do you understand by scheduling process what are the conditions which
guides during the CPU scheduling decisions?
2. What is the significance of dispatcher module in scheduling process? Explain the
dispatcher latency.
3. What are the various scheduling algorithms discuss the advantages of one over the
other.
4. When it is advisable to follow the priority scheduling approach, what is the
suggested solution to deal with starvation problem in this approach.
5. What is load balancing? How load balancing is achieved in multiprocessor
systems.
Unit 6 : Deadlocks:
This unit covers the deadlock principles, deadlock detection and recovery, deadlock
avoidance , prevention, pipes.
Introduction
Recall that one definition of an operating system is a resource allocator. There are many
resources that can be allocated to only one process at a time, and we have seen several
operating system features that allow this, such as mutexes, semaphores or file locks.
Sometimes a process has to reserve more than one resource. For example, a process
which copies files from one tape to another generally requires two tape drives. A process
which deals with databases may need to lock multiple records in a database.
A deadlock is a situation in which two computer programs sharing the same resource are
effectively preventing each other from accessing the resource, resulting in both programs
ceasing to function.
The earliest computer operating systems ran only one program at a time. All of the
resources of the system were available to this one program. Later, operating systems ran
multiple programs at once, interleaving them. Programs were required to specify in
advance what resources they needed so that they could avoid conflicts with other
programs running at the same time. Eventually some operating systems offered dynamic
allocation of resources. Programs could request further allocations of resources after they
had begun running. This led to the problem of the deadlock. Here is the simplest
example:
Program 1 requests resource A and receives it.
Program 2 requests resource B and receives it. Program 1 requests resource B and is queued up, pending the release of B.
Program 2 requests resource A and is queued up, pending the release of A.
Now neither program can proceed until the other program releases a resource. The
operating system cannot know what action to take. At this point the only alternative is to
abort (stop) one of the programs.
Learning to deal with deadlocks had a major impact on the development of operating
systems and the structure of databases. Data was structured and the order of requests was
constrained in order to avoid creating deadlocks.
In general, resources allocated to a process are not preemptable; this means that once a
resource has been allocated to a process, there is no simple mechanism by which the
system can take the resource back from the process unless the process voluntarily gives it
up or the system administrator kills the process. This can lead to a situation called
deadlock. A set of processes or threads is deadlocked when each process or thread is
waiting for a resource to be freed which is controlled by another process. Here is an
example of a situation where deadlock can occur.
Mutex M1, M2;
/* Thread 1 */
while (1) {
NonCriticalSection() Mutex_lock(&M1);
Mutex_lock(&M2);
CriticalSection(); Mutex_unlock(&M2);
Mutex_unlock(&M1);
}
/* Thread 2 */ while (1) {
NonCriticalSection()
Mutex_lock(&M2); Mutex_lock(&M1);
CriticalSection();
Mutex_unlock(&M1);
Mutex_unlock(&M2);
}
Suppose thread 1 is running and locks M1, but before it can lock M2, it is interrupted.
Thread 2 starts running; it locks M2, when it tries to obtain and lock M1, it is blocked
because M1 is already locked (by thread 1). Eventually thread 1 starts running again, and
it tries to obtain and lock M2, but it is blocked because M2 is already locked by thread 2.
Both threads are blocked; each is waiting for an event which will never occur.
Traffic gridlock is an everyday example of a deadlock situation.
In order for deadlock to occur, four conditions must be true.
• Mutual exclusion – Each resource is either currently allocated to exactly one
process or it is available. (Two processes cannot simultaneously control the same
resource or be in their critical section).
• Hold and Wait – processes currently holding resources can request new resources
• No preemption – Once a process holds a resource, it cannot be taken away by
another process or the kernel.
• Circular wait – Each process is waiting to obtain a resource which is held by
another process.
The dining philosophers problem discussed in an earlier section is a classic example of
deadlock. Each philosopher picks up his or her left fork and waits for the right fork to
become available, but it never does.
Deadlock can be modeled with a directed graph. In a deadlock graph, vertices represent
either processes (circles) or resources (squares). A process which has acquired a resource
is show with an arrow (edge) from the resource to the process. A process which has
requested a resource which has not yet been assigned to it is modeled with an arrow from
the process to the resource. If these create a cycle, there is deadlock.
The deadlock situation in the above code can be modeled like this.
This graph shows an extremely simple deadlock situation, but it is also possible for a
more complex situation to create deadlock. Here is an example of deadlock with four
processes and four resources.
There are a number of ways that deadlock can occur in an operating situation. We have
seen some examples, here are two more.
• Two processes need to lock two files, the first process locks one file the second
process locks the other, and each waits for the other to free up the locked file.
• Two processes want to write a file to a print spool area at the same time and both
start writing. However, the print spool area is of fixed size, and it fills up before
either process finishes writing its file, so both wait for more space to become
available.
Objective :
At the end of this unit, you will be able to understand the :
• Solutions to deadlock
• Deadlock detection and recovery
• Deadlock avoidance
• Deadlock Prevention
• Pipes
Solutions to deadlock
There are several ways to address the problem of deadlock in an operating system.
• Just ignore it and hope it doesn’t happen
• Detection and recovery – if it happens, take action
• Dynamic avoidance by careful resource allocation. Check to see if a resource can
be granted, and if granting it will cause deadlock, don’t grant it.
• Prevention – change the rules
Ignore deadlock
The text refers to this as the Ostrich Algorithm. Just hope that deadlock doesn’t happen.
In general, this is a reasonable strategy. Deadlock is unlikely to occur very often; a
system can run for years without deadlock occurring. If the operating system has a
deadlock prevention or detection system in place, this will have a negative impact on
performance (slow the system down) because whenever a process or thread requests a
resource, the system will have to check whether granting this request could cause a
potential deadlock situation.
If deadlock does occur, it may be necessary to bring the system down, or at least
manually kill a number of processes, but even that is not an extreme solution in most
situations.
Deadlock detection and recovery
As we saw above, if there is only one instance of each resource, it is possible to detect
deadlock by constructing a resource allocation/request graph and checking for cycles.
Graph theorists have developed a number of algorithms to detect cycles in a graph. The
book discusses one of these. It uses only one data structure L a list of nodes.
A cycle detection algorithm
For each node N in the graph
1. Initialize L to the empty list and designate all edges as unmarked
2. Add the current node to L and check to see if it appears twice. If it does, there is a
cycle in the graph.
3. From the given node, check to see if there are any unmarked outgoing edges. If
yes, go to the next step, if no, skip the next step
4. Pick an unmarked edge, mark it, then follow it to the new current node and go to
step 3.
5. We have reached a dead end. Go back to the previous node and make that the
current node. If the current node is the starting Node and there are no unmarked
edges, there are no cycles in the graph. Otherwise, go to step 3.
Let’s work through an example with five processes and five resources. Here is the
resource request/allocation graph.
The algorithm needs to search each node; let’s start at node P1. We add P1 to L and
follow the only edge to R1, marking that edge. R1 is now the current node so we add that
to L, checking to confirm that it is not already in L. We then follow the unmarked edge to
P2, marking the edge, and making P2 the current node. We add P2 to L, checking to
make sure that it is not already in L, and follow the edge to R2. This makes R2 the
current node, so we add it to L, checking to make sure that it is not already there. We are
now at a dead end so we back up, making P2 the current node again. There are no more
unmarked edges from P2 so we back up yet again, making R1 the current node. There are
no more unmarked edges from R1 so we back up yet again, making P1 the current node.
Since there are no more unmarked edges from P1 and since this was our starting point,
we are through with this node (and all of the nodes visited so far).
We move to the next unvisited node P3, and initialize L to empty. We first follow the
unmarked edge to R1, putting R1 on L. Continuing, we make P2 the current node and
then R2. Since we are at a dead end, we repeatedly back up until P3 becomes the current
node again.
L now contains P3, R1, P2, and R2. P3 is the current node, and it has another unmarked
edge to R3. We make R3 the current node, add it to L, follow its edge to P4. We repeat
this process, visiting R4, then P5, then R5, then P3. When we visit P3 again we note that
it is already on L, so we have detected a cycle, meaning that there is a deadlock situation.
Once deadlock has been detected, it is not clear what the system should do to correct the
situation. There are three strategies.
• Preemption – we can take an already allocated resource away from a process and
give it to another process. This can present problems. Suppose the resource is a
printer and a print job is half completed. It is often difficult to restart such a job
without completely starting over.
• Rollback – In situations where deadlock is a real possibility, the system can
periodically make a record of the state of each process and when deadlock occurs,
roll everything back to the last checkpoint, and restart, but allocating resources
differently so that deadlock does not occur. This means that all work done after
the checkpoint is lost and will have to be redone.
• Kill one or more processes – this is the simplest and crudest, but it works.
Deadlock avoidance
The above solution allowed deadlock to happen, then detected that deadlock had occurred
and tried to fix the problem after the fact. Another solution is to avoid deadlock by only
granting resources if granting them cannot result in a deadlock situation later. However,
this works only if the system knows what requests for resources a process will be making
in the future, and this is an unrealistic assumption. The text describes the bankers
algorithm but then points out that it is essentially impossible to implement because of this
assumption.
Deadlock Prevention
The difference between deadlock avoidance and deadlock prevention is a little subtle.
Deadlock avoidance refers to a strategy where whenever a resource is requested, it is only
granted if it cannot result in deadlock. Deadlock prevention strategies involve changing
the rules so that processes will not make requests that could result in deadlock.
Here is a simple example of such a strategy. Suppose every possible resource is
numbered (easy enough in theory, but often hard in practice), and processes must make
their requests in order; that is, they cannot request a resource with a number lower than
any of the resources that they have been granted so far. Deadlock cannot occur in this
situation.
As an example, consider the dining philosophers problem. Suppose each chopstick is
numbered, and philosophers always have to pick up the lower numbered chopstick before
the higher numbered chopstick. Philosopher five picks up chopstick 4, philosopher 4
picks up chopstick 3, philosopher 3 picks up chopstick 2, philosopher 2 picks up
chopstick 1. Philosopher 1 is hungry, and without this assumption, would pick up
chopstick 5, thus causing deadlock. However, if the lower number rule is in effect, he/she
has to pick up chopstick 1 first, and it is already in use, so he/she is blocked. Philosopher
5 picks up chopstick 5, eats, and puts both down, allows philosopher 4 to eat. Eventually
everyone gets to eat.
An alternative strategy is to require all processes to request all of their resources at once,
and either all are granted or none are granted. Like the above strategy, this is
conceptually easy but often hard to implement in practice because it assumes that a
process knows what resources it will need in advance.
Livelock
There is a variant of deadlock called livelock. This is a situation in which two or more
processes continuously change their state in response to changes in the other process(es)
without doing any useful work. This is similar to deadlock in that no progress is made but
differs in that neither process is blocked or waiting for anything.
A human example of livelock would be two people who meet face-to-face in a corridor
and each moves aside to let the other pass, but they end up swaying from side to side
without making any progress because they always move the same way at the same time.
Addressing deadlock in real systems
Deadlock is a terrific theoretical problem for graduate students, but none of the solutions
discussed above can be implemented in a real world, general purpose operating system. It
would be difficult to require a user program to make requests for resources in a certain
way or in a certain order. As a result, most operating systems use the ostrich algorithm.
Some specialized systems have deadlock avoidance/prevention mechanisms. For
example, many database operations involve locking several records, and this can result in
deadlock, so database software often has a deadlock prevention algorithm.
The Unix file locking system lockf has a deadlock detection mechanism built into it.
Whenever a process attempts to lock a file or a record of a file, the operating system
checks to see if that process has locked other files or records, and if it has, it uses a graph
algorithm similar to the one discussed above to see if granting that request will cause
deadlock, and if it does, the request for the lock will fail, and the lockf system call will
return and errno will be set to EDEADLK.
Killing Zombies
Recall that if a child dies before its parent calls wait, the child becomes a zombie. In
some applications, a web server for example, the parent forks off lots of children but
doesn’t care whether the child is dead or alive. For example, a web server might fork a
new process to handle each connection, and each child dies when the client breaks the
connection. Such an application is at risk of producing many zombies, and zombies can
clog up the process table.
When a child dies, it sends a SIGCHLD signal to its parent. The parent process can
prevent zombies from being created by creating a signal handler routine for SIGCHLD
which calls wait whenever it receives a SIGCHLD signal. There is no danger that this
will cause the parent to block because it would only call wait when it knows that a child
has just died.
There are several versions of wait on a Unix system. The system call waitpid has this
prototype
#include <sys/types.h> #include <sys/wait.h>
pid_t waitpid(pid_t pid, int *stat_loc, int options)
This will function like wait in that it waits for a child to terminate, but this function
allows the process to wait for a particular child by setting its first argument to the pid that
we want to wait for. However, that is not our interest here. If the first argument is set to
zero, it will wait for any child to terminate, just like wait. However, the third argument
can be set to WNOHANG. This will cause the function to return immediately if there are
no dead children. It is customary to use this function rather than wait in the signal
handler.
Here is some sample code
#include <sys/types.h> #include <stdio.h>
#include <signal.h>
#include <wait.h>
#include <unistd.h> void *zombiekiller(int n)
{
int status; waitpid(0,&status,WNOHANG);
signal(SIGCHLD,zombiekiller);
return (void *) NULL; }
int main()
{
signal(SIGCHLD, zombiekiller); ....
}
Pipes
A second form of redirection is a pipe. A pipe is a connection between two processes in
which one process writes data to the pipe and the other reads from the pipe. Thus, it
allows one process to pass data to another process.
The Unix system call to create a pipe is
int pipe(int fd[2])
This function takes an array of two ints (file descriptors) as an argument. It creates a pipe
with fd[0] at one end and fd[1] at the other. Reading from the pipe and writing to the pipe
are done with the read and write calls that you have seen and used before. Although both
ends are opened for both reading and writing, by convention a process writes to fd[1] and
reads from fd[0]. Pipes only make sense if the process calls fork after creating the pipe.
Each process should close the end of the pipe that it is not using. Here is a simple
example in which a child sends a message to its parent through a pipe.
#include <unistd.h>
#include <stdio.h> int main()
{
pid_t pid; int retval;
int fd[2];
int n; retval = pipe(fd);
if (retval < 0) {
printf("Pipe failedn"); /* pipe is unlikely to fail */
exit(0); }
pid = fork();
if (pid == 0) { /* child */ close(fd[0]);
n = write (fd[1],"Hello from the child",20);
exit(0); }
else if (pid > 0) { /* parent */
char buffer[64];
close(fd[1]); n = read(fd[0],buffer,64);
buffer[n]='';
printf("I got your message: %sn",buffer); }
return 0;
}
There is no need for the parent to wait for the child to finish because reading from a pipe
will block until there is something in the pipe to read. If the parent runs first, it will try to
execute the read statement, and will immediately block because there is nothing in the
pipe. After the child writes a message to the pipe, the parent will wake up.
Pipes have a fixed size (often 4096 bytes) and if a process tries to write to a pipe which is
full, the write will block until a process reads some data from the pipe.
Here is a program which combines dup2 and pipe to redirect the output of the ls process
to the input of the more process as would be the case if the user typed
ls | more at the Unix command line.
#include <stdio.h>
#include <unistd.h>
void error(char *msg) {
perror(msg);
exit(1); }
int main()
{
int p[2], retval; retval = pipe(p);
if (retval < 0) error("pipe");
retval=fork(); if (retval < 0) error("forking");
if (retval==0) { /* child */
dup2(p[1],1); /* redirect stdout to pipe */
close(p[0]); /* don't permit this process to read from pipe */
execl("/bin/ls","ls","-l",NULL);
error("Exec of ls"); }
/* if we get here, we are the parent */
dup2(p[0],0); /* redirect stdin to pipe */
close(p[1]); /* don't permit this process to write to pipe */
execl("/bin/more","more",NULL);
error("Exec of more"); return 0;
}
Summary
A deadlock is considered to be one of the situation which whenever occur prevents the
normal flow of execution of any application. Thus needs to be understood well. To cater
this need, the unit began with providing a detailed discussion on the fundamental
concepts of deadlock followed by understanding various situations that force deadlock to
occur. Finally Unit provided a detailed coverage on methods of avoiding deadlocks to
occur and in case of their occurrence, mechanism to detect them so that precautionary
measures could be taken.
Terminal Questions
1. What do you mean by a deadlock? Explain
2. Discuss various conditions to be true for deadlock to occur.
3. Discuss various application to overcome the problem of deadlock.
4. What do you mean by a Zomby? Discuss in brief. 5. Explain the concept of pipes.
6.
Unit 7 : Concurrency Control :
This unit deals with the concurrency, race condition, critical section, mutual exclusion
and Semaphores
Introduction
Concurrency is a property of systems which execute processes overlapped in time on
single or multiple processors, and which may permit the sharing of common resources
between those overlapped processes. Concurrent use of shared resources is the source of
many difficulties, such as race conditions. Concurrency control is a method used to
ensure that processes are executed in a safe manner without affecting each other and
correct results are generated, while getting those results as quickly as possible. Mutual
exclusion is a way of making sure that if one process is using a shared modifiable data,
the other processes will be excluded from doing the same thing. The mutual exclusion
have a basic problem of busy waiting. If a process is unable to enter in to its critical
section; it tightly executes the loop of testing the shared global variable, wasting CPU
time, as well as resources. Semaphores avoid this wastage of time and resources by
blocking the process if it can not enter into its critical section. This process will be wake
up by the currently running process after coming out of critical section. Following
sections covers various aspects and issues related to concurrent transactions.
Objectives:
At the end of this unit you will be able to understand the:
• Brief introduction of Concurrency Control
• Conditions for Deadlocks
• Semaphores
What is concurrency?
“Concurrency occurs when two or more execution flows are able to run simultaneously.”
– Edsger Dijkstra.
Concurrency is a property of systems which execute processes overlapped in time on
single or multiple processors, and which may permit the sharing of common resources
between those overlapped processes. Concurrent use of shared resources is the source of
many difficulties, such as race conditions (as explained bellow). The introduction of
mutual exclusion can prevent race conditions, but can lead to problems such as deadlock,
and starvation.
In a single-processor multiprogramming system, processes must be are interleaved in
time to yield the appearance of simultaneous execution. In a multiple-processor system, it
is possible not only to interleave the execution of multiple processes but also to overlap
them. Interleaving and overlapping techniques can be viewed as examples of concurrent
processing
Concurrency control is a method used to ensure that processes are executed in a safe
manner (i.e., without affecting each other) and correct results are generated, while getting
those results as quickly as possible.
Race Conditions
A race condition occurs when multiple processes or threads read and write data items so
that the final result depends on the order of execution of instructions in the multiple
processes.
Suppose that two processes, P1 and P2, share the global variable A. At some point in its
execution, P1 updates variable A to the value 1, and at some point in its execution, P2
updates variable A to the value 2. Thus, the two processes are in a race to write variable
A. In this example the “loser” of the race (the process that updates last) determines the
final value of A.
Critical Section
A critical section is a part of program that accesses a shared resource (data structure or
device) that must not be concurrently accessed by more than one process of execution.
The key to preventing trouble involving shared storage is find some way to prohibit more
than one process from reading and writing the shared data simultaneously. To avoid race
conditions and flawed results, one must identify codes in Critical Sections in each
process.
Mutual Exclusion
Mutual exclusion is a way of making sure that if one process is using a shared modifiable
data, the other processes will be excluded from doing the same thing.
That is, while one process executes the shared variable, all other processes desiring to do
so at the same time moment should be kept waiting; when that process has finished using
the shared variable, one of the processes waiting to do so should be allowed to proceed.
In this fashion, each process using the shared data (variables) excludes all others from
doing so simultaneously. This is called Mutual Exclusion.
Mutual exclusion needs to be enforced only when processes access shared modifiable
data – when processes are performing operations that do not conflict with one another
they should be allowed to proceed concurrently.
Requirements for mutual exclusion
Following are the six requirements for mutual exclusion.
1. Mutual exclusion must be enforced: Only one process at a time is allowed into its
critical section, among all processes that have critical sections for the same
resource or shared object.
2. A process that halts in its non critical section must do so without interfering with
other processes.
3. It must not be possible for a process requiring access to a critical section to be
delayed indefinitely.
4. When no process is in a critical section, any process that requests entry to its
critical section must be permitted to enter without delay.
5. No assumptions are made about relative process speed or number of
processors.
6. A process remains inside its critical section for a finite time only.
Following are some of the methods for achieving mutual exclusion.
Mutual exclusion by disabling interrupts:
In an interrupt driven system, context switches from one process to another can only
occur on interrupts (timer, I/O device, etc). If a process disables all interrupts then it
cannot be switched out.
On entry to the critical section the process can disable all interrupts, and on exit from it
can enable them again as shown bellow.
while (true)
{
/* disable interrupts */;
/* critical section */;
/* enable interrupts */;
/* remainder */;
}
Figure 7.1: Mutual exclusion by disabling interrupts
Because the critical section cannot be interrupted, mutual exclusion is guaranteed. But as
the processor can not interleave processes, the system performance is degraded. Also this
solution does not work for multi processor system, where more than one process is run
concurrently.
Mutual exclusion by using Lock variable:
In this method, we consider a single, shared, (lock) variable, initially 0. When a process
wants to enter in its critical section, it first tests the lock value. If lock is 0, the process
first sets it to 1 and then enters the critical section. If the lock is already 1, the process just
waits until (lock) variable becomes 0. Thus, a 0 means that no process in its critical
section and 1 mean some process is in its critical section.
process (i)
{
while(lock != 0)
/* no operation */;
lock = 1;
/* critical section */;
lock = 0;
/* remainder */;
}
Figure 7.2: Mutual exclusion using lock variable
The flaw in this proposal can be best explained by example. Suppose process A sees that
the lock is 0. Before it can set the lock to 1 another process B is scheduled, runs, and sets
the lock to 1. When the process A runs again, it will also set the lock to 1, and two
processes will be in their critical section simultaneously. Thus this method does not
guarantee mutual exclusion.
Mutual exclusion by Strict Alternation:
In this method, the integer variable ‘turn’ keeps track of whose turn is to enter the critical
section. Initially, process 0 inspect turn, finds it to be 0, and enters in its critical section.
Process 1 also finds it to be 0 and sits in a loop continually testing ‘turn’ to see when it
becomes 1. Process 0, after coming out of critical section, sets turn to 1, to allow process
1 to enter in its critical section, as shown bellow.
/* Process 0 */
while (true)
{
while(turn != 0)
/* no operation */;
/* critical section */;
turn = 1;
/* remainder */;
}
/* Process 1 */
while (true)
{
while(turn != 1)
/* no operation */;
/* critical section */;
turn = 0;
/* remainder */;
}
Figure 7.3: Mutual exclusion by strict alternation
Taking turns is not a good idea when one of the processes is much slower than the other.
Suppose process 0 finishes its critical section quickly, and again wants to enter in its
critical section, but it can not do so, as the turn value is set to 1. It has to wait for process
1 to finish its critical section part. Here both processes are in their non-critical section.
This situation violates above mutual exclusion requirement condition no. 4.
Mutual exclusion by Peterson’s Method:
The algorithm uses two variables, flag, a boolean array and turn, an integer. A true flag
value indicates that the process wants to enter the critical section. The variable turn holds
the id of the process whose turn it is. Entrance to the critical section is granted for process
P0 if P1 does not want to enter its critical section or if P1 has given priority to P0 by
setting turn to 0.
flag[0]=false;
flag[1]=false;
turn = 0;
/* Process 0 */
while (true)
{
flag[0] = true;
turn = 1;
while(flag[1] && turn == 1)
/* no operation */;
/* critical section */;
flag[0] = false;
/* remainder */;
}
/* Process 1 */
while (true)
{
flag[1] = true;
turn = 0;
while(flag[0] && turn == 0)
/* no operation */;
/* critical section */;
flag[1] = false;
/* remainder */;
}
Figure 7.4: Peterson’s algorithm
Mutual exclusion by using Special Machine Instructions:
In a multiprocessor environment, the processors share access to a common main memory
and at the hardware level, only one access to a memory location is permitted at a time.
With this as a foundation, some computer processors designed several machine
instructions that carry out two actions, such as reading and writing, of a single memory
location. Since processes interleave at the instruction level, so such special instructions
are atomic and are not subject to interference from other processes. Two of such kind of
instructions are discussed in the following parts.
Test and Set Instruction: The test and set instruction can be defined as follows:
boolean testset (int i)
{
if (i = = 0)
{
i=1;
return true;
} else.
{
return false;
}
Figure 7.5: Test and Set Instructions
where the variable i is used like a traffic light. If it is 0, meaning green, then the
instruction sets it 1, i.e. red, and return true. Thus the current process is permitted to pass
but the others are told to stop. On the other hand, if the light is already red, then the
running process will receive false and realize not supposed to proceed.
Exchange Instruction: The exchange instruction can be defined as follows:
void exchange (int register, int memory)
{
int temp;
temp = memory;
memory = register;
register = temp;
}
Figure 7.6: Exchange Instruction
The instruction exchanges the contents of a register with that of a memory location. A
shared variable bolt is initialized to 0. Each process uses a local variable key that is
initialized to 1, and executes the instruction as exchange(key, bolt). .The only process
that may enter its critical section is one that finds bolt equal to 0. It excludes all other
processes from the critical section by setting bolt to 1. When a process leaves its critical
section, it resets bolt to 0, allowing another process to gain access to its critical section.
Semaphores
All the above methods of mutual exclusion have a basic problem of busy waiting. If a
process is unable to enter in to its critical section; it tightly executes the loop of testing
the shared global variable, wasting CPU time, as well as resources. Semaphores avoid
this wastage of time and resources by blocking the process if it can not enter into its
critical section. This process will be wake up by the currently running process after
coming out of critical section.
What are Semaphores?
A semaphore is a mechanism that prevents two or more processes from accessing a
shared resource simultaneously. On the railroads a semaphore prevents two trains from
crashing on a shared section of track. On railroads and computers, semaphores are
advisory: if a train engineer doesn’t observe and obey it, the semaphore won’t prevent a
crash, and if a process doesn’t check a semaphore before accessing a shared resource,
chaos might result.
Semaphores can be thought of as flags (hence their name, semaphores). They are either
on or off. A process can turn on the flag or turn it off. If the flag is already on, processes
that try to turn on the flag will sleep until the flag is off. Upon awakening, the process
will reattempt to turn the flag on, possibly succeeding or possibly sleeping again. Such
behavior allows semaphores to be used in implementing a post-wait driver – a system
where processes can wait for events (i.e., wait on turning on a semaphore) and post
events (i.e. turning off of a semaphore).
Dijkstra in 1965 proposed semaphores as a solution to the problems of concurrent
processes. The fundamental principle is: That two or more processes can cooperate by
means of simple signals, such that a process can be forced to stop at a specified place
until it has received a specific signal.
For signaling, special variables called semaphores are used.
Primitive signal (s) is used to transmit a signal
Primitive wait (s) is used to receive a signal
Semaphore Implementation:
To achieve desired effect, view semaphores as variables that have an integer value upon
which three operations are defined:
• A semaphore may be initialized to a non-negative value
• The wait operation decrements the semaphore value. If the value becomes
negative, the process executing the wait is blocked
• The signal operation increments the semaphore value. If the value is not positive,
then the process blocked by wait operation is unblocked. There is no other way to
manipulate semaphores
wait (S)
{
while (S£ 0); /*no-operation */;
S–;
}
�
signal (S)
{
S++;
}
Figure 7.7: Semaphore operations
Mutual Exclusion using Semaphore:
The following example illustrates mutual exclusion using semaphore:
A process before entering in to its critical section, performs wait(mutex) operating and
after coming out of critical section, signal(mutex) operation; thus achieving mutual
exclusion.
Shared data:
semaphore mutex; //initially mutex = 1
Process: Pi:
do
{
wait(mutex);
/* critical section */
signal(mutex);
/* remainder section */
} while (1);
Figure 7.8: Mutual exclusion using semaphore
Following code gives the detailed implementation of wait and signal procedures for
above example. The structure definition has semaphore value and process link. The wait
operation decrements the semaphore value, and if it is less than 0 then adds it to waiting
queue and blocks the process.
Declaration:
typedef struct
{
int value;
struct process *L;
} semaphore;
wait(S):
{
S.value–;
if (S.value < 0)
{
add this process to S.L;
block;
}
}
signal(S):
{
S.value++;
if (S.value <= 0)
{
remove a process P from S.L;
wakeup(P);
}
}
Figure 7.9: wait() and signal() for mutual exclusion
The process which is currently in critical section; after coming out increments the
semaphore value, and checks if it is less than of equal to 0. If so, it removes process from
waiting queue and then wakes up the process.
Summary
Concurrency is a property of systems which execute processes overlapped in time on
single or multiple processors, and which may permit the sharing of common resources
between those overlapped processes. Concurrency control is a method used to ensure that
processes are executed in a safe manner (i.e., without affecting each other) and correct
results are generated, while getting those results as quickly as possible. A race condition
occurs when multiple processes or threads read and write data items so that the final
result depends on the order of execution of instructions in the multiple processes.
Mutual exclusion is a way of making sure that if one process is using a shared modifiable
data, the other processes will be excluded from doing the same thing. Mutual exclusion
can be achieved by various ways such as using lock variable, by strict alternation, by
disabling interrupts, using Peterson’s method, through special machine instructions, and
Semaphores.
Terminal Questions
1. What is concurrency?
2. Discuss the problems caused by concurrent executions of processes.
3. What is race condition?
4. Describe critical section.
5. What is mutual exclusion? What are its requirements?
6. Explain any one method for achieving mutual exclusion.
7. Explain the Peterson’s solution for mutual exclusion.
8. What are special machine instructions? How they support mutual exclusion?
9. What are Semaphores? How can we achieve mutual exclusion using Semaphores?
Unit 8 : File Systems and Space Management :
This unit covers the file management covers the file structure, implementing file systems
and space management – Block size and extents, Free space, reliability, bad block and
back-up dumps. And consistency checking, transactions and performance discussed in
brief.
Introduction
Most operating systems provide a file system, as a file system is an integral part of any
modern operating system. Early microcomputer operating systems’ only real task was file
management – a fact reflected in their names. Some early operating systems had a
separate component for handling file systems which was called a disk operating system.
On some microcomputers, the disk operating system was loaded separately from the rest
of the operating system. On early operating systems, there was usually support for only
one, native, unnamed file system; for example, CP/M supports only its own file system,
which might be called “CP/M file system” if needed, but which didn’t bear any official
name at all. Because of this, there needs to be an interface provided by the operating
system software between the user and the file system. This interface can be textual (such
as provided by a command line interface, such as the UNIX shell, or OpenVMS DCL) or
graphical (such as provided by a graphical user interface, such as file browsers). If
graphical, the metaphor of the folder, containing documents, other files, and nested
folders is often used. This unit covers various issues related to Files.
Objectives:
At the end of this unit you will be understand the:
• Brief introduction of File Systems and Structures and their implementation
• Storage and Space management with consistency checking, Performance
evaluation and transaction related issues
• Fundamental understanding of Access Methods
File Systems
Just as the process abstraction beautifies the hardware by making a single CPU (or a
small number of CPUs) appear to be many CPUs, one per “user,” the file system
beautifies the hardware disk, making it appear to be a large number of disk-like objects
called files. Like a disk, a file is capable of storing a large amount of data cheaply,
reliably, and persistently. The fact that there are lots of files is one form of beautification:
Each file is individually protected, so each user can have his own files, without the
expense of requiring each user to buy his own disk. Each user can have lots of files,
which makes it easier to organize persistent data. The file system also makes each
individual file more beautiful than a real disk. At the very least, it erases block
boundaries, so a file can be any length (not just a multiple of the block size) and
programs can read and write arbitrary regions of the file without worrying about whether
they cross block boundaries. Some systems (not Unix) also provide assistance in
organizing the contents of a file.
Systems use the same sort of device (a disk drive) to support both virtual memory and
files. The question arises why these have to be distinct facilities, with vastly different user
interfaces. The answer is that they don’t. In Multics, there was no difference whatsoever.
Everything in Multics was a segment. The address space of each running process
consisted of a set of segments (each with its own segment number), and the “file system”
was simply a set of named segments. To access a segment from the file system, a process
would pass its name to a system call that assigned a segment number to it. From then on,
the process could read and write the segment simply by executing ordinary loads and
stores. For example, if the segment was an array of integers, the program could access the
ith
number with a notation like a[i] rather than having to seek to the appropriate offset and
then execute a read system call. If the block of the file containing this value wasn’t in
memory, the array access would cause a page fault, which was serviced.
This user-interface idea, sometimes called “single-level store,” is a great idea. So why is
it not common in current operating systems? In other words, why are virtual memory and
files presented as very different kinds of objects? There are possible explanations one
might propose:
The address space of a process is small compared to the size of a file system.
There is no reason why this has to be so. In Multics, a process could have up to 256K
segments, but each segment was limited to 64K words. Multics allowed for lots of
segments because every “file” in the file system was a segment. The upper bound of 64K
words per segment was considered large by the standards of the time; The hardware
actually allowed segments of up to 256K words (over one megabyte). Most new
processors introduced in the last few years allow 64-bit virtual addresses. In a few years,
such processors will dominate. So there is no reason why the virtual address space of a
process cannot be large enough to include the entire file system.
The virtual memory of a process is transient – it goes away when the process
terminates – while files must be persistent.
Multics showed that this doesn’t have to be true. A segment can be designated as
“permanent,” meaning that it should be preserved after the process that created it
terminates. Permanent segments to raise a need for one “file-system-like” facility, the
ability to give names to segments so that new processes can find them.
Files are shared by multiple processes, while the virtual address space of a process is
associated with only that process.
Most modern operating systems (including most variants of Unix) provide some way for
processes to share portions of their address spaces anyhow, so this is a particularly weak
argument for a distinction between files and segments.
The real reason single-level store is not ubiquitous is probably a concern for efficiency.
The usual file-system interface encourages a particular style of access: Open a file, go
through it sequentially, copying big chunks of it to or from main memory, and then close
it. While it is possible to access a file like an array of bytes, jumping around and
accessing the data in tiny pieces, it is awkward. Operating system designers have found
ways to implement files that make the common “file like” style of access very efficient.
While there appears to be no reason in principle why memory-mapped files cannot be
made to give similar performance when they are accessed in this way, in practice, the
added functionality of mapped files always seems to pay a price in performance. Besides,
if it is easy to jump around in a file, applications programmers will take advantage of it,
overall performance will suffer, and the file system will be blamed.
Naming
Every file system provides some way to give a name to each file. We will consider only
names for individual files here, and talk about directories later. The name of a file is (at
least sometimes) meant to used by human beings, so it should be easy for humans to use.
Different operating systems put different restrictions on names:
Size
Some systems put severe restrictions on the length of names. For example DOS restricts
names to 11 characters, while early versions of Unix (and some still in use today) restrict
names to 14 characters. The Macintosh operating system, Windows 95, and most modern
version of Unix allow names to be essentially arbitrarily long. I say “essentially” since
names are meant to be used by humans, so they don’t really to to be all that long. A name
that is 100 characters long is just as difficult to use as one that it forced to be under 11
characters long (but for different reasons). Most modern versions of Unix, for example,
restrict names to a limit of 255 characters.
Case
Are upper and lower case letters considered different? The Unix tradition is to consider
the names FILE1 and file1 to be completely different and unrelated names. In DOS and
its descendants, however, they are considered the same. Some systems translate names to
one case (usually upper case) for storage. Others retain the original case, but consider it
simply a matter of decoration. For example, if you create a file named “FILE1,” you
could open it as “file1″ or “FIL,” but if you list the directory, you would still see the file
listed as “Fil”.
Character Set
Different systems put different restrictions on what characters can appear in file names.
The Unix directory structure supports names containing any character other than NUL
(the byte consisting of all zero bits), but many utility programs (such as the shell) would
have troubles with names that have spaces, control characters or certain punctuation
characters (particularly ‘/’). MacOS allows all of these (e.g., it is not uncommon to see a
file name with the Copyright symbol © in it). With the world-wide spread of computer
technology, it is becoming increasingly important to support languages other than
English, and in fact alphabets other than Latin. There is a move to support character
strings (and in particular file names) in the Unicode character set, which devotes 16 bits
to each character rather than 8 and can represent the alphabets of all major modern
languages from Arabic to Devanagari to Telugu to Khmer.
Format
It is common to divide a file name into a base name and an extension that indicates the
type of the file. DOS requires that each name be compose of a bast name of eight or less
characters and an extension of three or less characters. When the name is displayed, it is
represented as base.extension. Unix internally makes no such distinction, but it is a
common convention to include exactly one period in a file name (e.g. fil.c for a C source
file).
File Structure
Unix hides the “chunkiness” of tracks, sectors, etc. and presents each file as a “smooth”
array of bytes with no internal structure. Application programs can, if they wish, use the
bytes in the file to represent structures. For example, a wide-spread convention in Unix is
to use the newline character (the character with bit pattern 00001010) to break text files
into lines. Some other systems provide a variety of other types of files. The most
common are files that consist of an array of fixed or variable size records and files that
form an index mapping keys to values. Indexed files are usually implemented as B-trees.
File Types
Most systems divide files into various “types.” The concept of “type” is a confusing one,
partially because the term “type” can mean different things in different contexts. Unix
initially supported only four types of files: directories, two kinds of special files, and
“regular” files. Just about any type of file is considered a “regular” file by Unix. Within
this category, however, it is useful to distinguish text files from binary files; within binary
files there are executable files (which contain machine-language code) and data files; text
files might be source files in a particular programming language (e.g. C or Java) or they
may be human-readable text in some mark-up language such as html (hypertext markup
language). Data files may be classified according to the program that created them or is
able to interpret them, e.g., a file may be a Microsoft Word document or Excel
spreadsheet or the output of TeX. The possibilities are endless.
In general (not just in Unix) there are three ways of indicating the type of a file:
1. The operating system may record the type of a file in meta-data stored separately
from the file, but associated with it. Unix only provides enough meta-data to
distinguish a regular file from a directory (or special file), but other systems
support more types.
2. The type of a file may be indicated by part of its contents, such as a header made
up of the first few bytes of the file. In Unix, files that store executable programs
start with a two byte magic number that identifies them as executable and selects
one of a variety of executable formats. In the original Unix executable format,
called the a.out format, the magic number is the octal number 0407, which
happens to be the machine code for a branch instruction on the PDP-11 computer,
one of the first computers to implement Unix. The operating system could run a
file by loading it into memory and jumping to the beginning of it. The 0407 code,
interpreted as an instruction, jumps to the word following the 16-byte header,
which is the beginning of the executable code in this format. The PDP-11
computer is extinct by now, but it lives on through the 0407 code!
3. The type of a file may be indicated by its name. Sometimes this is just a
convention, and sometimes it’s enforced by the OS or by certain programs. For
example, the Unix Java compiler refuses to believe that a file contains Java source
unless its name ends with .java.
Some systems enforce the types of files more vigorously than others. File types may be
enforced
• Not at all,
• Only by convention,
• By certain programs (e.g. the Java compiler), or
• By the operating system itself.
Unix tends to be very lax in enforcing types.
Access Modes
Systems support various access modes for operations on a file.
• Sequential. Read or write the next record or next n bytes of the file. Usually,
sequential access also allows a rewind operation.
• Random. Read or write the nth record or bytes i through j. Unix provides an
equivalent facility by adding a seek operation to the sequential operations listed
above. This packaging of operations allows random access but encourages
sequential access.
• Indexed. Read or write the record with a given key. In some cases, the “key”
need not be unique – there can be more than one record with the same key. In this
case, programs use a combination of indexed and sequential operations: Get the
first record with a given key, then get other records with the same key by doing
sequential reads.
Note that access modes are distinct from from file structure – e.g., a record-structured file
can be accessed either sequentially or randomly – but the two concepts are not entirely
unrelated. For example, indexed access mode only makes sense for indexed files.
File Attributes
This is the area where there is the most variation among file systems. Attributes can also
be grouped by general category.
Name
Ownership and Protection
Owner, owner’s “group,” creator, access-control list (information about who can to what
to this file, for example, perhaps the owner can read or modify it, other members of his
group can only read it, and others have no access).
Time Stamps
Time created, time last modified, time last accessed, time the attributes were last
changed, etc. Unix maintains the last three of these. Some systems record not only when
the file was last modified, but by whom.
Sizes
Current size, size limit, “high-water mark”, space consumed (which may be larger than
size because of internal fragmentation or smaller because of various compression
techniques).
Type Information
As described above: File is ASCII, is executable, is a “system” file, is an Excel spread
sheet, etc.
Misc
Some systems have attributes describing how the file should be displayed when a directly
is listed. For example MacOS records an icon to represent the file and the screen
coordinates where it was last displayed. DOS has a “hidden” attribute meaning that the
file is not normally shown. Unix achieves a similar effect by convention: The ls program
that is usually used to list files does not show files with names that start with a period
unless you explicit request it to (with the -a option).
Unix records a fixed set of attributes in the meta-data associated with a file. If you want
to record some fact about the file that is not included among the supported attributes, you
have to use one of the tricks listed above for recording type information: encode it in the
name of the file, put it into the body of the file itself, or store it in a file with a related
name. Other systems (notably MacOS and Windows NT) allow new attributes to be
invented on the fly. In MacOS, each file has a resource fork, which is a list of (attribute-
name, attribute-value) pairs. The attribute name can be any four-character string, and the
attribute value can be anything at all. Indeed, some kinds of files put the entire “contents”
of the file in an attribute and leave the “body” of the file (called the data fork) empty.
Self Assessment Questions
1. Discuss the three ways of indicating the type of files.
2. Explain the various types of file access modes.
3. Explain the file system attributes in brief.
Implementing File Systems
Files
We will assume that all the blocks of the disk are given block numbers starting at zero
and running through consecutive integers up to some maximum. We will further assume
that blocks with numbers that are near each other are located physically near each other
on the disk (e.g., same cylinder) so that the arithmetic difference between the numbers of
two blocks gives a good estimate how long it takes to get from one to the other. First let’s
consider how to represent an individual file. There are (at least!) four possibilities:
Contiguous
The blocks of a file are the block numbered n, n+1, n+2, …, m. We can represent any file
with a pair of numbers: the block number of of first block and the length of the file (in
blocks). The advantages of this approach are
• It’s simple
• The blocks of the file are all physically near each other on the disk and in order so
that a sequential scan through the file will be fast.
The problem with this organization is that you can only grow a file if the block following
the last block in the file happens to be free. Otherwise, you would have to find a long
enough run of free blocks to accommodate the new length of the file and copy it. As a
practical matter, operating systems that use this organization require the maximum size of
the file to be declared when it is created and pre-allocate space for the whole file. Even
then, storage allocation has all the problems we considered when studying main-memory
allocation including external fragmentation.
Linked List
A file is represented by the block number of its first block, and each block contains the
block number of the next block of the file. This representation avoids the problems of the
contiguous representation: We can grow a file by linking any disk block onto the end of
the list, and there is no external fragmentation. However, it introduces a new problem:
Random access is effectively impossible. To find the 100th block of a file, we have to
read the first 99 blocks just to follow the list. We also lose the advantage of very fast
sequential access to the file since its blocks may be scattered all over the disk. However,
if we are careful when choosing blocks to add to a file, we can retain pretty good
sequential access performance.
Both the space overhead (the percentage of the space taken up by pointers) and the time
overhead (the percentage of the time seeking from one place to another) can be decreased
by using larger blocks. The hardware designer fixes the block size (which is usually quite
small) but the software can get around this problem by using “virtual” blocks, sometimes
called clusters. The OS simply treats each group of (say) four contiguous physical disk
sectors as one cluster. Large, clusters, particularly if they can be variable size, are
sometimes called extents. Extents can be thought of as a compromise between linked and
contiguous allocation.
Disk Index
The idea here is to keep the linked-list representation, but take the link fields out of the
blocks and gather them together all in one place. This approach is used in the “FAT” file
system of DOS, OS/2 and older versions of Windows. At some fixed place on disk,
allocate an array I with one element for each block on the disk, and move the link field
from block n to I[m]. The whole array of links, called a file access table (FAT) is now
small enough that it can be read into main memory when the systems starts up. Accessing
the 100th block of a file still requires walking through 99 links of a linked list, but now
the entire list is in memory, so time to traverse it is negligible (recall that a single disk
access takes as long as 10’s or even 100’s of thousands of instructions). This
representation has the added advantage of getting the “operating system” stuff (the links)
out of the pages of “user data”. The pages of user data are now full-size disk blocks, and
lots of algorithms work better with chunks that are a power of two bytes long. Also, it
means that the OS can prevent users (who are notorious for screwing things up) from
getting their grubby hands on the system data.
The main problem with this approach is that the index array I can get quite large with
modern disks. For example, consider a 2 GB disk with 2K blocks. There are million
blocks, so a block number must be at least 20 bits. Rounded up to an even number of
bytes, that’s 3 bytes–4 bytes if we round up to a word boundary–so the array I is three or
four megabytes. While that’s not an excessive amount of memory given today’s RAM
prices, if we can get along with less, there are better uses for the memory.
File Index
Although a typical disk may contain tens of thousands of files, only a few of them are
open at any one time, and it is only necessary to keep index information about open files
in memory to get good performance. Unfortunately the whole-disk index described in the
previous paragraph mixes index information about all files for the whole disk together,
making it difficult to cache only information about open files. The inode structure
introduced by Unix groups together index information about each file individually. The
basic idea is to represent each file as a tree of blocks, with the data blocks as leaves. Each
internal block (called an indirect block in Unix jargon) is an array of block numbers,
listing its children in order. If a disk block is 2K bytes and a block number is four bytes,
512 block numbers fit in a block, so a one-level tree (a single root node pointing directly
to the leaves) can accommodate files up to 512 blocks, or one megabyte in size. If the
root node is cached in memory, the “address” (block number) of any block of the file can
be found without any disk accesses. A two-level tree, with 513 total indirect blocks, can
handle files 512 times as large (up to one-half gigabyte).
The only problem with this idea is that it wastes space for small files. Any file with more
than one block needs at least one indirect block to store its block numbers. A 4K file
would require three 2K blocks, wasting up to one third of its space. Since many files are
quite small, this is serious problem. The Unix solution is to use a different kind of
“block” for the root of the tree.
An index node (or inode for short) contains almost all the meta-data about a file listed
above: ownership, permissions, time stamps, etc. (but not the file name). Inodes are small
enough that several of them can be packed into one disk block. In addition to the meta-
data, an inode contains the block numbers of the first few blocks of the file. What if the
file is too big to fit all its block numbers into the inode? The earliest version of Unix had
a bit in the meta-data to indicate whether the file was “small” or “big.” For a big file, the
inode contained the block numbers of indirect blocks rather than data blocks. More recent
versions of Unix contain pointers to indirect blocks in addition to the pointers to the first
few data blocks. The inode contains pointers to (i.e., block numbers of) the first few
blocks of the file, a pointer to an indirect block containing pointers to the next several
blocks of the file, a pointer to a doubly indirect block, which is the root of a two-level
tree whose leaves are the next blocks of the file, and a pointer to a triply indirect block. A
large file is thus a lop-sided tree.
A real-life example is given by the Solaris 2.5 version of Unix. Block numbers are four
bytes and the size of a block is a parameter stored in the file system itself, typically 8K
(8192 bytes), so 2048 pointers fit in one block. An inode has direct pointers to the first 12
blocks of the file, as well as pointers to singly, doubly, and triply indirect blocks. A file
of up to 12+2048+2048*2048 = 4,196,364 blocks or 34,376,613,888 bytes
(about 32 GB) can be represented without using triply indirect blocks,
and with the triply indirect block, the maximum file size is
(12+2048+2048*2048+2048*2048*2048)*8192 = 70,403,120,791,552 bytes (slightly
more than 246
bytes, or about 64 terabytes). Of course, for such huge files, the size of the
file cannot be represented as a 32-bit integer. Modern versions of Unix store the file
length as a 64-bit integer, called a “long” integer in Java. An inode is 128 bytes long,
allowing room for the 15 block pointers plus lots of meta-data. 64 inodes fit in one disk
block. Since the inode for a file is kept in memory while the file is open, locating an
arbitrary block of any file requires reading at most three I/O operations, not counting the
operation to read or write the data block itself.
Directories
A directory is simply a table mapping character-string human-readable names to
information about files. The early PC operating system CP/M shows how simple a
directory can be. Each entry contains the name of one file, its owner, size (in blocks) and
the block numbers of 16 blocks of the file. To represent files with more than 16 blocks,
CP/M used multiple directory entries with the same name and different values in a field
called the extent number. CP/M had only one directory for the entire system.
DOS uses a similar directory entry format, but stores only the first block number of the
file in the directory entry. The entire file is represented as a linked list of blocks using the
disk index scheme described above. All but the earliest version of DOS provide
hierarchical directories using a scheme similar to the one used in Unix.
Unix has an even simpler directory format. A directory entry contains only two fields: a
character-string name (up to 14 characters) and a two-byte integer called an inumber,
which is interpreted as an index into an array of inodes in a fixed, known location on
disk. All the remaining information about the file (size, ownership, time stamps,
permissions, and an index to the blocks of the file) are stored in the inode rather than the
directory entry. A directory is represented like any other file (there’s a bit in the inode to
indicate that the file is a directory). Thus the inumber in a directory entry may designate a
“regular” file or another directory, allowing arbitrary graphs of nodes. However, Unix
carefully limits the set of operating system calls to ensure that the set of directories is
always a tree. The root of the tree is the file with inumber 1 (some versions of Unix use
other conventions for designating the root directory). The entries in each directory point
to its children in the tree. For convenience, each directory also two special entries: an
entry with name “..”, which points to the parent of the directory in the tree and an entry
with name “.”, which points to the directory itself. Inumber 0 is not used, so an entry is
marked “unused” by setting its inumber field to 0.
Self Assessment Questions
1. What is Block? Write its advantages.
2. Explain the disk index with its advantages over the Operating Systems.
3. Explain the UNIX directory format with a suitable exaple.
Space Management
Block Size and Extents
All of the file organizations I’ve mentioned store the contents of a file in a set of disk
blocks. How big should a block be? The problem with small blocks is I/O overhead.
There is a certain overhead to read or write a block beyond the time to actually transfer
the bytes. If we double the block size, a typical file will have half as many blocks.
Reading or writing the whole file will transfer the same amount of data, but it will
involve half as many disk I/O operations. The overhead for an I/O operations includes a
variable amount of latency (seek time and rotational delay) that depends on how close the
blocks are to each other, as well as a fixed overhead to start each operation and respond
to the interrupt when it completes.
Many years ago, researchers at the University of California at Berkeley studied the
original Unix file system. They found that when they tried reading or writing a single
very large file sequentially, they were getting only about 2% of the potential speed of the
disk. In other words, it took about 50 times as long to read the whole file as it would if
they simply read that many sequential blocks directly from the raw disk (with no file
system software). They tried doubling the block size (from 512 bytes to 1K) and the
performance more than doubled. The reason the speed more than doubled was that it took
less than half as many I/O operations to read the file. Because the blocks were twice as
large, twice as much of the file’s data was in blocks pointed to directly by the inode.
Indirect blocks were twice as large as well, so they could hold twice as many pointers.
Thus four times as much data could be accessed through the singly indirect block without
resorting to the doubly indirect block.
If doubling the block size more than doubled performance, why stop there? Why didn’t
the Berkeley folks make the blocks even bigger? The problem with big blocks is internal
fragmentation. A file can only grow in increments of whole blocks. If the sizes of files
are random, we would expect on the average that half of the last block of a file is wasted.
If most files are many blocks long, the relative amount of waste is small, but if the block
size is large compared to the size of a typical file, half a block per file is significant. In
fact, if files are very small (compared to the block size), the problem is even worse. If, for
example, we choose a block size of 8k and the average file is only 1K bytes long, we
would be wasting about 7/8 of the disk.
Most files in a typical Unix system are very small. The Berkeley researchers made a list
of the sizes of all files on a typical disk and did some calculations of how much space
would be wasted by various block sizes. Simply rounding the size of each file up to a
multiple of 512 bytes resulted in wasting 4.2% of the space. Including overhead for
inodes and indirect blocks, the original 512-byte file system had a total space overhead of
6.9%. Changing to 1K blocks raised the overhead to 11.8%. With 2k blocks, the overhead
would be 22.4% and with 4k blocks it would be 45.6%. Would 4k blocks be worthwhile?
The answer depends on economics. In those days disks were very expensive, and a
wasting half the disk seemed extreme. These days, disks are cheap, and for many
applications people would be happy to pay twice as much per byte of disk space to get a
disk that was twice as fast.
But there’s more to the story. The Berkeley researchers came up with the idea of breaking
up the disk into blocks and fragments. For example, they might use a block size of 2k and
a fragment size of 512 bytes. Each file is stored in some number of whole blocks plus 0
to 3 fragments at the end. The fragments at the end of one file can share a block with
fragments of other files. The problem is that when we want to append to a file, there may
not be any space left in the block that holds its last fragment. In that case, the Berkeley
file system copies the fragments to a new (empty) block. A file that grows a little at a
time may require each of its fragments to be copied many times. They got around this
problem by modifying application programs to buffer their data internally and add it to a
file a whole block’s worth at a time. In fact, most programs already used library routines
to buffer their output (to cut down on the number of system calls), so all they had to do
was to modify those library routines to use a larger buffer size. This approach has been
adopted by many modern variants of Unix. The Solaris system you are using for this
course uses 8k blocks and 1K fragments.
As disks get cheaper and CPU’s get faster, wasted space is less of a problem and the
speed mismatch between the CPU and the disk gets worse. Thus the trend is towards
larger and larger disk blocks.
At first glance it would appear that the OS designer has no say in how big a block is. Any
particular disk drive has a sector size, usually 512 bytes, wired in. But it is possible to use
larger “blocks”. For example, if we think it would be a good idea to use 2K blocks, we
can group together each run of four consecutive sectors and call it a block. In fact, it
would even be possible to use variable-sized “blocks,” so long as each one is a multiple
of the sector size. A variable-sized “block” is called an extent. When extents are used,
they are usually used in addition to multi-sector blocks. For example, a system may use
2k blocks, each consisting of 4 consecutive sectors, and then group them into extents of 1
to 10 blocks. When a file is opened for writing, it grows by adding an extent at a time.
When it is closed, the unused blocks at the end of the last extent are returned to the
system. The problem with extents is that they introduce all the problems of external
fragmentation that we saw in the context of main memory allocation. Extents are
generally only used in systems such as databases, where high-speed access to very large
files is important.
Free Space
We have seen how to keep track of the blocks in each file. How do we keep track of the
free blocks – blocks that are not in any file? There are two basic approaches.
• Use a bit vector. That is simply an array of bits with one bit for each block on the
disk. A 1 bit indicates that the corresponding block is allocated (in some file) and
a 0 bit says that it is free. To allocate a block, search the bit vector for a zero bit,
and set it to one.
• Use a free list. The simplest approach is simply to link together the free blocks by
storing the block number of each free block in the previous free block. The
problem with this approach is that when a block on the free list is allocated, you
have to read it into memory to get the block number of the next block in the list.
This problem can be solved by storing the block numbers of additional free blocks
in each block on the list. In other words, the free blocks are stored in a sort of
lopsided tree on disk. If, for example, 128 block numbers fit in a block, 1/128 of
the free blocks would be linked into a list. Each block on the list would contain a
pointer to the next block on the list, as well as pointers to 127 additional free
blocks. When the first block of the list is allocated to a file, it has to be read into
memory to get the block numbers stored in it, but then we and allocate 127 more
blocks without reading any of them from disk. Freeing blocks is done by running
this algorithm in reverse: Keep a cache of 127 block numbers in memory. When a
block is freed, add its block number to this cache. If the cache is full when a block
is freed, use the block being freed to hold all the block numbers in the cache and
link it to the head of the free list by adding to it the block number of the previous
head of the list.
How do these methods compare? Neither requires significant space overhead on disk.
The bitmap approach needs one bit for each block. Even for a tiny block size of 512
bytes, each bit of the bitmap describes 512*8 = 4096 bits of free space, so the overhead is
less than 1/40 of 1%. The free list is even better. All the pointers are stored in blocks that
are free anyhow, so there is no space overhead (except for one pointer to the head of the
list). Another way of looking at this is that when the disk is full (which is the only time
we should be worried about space overhead!) the free list is empty, so it takes up no
space. The real advantage of bitmaps over free lists is that they give the space allocator
more control over which block is allocated to which file. Since the blocks of a file are
generally accessed together, we would like them to be near each other on disk. To ensure
this clustering, when we add a block to a file we would like to choose a free block that is
near the other blocks of a file. With a bitmap, we can search the bitmap for an appropriate
block. With a free list, we would have to search the free list on disk, which is clearly
impractical. Of course, to search the bitmap, we have to have it all in memory, but since
the bitmap is so tiny relative to the size of the disk, it is not unreasonable to keep the
entire bitmap in memory all the time. To do the comparable operation with a free list, we
would need to keep the block numbers of all free blocks in memory. If a block number is
four bytes (32 bits), that means that 32 times as much memory would be needed for the
free list as for a bitmap. For a concrete example, consider a 2 gigabyte disk with 8K
blocks and 4-byte block numbers. The disk contains 231
/213
= 218
= 262,144 blocks. If
they are all free, the free list has 262,144 entries, so it would take one megabyte of
memory to keep them all in memory at once. By contrast, a bitmap requires 218
bits, or
215
= 32K bytes (just four blocks). (On the other hand, the bit map takes the same amount
of memory regardless of the number of blocks that are free).
Reliability
Disks fail, disks sectors get corrupted, and systems crash, losing the contents of volatile
memory. There are several techniques that can be used to mitigate the effects of these
failures. We only have room for a brief survey.
Bad-block Forwarding
When the disk drive writes a block of data, it also writes a checksum, a small number of
additional bits whose value is some function of the “user data” in the block. When the
block is read back in, the checksum is also read and compared with the data. If either the
data or checksum were corrupted, it is extremely unlikely that the checksum comparison
will succeed. Thus the disk drive itself has a way of discovering bad blocks with
extremely high probability.
The hardware is also responsible for recovering from bad blocks. Modern disk drives do
automatic bad-block forwarding. The disk drive or controller is responsible for mapping
block numbers to absolute locations on the disk (cylinder, track, and sector). It holds a
little bit of space in reserve, not mapping any block numbers to this space. When a bad
block is discovered, the disk allocates one of these reserved blocks and maps the block
number of the bad block to the replacement block. All references to this block number
access the replacement block instead of the bad block. There are two problems with this
scheme. First, when a block goes bad, the data in it is lost. In practice, blocks tend to be
bad from the beginning, because of small defects in the surface coating of the disk
platters. There is usually a stand-alone formatting program that tests all the blocks on the
disk and sets up forwarding entries for those that fail. Thus the bad blocks never get used
in the first place. The main reason for the forwarding is that it is just too hard (expensive)
to create a disk with no defects. It is much more economical to manufacture a “pretty
good” disk and then use bad-block forwarding to work around the few bad blocks. The
other problem is that forwarding interferes with the OS’s attempts to lay out files
optimally. The OS may think it is doing a good job by assigning consecutive blocks of a
file to consecutive block numbers, but if one of those blocks is forwarded, it may be very
far away for the others. In practice, this is not much of a problem since a disk typically
has only a handful of forwarded sectors out of millions.
The software can also help avoid bad blocks by simply leaving them out of the free list
(or marking them as allocated in the allocation bitmap).
Back-up Dumps
There are a variety of storage media that are much cheaper than (hard) disks but are also
much slower. An example is 8 millimeter video tape. A “two-hour” tape costs just a few
dollars and can hold two gigabytes of data. By contrast, a 2GB hard drive currently casts
several hundred dollars. On the other hand, while worst-case access time to a hard drive
is a few tens of milliseconds, rewinding or fast-forwarding a tape to desired location can
take several minutes. One way to use tapes is to make periodic back up dumps. Dumps
are really used for two different purposes:
• To recover lost files. Files can be lost or damaged by hardware failures, but far
more often they are lost through software bugs or human error (accidentally
deleting the wrong file). If the file is saved on tape, it can be restored.
• To recover from catastrophic failures. An entire disk drive can fail, or the whole
computer can be stolen, or the building can burn down. If the contents of the disk
have been saved to tape, the data can be restored (to a repaired or replacement
disk). All that is lost is the work that was done since the information was dumped.
Corresponding to these two ways of using dumps, there are two ways of doing dumps. A
physical dump simply copies all of the blocks of the disk, in order, to tape. It’s very fast,
both for doing the dump and for recovering a whole disk, but it makes it extremely slow
to recover any one file. The blocks of the file are likely to be scattered all over the tape,
and while seeks on disk can take tens of milliseconds, seeks on tape can take tens or
hundreds of seconds. The other approach is a logical dump, which copies each file
sequentially. A logical dump makes it easy to restore individual files. It is even easier to
restore files if the directories are dumped separately at the beginning of the tape, or if the
name(s) of each file are written to the tape along with the file.
The problem with logical dumping is that it is very slow. Dumps are usually done
much more frequently than restores. For example, you might dump your disk every
night for three years before something goes wrong and you need to do a restore. An
important trick that can be used with logical dumps is to only dump files that have
changed recently. An incremental dump saves only those files that have been
modified since a particular date and time. Fortunately, most file systems record the
time each file was last modified. If you do a backup each night, you can save only
those files that have changed since the last backup. Every once in a while (say once a
month), you can do a full backup of all files. In Unix jargon, a full backup is called an
epoch (pronounced “eepock”) dump, because it dumps everything that has changed
since “the epoch”–January 1, 1970, which is the the earliest possible date in Unix.
The Computer Sciences department currently does backup dumps on about 260 GB of
disk space. Epoch dumps are done once every 14 days, with the timing on different
file systems staggered so that about 1/14 of the data is dumped each night. Daily
incremental dumps save about 6-10% of the data on each file system.
Incremental dumps go fast because they dump only a small fraction of the files, and they
don’t take up a lot of tape. However, they introduce new problems:
• If you want to restore a particular file, you need to know when it was last
modified so that you know which dump tape to look at.
• If you want to restore the whole disk (to recover from a catastrophic failure), you
have to restore from the last epoch dump, and then from every incremental dump
since then, in order. A file that is modified every day will appear on every tape.
Each restore will overwrite the file with a newer version. When you’re done,
everything will be up-to-date as of the last dump, but the whole process can be
extremely slow (and labor-intensive).
• You have to keep around all the incremental tapes since the last epoch. Tapes are
cheap, but they’re not free, and storing them can be a hassle.
The First problem can be solved by keeping a directory of what was dumped when. A
bunch of UW alumni (the same person who invented NFS) have made themselves
millionaires by marketing software to do this. The other problems can be solved by a
clever trick. Each dump is assigned a positive integer level. A level n dump is an
incremental dump that dumps all files that have changed since the most recent previous
dump with a level greater than or equal to n. An epoch dump is considered to have
infinitely high level. Levels are assigned to dumps as follows:
This scheme is sometimes called a ruler schedule for obvious reasons. Level-1 dumps
only save files that have changed in the previous day. Level-2 dumps save files that have
changed in the last two days, level-3 dumps cover four days, level-4 dumps cover 8 days,
etc. Higher-level dumps will thus include more files (so they will take longer to do), but
they are done infrequently. The nice thing about this scheme is that you only need to save
one tape from each level, and the number of levels is the logarithm of the interval
between epoch dumps. Thus even if did a dump each night and you only did an epoch
dump only once a year, you would need only nine levels (hence nine tapes). That also
means that a full restore needs at worst one restore from each of nine tapes (rather than
365 tapes!). To figure out what tapes you need to restore from if your disk is destroyed
after dump number n, express n in binary, and number the bits from right to left, starting
with 1. The 1 bits tell you which dump tapes to use. Restore them in order of decreasing
level. For example, 20 in binary is 10100, so if the disk is destroyed after the 20th dump,
you only need to restore from the epoch dump and from the most recent dumps at levels 5
and 3.
Self Assessment Questions
1. Explain how the block size is affected on I/O operation to read the file.
2. Explain how do you keep a track of the free blocks that are not in any file?
3. Explain the techniques that can be used to mitigate the effects of the disk fail,
system crash and losing the content of volatile memory.
Consistency Checking
Some of the information in a file system is redundant. For example, the free list could be
reconstructed by checking which blocks are not in any file. Redundancy arises because
the same information is represented in different forms to make different operations faster.
If you want to know which blocks are in a given file, look at the inode. If you you want to
know which blocks are not in any inode, use the free list. Unfortunately, various
hardware and software errors can cause the data to become inconsistent. File systems
often include a utility that checks for consistency and optionally attempts to repair
inconsistencies. These programs are particularly handy for cleaning up the disks after a
crash.
Unix has a utility called fscheck. It has two principal tasks. First, it checks that blocks are
properly allocated. Each inode is supposed to be the root of a tree of blocks, the free list
is supposed to be a tree of blocks, and each block is supposed to appear in exactly one of
these trees. Fscheck runs through all the inodes, checking each allocated inode for
reasonable values, and walking through the tree of blocks rooted at the inode. It maintains
a bit vector to record which blocks have been encountered. If block is encountered that
has already been seen, there is a problem: Either it occurred twice in the same file (in
which case it isn’t a tree), or it occurred in two different files. A reasonable recovery
would be to allocate a new block, copy the contents of the problem block into it, and
substitute the copy for the problem block in one of the two places where it occurs. It
would also be a good idea to log an error message so that a human being can check up
later to see what’s wrong. After all the files are scanned, any block that hasn’t been found
should be on the free list. It would be possible to scan the free list in a similar manner,
but it’s probably easier just to rebuild the free list from the set of blocks that were not
found in any file. If a bitmap instead of a free list is used, this step is even easier: Simply
overwrite the file system’s bitmap with the bitmap constructed during the scan.
The other main consistency requirement concerns the directory structure. The set of
directories is supposed to be a tree, and each inode is supposed to have a link count that
indicates how many times it appears in directories. The tree structure could be checked
by a recursive walk through the directories,but it is more efficient to combine this check
with the walk through the inodes that checks for disk blocks, but recording, for each
directory inode encountered, the inumber of its parent. The set of directories is a tree if
and only if and only if every directory other than the root has a unique parent. This pass
can also rebuild the link count for each inode by maintaining in memory an array with
one slot for each inumber. Each time the inumber is found in a directory, increment the
corresponding element of the array. The resulting counts should match the link counts in
the inodes. If not, correct the counts in the inodes.
This illustrates a very important principal that pops up throughout operating system
implementation (indeed, throughout any large software system): the doctrine of hints and
absolutes. Whenever the same fact is recorded in two different ways, one of them should
be considered the absolute truth, and the other should be considered a hint. Hints are
handy because they allow some operations to be done much more quickly that they could
if only the absolute information was available. But if the hint and the absolute do not
agree, the hint can be rebuilt from the absolutes. In a well-engineered system, there
should be some way to verify a hint whenever it is used. Unix is a bit lax about this. The
link count is a hint (the absolute information is a count of the number of times the
inumber appears in directories), but Unix treats it like an absolute during normal
operation. As a result, a small error can snowball into completely trashing the file system.
For another example of hints, each allocated block could have a header containing the
inumber of the file containing it and its offset in the file. There are systems that do this
(Unix isn’t one of them). The tree of blocks rooted at an inode then becomes a hint,
providing an efficient way of finding a block, but when the block is found, its header
could be checked. Any inconsistency would then be caught immediately, and the inode
structures could be rebuilt from the information in the block headers.
By the way, if the link count calculated by the scan is zero (i.e., the inode, although
marked as allocated, does not appear in any directory), it would not be prudent to delete
the file. A better recovery is to add an entry to a special lost+found directory pointing to
the orphan inode, in case it contains something really valuable.
Transactions
The previous section talks about how to recover from situations that “can’t happen.” How
do these problems arise in the first place? Wouldn’t it be better to prevent these problems
rather than recover from them after the fact? Many of these problems arise, particularly
after a crash, because some operation was “half-completed.” For example, suppose the
system was in the middle of executing a unlink system call when the lights went out. An
unlink operation involves several distinct steps:
• remove an entry from a directory,
• decrement a link count, and if the count goes to zero,
• move all the blocks of the file to the free list, and
• free the inode.
If the crash occurs between the first and second steps, the link count will be wrong. If it
occurs during the third step, a block may be linked both into the file and the free list, or
neither, depending on the details of how the code is written. And so on…
To deal with this kind of problem in a general way, transactions were invented.
Transactions were first developed in the context of database management systems, and
are used heavily there, so there is a tradition of thinking of them as “database stuff” and
teaching about them only in database courses and text books. But they really are an
operating system concept. Here’s a two-bit introduction.
We have already seen a mechanism for making complex operations appear atomic. It is
called a critical section. Critical sections have a property that is sometimes called
synchronization atomicity. It is also called serializability because if two processes try to
execute their critical sections at about the same time, the next effect will be as if they
occurred in some serial order. If systems can crash (and they can!), synchronization
atomicity isn’t enough. We need another property, called failure atomicity, which means
an “all or nothing” property: Either all of the modifications of nonvolatile storage
complete or none of them do.
There are basically two ways to implement failure atomicity. They both depend on the
fact that a writing a single block to disk is an atomic operation. The first approach is
called logging. An append-only file called a log is maintained on disk. Each time a
transaction does something to file-system data, it creates a log record describing the
operation and appends it to the log. The log record contains enough information to undo
the operation. For example, if the operation made a change to a disk block, the log record
might contain the block number, the length and offset of the modified part of the block,
and the the original content of that region. The transaction also writes a begin record
when it starts, and a commit record when it is done. After a crash, a recovery process
scans the log looking for transactions that started (wrote a begin record) but never
finished (wrote a commit record). If such a transaction is found, its partially completed
operations are undone (in reverse order) using the undo information in the log records.
Sometimes, for efficiency, disk data is cached in memory. Modifications are made to the
cached copy and only written back out to disk from time to time. If the system crashes
before the changes are written to disk, the data structures on disk may be inconsistent.
Logging can also be used to avoid this problem by putting into each log record redo
information as well as undo information. For example, the log record for a modification
of a disk block should contain both the old and new value. After a crash, if the recovery
process discovers a transaction that has completed, it uses the redo information to make
sure the effects of all of its operations are reflected on disk. Full recovery is always
possible provided
• The log records are written to disk in order,
• The commit record is written to disk when the transaction completes, and
• The log record describing a modification is written to disk before any of the
changes made by that operation are written to disk.
This algorithm is called write-ahead logging.
The other way of implementing transactions is called shadow blocks.5 Suppose the data
structure on disk is a tree. The basic idea is never to change any block (disk block) of the
data structure in place. Whenever you want to modify a block, make a copy of it (called a
shadow of it) instead, and modify the parent to point to the shadow. Of course, to make
the parent point to the shadow you have to modify it, so instead you make a shadow of
the parent an modify it instead. In this way, you shadow not only each block you really
wanted to modify, but also all the blocks on the path from it to the root. You keep the
shadow of the root block in memory. At the end of the transaction, you make sure the
shadow blocks are all safely written to disk and then write the shadow of the root directly
onto the root block. If the system crashes before you overwrite the root block, there will
be no permanent change to the tree on disk. Overwriting the root block has the effect of
linking all the modified (shadow blocks) into the tree and removing all the old blocks.
Crash recovery is simply a matter of garbage collection. If the crash occurs before the
root was overwritten, all the shadow blocks are garbage. If it occurs after, the blocks they
replaced are garbage. In either case, the tree itself is consistent, and it is easy to find the
garbage blocks (they are blocks that aren’t in the tree).
Database systems almost universally use logging, and shadowing is mentioned only in
passing in database texts. But the shadowing technique is used in a variant of the Unix
file system called (somewhat misleadingly) the Log-structured File System (LFS). The
entire file system is made into a tree by replacing the array of inodes with a tree of
inodes. LFS has the added advantage (beyond reliability) that all blocks are written
sequentially, so write operations are very fast. It has the disadvantage that files that are
modified here and there by random access tend to have their blocks scattered about, but
that pattern of access is comparatively rare, and there are techniques to cope with it when
it occurs. The main source of complexity in LFS is figuring out when and how to do the
“garbage collection.”
Performance
The main trick to improve file system performance (like anything else in computer
science) is caching. The system keeps a disk cache (sometimes also called a buffer pool)
of recently used disk blocks. In contrast with the page frames of virtual memory, where
there were all sorts of algorithms proposed for managing the cache, management of the
disk cache is pretty simple. On the whole, it is simply managed LRU (least recently
used). Why is it that for paging we went to great lengths trying to come up with an
algorithm that is “almost as good as LRU” while here we can simply use true LRU? The
problem with implementing LRU is that some information has to be updated on every
single reference. In the case of paging, references can be as frequent as every instruction,
so we have to make do with whatever information hardware is willing to give us. The
best we can hope for is that the paging hardware will set a bit in a page-table entry. In the
case of file system disk blocks, however, each reference is the result of a system call, and
adding a few extra instructions added to a system call for cache maintenance is not
unreasonable.
Summary
File Systems and Space Management is an integral part of the operating systems. These
section coves the file management and space management systems, which includes the
file structure, file types and different file access modes etc. and also deals with the
implementing file systems. In space management coves the block size and extents,
keeping track of free space basic approaches. It covers the disk reliability techniques
Terminal Questions
1. What do you mean by file? Explain the significance.
2. Explain why virtual memory and files are different kinds of objects.
3. Discuss the file structure? Explain the various access modes.
4. Discuss the various file organization methods?
5. What do you mean by a block & an Extent?
6. Discuss the concept of space management.
7. What do you mean by consistency Checking, discuss how it will effect on file
system.
Unit 9 : Input-Output Architecture :
This unit covers the I/O structure , I/O control strategies, Program-controlled I/O,
Interrupt-controlled I/O Direct Memory Access and cover the I/O address space.
Introduction
In our discussion of the memory hierarchy (in Unit 4), it was implicitly assumed that
memory in the computer system would be “fast enough” to match the speed of the
processor (at least for the highest elements in the memory hierarchy) and that no special
consideration need be given about how long it would take for a word to be transferred
from memory to the processor – an address would be generated by the processor, and
after some fixed time interval, the memory system would provide the required
information. (In the case of a cache miss, the time interval would be longer, but generally
still fixed. For a page fault, the processor would be interrupted; and the page fault
handling software invoked.)
Although input-output devices are “mapped” to appear like memory devices in many
computer systems, I/O devices have characteristics quite different from memory devices,
and often pose special problems for computer systems. This is principally for two
reasons:
• I/O devices span a wide range of speeds. (e.g. terminals accepting input at a few
characters per second; disks reading data at over 10 million characters / second).
• Unlike memory operations, I/O operations and the CPU are not generally
synchronized with each other.
Objectives
At the end of this unit, you will be able to understand the :
• Fundamentals and significance of I/O Operations
• I/O structure for a medium-scale processor system
• I/O Control Strategies
• Various Mechanisms for I/O Operations
I/O structure
Figure-1 shows the general I/O structure associated with many medium-scale processors.
Note that the I/O controllers and main memory are connected to the main system bus.
The cache memory (usually found on-chip with the CPU) has a direct connection to the
processor, as well as to the system bus.
Figure 1: A general I/O structure for a medium-scale processor system
Note that the I/O devices shown here are not connected directly to the system bus, they
interface with another device called an I/O controller. In simpler systems, the CPU may
also serve as the I/O controller, but in systems where throughput and performance are
important, I/O operations are generally handled outside the processor.
Until relatively recently, the I/O performance of a system was somewhat of an
afterthought for systems designers. The reduced cost of high-performance disks,
permitting the proliferation of virtual memory systems, and the dramatic reduction in the
cost of high-quality video display devices, have meant that designers must pay much
more attention to this aspect to ensure adequate performance in the overall system.
Because of the different speeds and data requirements of I/O devices, different I/O
strategies may be useful, depending on the type of I/O device which is connected to the
computer. Because the I/O devices are not synchronized with the CPU, some information
must be exchanged between the CPU and the device to ensure that the data is received
reliably. This interaction between the CPU and an I/O device is usually referred to as
“handshaking”. For a complete “handshake,” four events are important:
• The device providing the data (the talker) must indicate that valid data is now
available.
• The device accepting the data (the listener) must indicate that it has accepted the
data. This signal informs the talker that it need not maintain this data word on the
data bus any longer.
• The talker indicates that the data on the bus is no longer valid, and removes the
data from the bus. The talker may then set up new data on the data bus.
• The listener indicates that it is not now accepting any data on the data bus. the
listener may use data previously accepted during this time, while it is waiting for
more data to become valid on the bus.
Note that each of the talker and listener supply two signals. The talker supplies a signal
(say, data valid, or DAV) at step (1). It supplies another signal (say, data not valid, or
) at step (3). Both these signals can be coded as a single binary value (DAV) which
takes the value 1 at step (1) and 0 at step (3). The listener supplies a signal (say, data
accepted, or DAC) at step (2). It supplies a signal (say, data not now accepted, or ) at
step (4). It, too, can be coded as a single binary variable, DAC. Because only two binary
variables are required, the handshaking information can be communicated over two
wires, and the form of handshaking described above is called a two wire Handshake.
Other forms of handshaking are used in more complex situations; for example, where
there may be more than one controller on the bus, or where the communication is among
several devices. Figure 2 shows a timing diagram for the signals DAV and DAC which
identifies the timing of the four events described previously.
Figure 2: Timing diagram for two-wire handshake
Either the CPU or the I/O device can act as the talker or the listener. In fact, the CPU may
act as a talker at one time and a listener at another. For example, when communicating
with a terminal screen (an output device) the CPU acts as a talker, but when
communicating with a terminal keyboard (an input device) the CPU acts as a listener.
Self Assessment Questions
1. Explain the general I/O structure for a medium scale processor system with neat
diagram.
2. What do you mean by ‘handshaking’, write the important four events in this
context.
I/O Control Strategies
Several I/O strategies are used between the computer system and I/O devices, depending
on the relative speeds of the computer system and the I/O devices. The simplest strategy
is to use the processor itself as the I/O controller, and to require that the device follow a
strict order of events under direct program control, with the processor waiting for the I/O
device at each step.
Another strategy is to allow the processor to be “interrupted” by the I/O devices, and to
have a (possibly different) “interrupt handling routine” for each device. This allows for
more flexible scheduling of I/O events, as well as more efficient use of the processor.
(Interrupt handling is an important component of the operating system.)
A third general I/O strategy is to allow the I/O device, or the controller for the device,
access to the main memory. The device would write a block of information in main
memory, without intervention from the CPU, and then inform the CPU in some way that
that block of memory had been overwritten or read. This might be done by leaving a
message in memory, or by interrupting the processor. (This is generally the I/O strategy
used by the highest speed devices – hard disks and the video controller.)
Program-controlled I/O
One common I/O strategy is program-controlled I/O, (often called polled I/O). Here all
I/O is performed under control of an “I/O handling procedure,” and input or output is
initiated by this procedure.
The I/O handling procedure will require some status information (handshaking
information) from the I/O device (e.g., whether the device is ready to receive data). This
information is usually obtained through a second input from the device; a single bit is
usually sufficient, so one input “port” can be used to collect status, or handshake,
information from several I/O devices. (A port is the name given to a connection to an I/O
device; e.g., to the memory location into which an I/O device is mapped). An I/O port is
usually implemented as a register (possibly a set of D flip flops) which also acts as a
buffer between the CPU and the actual I/O device. The word port is often used to refer to
the buffer itself.
Typically, there will be several I/O devices connected to the processor; the processor
checks the “status” input port periodically, under program control by the I/O handling
procedure. If an I/O device requires service, it will signal this need by altering its input to
the “status” port. When the I/O control program detects that this has occurred (by reading
the status port) then the appropriate operation will be performed on the I/O device which
requested the service. A typical configuration might look somewhat as shown in
Figure – 3. The outputs labeled “handshake in” would be connected to bits in the “status”
port. The input labeled “handshake in” would typically be generated by the appropriate
decode logic when the I/O port corresponding to the device was addressed.
Figure 3:
Program controlled I/O
Program-controlled I/O has a number of advantages:
• All control is directly under the control of the program, so changes can be readily
implemented.
• The order in which devices are serviced is determined by the program, this order
is not necessarily fixed but can be altered by the program, as necessary. This
means that the “priority” of a device can be varied under program control. (The
“priority” of a determines which of a set of devices which are simultaneously
ready for servicing will actually be serviced first).
• It is relatively easy to add or delete devices.
Perhaps the chief disadvantage of program-controlled I/O is that a great deal of time may
be spent testing the status inputs of the I/O devices, when the devices do not need
servicing. This “busy wait” or “wait loop” during which the I/O devices are polled but no
I/O operations are performed is really time wasted by the processor, if there is other work
which could be done at that time. Also, if a particular device has its data available for
only a short time, the data may be missed because the input was not tested at the
appropriate time.
Program controlled I/O is often used for simple operations which must be performed
sequentially. For example, the following may be used to control the temperature in a
room:
DO forever
INPUT temperature
IF (temperature < setpoint) THEN
turn heat ON ELSE
turn heat OFF
END IF
Note here that the order of events is fixed in time, and that the program loops forever.
(It is really waiting for a change in the temperature, but it is a “busy wait.”)
Self Assessment Questions
1. Write the advantages of program-controlled I/O
Interrupt-controlled I/O
Interrupt-controlled I/O reduces the severity of the two problems mentioned for program-
controlled I/O by allowing the I/O device itself to initiate the device service routine in the
processor. This is accomplished by having the I/O device generate an interrupt signal
which is tested directly by the hardware of the CPU. When the interrupt input to the CPU
is found to be active, the CPU itself initiates a subprogram call to somewhere in the
memory of the processor; the particular address to which the processor branches on an
interrupt depends on the interrupt facilities available in the processor.
The simplest type of interrupt facility is where the processor executes a subprogram
branch to some specific address whenever an interrupt input is detected by the CPU. The
return address (the location of the next instruction in the program that was interrupted) is
saved by the processor as part of the interrupt process.
If there are several devices which are capable of interrupting the processor, then with this
simple interrupt scheme the interrupt handling routine must examine each device to
determine which one caused the interrupt. Also, since only one interrupt can be handled
at a time, there is usually a hardware “priority encoder” which allows the device with the
highest priority to interrupt the processor, if several devices attempt to interrupt the
processor simultaneously. In Figure -3, the “handshake out” outputs would be connected
to a priority encoder to implement this type of I/O. the other connections remain the
same. (Some systems use a “daisy chain” priority system to determine which of the
interrupting devices is serviced first. “Daisy chain” priority resolution is discussed later.)
In most modern processors, interrupt return points are saved on a “stack” in memory, in
the same way as return addresses for subprogram calls are saved. In fact, an interrupt can
often be thought of as a subprogram which is invoked by an external device. If a stack is
used to save the return address for interrupts, it is then possible to allow one interrupt the
interrupt handling routine of another interrupt. In modern computer systems, there are
often several “priority levels” of interrupts, each of which can be disabled, or “masked.”
There is usually one type of interrupt input which cannot be disabled (a non-maskable
interrupt) which has priority over all other interrupts. This interrupt input is used for
warning the processor of potentially catastrophic events such as an imminent power
failure, to allow the processor to shut down in an orderly way and to save as much
information as possible.
Most modern computers make use of “vectored interrupts.” With vectored interrupts, it is
the responsibility of the interrupting device to provide the address in main memory of the
interrupt servicing routine for that device. This means, of course, that the I/O device itself
must have sufficient “intelligence” to provide this address when requested by the CPU,
and also to be initially “programmed” with this address information by the processor.
Although somewhat more complex than the simple interrupt system described earlier,
vectored interrupts provide such a significant advantage in interrupt handling speed and
ease of implementation (i.e., a separate routine for each device) that this method is almost
universally used on modern computer systems.
Some processors have a number of special inputs for vectored interrupts (each acting
much like the simple interrupt described earlier). Others require that the interrupting
device itself provide the interrupt address as part of the process of interrupting the
processor.
Direct Memory Access
In most mini- and mainframe computer systems, a great deal of input and output occurs
between the disk system and the processor. It would be very inefficient to perform these
operations directly through the processor; it is much more efficient if such devices, which
can transfer data at a very high rate, place the data directly into the memory, or take the
data directly from the processor without direct intervention from the processor. I/O
performed in this way is usually called direct memory access, or DMA. The controller for
a device employing DMA must have the capability of generating address signals for the
memory, as well as all of the memory control signals. The processor informs the DMA
controller that data is available (or is to be placed into) a block of memory locations
starting at a certain address in memory. The controller is also informed of the length of
the data block.
There are two possibilities for the timing of the data transfer from the DMA controller to
memory:
• The controller can cause the processor to halt if it attempts to access data in the
same bank of memory into which the controller is writing. This is the fatest option
for the I/O device, but may cause the processor to run more slowly because the
processor may have to wait until a full block of data is transferred.
• The controller can access memory in memory cycles which are not used by the
particular bank of memory into which the DMA controller is writing data. This
approach, called “cycle stealing,” is perhaps the most commonly used approach.
(In a processor with a cache that has a high hit rate this approach may not slow
the I/O transfer significantly).
DMA is a sensible approach for devices which have the capability of transferring blocks
of data at a very high data rate, in short bursts. It is not worthwhile for slow devices, or
for devices which do not provide the processor with large quantities of data. Because the
controller for a DMA device is quite sophisticated, the DMA devices themselves are
usually quite sophisticated (and expensive) compared to other types of I/O devices.
One problem that systems employing several DMA devices have to address is the
contention for the single system bus. There must be some method of selecting which
device controls the bus (acts as “bus master”) at any given time. There are many ways of
addressing the “bus arbitration” problem; three techniques which are often implemented
in processor systems are the following (these are also often used to determine the
priorities of other events which may occur simultaneously, like interrupts). They rely on
the use of at least two signals (bus_request and bus_grant), used in a manner similar to
the two-wire handshake:
Daisy chain arbitration Here, the requesting device or devices assert the signal
bus_request. The bus arbiter returns the bus_grant signal, which passes through each of
the devices which can have access to the bus, as shown in Figure - 4. Here, the priority of
a device depends solely on its position in the daisy chain. If two or more devices request
the bus at the same time, the highest priority device is granted the bus first, then the
bus_grant signal is passed further down the chain. Generally a third signal (bus_release)
is used to indicate to the bus arbiter that the first device has finished its use of the bus.
Holding bus_request asserted indicates that another device wants to use the bus.
Figure 4:
Daisy chain bus arbitration
Priority encoded arbitration Here, each device has a request line connected to a
centralized arbiter that determines which device will be granted access to the bus. The
order may be fixed by the order of connection (priority encoded), or it may be determined
by some algorithm preloaded into the arbiter. Figure - 5 shows this type of system. Note
that each device has a separate line to the bus arbiter. (The bus_grant signals have been
omitted for clarity.)
Figure 5:
Priority encoded bus arbitration
Distributed arbitration by self-selection Here, the devices themselves determine which of
them has the highest priority. Each device has a bus_request line or lines on which it
places a code identifying itself. Each device examines the codes for all the requesting
devices, and determines whether or not it is the highest priority requesting device.
These arbitration schemes may also be used in conjunction with each other. For example,
a set of similar devices may be daisy chained together, and this set may be an input to a
priority encoded scheme.
Using interrupts driven device drivers to transfer data to or from hardware devices works
well when the amount of data is reasonably low. For example a 9600 baud modem can
transfer approximately one character every millisecond (’th second).
Figure – 6
If the interrupt latency, the amount of time that it takes between the hardware device
raising the interrupt and the device driver’s interrupt handling routine being called, is low
(say 2 milliseconds) then the overall system impact of the data transfer is very low. The
9600 baud modem data transfer would only take 0.002% of the CPU’s processing time.
For high speed devices, such as hard disk controllers or ethernet devices the data transfer
rate is a lot higher. A SCSI device can transfer up to 40 Mbytes of information per
second.
Direct Memory Access, or DMA, was invented to solve this problem. A DMA controller
allows devices to transfer data to or from the system’s memory without the intervention
of the processor. A PC’s ISA DMA controller has 8 DMA channels of which 7 are
available for use by the device drivers. Each DMA channel has associated with it a 16 bit
address register and a 16 bit count register. To initiate a data transfer the device driver
sets up the DMA channel’s address and count registers together with the direction of the
data transfer, read or write. It then tells the device that it may start the DMA when it
wishes. When the transfer is complete the device interrupts the PC. Whilst the transfer is
taking place the CPU is free to do other things.
Device drivers have to be careful when using DMA. First of all the DMA controller
knows nothing of virtual memory, it only has access to the physical memory in the
system. Therefore the memory that is being DMA’d to or from must be a contiguous
block of physical memory. This means that you cannot DMA directly into the virtual
address space of a process. You can however lock the processes physical pages into
memory, preventing them from being swapped out to the swap device during a DMA
operation. Secondly, the DMA controller cannot access the whole of physical memory.
The DMA channel’s address register represents the first 16 bits of the DMA address, the
next 8 bits come from the page register. This means that DMA requests are limited to the
bottom 16 Mbytes of memory.
DMA channels are scares resources, there are only 7 of them, and they cannot be shared
between device drivers. Just like interrupts the device driver must be able to work out
which DMA channel it should use. Like interrupts, some devices have a fixed DMA
channel. The floppy device, for example, always uses DMA channel 2. Sometimes the
DMA channel for a device can be set by jumpers, a number of Ethernet devices use this
technique. The more flexible devices can be told (via their CSRs) which DMA channels
to use and, in this case, the device driver can simple pick a free DMA channel to use.
Self Assessment Questions
1. What do you mean by direct access memory?
2. Explain the two possibilities for the timing of the data transfer from the DMA
controller to memory.
The I/O address space
Some processors map I/O devices in their own, separate, address space; others use
memory addresses as addresses of I/O ports. Both approaches have advantages and
disadvantages. The advantages of a separate address space for I/O devices are, primarily,
that the I/O operations would then be performed by separate I/O instructions, and that all
the memory address space could be dedicated to memory.
Typically, however, I/O is only a small fraction of the operations performed by a
computer system; generally less than 1 percent of all instructions are I/O instructions in a
program. It may not be worthwhile to support such infrequent operations with a rich
instruction set, so I/O instructions are often rather restricted.
In processors with memory mapped I/O, any of the instructions which references memory
directly can also be used to reference I/O ports, including instructions which modify the
contents of the I/O port (e.g., arithmetic instructions.)
Some problems can arise with memory mapped I/O in systems which use cache memory
or virtual memory. If a processor uses a virtual memory mapping, and the I/O ports are
allowed to be in a virtual address space, the mapping to the physical device may not be
consistent if there is a context switch. Moreover. the device would have to be capable of
performing the virtual-to-physical mapping. If physical addressing is used, mapping
across page boundaries may be problematic.
If the memory locations are cached, then the value in cache may not be consistent with
the new value loaded in memory. Generally, either there is some method for invalidating
cache that may be mapped to I/O addresses, or the I/O addresses are not cached at all. We
will look at the general problem of maintaining cache in a consistent state (the cache
coherency problem) in more detail when we discuss multi-processor systems.
Terminal Questions
1. What is the significance of I/O Operations?
2. Draw a block diagram of an I/O structure and discuss the working principle.
3. What are various I/O control strategies ? Discuss in brief.
4. Explain programmed I/O and interrupt I/O. How they differ?
5. Discuss the concept of Direct Memory Access. What are its advantages over other
methods?
Unit 10 : Case Study on Window Operating Systems :
In this units covers the covers the architecture of the Win NT OS, Win2000, Common
functionality to handles the different activities. And it coves the service family
functionality. Discussed the different versions of OS.
Introduction
Windows 2000, Windows XP and Windows Server 2003 are all part of the Windows NT
family of Microsoft operating systems. They are all preemptive, reentrant operating
systems, which have been designed to work with either uniprocessor- or symmetrical
multi processor (SMP)-based Intel x86 computers. To process input/output (I/O) requests
it uses packet-driven I/O which utilises I/O request packets (IRPs) and asynchronous I/O.
Starting with Windows XP, Microsoft began building in 64-bit support into their
operating systems – before this their operating systems were based on a 32-bit model.
The architecture of the Windows NT operating system line is highly modular, and
consists of two main layers: a user mode and a kernel mode. Programs and subsystems in
user mode are limited in terms of what system resources they have access to, while the
kernel mode has unrestricted access to the system memory and external devices. The
kernels of the operating systems in this line are all known as hybrid kernels as their
microkernel is essentially the kernel, while higher-level services are implemented by the
executive, which exists in kernel mode.
Objective:
At the end of this unit you will be understand the:
• Architectural details of Windows NT
• Functionality and operations of Windows NT
• Services and functionality of Windows NT Operating Systems
• Deployment related issues in Windows NT
Architecture of the Windows NT operating system line
The Windows NT operating system family’s architecture consists of two layers (user
mode and kernel mode), with many different modules within both of these layers.
User mode in the Windows NT line is made of subsystems capable of passing I/O
requests to the appropriate kernel mode software drivers by using the I/O manager. Two
subsystems make up the user mode layer of Windows 2000: the Environment subsystem
(runs applications written for many different types of operating systems), and the Integral
subsystem (operates system specific functions on behalf of the environment subsystem).
Kernel mode in Windows 2000 has full access to the hardware and system resources of
the computer. The kernel mode stops user mode services and applications from accessing
critical areas of the operating system that they should not have access to.
The Executive interfaces with all the user mode subsystems. It deals with I/O, object
management, security and process management. The hybrid kernel sits between the
Hardware Abstraction Layer and the Executive to provide multiprocessor
synchronization, thread and interrupt scheduling and dispatching, and trap handling and
exception dispatching. The microkernel is also responsible for initializing device drivers
at bootup. Kernel mode drivers exist in three levels: highest level drivers, intermediate
drivers and low level drivers. Windows Driver Model (WDM) exists in the intermediate
layer and was mainly designed to be binary and source compatible between Windows 98
and Windows 2000. The lowest level drivers are either legacy Windows NT device
drivers that control a device directly or can be a PnP hardware bus.
User mode
The user mode is made up of subsystems which can pass I/O requests to the appropriate
kernel mode drivers via the I/O manager (which exists in kernel mode). Two subsystems
make up the user mode layer of Windows 2000: the Environment subsystem and the
Integral subsystem.
The environment subsystem was designed to run applications written for many different
types of operating systems. None of the environment subsystems can directly access
hardware, and must request access to memory resources through the Virtual Memory
Manager that runs in kernel mode. Also, applications run at a lower priority than kernel
mode processes. Currently, there are three main environment subsystems: the Win32
subsystem, an OS/2 subsystem and a POSIX subsystem.
The Win32 environment subsystem can run 32-bit Windows applications. It contains the
console as well as text window support, shutdown and hard-error handling for all other
environment subsystems. It also supports Virtual DOS Machines (VDMs), which allow
MS-DOS and 16-bit Windows 3.x (Win16) applications to be run on Windows. There is a
specific MS-DOS VDM which runs in its own address space and which emulates an Intel
80486 running MS-DOS 5. Win16 programs, however, run in a Win16 VDM. Each
program, by default, runs in the same process, thus using the same address space, and the
Win16 VDM gives each program its own thread to run on. However, Windows 2000 does
allow users to run a Win16 program in a separate Win16 VDM, which allows the
program to be preemptively multitasked as Windows 2000 will pre-empt the whole VDM
process, which only contains one running application. The OS/2 environment subsystem
supports 16-bit character-based OS/2 applications and emulates OS/2 1.x, but not 2.x or
later OS/2 applications. The POSIX environment subsystem supports applications that
are strictly written to either the POSIX.1 standard or the related ISO/IEC standards.
The integral subsystem looks after operating system specific functions on behalf of the
environment subsystem. It consists of a security subsystem, a workstation service and a
server service. The security subsystem deals with security tokens, grants or denies access
to user accounts based on resource permissions, handles logon requests and initiates
logon authentication, and determines which system resources need to be audited by
Windows 2000. It also looks after Active Directory. The workstation service is an API to
the network redirector, which provides the computer access to the network. The server
service is an API that allows the computer to provide network services.
Kernel mode
Windows 2000 kernel mode has full access to the hardware and system resources of the
computer and runs code in a protected memory area. It controls access to scheduling,
thread prioritization, memory management and the interaction with hardware. The kernel
mode stops user mode services and applications from accessing critical areas of the
operating system that they should not have access to as user mode processes ask the
kernel mode to perform such operations on its behalf.
Kernel mode consists of executive services, which are it made up on many modules that
do specific tasks, kernel drivers, a microkernel and a Hardware Abstraction Layer, or
HAL.
• Executive
The Executive interfaces with all the user mode subsystems. It deals with I/O, object
management, security and process management. It contains various components,
including the I/O Manager, the Security Reference Monitor, the Object Manager, the IPC
Manager, the Virtual Memory Manager (VMM), a PnP Manager and Power Manager,
as well as a Window Manager which works in conjunction with the Windows Graphics
Device Interface (GDI). Each of these components exports a kernel-only support routine
allows other components to communicate with one another. Grouped together, the
components can be called executive services. No executive component has access to the
internal routines of any other executive component.
Each object in Windows 2000 exists in its own namespace. This is a screenshot from
SysInternals’ WinObj
The object manager is a special executive subsystem that all other executive subsystems
must pass through to gain access to Windows 2000 resources – essentially making it a
resource management infrastructure service. The object manager is used to reduce the
duplication of object resource management functionality in other executive subsystems,
which could potentially lead to bugs and make development of Windows 2000 harder. To
the object manager, each resource is an object, whether that resource is a physical
resource (such as a file system or peripheral) or a logical resource (such as a file). Each
object has a structure or object type that the object manager must know about. When
another executive subsystem requests the creation of an object, they send that request to
the object manager which creates an empty object structure which the requesting
executive subsystem then fills in. Object types define the object procedures and any data
specific to the object. In this way, the object manager allows Windows 2000 to be an
object oriented operating system, as object types can be thought of as classes that define
objects.
Each instance of an object that is created stores its name, parameters that are passed to
the object creation function, security attributes and a pointer to its object type. The object
also contains an object close procedure and a reference count to tell the object manager
how many other objects in the system reference that object and thereby determines
whether the object can be destroyed when a close request is sent to it. Every object exists
in a hierarchical object namespace.
Further executive subsystems are the following:
(i) I/O Manager: allows devices to communicate with user-mode subsystems. It
translates user-mode read and write commands in read or write IRPs which it passes to
device drivers. It accepts file system I/O requests and translates them into device specific
calls, and can incorporate low-level device drivers that directly manipulate hardware to
either read input or write output. It also includes a cache manager to improve disk
performance by caching read requests and write to the disk in the background
(ii) Security Reference Monitor (SRM): the primary authority for enforcing the security
rules of the security integral subsystem. It determines whether an object or resource can
be accessed, via the use of access control lists (ACLs), which are themselves made up of
access control entries (ACEs). ACEs contain a security identifier (SID) and a list of
operations that the ACE gives a select group of trustees – a user account, group account,
or logon session – permission (allow, deny, or audit) to that resource.
(iii) IPC Manager: short for Interprocess Communication Manager, this manages the
communication between clients (the environment subsystem) and servers (components of
the Executive). It can use two facilities: the Local Procedure Call (LPC) facility (clients
and servers on the one computer) and the Remote Procedure Call (RPC) facility (where
clients and servers are situated on different computers. Microsoft has had significant
security issues with the RPC facility.
(iv) Virtual Memory Manager: manages virtual memory, allowing Windows 2000 to
use the hard disk as a primary storage device (although strictly speaking it is secondary
storage). It controls the paging of memory in and out of physical memory to disk storage.
(v) Process Manager: handles process and thread creation and termination
(vi) PnP Manager: handles Plug and Play and supports device detection and installation
at boot time. It also has the responsibility to stop and start devices on demand –
sometimes this happens when a bus gains a new device and needs to have a device driver
loaded to support that device. Both FireWire and USB are hot-swappable and require the
services of the PnP Manager to load, stop and start devices. The PnP manager interfaces
with the HAL, the rest of the executive (as necessary) and with device drivers.
(vii) Power Manager: the power manager deals with power events and generates power
IRPs. It coordinates these power events when several devices send a request to be turned
off it determines the best way of doing this.
The display system has been moved from user mode into the kernel mode as a device
driver contained in the file Win32k.sys. There are two components in this device driver –
the Window Manager and the GDI:
(viii) Window Manager: responsible for drawing windows and menus. It controls the
way that output is painted to the screen and handles input events (such as from the
keyboard and mouse), then passes messages to the applications that need to receive this
input
(ix) GDI: the Graphics Device Interface is responsible for tasks such as drawing lines
and curves, rendering fonts and handling palettes. Windows 2000 introduced native alpha
blending into the GDI.
(x) Microkernel & kernel-mode drivers
The Microkernel sits between the HAL and the Executive and provides multiprocessor
synchronization, thread and interrupt scheduling and dispatching, and trap handling and
exception dispatching. The Microkernel often interfaces with the process manager. The
microkernel is also responsible for initializing device drivers at bootup that are necessary
to get the operating system up and running.
Windows 2000 uses kernel-mode device drivers to enable it to interact with hardware
devices. Each of the drivers has well defined system routines and internal routines that it
exports to the rest of the operating system. All devices are seen by user mode code as a
file object in the I/O manager, though to the I/O manager itself the devices are seen as
device objects, which it defines as either file, device or driver objects. Kernel mode
drivers exist in three levels: highest level drivers, intermediate drivers and low level
drivers. The highest level drivers, such as file system drivers for FAT and NTFS, rely on
intermediate drivers. Intermediate drivers consist of function drivers – or main driver for
a device – that are optionally sandwiched between lower and higher level filter drivers.
The function driver then relies on a bus driver – or a driver that services a bus controller,
adapter, or bridge – which can have an optional bus filter driver that sits between itself
and the function driver. Intermediate drivers rely on the lowest level drivers to function.
The Windows Driver Model (WDM) exists in the intermediate layer. The lowest level
drivers are either legacy Windows NT device drivers that control a device directly or can
be a PnP hardware bus. These lower level drivers directly control hardware and do not
rely on any other drivers.
(xi) Hardware abstraction layer
The Windows 2000 Hardware Abstraction Layer, or HAL, is a layer between the physical
hardware of the computer and the rest of the operating system. It was designed to hide
differences in hardware and therefore provide a consistent platform on which applications
may run. The HAL includes hardware specific code that controls I/O interfaces, interrupt
controllers and multiple processors.
Windows 2000 was designed to support the 64-bit DEC Alpha. After Compaq announced
they would discontinue support of the processor, Microsoft stopped releasing tests build
of Windows 2000 for AXP to the public, stopping with beta 3. Development of Windows
on the Alpha continued internally in order to continue to have a 64-bit architecture
development model ready until the wider availability of the Intel Itanium IA-64
architecture. The HAL now only supports hardware that is compatible with the Intel x86
architecture.
Microsoft has had numerous security issues caused by vulnerabilities in its RPC
mechanisms. A list follows of the security bulletins that Microsoft have issued in regards
to RPC vulnerabilities:
Microsoft Security Bulletin MS03-026: issue with a vulnerability in the part of RPC that
deals with message exchange over TCP/IP. The failure results because of incorrect
handling of malformed messages. This particular vulnerability affects a Distributed
Component Object Model (DCOM) interface with RPC, which listens on RPC enabled
ports.
Microsoft Security Bulletin MS03-001: A security vulnerability results from an
unchecked buffer in the Locator service. By sending a specially malformed request to the
Locator service, an attacker could cause the Locator service to fail, or to run code of the
attacker’s choice on the system.
Microsoft Security Bulletin MS03-026: Buffer overrun in RPC may allow code execution
Microsoft Security Bulletin MS03-010: This particular vulnerabilty affects the RPC
Endpoint Mapper process, which listens on TCP/IP port 135. The RPC endpoint mapper
allows RPC clients to determine the port number currently assigned to a particular RPC
service. To exploit this vulnerability, an attacker would need to establish a TCP/IP
connection to the Endpoint Mapper process on a remote machine. Once the connection
was established, the attacker would begin the RPC connection negotiation before
transmitting a malformed message. At this point, the process on the remote machine
would fail. The RPC Endpoint Mapper process is responsible for maintaining the
connection information for all of the processes on that machine using RPC. Because the
Endpoint Mapper runs within the RPC service itself, exploiting this vulnerability would
cause the RPC service to fail, with the attendant loss of any RPC-based services the
server offers, as well as potential loss of some COM functions.
Microsoft Security Bulletin MS04-029: This RPC Runtime library vulnerability was
addressed in CAN-2004-0569, however the title is “Vulnerability in RPC Runtime
Library Could Allow Information Disclosure and Denial of Service”.
Microsoft Security Bulletin (MS00-066): A remote denial of service vulnerability in RPC
is found. Blocking ports 135-139 and 445 can stop attacks.
Microsoft Security Bulletin MS03-039: “There are three newly identified vulnerabilities
in the part of RPCSS Service that deals with RPC messages for DCOM activation- two
that could allow arbitrary code execution and one that could result in a denial of service.
The flaws result from incorrect handling of malformed messages. These particular
vulnerabilities affect the Distributed Component Object Model (DCOM) interface within
the RPCSS Service. This interface handles DCOM object activation requests that are sent
from one machine to another. An attacker who successfully exploited these
vulnerabilities could be able to run code with Local System privileges on an affected
system, or could cause the RPCSS Service to fail. The attacker could then be able to take
any action on the system, including installing programs, viewing, changing or deleting
data, or creating new accounts with full privileges. To exploit these vulnerabilities, an
attacker could create a program to send a malformed RPC message to a vulnerable
system targeting the RPCSS Service.”
Microsoft Security Bulletin MS01-041: “Several of the RPC servers associated with
system services in Microsoft Exchange Server, SQL Server, Windows NT 4.0 and
Windows 2000 do not adequately validate inputs, and in some cases will accept invalid
inputs that prevent normal processing. The specific input values at issue here vary from
RPC server to RPC server. An attacker who sent such inputs to an affected RPC server
could disrupt its service. The precise type of disruption would depend on the specific
service, but could range in effect from minor (e.g., the service temporarily hanging) to
major (e.g., the service failing in a way that would require the entire system to be
restarted).”
Windows 2000
Windows 2000 (also referred to as Win2K or W2K) is a preemptible and interruptible,
graphical, business-oriented operating system that was designed to work with either
uniprocessor or symmetric multi-processor (SMP) 32-bit Intel x86 computers. It is part of
the Microsoft Windows NT line of operating systems and was released on February 17,
2000. Windows 2000 comes in four versions: Professional, Server, Advanced Server, and
Datacenter Server. Additionally, Microsoft offers Windows 2000 Advanced Server-
Limited Edition, which was released in 2001 and runs on 64-bit Intel Itanium
microprocessors. Windows 2000 is classified as a hybrid-kernel operating system, and its
architecture is divided into two modes: user mode and kernel mode. The kernel mode
provides unrestricted access to system resources and facilitates the user mode, which is
heavily restricted and designed for most applications.
All versions of Windows 2000 have common functionality, including many system
utilities such as the Microsoft Management Console (MMC) and standard system
management applications such as a disk defragmentation utility. Support for people with
disabilities has also been improved by Microsoft across their Windows 2000 line, and
they have included increased support for different languages and locale information. All
versions of the operating system support the Windows NT filesystem, NTFS 5, the
Encrypted File System (EFS), as well as basic and dynamic disk storage. Dynamic disk
storage allows different types of volumes to be used. The Windows 2000 Server family
has enhanced functionality, including the ability to provide Active Directory services (a
hierarchical framework of resources), Distributed File System (a file system that supports
sharing of files) and fault-redundant storage volumes.
Windows 2000 can be installed and deployed to an enterprise through either an attended
or unattended installation. Unattended installations rely on the use of answer files to fill
in installation information, and can be performed through a bootable CD using Microsoft
Systems Management Server (SMS), by the System Preparation Tool (Sysprep).
History
Windows 2000 originally descended from the Microsoft Windows NT operating system
product line. Originally called Windows NT 5, Microsoft changed the name to Windows
2000 on October 27, 1998. It was also the first Windows version that was released
without a code name, though Windows 2000 Service Pack 1 was codenamed “Asteroid”
and Windows 2000 64-bit was codenamed “Janus” (not to be confused with Windows
3.1, which had the same codename). The first beta for Windows 2000 was released on
September 27, 1997 and several further betas were released until Beta 3 which was
released on April 29, 1999. From here, Microsoft issued three release candidates between
July and November 1999, and finally released the operating system to partners on
December 12, 1999. The public received the full version of Windows 2000 on February
17, 2000 and the press immediately hailed it as the most stable operating system
Microsoft had ever released. Novell, however, was not so impressed with Microsoft’s
new directory service architecture as they found it to be less scalable or reliable than their
own Novell Directory Services (NDS) technology. On September 29, 2000, Microsoft
released Windows 2000 Datacenter. Microsoft released Service Pack 1 (SP1) on August
15, 2000, Service Pack 2 (SP2) on May 16, 2001, Service Pack 3 (SP3) on August 29,
2002 and its last Service Pack (SP4) on June 26, 2003. Microsoft has stated that they will
not release a Service Pack 5, but instead, have offered an “Update Rollup” for Service
Pack 4. Microsoft phased out all development of their Java Virtual Machine (JVM) from
Windows 2000 in Service Pack 3.
Windows 2000 has since been superseded by newer Microsoft operating systems.
Microsoft has replaced Windows 2000 Server products with Windows Server 2003, and
Windows 2000 Professional with Windows XP Professional. Windows Neptune started
development in 1999, and was supposed to be the home-user edition of Windows 2000.
However, the project lagged in production time – and only one alpha release was built.
Windows Me was released as a substitute, and the Neptune project was forwarded to the
production of Whistler (Windows XP). The only elements of the Windows project which
were included in Windows 2000 were the ability to upgrade from Windows 95 or
Windows 98, and support for the FAT32 file system.
Several notable security flaws have been found in Windows 2000. Code Red and Code
Red II were famous (and highly visible to the worldwide press) computer worms that
exploited vulnerabilities of the indexing service of Windows 2000’s Internet Information
Services (IIS). In August 2003, two major worms named the Sobig worm and the Blaster
worm began to attack millions of Microsoft Windows computers, resulting in the largest
down-time and clean-up cost ever.
Architecture
The Windows 2000 operating system architecture consists of two layers (user mode and
kernel mode), with many different modules within both of these layers.
Windows 2000 is a highly modular system that consists of two main layers: a user mode
and a kernel mode. The user mode refers to the mode in which user programs are run.
Such programs are limited in terms of what system resources they have access to, while
the kernel mode has unrestricted access to the system memory and external devices. All
user mode applications access system resources through the executive which runs in
kernel mode.
User mode
User mode in Windows 2000 is made of subsystems capable of passing I/O requests to
the appropriate kernel mode drivers by using the I/O manager. Two subsystems make up
the user mode layer of Windows 2000: the environment subsystem and the integral
subsystem.
The environment subsystem was designed to run applications written for many different
types of operating systems. These applications, however, run at a lower priority than
kernel mode processes. There are three main environment subsystems:
Win32 subsystem runs 32-bit Windows applications and also supports Virtual DOS
Machines (VDMs), which allows MS-DOS and 16-bit Windows 3.x (Win16) applications
to run on Windows.
OS/2 environment subsystem supports 16-bit character-based OS/2 applications and
emulates OS/2 1.3 and 1.x, but not 2.x or later OS/2 applications.
POSIX environment subsystem supports applications that are strictly written to either the
POSIX.1 standard or the related ISO/IEC standards.
The integral subsystem looks after operating system specific functions on behalf of the
environment subsystem. It consists of a security subsystem (grants/denies access and
handles logons), workstation service (helps the computer gain network access) and a
server service (lets the computer provide network services).
Kernel mode
Kernel mode in Windows 2000 has full access to the hardware and system resources of
the computer. The kernel mode stops user mode services and applications from accessing
critical areas of the operating system that they should not have access to.
Each object in Windows 2000 exists in its own namespace. This is a screenshot from
SysInternal’s
WinObj
The executive interfaces with all the user mode subsystems. It deals with I/O, object
management, security and process management. It contains various components,
including:
Object manager: a special executive subsystem that all other executive subsystems must
pass through to gain access to Windows 2000 resources. This essentially is a resource
management infrastructure service that allows Windows 2000 to be an object oriented
operating system.
I/O Manager: allows devices to communicate with user-mode subsystems by translating
user-mode read and write commands and passing them to device drivers.
Security Reference Monitor (SRM): the primary authority for enforcing the security
rules of the security integral subsystem.
IPC Manager: short for Interprocess Communication Manager, manages the
communication between clients (the environment subsystem) and servers (components of
the executive).
Virtual Memory Manager: manages virtual memory, allowing Windows 2000 to use
the hard disk as a primary storage device (although strictly speaking it is secondary
storage).
Process Manager: handles process and thread creation and termination
PnP Manager: handles Plug and Play and supports device detection and installation at
boot time.
Power Manager: the power manager coordinates power events and generates power
IRPs.
The display system is handled by a device driver contained in Win32k.sys. The Window
Manager component of this driver is responsible for drawing windows and menus while
the GDI (graphical device interface) component is responsible for tasks such as drawing
lines and curves, rendering fonts and handling palettes.
The Windows 2000 Hardware Abstraction Layer, or HAL, is a layer between the physical
hardware of the computer and the rest of the operating system. It was designed to hide
differences in hardware and therefore provide a consistent platform to run applications
on. The HAL includes hardware specific code that controls I/O interfaces, interrupt
controllers and multiple processors.
The microkernel sits between the HAL and the executive and provides multiprocessor
synchronization, thread and interrupt scheduling and dispatching, trap handling and
exception dispatching. The microkernel often interfaces with the process manager. The
microkernel is also responsible for initializing device drivers at bootup that are necessary
to get the operating system up and running.
Common functionality
Certain features are common across all versions of Windows 2000 (both Professional and
the Server versions), among them being NTFS 5, the Microsoft Management Console
(MMC), the Encrypting File System (EFS), dynamic and basic disk storage, usability
enhancements and multi-language and locale support. Windows 2000 also has several
standard system utilities included as standard. As well as these features, Microsoft
introduced a new feature to protect critical system files, called Windows File Protection
(WFP). This prevents programs (with the exception of Microsoft’s update programs)
from replacing critical Windows system files and thus making the system inoperable.
Microsoft recognised that the infamous Blue Screen of Death (or stop error) could cause
serious problems for servers that needed to be constantly running and so provided a
system setting that would allow the server to automatically reboot when a stop error
occurred. Users have the option of dumping the first 64KB of memory to disk (the
smallest amount of memory that is useful for debugging purposes, also known as a
minidump), a dump of only the kernel’s memory or a dump of the entire contents of
memory to disk, as well as write that this event happened to the Windows 2000 event log.
In order to improve performance on computers running Windows 2000 as a server
operating system, Microsoft gave administrators the choice of optimising the operating
system for background services or for applications.
NTFS 5
Windows 2000 supports disk quotas, which can be set via the “Quotas” tab found in the
hard disk properties dialog box.
Microsoft released the third version of the NT File System (NTFS) – also known as
version 5.0 – in Windows 2000; this introduced quotas, file-system-level encryption
(called EFS), sparse files and reparse points. Sparse files allow for the efficient storage of
data sets that are very large yet contain many areas that only have zeroes. Reparse points
allow the object manager to reset a file namespace lookup and let file system drivers
implement changed functionality in a transparent manner. Reparse points are used to
implement Volume Mount Points, Directory Junctions, Hierarchical Storage
Management, Native Structured Storage and Single Instance Storage. Volume mount
points and directory junctions allow for a file to be transparently referred from one file or
directory location to another.
Encrypting File System
The Encrypting File System (EFS) introduced strong encryption into the Windows file
world. It allowed any folder or drive on an NTFS volume to be encrypted transparently to
the end user. EFS works in conjunction with the EFS service, Microsoft’s CryptoAPI and
the EFS File System Run-Time Library (FSRTL). As of February 2004, its encryption
has not been compromised.
EFS works by encrypting a file with a bulk symmetric key (also known as the File
Encryption Key, or FEK), which is used because it takes a relatively smaller amount of
time to encrypt and decrypt large amounts of data than if an asymmetric key cipher is
used. The symmetric key that is used to encrypt the file is then encrypted with a public
key that is associated with the user who encrypted the file, and this encrypted data is
stored in the header of the encrypted file. To decrypt the file, the file system uses the
private key of the user to decrypt the symmetric key that is stored in the file header. It
then uses the symmetric key to decrypt the file. Because this is done at the file system
level, it is transparent to the user. Also, in case of a user losing access to their key,
support for recovery agents that can decrypt files has been built in to the EFS system.
Basic and dynamic disk storage
Windows 2000 introduced the Logical Disk Manager for dynamic storage. All versions
of Windows 2000 support three types of dynamic disk volumes (along with basic
storage): simple volumes, spanned volumes and striped volumes:
Simple volume: this is a volume with disk space from one disk.
Spanned volumes: multiple disks spanning up to 32 disks. If one disk fails, all data in
the volume is lost.
Striped volumes: also known as RAID-0, a striped volume stores all its data across
several disks in stripes. This allows better performance because disk read and writes are
balanced across multiple disks. Windows 2000 also added support for iSCSI protocol.
Accessibility support
The Windows 2000 onscreen keyboard map allows users who have problems with using
the keyboard to use a mouse to input text.
Microsoft made an effort to increase the usability of Windows 2000 for people with
visual and auditory impairments and other disabilities. They included several utilities
designed to make the system more accessible:
FilterKeys: These are a group of keyboard related support for people with typing issues,
and include:
SlowKeys: Windows is told to disregard keystrokes that are not held down for a certain
time period
BounceKeys: multiple keystrokes to one key to be ignored within a certain timeframe
RepeatKeys: allows users to slow down the rate at which keys are repeated via the
keyboard’s keyrepeat feature
ToggleKeys: when turned on, Windows will play a sound when either the CAPS LOCK,
NUM LOCK or SCROLL LOCK keys are pressed
MouseKeys: allows the cursor to be moved around the screen via the numeric keypad
instead of the mouse
On screen keyboard: assists those who are not familiar with a given keyboard by
allowing them to use a mouse to enter characters to the screen
SerialKeys: gives Windows 2000 the ability to support speech augmentation devices
StickyKeys: makes modifier keys (ALT, CTRL and SHIFT) become “sticky” – in other
words a user can press the modifier key, release that key and then press the combination
key. Normally the modifier key must remain pressed down to activate the sequence.
On screen magnifier: assists users with visual impairments by magnifying the part of
the screen they place their mouse over.
Narrator: Microsoft Narrator assists users with visual impairments with system
messages, as when these appear the narrator will read this out via the sound system
High contrast theme: to assist users with visual impairments
SoundSentry: designed to help users with auditory impairments, Windows 2000 will
show a visual effect when a sound is played through the sound system
Language & locale support
Windows 2000 has support for many languages other than English. It supports Arabic,
Armenian, Baltic, Central European, Cyrillic, Georgian, Greek, Hebrew, Indic, Japanese,
Korean, Simplified Chinese, Thai, Traditional Chinese, Turkic, Vietnamese and Western
European languages. It also has support for many different locales, a list of which can be
found on Microsoft’s website.
System utilities
The Microsoft Management Console (MMC) is used for administering Windows 2000
computers.
Windows 2000 introduced the Microsoft Management Console (MMC), which is used to
create, save, and open administrative tools. Each of the tools is called a console, and most
consoles allow an administrator to administer other Windows 2000 computers from one
centralised computer. Each console can contain one or many specific administrative
tools, called snap-ins. Snap-ins can be either standalone (performs one function), or
extensions (adds functionality to an existing snap-in). In order to provide the ability to
control what snap-ins can be seen in a console, the MMC allows consoles to be created in
author mode or created in user mode. Author mode allows snap-ins to be added, new
windows to be created, all portions of the console tree can be displayed and for consoles
to be saved. User mode allows consoles to be distributed with restrictions applied. User
mode consoles can have full access granted user so they can make whatever changes they
desire, can have limited access so that users cannot add to the console but they can view
multiple windows in a console, or they can have limited access so that users cannot add
to the console and also cannot view multiple windows in a console.
The Windows 2000 Computer Management console is capable of performing many
system tasks. It is pictured here starting a disk defragmentation.
The main tools that come with Windows 2000 can be found in the Computer
Management console (found in Administrative Tools in the Control Panel). This contains
the event viewer – a means of seeing events and the Windows equivalent of a log file, a
system information viewer, the ability to view open shared folders and shared folder
sessions, a device manager and a tool to view all the local users and groups on the
Windows 2000 computer. It also contains a disk management snap-in, which contains a
disk defragmenter as well as other disk management utilities. Lastly, it also contains a
services viewer, which allows users to view all installed services and to stop and start
them on demand, as well as configure what those services should do when the computer
starts.
REGEDIT.EXE utility:
Windows 2000 comes bundled with two utilities to edit the Windows registry. One acts
like the Windows 9x REGEDIT.EXE program and the other could edit registry
permissions in the same manner that Windows NT’s REGEDT32.EXE program could.
REGEDIT.EXE has a left-side tree view that begins at “My Computer” and lists all
loaded hives. REGEDT32.EXE has a left-side tree view, but each hive has its own
window, so the tree displays only keys. REGEDIT.EXE represents the three components
of a value (its name, type, and data) as separate columns of a table. REGEDT32.EXE
represents them as a list of strings. REGEDIT.EXE was written for the Win32 API and
supports right-clicking of entries in a tree view to adjust properties and other settings.
REGEDT32.EXE was also written for the Win32 API and requires all actions to be
performed from the top menu bar. Because REGEDIT.EXE was directly ported from
Windows 98, it does not support permission editing (permissions do not exist in
Windows 9x). Therefore, the only way to access the full functionality of an NT registry
was with REGEDT32.EXE, which uses the older multiple document interface (MDI),
which newer versions of regedit do not use. Windows XP was the first system to integrate
these two programs into one, adopting the REGEDIT.EXE behavior with the additional
NT functionality.
The System File Checker (SFC) also comes bundled with Windows 2000. It is a
command line utility that scans system files and verifies whether they were signed by
Microsoft and works in conjunction with the Windows File Protection mechanism. It can
also repopulate and repair all the files in the Dllcache folder.
Recovery Console
The Recovery Console is usually used to recover unbootable systems. The Recovery
Console is an application that is run from outside the installed copy of Windows and that
enables a user to perform maintenance tasks that cannot be run from inside of the
installed copy, or cannot be feasibly run from another computer or copy of Windows
2000. It is usually used, however, to recover the system from errors causing booting to
fail, which would render other tools useless.
It presents itself as a simple command line interface. The commands are limited to ones
for checking and repairing the hard drive(s), repairing boot information (including
NTLDR), replacing corrupted system files with fresh copies from the CD, or
enabling/disabling services and drivers for the next boot.
The console can be accessed in one of two ways:
Starting from the Windows 2000 CD, and choosing to enter the Recovery Console or
Installing the Recovery Console via Winnt32.exe, with the /cmdcons switch. However,
the console can then only be used if the system boots to the point where NTLDR can start
it.
Server family functionality
The Windows 2000 server family consists of Windows 2000 Server, Windows 2000
Advanced Server and Windows 2000 Datacenter Server.
All editions of Windows 2000 Server have the following services and functionality built-
in:
Routing and Remote Access Service (RRAS) support, facilitating dial-up and VPN
connections, support for RADIUS authentication, network connection sharing, Network
Address Translation, unicast and multicast routing DNS server, including support for
Dynamic DNS. Active Directory relies heavily on DNS. Microsoft Connection Manager
Administration Kit and Connection Point Services Support for distributed file systems
(DFS))
Hierarchical Storage Management support, a service that runs in conjunction with NTFS
that automatically transfers files that are not used for some period of time to less
expensive storage media
Fault tolerant volumes, namely it supports Mirrored and RAID-5
Group policy (part of Active Directory)
Distributed File System
The Distributed File System, or DFS, allows shares in multiple different locations to be
logically grouped under one folder, or DFS root. When users try to access a share that
exists off the DFS root, the user is really looking at a DFS link and the DFS server
transparently redirects them to the correct file server and share. A DFS root can only exist
on a Windows 2000 version that is part of the server family, and only one DFS root can
exist on that server.
There can be two ways of implementing DFS on Windows 2000: through standalone
DFS, or through domain-based DFS. Standalone DFS allows for only DFS roots that
exist on the local computer, and thus does not use Active Directory. Domain-based DFS
roots exist within Active Directory and can have their information distributed to other
domain controllers within the domain – this provides fault tolerance to DFS. DFS roots
that exist on a domain must be hosted on a domain controller or on a domain member
server. The file and root information is replicated via the Microsoft File Replication
Service (FRS).
Active Directory
Active Directory allows administrators to assign enterprise wide policies, deploy
programs to many computers, and apply critical updates to an entire organization, and is
one of the main reasons why many corporations have moved to Windows 2000. Active
Directory stores information about its users and can act in a similar manner to a phone
book. This allows all of the information and computer settings about an organization to
be stored in a central, organized database. Active Directory Networks can vary from a
small installation with a few hundred objects, to a large installation with millions of
objects. Active Directory can organise groups of resources into a single domain and can
link domains into a contiguous domain name space together to form trees. Groups of
trees that do not exist within the same namespace can be linked together to form forests.
Active Directory can only be installed on a Windows 2000 Server, Advanced Server or
Datacenter Server computer, and cannot be installed on a Windows 2000 Professional
computer. It requires that a DNS service that supports SRV resource records be installed,
or that an existing DNS infrastructure be upgraded to support this functionality. It also
requires that one or more domain controllers exist to hold the Active Directory database
and provide Active Directory directory services.
Volume fault tolerance
Along with support for simple, spanned and striped volumes, the server family of
Windows 2000 also supports fault tolerant volume types. The types supported are
mirrored volumes and RAID-5 volumes:
Mirrored volumes: the volume contains several disks, and when data is written to one it
is mirrored to the other disks. This means that if one disk fails, the data can be totally
recovered from the other disk. Mirrored volumes are also known as RAID-1.
RAID-5 volumes: a RAID-5 volume consists of multiple disks, and it uses block-level
striping with parity data distributed across all member disks. Should a disk fail in the
array, the parity blocks from the surviving disks are combined mathematically with the
data blocks from the surviving disks to reconstruct the data on the failed drive “on-the-
fly” (this works with various levels of success).
Versions
Windows 2000 Professional was designed as the desktop operating system for
businesses and power users. It is the basic unit of Windows 2000, and the most common.
It offers greater security and stability than many of the previous Windows desktop
operating systems. It supports up to two processors, and can address up to 4 GBs of
RAM.
Windows 2000 Server products share the same user interface with Windows 2000
Professional, but contain additional components for running infrastructure and
application software. A significant component of the server products is Active Directory,
which is an enterprise-wide directory service based on LDAP. Additionally, Microsoft
integrated Kerberos network authentication, replacing the often-criticised NTLM
authentication system used in previous versions. This also provided a purely transitive-
trust relationship between Windows 2000 domains in a forest (a collection of one or more
Windows 2000 domains that share a common schema, configuration, and global
catalogue, being linked with two-way transitive trusts). Furthermore, Windows 2000
introduced a DNS server which allows dynamic registration of IP addresses.
Windows 2000 Advanced Server is a variant of Windows 2000 Server operating system
designed for medium-to-large businesses. It offers clustering infrastructure for high
availability and scalability of applications and services, including main memory support
of up to 8 gigabytes (GB) on Page Address Extension (PAE) systems and the ability to do
8-way SMP. It has support for TCP/IP load balancing and enhanced two-node server
clusters based on the Microsoft Cluster Server (MSCS) in the Windows NT Server 4.0
Enterprise Edition. A limited edition 64 bit version of Windows 2000 Advanced Server
was made available via the OEM Channel. It also supports failover and load balancing.
Windows 2000 Datacenter Server is a variant of the Windows 2000 Server that is
designed for large businesses that move large quantities of confidential or sensitive data
frequently via a central server. As with Advanced Server, it supports clustering, failover
and load balancing. Its system requirements are normal, but are compatible with vast
amounts of power: A Pentium-class CPU at 400 MHz or higher – up to 32 are supported
in one machine. 256 MB of RAM – up to 64 GB is supported in one machine.
Approximately 1 GB of available disk space.
Deployment
Windows 2000 can be deployed to a site via various methods. It can be installed onto
servers via traditional media (such as via CD) or via distribution folders that reside on a
shared folder. Installations can be attended or unattended. An attended installation
requires the manual intervention of an operator to choose options when installing the
operating system. Unattended installations are scripted via an answer file, or predefined
script in the form of an INI file that has all the options filled in already. The Winnt.exe or
Winnt32.exe program then uses that answer file to automate the installation. Unattended
installations can be performed via a bootable CD, using Microsoft Systems Management
Server (SMS), via the System Preparation Tool (Sysprep), via running the Winnt32.exe
program using the /syspart switch or via running the Remote Installation Service (RIS).
The Syspart method is started on a standardised reference computer – though the
hardware need not be similar – and it copies the required installation files from the
reference computer’s hard drive to the target computer’s hard drive. The hard drive does
not need to be in the target computer and may be swapped out to it at any time, with
hardware configuration still needing to be done later. The Winnt.exe program must also
be passed a /unattend switch that points to a valid answer file and a /s file to point to the
location of one or more valid installation sources.
Sysprep allows the duplication of a disk image on an existing Windows 2000 Server
installation to multiple servers. This means that all applications and system configuration
settings will be copied across to the new Windows 2000 installations, but it also means
that the reference and target computers must have the same HALs, ACPI support, and
mass storage devices – though Windows 2000 automatically detects plug and play
devices. The primary reason for using Sysprep is for deploying Windows 2000 to a site
that has standard hardware and that needs a fast method of installing Windows 2000 to
those computers. If a system has different HALs, mass storage devices or ACPI support,
then multiple images would need to be maintained.
Systems Management Server can be used to upgrade system to Windows 2000 to
multiple systems. Those operating systems that can be upgraded in this process must be
running a version of Windows that can be upgraded (Windows NT 3.51, Windows NT 4,
Windows 98 and Windows 95 OSR2.x) and those versions must be running the SMS
client agent that can receive software installation operations. Using SMS allows
installations to happen over a wide geographical area and provides centralised control
over upgrades to systems.
Remote Installation Services (RIS) are a means to automatically install Windows 2000
Professional (and not Windows 2000 Server) to a local computer over a network from a
central server. Images do not have to support specific hardware configurations and the
security settings can be configured after the computer reboots as the service generates a
new unique security ID (SID) for the machine. This is required so that local accounts are
given the right identifier and do not clash with other Windows 2000 Professional
computers on a network.
RIS requires that client computers are able to boot over the network via either a network
interface card that has a Pre-Boot Execution Environment (PXE) boot ROM installed or
that it has a network card installed that is supported by the remote boot disk generator.
The remote computer must also meet the Net PC specification. The server that RIS runs
on must be Windows 2000 Server and the server must be able to access a network DNS
Service, a DHCP service and the Active Directory services.
“NDS eDirectory is a cross-platform directory solution that works on NT 4, Windows
2000 when available, Solaris and NetWare 5. Active Directory will only support the
Windows 2000 environment. In addition, eDirectory users can be assured they are using
the most trusted, reliable and mature directory service to manage and control their e-
business relationships – not a 1.0 release.”